程序代做CS代考 algorithm Clustering – cscodehelp代写
Clustering
Reference book: Bishop: “Pattern Recognition and Machine Learning” Chapter 9.1
Clustering
• Unsupervised learning
• Generating “classes”
• Distance/similarity measures • Agglomerative methods
• Divisive methods
Unsupervised Learning
• No labels/responses. Finding structure in data. • Dimensionality Reduction.
Clustering Subspace Learning 𝑇:R$ → {1,2,…,𝐾} 𝑇:R$ → R-,𝑚 < 𝐷
Uses of Unsupervised Learning • Data compression
RGB
(12,165,73) (11,167,79) :
Labels
3 43
:
Dictionary
1 ~ (10, 160, 70) 2 ~ (40, 240, 20) :
Uses of Unsupervised Learning
• To improve classification/regression (semi-supervised learning)
1. From unlabeled data, learn a good feature 𝑇: R$ → R-.
2. To labeled data, apply transformation 𝑇: R$ → R-.
𝑇𝒙2 ,𝒚2 ,..., 𝑇𝒙4 ,𝒚4
3. Perform classification/regression on transformed low- dimensional data.
What is Clustering?
• Unsupervised learning - no information from teacher
• The process of partitioning a set of data into a set of meaningful
(hopefully) sub-classes, called clusters • Cluster:
• collection of data points that are “similar” to one another and collectively should be treated as group (it is not the group we
learned in linear algebra)
• as a collection, are sufficiently different from other groups
What is Clustering
Clustering Problem.
Input. Training data 𝒮4 = 𝒙2,𝒙7,...,𝒙4 , 𝒙8 ∈ R$. Integer 𝐾
Output. Clusters 𝒞2,𝒞7,...,𝒞; ⊂ 1,2,...,𝑁 such that every data point is in one and only one cluster.
Clusterrepresentatives 𝝁2,...,𝝁;
Some clusters could be empty!
Clusters
How to Specify a Cluster
• By listing all its elements
𝒞2 = 1,3,4,7,8,11 𝒞7 = 2,5,6,9,10
1
8
7
4
3 11
10
6 9
5 2
How to Specify a Cluster
• Using a representative
a. A point in center of cluster (centroid) b. A point in the training data (exemplar)
Each point 𝒙G will be assigned the closest representative.
1
8
7
4
3 11
10
6 9
5 2
1
8
7
4
3 11
10
6 9
5 2
centroid
exemplar
Voronoi Diagram
We can partition all the points in the space into regions, according to their closest representative.
Distance/Similarity Measures
(sometimes called loss functions)
A measure of how close two data points are. Nearby points (i.e. distance is small) are more likely to belong to the same cluster.
• Euclidean Distance
dist𝒙,𝒚=𝒙−𝒚7∶= N𝑥G−𝑦G
8
7 GO2
Distance/Similarity Measures
A measure of how alike two data points are. Similar points (i.e. similarity is large) are more likely to belong to the same cluster.
𝒙T𝒚 𝒙 ‖𝒚‖
• Cosine Similarity cos 𝒙, 𝒚 =
Distance (Similarity) Matrix
• Similarity (Distance) Matrix
• based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values)
• (i, j)th entry in the matrix is the distance (similarity) between items i and j
I1 I2 ! In I1
I2 "
In
d = similarity (or distance) of D to D ij ij
• d12 ! d1n d21 •!d2n
""•" dn1 dn2 ! •
Note that dij = dji (i.e., the matrix is symmetric). So, we only need the lower triangle part of the matrix.
The diagonal is all 1’s (similarity) or all 0’s (distance)
Example: Term Similarities in Documents
T1
T2
T3
T4
T5
T6
T7
T8
Doc1
0
4
0
0
0
2
1
3
Doc2
3
1
4
3
1
2
0
1
Doc3
3
0
0
0
3
0
3
0
Doc4
0
1
0
3
0
0
2
0
Doc5
2
2
2
3
1
4
0
2
sim(T,T)=å(w ×w ) i j k=1 ik jk
N
Term-Term Similarity Matrix
T1
T2
T3
T5
T7
T2
T3
T4
T5
7
16
15
8
12
18
T4
T6
T6
T7
14
14
3
18
6
16
6
18
6
T8
9
7
6
17
0
8
6
9
9
3
2
16
3
Similarity (Distance) Thresholds
• A similarity (distance) threshold may be used to mark pairs that are “sufficiently” similar
T1
T2
T3
T4
T5
T6
T7
T2
7
T3
16
8
T4
15
12
18
T5
14
3
6
6
T6
14
18
16
18
6
T7
9
6
0
6
9
2
T8
7
17
8
9
3
16
3
Using a threshold value of 10 in the previous example
T1
T2
T3
T4
T5
T6
T7
T2
0
T3
1
0
T4
1
1
1
T5
1
0
0
0
T6
1
1
1
1
0
T7
0
0
0
0
0
0
T8
0
1
0
0
0
1
0
Graph Representation
• The similarity matrix can be visualized as an undirected graph • each item is represented by a node, and edges represent the fact
that two items are similar (a 1 in the similarity threshold matrix) T1 T3
T1
T2
T3
T4
T5
T6
T7
T2
0
T3
1
0
T4
1
1
1
T5
1
0
0
0
T6
1
1
1
1
0
T7
0
0
0
0
0
0
T8
0
1
0
0
0
1
0
T5
T4
T2
If no threshold is used, then matrix can be represented as a weighted graph
T6
T7 T8
Training Loss
Sum of squared distances to closest representative.
loss≈11× 1 7 =11
assume length of each edge is about 1
Training Loss
Sum of squared distances to closest representative (cluster center).
loss≈9× 0.1 7 =0.09
assume length of each edge is about 0.1
Training loss
Optimizing both clusters and representatives.
4;
L𝑹,𝜧;𝒮8 =NN𝑟8^ 𝒙8−𝝁^ 8O2 ^O2
7
where
𝒮8 = 𝒙2,𝒙7,...,𝒙4 , 𝒙G∈R$
𝑟8^ ∈ 0,1 denotes which of the 𝐾 clusters data point 𝒙8 is assigned
to.If𝒙8 isassignedtocluster𝑘then𝑟8^ =1,and𝑟8a =0if𝑗≠𝑘. 𝑹 = 𝑟8^ ∈ 0,1 8×^,𝑛 = 1,...,𝑁,𝑘 = 1,...,𝐾
𝜧= 𝝁2,...,𝝁; e
Goal: find values for 𝑹 and 𝜧 so as to minimize L.
Basic Clustering Methodology Two approaches:
Agglomerative: pairs of items/clusters are successively linked to produce larger clusters
Divisive (partitioning): items are initially placed in one cluster and then divided into separate groups
Need to define an update rule for how to combine two clusters together.
Single-link Clustering: Distance between clusters to be shortest distance between any two points in each cluster:
Complete-link Clustering: Distance between clusters to be longest distance between two points from each cluster:
Average link clustering: Distance is average distance between clusters
Centroid clustering: Distance is distance between cluster centroids
Hierarchical/Agglomerative Methods
• Based on some methods of representing hierarchy of data points
• One idea: hierarchical dendogram (connects points based on similarity)
T1
T2
T3
T4
T5
T6
T7
T2
7
T3
16
8
T4
15
12
18
T5
14
3
6
6
T6
14
18
16
18
6
T7
9
6
0
6
9
2
T8
7
17
8
9
3
16
3
T5 T1 T3 T4 T2 T6 T8 T7
Hierarchical Agglomerative
• Compute distance matrix
• Put each data point in its own cluster
• Find most similar pair of clusters
• merge pairs of clusters (show merger in dendogram) • update similarity matrix
• repeat until all data points are in one cluster
K-Means
Optimization Algorithm Goal. Minimize L 𝑥, 𝑦 .
4;
L𝑹,𝜧;𝒮8 =NN𝑟8^ 𝒙8−𝝁^ 8O2 ^O2
7
Coordinate Descent (Optimization).
Repeat until convergence:
1. Find optimal 𝑥 while holding 𝑦 constant. 2. Find optimal 𝑦 while holding 𝑥 constant.
Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function.
Coordinate optimisation
Optimization Algorithm Coordinate Descent (Optimization)
Repeat until convergence:
• Find best clusters given centres • Find best centres given clusters
4;
L𝑹,𝜧;𝒮8 =NN𝑟8^ 𝒙8−𝝁^ 8O2 ^O2
7
Optimization Algorithm
1. Initialize centers 𝝁2, ... , 𝝁; from the data.
2. Repeat until no further change in training loss:
a. For each 𝑛 ∈ {1,...,𝑁},
𝑟 =f1,if𝑘=argmin 𝒙8−𝝁a 8^ a
7
0, otherwise.
We assign the 𝑛th data point to the closest cluster centre.
b. Foreach𝑗∈{1,...,𝑘}, ∑8𝑟8^𝒙8
𝝁^= ∑8𝑟8^
We recompute cluster means.
Convergence
• Training loss always decreases in each step (coordinate descent).
• Converges to local minimum, not necessarily global minimum.
Repeat algorithm over many initial points, and pick the configuration with the smallest training loss.
Training loss always decreases after every update step. Kmeans must terminate in finite time.
Training loss always decreases after every update step. Kmeans must terminate in finite time.
suppose k-means could run forever. Finitely many configurations, (K^N) many.
Eventually, choose R now, and choose R again in the future, for some matrix R
Discussion
Initialization
• Empty clusters
Choose points in your dataset as the representative mu_k to begin with.
o Pick data points to initialize clusters • Bad local minima
o Initialize many times and pick solution with smallest training loss
o Pick good starting positions
Initialization
Starting position of centroids
Final position of centroids
Problem. How to choose good starting positions? Solution. Place them far apart with high probability.
Number of Clusters
Number of Clusters
How do we choose 𝑘, the optimal number of clusters?
• Elbow method • Training Loss
• Validation Loss
Elbow
Training Loss
Check your understanding
• Clustering gives you an idea of how the data is distributed.
• You have 1000 data vectors that are very similar to each other (e.g., Eu distance less than 0.0001). You can only divide them into a few clusters.
• When you use kmeans, you usually obtain a global minimum of the loss function
• When you use kmeans, you can never achieve global minimum of the loss function.
• The number of clusters is a parameter that can be optimized by kmeans.
• In K-fold cross validation, K is a hyperparameter.
• Agglomerative clustering as we introduced in this lecture, has a
fixed clustering result given distance measurement and the
number of clusters desired, and tie breaker.
Increasing K for K-means always ensures the optimal solution for that K has smaller loss.