K-Means Clustering

Last modified on 01 Oct 2021.

What’s the idea of K-Means?
How to choose number of clusters?
Discussion
Using K-Means with Scikit-learn
K-Means in action
K-medois clustering
References

K-Means is the most popular clustering method any learner should know. In this note, we will understand the idea of KMeans and how to use it with Scikit-learn. Besides that, we also learn about its variants (K-medois, K-modes, K-medians).

What’s the idea of K-Means?

Randomly choose centroids ( $k$ ).
Go through each example and assign them to the nearest centroid (assign class of that centroid).
Move each centroid (of each class) to the average of data points having the same class with the centroid.
Repeat 2 and 3 until convergence.

KMeans idea

How to choose number of clusters?

Using “Elbow” method.

KMeans idea

Discussion

A type of Partitioning clustering.
Not good if there are outliers, noise.
The K-means method is sensitive to outliers ⇒ K-medoids clustering or PAM (Partitioning Around Medoids) is less sensitive to outliers^[ref]

Using K-Means with Scikit-learn

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10, random_state=0) # default k=8

kmeans.fit(X)
kmeans.predict(X)

# or
kmeans.fit_predict(X)

Some notable parameters (see full):

max_iter: Maximum number of iterations of the k-means algorithm for a single run.
kmeans.labels_: show labels of each point.
kmeans.cluster_centers_ : cluster centroids.

K-Means in action

K-Means clustering on the handwritten digits data.
Image compression using K-Means – Open in HTML – Open in Colab.

K-medois clustering

References

Luis Serrano – [Video] Clustering: K-means and Hierarchical.
Andrew NG. – My raw note of the course “Machine Learning” on Coursera.