What You Will Learn in This Section
- How K-Means works
- How to determine the number of clusters in K-Means
Detailed Steps of K-Means
In this section, we will explore the detailed steps involved in the K-Means clustering algorithm. First, let's define the problem: We are given a data matrix of size \( (N, M) \), and our goal is to group the data into \( K \) clusters. \( K \) is a hyperparameter that the user must specify before running the algorithm. The algorithm consists of three main steps:
-
Initialization of Cluster Centroids
The algorithm begins by randomly initializing \( K \) cluster centroids. We can select \( K \) data points randomly from the given dataset. However, selecting the initial centroids carefully is crucial, as poor choices can affect the final cluster quality. When initializing cluster centroids, consider the following:
- Ensure that the \( K \) centroids are well-separated. If the initial centroids are too close to each other, the resulting clusters may not be well-defined.
After the initialization step, K-Means iterates between Step 2 and Step 3 until convergence.
-
Cluster Assignment Step
In this step, each data point is assigned to the nearest cluster. The algorithm computes the distance between each data point and all cluster centroids. The cluster with the closest centroid is assigned to the data point.
-
Cluster Centroid Update Step
Once all data points have been assigned to their respective clusters, the algorithm updates the cluster centroids. The new centroids are computed as the mean of all data points within each cluster.
The algorithm repeats Steps 2 and 3 until a stopping criterion is met. Below are some common stopping criteria:
- Running the algorithm for a fixed number of iterations.
- Checking for minimal changes in cluster centroids. The Euclidean distance between new and previous centroids can be computed, and if it is below a certain threshold (e.g., 0.00001), the algorithm stops.
- Monitoring cluster quality using the Within-Cluster Sum of Squares (WCSS) metric. If the improvement in WCSS is below a predefined threshold (e.g., 0.00001), the algorithm stops.
How to Determine the Optimal Number of Clusters
In K-Means, the number of clusters \( (K) \) is a hyperparameter that must be set before training begins. To find the optimal value of \( K \), we aim to minimize the WCSS (Within-Cluster Sum of Squares) metric. The approach involves running the K-Means algorithm for different values of \( K \) and plotting the results. The optimal number of clusters is chosen using the elbow method, which identifies the point where increasing \( K \) results in only a marginal decrease in WCSS. Based on the following diagram, \( K = 4 \) appears to be the optimal choice.