k_means/clustering_metrics

Model Evaluation Metrics for the K-Means Algorithm

What You Will Learn in This Section
  • Metrics used for unsupervised learning
  • WCSS (Within-Cluster Sum of Squares)
  • Silhouette Score

There are numerous metrics available to evaluate the quality of clusters identified by a clustering algorithm. These metrics can be categorized into two types:

  • When true labels are available
  • When true labels are not available
The Scikit-learn documentation provides a comprehensive list of clustering metrics.

We will focus on the second category, where true labels are unavailable. In this section, we will discuss two widely used metrics, but readers are encouraged to explore the Scikit-learn documentation for additional metrics.

  • WCSS (Within-Cluster Sum of Squares)
    This metric measures cluster coherence. Lower values indicate higher coherence, while higher values suggest less coherence. The calculation of WCSS involves two steps:
    1. Compute the mean squared distance between a data point and its cluster centroid.
      \begin{align} Cluster \ (C_i) = \frac{1}{p_i}* \sum_{j=1}^{p_i} ( X_j- centroid_i)^2 \end{align} where:
      \( p_i \) = number of data points in cluster \( i \)
      \( X_j \) = data points belonging to cluster \( i \)

    2. Compute the average value of \( C_i \) across all clusters.
      \begin{align} WCSS = \frac{1}{K} * \sum_{i=1}^K C_i \end{align} where \( K \) denotes the number of clusters.

    This is a straightforward metric to understand. A lower WCSS value indicates better cluster quality. However, WCSS does not have defined upper or lower bounds, making it difficult to compare models directly. Consider the following example:
    Model 1: WCSS = 500
    Model 2: WCSS = 400
    While Model 2 is better than Model 1, WCSS alone does not indicate how close either model is to an optimal solution.
  • Silhouette Score
    This score measures how similar a data point is to its assigned cluster compared to other clusters. The silhouette score ranges from -1 to +1, where a higher value indicates that the data point is well-clustered and distinct from other clusters. More details about Silhouette score
    Calculation of Silhouette Score
    The silhouette score is calculated for each data point, and the average silhouette score across all data points is computed. The score is based on two values:
    \( a \) : The mean distance between a sample and all other points within its cluster
    \( b \) : The mean distance between the sample and all points in the nearest neighboring cluster
    \begin{align} s_i &= \frac{b-a}{\max(a,b)} \\ S &= \frac{1}{N} * \sum_{i=1}^N s_i \end{align} where:
    \( s_i \) = silhouette score for sample \( i \)
    \( S \) = overall silhouette score

    Silhouette analysis can be performed at the individual sample level and is often used to determine the optimal number of clusters. The Scikit-learn documentation provides a detailed example of this analysis.