What You Will Learn in This Section
- Metrics used for unsupervised learning
- WCSS (Within-Cluster Sum of Squares)
- Silhouette Score
There are numerous metrics available to evaluate the quality of clusters identified by a clustering algorithm. These metrics can be categorized into two types:
- When true labels are available
- When true labels are not available
We will focus on the second category, where true labels are unavailable. In this section, we will discuss two widely used metrics, but readers are encouraged to explore the Scikit-learn documentation for additional metrics.
-
WCSS (Within-Cluster Sum of Squares)
This metric measures cluster coherence. Lower values indicate higher coherence, while higher values suggest less coherence. The calculation of WCSS involves two steps:-
Compute the mean squared distance between a data point and its cluster centroid.
\begin{align} Cluster \ (C_i) = \frac{1}{p_i}* \sum_{j=1}^{p_i} ( X_j- centroid_i)^2 \end{align} where:
\( p_i \) = number of data points in cluster \( i \)
\( X_j \) = data points belonging to cluster \( i \)
-
Compute the average value of \( C_i \) across all clusters.
\begin{align} WCSS = \frac{1}{K} * \sum_{i=1}^K C_i \end{align} where \( K \) denotes the number of clusters.
This is a straightforward metric to understand. A lower WCSS value indicates better cluster quality. However, WCSS does not have defined upper or lower bounds, making it difficult to compare models directly. Consider the following example:
Model 1: WCSS = 500
Model 2: WCSS = 400
While Model 2 is better than Model 1, WCSS alone does not indicate how close either model is to an optimal solution. -
Compute the mean squared distance between a data point and its cluster centroid.
-
Silhouette Score
This score measures how similar a data point is to its assigned cluster compared to other clusters. The silhouette score ranges from -1 to +1, where a higher value indicates that the data point is well-clustered and distinct from other clusters. More details about Silhouette score
Calculation of Silhouette Score
The silhouette score is calculated for each data point, and the average silhouette score across all data points is computed. The score is based on two values:
\( a \) : The mean distance between a sample and all other points within its cluster
\( b \) : The mean distance between the sample and all points in the nearest neighboring cluster\begin{align} s_i &= \frac{b-a}{\max(a,b)} \\ S &= \frac{1}{N} * \sum_{i=1}^N s_i \end{align} where:
\( s_i \) = silhouette score for sample \( i \)
\( S \) = overall silhouette score
Silhouette analysis can be performed at the individual sample level and is often used to determine the optimal number of clusters. The Scikit-learn documentation provides a detailed example of this analysis.