What You Will Learn in This Section
- Why Unsupervised Learning Is Needed
- A Brief Overview of the K-Means Algorithm
Why Do We Need Unsupervised Learning?
Regression and classification algorithms require labeled data. The variables \( X \) are called independent variables, while
the \( y \) variable is called the dependent variable or label. Consider the house price prediction problem: the number of
rooms, area of the house, and locality are possible independent variables, while the house price is the dependent variable.
In a regression problem, the dependent variable is continuous and can take any value. Since house prices can take any value,
this is a regression problem.
Classification problems, on the other hand, have discrete dependent variables. The simplest case is a binary classification
problem, where the dependent variable has only two classes \( \{ 0,1 \} \). An example is email classification, where class 1 denotes
a spam email and class 0 denotes a non-spam email. The dependent variable can also have more than two classes, in which case
it is a multiclass problem. An example of this is customer classification into premium, good, and bad categories.
Regression and classification problems fall under supervised learning because they rely on a dependent variable. The dependent
variable guides the algorithm to learn patterns in the data. These dependent variables serve as ground truth, and manual labeling
is often required. For example, in email classification, a human annotator manually marks emails as spam or non-spam. The final
labeled dataset is then used to train a machine learning model. Data collection can be an expensive and time-consuming process,
making supervised learning challenging.
What if we do not have a dependent variable in our dataset? Can we still extract meaningful patterns from the independent variables?
The answer is yes. We can use unsupervised learning algorithms to group similar data points into clusters.
Examples of unsupervised learning algorithms include K-Means, Gaussian Mixture Models (GMM), and hierarchical clustering. In this
section, we will explore a brief overview of the K-Means algorithm.
A Brief Overview of K-Means Clustering
Imagine you are given a dataset containing features such as income, spending capacity, age, and gender, and you need to categorize the data into multiple clusters. The K-Means algorithm provides a solution to this problem. It is one of the simplest yet most effective clustering algorithms. First, we specify the desired number of clusters (K), and the algorithm then assigns a cluster ID to each data point. The algorithm follows three main steps:- Randomly select K cluster centroids. K is specified by the user and represents the number of clusters.
- Compute the Euclidean distance between each data point and the cluster centroids. Assign each data point to the closest cluster.
- Update the cluster centroids by calculating the mean of all data points within each cluster.
Let's understand this with an example. For simplicity, we will focus on two features:
income and spending capacity (as these are easier to visualize).
Before applying the K-Means algorithm, we normalize the data. The diagram below illustrates the output of the K-Means
algorithm. The x-axis represents normalized income, while the y-axis represents normalized spending. The K-Means algorithm
has identified three clusters:
- Cluster 0 consists of individuals with high income and high spending. This segment represents premium customers.
- Cluster 2 consists of individuals with high income but low spending. These can be classified as good customers.
- Cluster 1 consists of individuals with low income and low spending, which is not an ideal customer segment.