What You Will Learn in This Section
- A brief summary of the Logistic Regression model
Logistic regression is a supervised learning algorithm used for classification problems. In classification tasks, the dependent variable has discrete categories. For example, in email spam classification, y = 1 indicates that an email is spam, while y = 0 means it is not spam. Similarly, in credit default prediction, y = 1 represents a customer who has defaulted on credit, whereas y = 0 signifies a non-defaulting customer. In this discussion, we focus on binary classification problems, though logistic regression can also handle multiple classes.
Let's explore the credit default problem in more detail. We have historical customer data, including attributes such as income and credit utilization. The dependent variable, y, has two levels: y = 1 (customer defaulted) and y = 0 (customer did not default). The table below presents some example data points.
Income | Credit Utilization (%) | Class |
---|---|---|
9 | 12 | 1 |
10 | 6 | 0 |
8 | 11 | 1 |
12 | 8 | 0 |
The plot below illustrates the relationship between credit utilization and class labels. It is evident that individuals with higher credit utilization are more likely to default on credit.
Logistic regression fits a sigmoid curve to the data, assigning a probability score to each data point instead of directly predicting the class. The red curve in the plot represents the sigmoid function's values for the given data points. By hovering over the curve, you can see the probability values. Once the sigmoid curve is fitted, we can set a threshold (e.g., 0.5) to classify each data point. If the sigmoid value is greater than or equal to 0.5, we assign it to the positive class; otherwise, it is classified as negative.
Summary of This Lesson
Logistic regression fits a sigmoid curve to the data and generates probability scores for each data point. In the next chapter, we will explore the sigmoid curve in greater detail.