logistic_regression/mathematics

What You Will Learn in This Section

Logistic Regression Model
Cost Function Used in Logistic Regression
Gradient Descent Optimizer for Logistic Regression

In this section, we will systematically explore the key components of the Logistic Regression algorithm and then integrate them to understand the complete model. The major components of the Logistic Regression Model are:

Model
Cost Function
Optimizer

1. Model

The model defines the relationship between the dependent variable \(y\) and independent variables \(x\). Each algorithm has its own model equation. Logistic Regression directly models the probability of data points belonging to the positive class. It computes these probabilities in two steps:

First, it calculates the logit using Equation (1).
Next, it computes the probability by passing the logit values through the sigmoid function.

\begin{align} h(\theta,X) =\theta_0+\theta_1*x_1+\theta_2*x_2+..\theta_k*x_k...(1) \\ \ \ \\ P(y=1\mid X,\theta) =\text{sigmoid}(h(\theta,X))...(2) \end{align} \( \theta_0,\theta_1,...\theta_k \) are learnable parameters. The right-hand term in Equation (1) is called the logit. Equation (2) represents the probability that a data point belongs to class 1 given \(X\) (input data point) and \(\theta\) (learned parameters). Combining these equations, the final Logistic Regression model equation is:
\begin{align} \hat{y} =\frac{1}{1+e^{-(\theta_0+\theta_1*x_1....\theta_k*x_k)}}....(3) \end{align}
\( \hat{y} \) is the predicted probability score, which represents the likelihood that a given data point belongs to class 1. A threshold is used to classify each data point as positive or negative. For example, if the probability score is greater than or equal to 0.5, the predicted class is 1; otherwise, it is 0.

2. Cost Function

The cost function measures the difference between actual and predicted values of the dependent variable. The model equation provides the predicted values \( (\hat{y}) \), while the dataset contains the actual values. Logistic Regression uses the binary cross-entropy loss function, given by: \begin{align} J(\theta_0,\theta_1,...\theta_k)=\frac{1}{m} * - \{ \sum_{i=1}^m y_i*\log(\hat{y_i}) + \sum_{i=1}^m(1-y_i)*\log(1-\hat{y_i}) \} \end{align} This function computes the average cost over the entire dataset, where \( m \) represents the number of data points.
The core idea behind the cost function is that if the predicted and actual values are close, the cost should be low; otherwise, it should be high. Below are some example cases demonstrating this concept:

Actual class = 1, Predicted Probability = 0.9 \begin{align} \text{error} &= - \{ 1 * \log(0.9) + (1-1)*\log(1-0.9) \} \\ &= 0.105 \end{align}
Actual class = 1, Predicted Probability = 0.1 \begin{align} \text{error} &= - \{ 1 * \log(0.1) + (1-1)* \log(1-0.1) \} \\ &= 2.302 \end{align}

3. Optimizer

The optimizer finds the optimal values of parameters \( \theta_0, \theta_1, ..., \theta_k \) to minimize the average cost. In this section, we focus on the gradient descent optimizer in the context of logistic regression.
Gradient descent consists of three main steps:

Parameter Initialization
All parameters are initialized, typically to zero for simplicity.
Gradient Calculation
Gradients are computed using the following formulas:
\begin{align} \frac{\partial J}{\partial \theta_0} &= \frac{1}{m} * \sum_{i=1}^m (\hat{y_i} - y_i) \\ \frac{\partial J}{\partial \theta_1} &= \frac{1}{m} * \sum_{i=1}^m (\hat{y_i} - y_i) * X_1 \\ \\ . \\ . \\ \frac{\partial J}{\partial \theta_k} &= \frac{1}{m} * \sum_{i=1}^m (\hat{y_i} - y_i) * X_k \\ \end{align}
Updating Model Parameters
Parameters are updated using the learning rate \( \alpha \):
\begin{align} \theta_0 &= \theta_0 - \alpha * \frac{\partial J}{\partial \theta_0} \\ \theta_1 &= \theta_1 - \alpha * \frac{\partial J}{\partial \theta_1} \\ \\ . \\ . \\ \theta_k&=\theta_k - \alpha*\partial \theta_k \end{align}

Bringing It All Together

We have examined the model equation and objective function used in Logistic Regression. We then explored how Gradient Descent iteratively determines the model parameters.
The entire process is summarized in the following algorithm.

                
                    initialize: a1=0, a2=0, ..., ak=0, b=0 
                    n = 1000 (number of iterations)      
                    for (i = 1 to n) 
                    { 
                        compute gradients: 
                        da1, da2, ..., da_k, db 
                        update parameters:  
                            a1 = a1 - alpha * da1 
                            a2 = a2 - alpha * da2
                            .
                            . 
                            . 
                            ak = ak - alpha * da_k
                            b = b - alpha * db  
                    }