What You Will Learn in This Section
- Mathematical equations for each component of the Linear Regression Model
- Detailed explanation of the Gradient Descent algorithm with mathematical derivations
- End-to-end understanding of the complete algorithm
In the previous section, we discussed how Gradient Descent can be used to find the optimal values for model parameters. To develop a solid understanding of Gradient Descent,
refer to this section, where we solve a simple objective function using Gradient Descent.
In this section, we will see how the model, loss function, and Gradient Descent come together to train a Linear Regression model.
The Linear Regression algorithm consists of three major steps:
1. Define the Model Equation: This equation is used to make predictions on new data and also helps identify model parameters. Equation 1 defines the Linear Regression model, where \( a_1 \) and \( b \) are the model parameters.
Here, \( \hat{y_i} \) represents the predicted target value for a given \( x_i \).
2. Define the Cost Function: The cost function measures how close the predicted values are to the actual target values. Linear Regression uses the Least Squares cost function, which minimizes the prediction error. Equation 2 represents the cost function:
3. Optimization Algorithm: The Gradient Descent algorithm is used to optimize the model. Gradient Descent iteratively updates model parameters to minimize the cost function.
The optimization process consists of three main steps:
I. Initialization: Model parameters are initialized with random values. The choice of initial values may affect the convergence of Gradient Descent. For instance, let's assume \( a_1 = 0 \) and \( b = 0 \).
II. Compute Gradients: The gradient of the cost function is computed with respect to each parameter. If we consider a single data point (\( N = 1 \)), the cost function simplifies as follows: \begin{align} J(a1,b) = \frac{1}{2} ( y - a1 \cdot x - b)^2 \end{align} The gradients are calculated as: \begin{align} \frac{\partial J(a1,b)}{\partial a1} &= - (y - \hat{y}) \cdot x \\ \frac{\partial J(a1,b)}{\partial b} &= - (y - \hat{y}) \end{align} For \( N \) data points, we compute the gradients over all samples and take the average: \begin{align} \partial a1 &= \frac{-1}{N} \sum_{i=1}^{N} ( y_i - \hat{y_i} ) \cdot x_i \\ \partial b &= \frac{-1}{N} \sum_{i=1}^{N} ( y_i - \hat{y_i} ) \end{align}
III. Update Step: Using the computed gradients, we update the parameter values: \begin{align} a1 &= a1 - \alpha \cdot \partial a1 \\ b &= b - \alpha \cdot \partial b \end{align} Here, \( \alpha \) is the learning rate, which controls the step size towards convergence. Selecting an appropriate learning rate is crucial and will be discussed in hyperparameter tuning.
Now, Let's Put Everything Together
We have defined the model equation and the objective function used in Linear Regression. Then, we explored how the Gradient Descent optimizer iteratively updates the model parameters.
The Gradient Descent process is summarized in the following pseudocode:
initialize: a1 = 0, b = 0
n = 1000 // number of iterations
for (i = 1 to n)
{
compute gradients:
da1, db
update parameters:
a1 = a1 - alpha * da1
b = b - alpha * db
}