linear_regression/probabilistic_interpretation_of_linear_regression

Probabilistic Interpretation of Linear Regression

Key Takeaways
  • Understanding the probabilistic assumptions behind Linear Regression
  • Derivation of the Least-Squares Cost Function using Maximum Likelihood Estimation
  • Connecting the Gaussian Noise Assumption with Least-Squares Optimization
In this section we will try to understand why least square cost function is a reasonable choice for linear regression ?
1. Assumption of a Linear Relationship

In a regression problem, we assume that the target variable \( y^{(i)} \) and input \( x^{(i)} \) are related by:
\[ y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)} \] where \( \epsilon^{(i)} \) is an error term that captures unmodeled effects or random noise.

2. Gaussian Noise Assumption

We assume the errors \( \epsilon^{(i)} \) are independently and identically distributed (IID) according to a Gaussian distribution:
\[ \epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2) \] which implies that the conditional probability of \( y^{(i)} \) given \( x^{(i)} \) and parameterized by \( \theta \) follows:
\[ p(y^{(i)} | x^{(i)}; \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right) \]

3. Maximum Likelihood Estimation

Given the dataset \( X \), the likelihood function is:
\[ L(\theta) = \prod_{i=1}^{n} p(y^{(i)} | x^{(i)}; \theta) \] Taking the log-likelihood:
\[ \ell(\theta) = \sum_{i=1}^{n} \left( \log \frac{1}{\sqrt{2\pi\sigma^2}} - \frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2} \right) \]

4. Deriving the Least-Squares Cost Function

Maximizing the log-likelihood function \( \ell(\theta) \) is equivalent to minimizing the sum of squared errors:
\[ J(\theta) = \frac{1}{2} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2 \] Thus, least-squares regression corresponds to the Maximum Likelihood Estimation (MLE) of \( \theta \) under Gaussian noise assumptions.