Key Takeaways
- Understanding the probabilistic assumptions behind Linear Regression
- Derivation of the Least-Squares Cost Function using Maximum Likelihood Estimation
- Connecting the Gaussian Noise Assumption with Least-Squares Optimization
1. Assumption of a Linear Relationship
In a regression problem, we assume that the target variable \( y^{(i)} \) and input \( x^{(i)} \) are related by:
\[
y^{(i)} = \theta^T x^{(i)} + \epsilon^{(i)}
\]
where \( \epsilon^{(i)} \) is an error term that captures unmodeled effects or random noise.
2. Gaussian Noise Assumption
We assume the errors \( \epsilon^{(i)} \) are independently and identically distributed (IID) according to a Gaussian distribution:
\[
\epsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)
\]
which implies that the conditional probability of \( y^{(i)} \) given \( x^{(i)} \) and parameterized by \( \theta \) follows:
\[
p(y^{(i)} | x^{(i)}; \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left(-\frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2}\right)
\]
3. Maximum Likelihood Estimation
Given the dataset \( X \), the likelihood function is:
\[
L(\theta) = \prod_{i=1}^{n} p(y^{(i)} | x^{(i)}; \theta)
\]
Taking the log-likelihood:
\[
\ell(\theta) = \sum_{i=1}^{n} \left( \log \frac{1}{\sqrt{2\pi\sigma^2}} - \frac{(y^{(i)} - \theta^T x^{(i)})^2}{2\sigma^2} \right)
\]
4. Deriving the Least-Squares Cost Function
Maximizing the log-likelihood function \( \ell(\theta) \) is equivalent to minimizing the sum of squared errors:
\[
J(\theta) = \frac{1}{2} \sum_{i=1}^{n} (y^{(i)} - \theta^T x^{(i)})^2
\]
Thus, least-squares regression corresponds to the Maximum Likelihood Estimation (MLE) of \( \theta \) under Gaussian noise assumptions.