Machine Learning: Linear Regression
Let $y$ be a dependent variable of a feature vector $x$
Problem: Given a training set $\langle x^{(i)}, y^{(i)} \rangle$, $1 \le i \le m$, find the value of $y$ on any input vector $x$.
We solve this problem by constructing a hypothesis funciton $h_\theta(x)$ using one of the methods below.
Notation
Optimization Objective
Step 1. Normalize each feature $(x^{(0)}_j, …, x^{(m)}_j)$, $1 \le j \le n$ by mean $\mu$ and standard deviation $\sigma$
Step 2. Minimize the cost function
where $\theta = (\theta_0, …, \theta_n)^T$
Step 3. Compute hypothesis function as
where vector $x$ is normalized using the same values of $\mu$ and $\sigma$ as in Step 1.
Gradient Descent
Gradient descent is the method for finding (global) minimum of cost funtion $J(\theta)$. There are few ways to implement this method.
Direct method
Choose small learning rate $\alpha > 0$ and find the fixed point of the function
Optimized method
Many mathematical software packages already include implementations of gradient descent that compute learning rate $\alpha$ automatically. These methods accept cost function $J(\theta)$ and its gradient $\nabla J(\theta)$ as arguments, which for the linear regression is computed as follows
Normal Equation
Unlike Gradient Descent this method does not require feature normalization (Step 1) and convergence loop. Normal equation gives the closed-form solution to linear regression
Regularization
In case of overfitting both methods can be tweaked by introducing polynomial features and adjusting equations as follows. Let $\lambda > 0$ and E be the diagonal matrix
Then the cost function for gradient descent becomes
and normal equation