Let $y$ be a dependent variable of a feature vector $x$

Problem: Given a training set $\langle x^{(i)}, y^{(i)} \rangle$, $1 \le i \le m$, find the value of $y$ on any input vector $x$.

We solve this problem by constructing a hypothesis funciton $h_\theta(x)$ using one of the methods below.

Optimization Objective

Step 1. Normalize each feature $(x^{(0)}_j, …, x^{(m)}_j)$, $1 \le j \le n$ by mean $\mu$ and standard deviation $\sigma$

Step 2. Minimize the cost function

where $\theta = (\theta_0, …, \theta_n)^T$

Step 3. Compute hypothesis function as

where vector $x$ is normalized using the same values of $\mu$ and $\sigma$ as in Step 1.

Gradient descent is the method for finding (global) minimum of cost funtion $J(\theta)$. There are few ways to implement this method.

Direct method

Choose small learning rate $\alpha > 0$ and find the fixed point of the function

Optimized method

Many mathematical software packages already include implementations of gradient descent that compute learning rate $\alpha$ automatically. These methods accept cost function $J(\theta)$ and its gradient $\nabla J(\theta)$ as arguments, which for the linear regression is computed as follows

Normal Equation

Unlike Gradient Descent this method does not require feature normalization (Step 1) and convergence loop. Normal equation gives the closed-form solution to linear regression

Regularization

In case of overfitting both methods can be tweaked by introducing polynomial features and adjusting equations as follows. Let $\lambda > 0$ and E be the diagonal matrix

Then the cost function for gradient descent becomes

and normal equation