Let $y$ be a dependent variable of a feature vector $x$

Problem: Given a training set $\langle x^{(i)}, y^{(i)} \rangle$, $1 \le i \le m$, find the value of $y$ on any input vector $x$.

We solve this problem by constructing a hypothesis funciton $h_\theta(x)$ using one of the methods below.

## Optimization Objective

Step 1. Normalize each feature $(x^{(0)}_j, …, x^{(m)}_j)$, $1 \le j \le n$ by mean $\mu$ and standard deviation $\sigma$

Step 2. Minimize the cost function

where $\theta = (\theta_0, …, \theta_n)^T$

Step 3. Compute hypothesis function as

where vector $x$ is normalized using the same values of $\mu$ and $\sigma$ as in Step 1.

Gradient descent is the method for finding (global) minimum of cost funtion $J(\theta)$. There are few ways to implement this method.

### Direct method

Choose small learning rate $\alpha > 0$ and find the fixed point of the function

### Optimized method

Many mathematical software packages already include implementations of gradient descent that compute learning rate $\alpha$ automatically. These methods accept cost function $J(\theta)$ and its gradient $\nabla J(\theta)$ as arguments, which for the linear regression is computed as follows

## Normal Equation

Unlike Gradient Descent this method does not require feature normalization (Step 1) and convergence loop. Normal equation gives the closed-form solution to linear regression

## Regularization

In case of overfitting both methods can be tweaked by introducing polynomial features and adjusting equations as follows. Let $\lambda > 0$ and E be the diagonal matrix

Then the cost function for gradient descent becomes

and normal equation