Let $y$ be a dependent variable of a feature vector $x$
Problem: Given a training set $\langle x^{(i)}, y^{(i)} \rangle$, $1 \le i \le m$, find the value of $y$ on any input vector $x$.
We solve this problem by constructing a hypothesis funciton $h_\theta(x)$ using one of the methods below.
Notation
Optimization Objective
Step 1. Normalize each feature $(x^{(0)}_j, …, x^{(m)}_j)$, $1 \le j \le n$ by mean $\mu$ and standard deviation $\sigma$
1 2 3 4 5 

Step 2. Minimize the cost function
where $\theta = (\theta_0, …, \theta_n)^T$
Step 3. Compute hypothesis function as
where vector $x$ is normalized using the same values of $\mu$ and $\sigma$ as in Step 1.
Gradient Descent
Gradient descent is the method for finding (global) minimum of cost funtion $J(\theta)$. There are few ways to implement this method.
Direct method
Choose small learning rate $\alpha > 0$ and find the fixed point of the function
1 2 3 4 5 6 7 8 

Optimized method
Many mathematical software packages already include implementations of gradient descent that compute learning rate $\alpha$ automatically. These methods accept cost function $J(\theta)$ and its gradient $\nabla J(\theta)$ as arguments, which for the linear regression is computed as follows
1 2 3 4 5 6 7 

1 2 3 4 

Normal Equation
Unlike Gradient Descent this method does not require feature normalization (Step 1) and convergence loop. Normal equation gives the closedform solution to linear regression
1 2 3 

Regularization
In case of overfitting both methods can be tweaked by introducing polynomial features and adjusting equations as follows. Let $\lambda > 0$ and E be the diagonal matrix
Then the cost function for gradient descent becomes
and normal equation