Let $y$ be a dependent variable of a feature vector $x$

Problem: Given a training set $\langle x^{(i)}, y^{(i)} \rangle$, $1 \le i \le m$, find the value of $y$ on any input vector $x$.

We solve this problem by constructing a hypothesis funciton $h_\theta(x)$ using one of the methods below.


Optimization Objective

Step 1. Normalize each feature $(x^{(0)}_j, …, x^{(m)}_j)$, $1 \le j \le n$ by mean $\mu$ and standard deviation $\sigma$

function [X_norm, mu, sigma] = featureNormalize(X)
mu = mean(X, 1);
sigma = std(X, 0, 1);
X_norm = (X - mu) ./ sigma;

Step 2. Minimize the cost function

where $\theta = (\theta_0, …, \theta_n)^T$

Step 3. Compute hypothesis function as

where vector $x$ is normalized using the same values of $\mu$ and $\sigma$ as in Step 1.

Gradient Descent

Gradient descent is the method for finding (global) minimum of cost funtion $J(\theta)$. There are few ways to implement this method.

Direct method

Choose small learning rate $\alpha > 0$ and find the fixed point of the function

function [theta] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y);
for iter = 1:num_iters
h = X * theta;
delta = h - y;
theta = theta - X' * delta * alpha / m;

Optimized method

Many mathematical software packages already include implementations of gradient descent that compute learning rate $\alpha$ automatically. These methods accept cost function $J(\theta)$ and its gradient $\nabla J(\theta)$ as arguments, which for the linear regression is computed as follows

function [J, grad] = costFunction(theta, X, y)
m = length(y);
h = X * theta;
delta = h - y;
J = delta' * delta / 2 / m;
grad = X' * delta / m;
function [theta] = gradientDescent(X, y, initial_theta)
options = optimset('GradObj', 'on', 'MaxIter', 400);
[theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);

Normal Equation

Unlike Gradient Descent this method does not require feature normalization (Step 1) and convergence loop. Normal equation gives the closed-form solution to linear regression

function [theta] = normalEquation(X, y)
theta = pinv(X' * X) * X' * y


In case of overfitting both methods can be tweaked by introducing polynomial features and adjusting equations as follows. Let $\lambda > 0$ and E be the diagonal matrix

Then the cost function for gradient descent becomes

and normal equation