Side Notes

never finished…

Machine Learning: Linear Regression

Let $y$ be a dependent variable of a feature vector $x$

Problem: Given a training set $\langle x^{(i)}, y^{(i)} \rangle$, $1 \le i \le m$, find the value of $y$ on any input vector $x$.

We solve this problem by constructing a hypothesis funciton $h_\theta(x)$ using one of the methods below.

Notation

Optimization Objective

Step 1. Normalize each feature $(x^{(0)}_j, …, x^{(m)}_j)$, $1 \le j \le n$ by mean $\mu$ and standard deviation $\sigma$

featureNormalize.m
1
2
3
4
5
function [X_norm, mu, sigma] = featureNormalize(X)
    mu = mean(X, 1);
    sigma = std(X, 0, 1);
    X_norm = (X - mu) ./ sigma;
end

Step 2. Minimize the cost function

where $\theta = (\theta_0, …, \theta_n)^T$

Step 3. Compute hypothesis function as

where vector $x$ is normalized using the same values of $\mu$ and $\sigma$ as in Step 1.

Gradient Descent

Gradient descent is the method for finding (global) minimum of cost funtion $J(\theta)$. There are few ways to implement this method.

Direct method

Choose small learning rate $\alpha > 0$ and find the fixed point of the function

gradientDescent.m
1
2
3
4
5
6
7
8
function [theta] = gradientDescent(X, y, theta, alpha, num_iters)
    m = length(y);
    for iter = 1:num_iters
        h = X * theta;
        delta = h - y;
        theta = theta - X' * delta * alpha / m;
    end
end

Optimized method

Many mathematical software packages already include implementations of gradient descent that compute learning rate $\alpha$ automatically. These methods accept cost function $J(\theta)$ and its gradient $\nabla J(\theta)$ as arguments, which for the linear regression is computed as follows

costFunction.m
1
2
3
4
5
6
7
function [J, grad] = costFunction(theta, X, y)
    m = length(y);
    h = X * theta;
    delta = h - y;
    J = delta' * delta / 2 / m;
    grad = X' * delta / m;
end
gradientDescent.m
1
2
3
4
function [theta] = gradientDescent(X, y, initial_theta)
    options = optimset('GradObj', 'on', 'MaxIter', 400);
    [theta, cost] = fminunc(@(t)(costFunction(t, X, y)), initial_theta, options);
end

Normal Equation

Unlike Gradient Descent this method does not require feature normalization (Step 1) and convergence loop. Normal equation gives the closed-form solution to linear regression

normalEquation.m
1
2
3
function [theta] = normalEquation(X, y)
    theta = pinv(X' * X) * X' * y
end

Regularization

In case of overfitting both methods can be tweaked by introducing polynomial features and adjusting equations as follows. Let $\lambda > 0$ and E be the diagonal matrix

Then the cost function for gradient descent becomes

and normal equation