Logistic regression is a classification case of linear regression whith dependent variable $y$ taking binary values.

Problem: Given a training set $\langle x^{(i)}, y^{(i)} \rangle$, $1 \le i \le m$, $x \in \mathbb{R}^{n+1}$, $x^{(i)} _ 0 = 0$, $y^{(i)} \in $ {0,1}, find classification function

Gradient Descent

Let’s build function $h_\theta(x)$ as a sigmoid function of $\theta\cdot x$

Sigmoid function has rank infinity, i.e. it operates on scalars, vectors and matrices.

function g = sigmoid(z)
g = 1 ./ (1 + exp(-z));

To find optimal parameter $\theta \in \mathbb{R}^{n+1}$ we are going to use optimized gradient descent method which takes as arguments cost function $J(\theta)$ and its gradient. For logistic regression they are

where $X = (x^{(i)}_j) _{m \times n+1}$ is a matrix of the training examples from the previous lecture.

Analogous to linear regression, logistic regression can be regularized too

function [J, grad] = costFunction(theta, X, y, lambda)
m = length(y); % number of training examples
h = sigmoid(X * theta);
J = (y' * log(h) + (1 - y)' * log(1 - h)) / -m;
grad = X' * (h - y) / m;
% Regularization
th = theta; th(1) = 0;
J = J + th' * th * lambda / m / 2;
grad = grad + th * lambda / m;

Having computed $\theta$ we can now implement the prediction function

function p = predict(theta, X)
p = sigmoid(X * theta) >= 0.5;

which can be used to classify new examples and check the prediction accuracy on the training set

function a = accuracy(theta, X, y)
p = predict(theta, X);
a = mean(double(p == y)) * 100;

Multi-class Classification

Logistic regression works for binary $y$. Suppose now that $y^{(i)} \in ${$1,…,K$}, where $K > 2$. In this case we can use One-vs-All variation of this algorithm.

Step 1. Convert vector $y$ into a binary matrix $Y$

where $y^{(i)}_k = \delta _{k y^{(i)}}$, i.e. $y^{(i)}_k = 1$ when $y^{(i)} = k$, otherwise $y^{(i)}_k = 0$.

Step 2. Train logistic classifier on every column of matrix $Y$. The result will be a matrix $\Theta = (\theta_{jk})_{n+1 \times K}$

Step 3. For any given vector $x$ compute vector $h = x^T \Theta$. Then the predicted value $y$ will be

To compute accuracy of the one-vs-all classifier on the training set use accuracy.m script from above with modified predict.m

function p = predict(Theta, X)
a = sigmoid(X * Theta);
[v, p] = max(a, [], 2);