Linear Discriminant Model

Motivation. In the input space, draw a line to classify into two categories. This is achievable using linear regression

Discriminant Function

def. Discriminant Function takes an input vector $x$ and assigns into one of $K$ classes, $C_{k}$ .

def. Decision Surface is the boundary that splits the classes in input space.
def. Decision Region is the regions generated by the decision surfaces that correspond to a single class. When there is two classes, $C_{1}, C_{2}$ then:

y (x) = weight=params w^{⊤} x + bias w_{0}

And assign:

x \in {C_{1} C_{2} if y (x) \geq 0 if y (x) < 0

$w$ is orthogonal to the decision surface hyperplane (via linalg; green in image)
$w_{0}$ determines the distace from the origin of the decision surface When there are $K > 2$ classes then we cant just split into $K$ regions naively because: Therefore we instead introduce a discriminant function for every pair of classes, in total $\frac{K ( K - 1 )}{2}$ DFs for each $k \in {0 \dots K}$ :

y_{k} (x) = w_{k}^{⊤} x + w_{k, 0}

and then assign to: $C_{a r g m a x_{k} {y_{k} (x)}}$ i.e. the maximum of all discriminant functions. The pairwise decision surface for pair $j, k$ is:

⊤ (w_{k} - w_{j}) x + distance from origin (w_{k, 0} - w_{j, 0}) = 0

thm. Such decision regions are singly connected and convex.¹

Linear Discriminnant Models

def. Perceptron Algorithm. (only works for $K = 2$ , but is illustrative.)

y (x) = f (w^{⊤} ϕ (x))

where:

a fixed (=non-trained) function $ϕ (x)$ . Includes bias component.
an activation function $f (x)$ where:

f (x) = {10 if x \geq 0 if x < 0

Optimizing the Perceptron Algorithm

def. Perceptron Error Function.

E_{P} (w) = - n \in M \sum always >0 w^{⊤} \cdot ϕ_{n} (x) \cdot t_{n}

where

Sums over $M$ , all the misclassified $x$ ’s
$t_{n} \in {- 1, + 1}$ indicates which class it is in, $C_{1}, C_{2}$ . Intuition. In this error function:

Correct classification has error $0$
Incorrect classification has error $w^{⊤} \cdot ϕ_{n} (x) \cdot t_{n}$ Optimizing this globally using $\nabla E_{P} = 0$ is too computationally intensive. We instead use:

w^{(τ + 1)} = w^{(τ)} - η \nabla E_{P} (w) = w^{(τ)} + η \cdot ϕ_{n} t_{n}

$τ$ is simply the learning step index
$η$ is the learning rate. Adjustable Visualization. Following are two steps of a perceptron optimization.

Top Left: decision boundary (=black line) is initialized, defined by the orthogonal vector (=black arrow). Green point is misclassified with error $w^{⊤} \cdot ϕ_{n} (x) \cdot t_{n}$ (=red arrow).
Top Right: error is added to the parameters to obtain a new decision boundary and new orthogonal vector.
Bottom Left: Green point is misclassified with error (=red arrow).
Bottom Right: error is added to the parameters to obtain a new decision boundary and new orthogonal vector. Motivation. We are updating for every single new data $x$ that comes in. Doesn’t this mean that the error for other data may go up? No; the math says otherwise: thm. Perceptron Convergence Theorem. If the training data is linearly separable, the perceptron learning algorithm is guaranteeed to find a “exact solution” in finite steps.²

proof ↩
proof ↩

PK's Notes

Explorer

Linear Discriminant Model

Discriminant Function

Linear Discriminnant Models

Optimizing the Perceptron Algorithm

Graph View

Table of Contents

Backlinks

PK's Notes

Explorer

Linear Discriminant Model

Discriminant Function §

Linear Discriminnant Models §

Optimizing the Perceptron Algorithm §

Footnotes §

Graph View

Table of Contents

Backlinks

Discriminant Function

Linear Discriminnant Models

Optimizing the Perceptron Algorithm

Footnotes