The theory of Operations Research has been used to develop models in many fields like statistics and machine learning.



Linear Regression

We try to find α and β such that a line g = α + β x to minimize the sum of squared errors for all the data points:

minα,β Σni=1 [yi - (α + β xi)]2

Despite the word linear in “simple linear regression”, note that this is actually to solve a nonlinear program with no constraints. Suppose the objective function f(α, β) = Σni=1 [yi - (α + β xi)]2, then its gradient and Hessian are:

∇f = ⎡ -2 Σni=1 [yi - (α + β xi)]    ⎤
     ⎣ -2 Σni=1 [yi - (α + β xi)] xi ⎦

∇2f = ⎡     2n          2 Σni=1 xi   ⎤
      ⎣ 2 Σni=1 xi       2 Σni=1 x2i

Because of n > 0, then the leading principal minors are all nonnegative, then the objective function is convex. Awesome. We have shown that this is an unconstrained convex program. Here we may use numerical algorithms to solve for the optimal α and β. However, analytical algorithm will give us a closed-form formula:

∇f(α, β) = 0

⟹ -2 Σni=1 [yi - (α + β xi)]     = 0, and
    -2 Σni=1 [yi - (α + β xi)] xi  = 0

⟹         n α +  (Σni=1 xi) β = Σni=1 yi, and
    (Σni=1 xi) α + (Σni=1 x2i) β = Σni=1 xi yi

Now you have a linear system of variable α and β, solving it will give you optimal solution directly.

The same idea applies to multiple linear regression. Given a data set { xi1, xi2, ..., xip, yi }i=1,...,n, find α, β1, β2, ..., βp to solve:

minα,βj Σni=1 [yi - (α + Σpj=1 βj xij)]2

Note one reason to define fitting error as the sum of squared errors Σni=1 [yi - (α + β xi)]2 rather than the sum of absolute errors Σni=1 |yi - (α + β xi)| is that the latter one cannot be differentiated.

Ridge and LASSO Regression

When you apply linear regression for prediction, you want to avoid overfitting, by using regularization. Usually in practice, there are hundreds (if not millions) of variables, you usually have no idea which one of them is really useful. Let λ > 0 be the penalty of using variables, when we can define:

Ridge Regression:

minα,βj Σni=1 [yi - (α + βT xi)]2 + λ Σpj=1 β2j

LASSO Regression:

minα,βj Σni=1 [yi - (α + βT xi)]2 + λ Σpj=1j|

We want to minimize the sum of everything, but hoping ideally all the βj is zero. This will choose the most effective βj, to make them positive or negative so that you may minimize the sum of squared error, while do not get a too large penalty.

Note, both the two models are solving unconstrained convex programs.



Support Vector Machine

Classification is an important subject in machine learning. Given a data set { xi1, xi2, ..., xin, yi }i=1,...,m where yi ∈ {1, -1}. We want to find a classifier to assign data point i to a class according to the features xi to minimize the total number of classification error.

In the setting of 2-dimension, i.e. R2, linear classification is like to draw a straight line to separate a set of data points (with yi = 1) from the others (with yi = -1). If 3-dimension R3, it is to draw a plane, and for higher dimensions Rn for n > 3, it is to draw a hyperplane.

The line that is the best is the one that is the farthest from both sets. A set’s supporting hyperplane is the hyperplane that separate all the data points in the set to one side of the hyperplane. The data points that are closest to or on the supporting hyperplane are called support vectors. Support Vector Machine is a model/algorithm that try to find the best separating hyperplane α + βT x = 0, called the classifier.

We can classify a point as in set 1 if α + βT x ≥ 1, or in set 2 if α + βT x ≤ -1. It is equivalent to use k and -k instead of 1 and -1, since we may scale α and β in any way we like.

Next we want to maximize the distance between the two supporting hyperplanes. Suppose x1 and x2 are the two support vectors on the supporting hyperplanes of the corresponding set 1 and 2. Then the distance is the projection of x1 - x2 on to the normal vector of the separating hyperplane.

Recall when doing projection of a vector a ∈ Rn onto vector w ∈ Rn, the projection will give you a new vector aw = α w, where α ∈ R is a scalar. We then have two equations about the norm (length) of these vectors a, w and aw:

∥aw∥ = ∥a∥ cosθ
∥aw∥ = ∥w∥ α

they imply α = ∥a∥ cosθ / ∥w∥. Then it follow that:

aw = α w
= (∥a∥ cosθ / ∥w∥) w
= (∥a∥ (aTw/∥a∥∥w∥) / ∥w∥) w
= (aTw/∥w∥2) w

∥aw∥ = aTw/∥w∥

Go back to the support vector machine, a = x1 - x2 and w = β, so the distance is (x1 - x2)Tβ / ∥β∥. Finally the objective function and constraints are:

maxα,β (x1 - x2)Tβ / ∥β∥
s.t yi(α + βxi) ≥ 1 for all i = 1,...,m

⟹ α + βT xi ≥ 1 when yi = 1
⟹ α + βT xi ≤ -1 when yi = -1

We could further simplify the objective function. When we plug x1 and x2 into α + βT xi, we exactly have:

α + βT x1 = 1
α + βT x2 = -1

(x1 - x2)Tβ = βTx1 - βTx2
= (1 - α) - (-1 - α) = 2

The objective function finally becomes maxα,β 2/∥β∥. Instead of maximization, we could do minα,β ∥β∥/2:

minα,β ∥β∥/2
⟹ minα,β √(β21 + β22 + ... β2n)/2
⟹ minα,β21 + β22 + ... β2n)/2
⟹ minα,β 1/2 Σnk=1 β2k

s.t yi(α + βxi) ≥ 1 for all i = 1,...,m

This is a convex program! Awesome.



Imperfect Separation

Given a separating hyperplane α + βT x = 0, ideally we have yi(α + βxi) ≥ 1 satisfied for data point i. However in practice, perfect separation is usually impossible, which means the constraint yi(α + βxi) ≥ 1 will be violated. We need to allow errors, by adding “degree of errors” γi ≥ 0 into the objective function.

minα,β,γ 1/2 Σnk=1 β2k + C Σmi=1 γi

s.t yi(α + βxi) ≥ 1 - γi for all i = 1,...,m
where γi ≥ 0, C ≥ 0

C is a given parameter, the larger the greater penalty. Again, this is a convex program.

Dualization of SVM

Researchers studied the SVM problem and that they’ve found that the dual program of SVM is easier to solve. We can get the dual program by using Lagrange duality. Let:

  • λi ≥ 0 be the Lagrange multiplier for the constraint of imperfect separation
  • μi ≥ 0 be the Lagrange multiplier for the constraint that γi ≥ 0

Then the Lagrangian of the SVM program is:

L(α,β,γ|λ,μ)
= (1/2) Σnk=1 β2k + C Σmi=1 γimi=1 λi [yi (α + Σnk=1 xik βk) - 1 + γi]
 -Σmi=1 μi γi

Then the Lagrange dual program is to choose λ and μ in the right way to maximize the outcome of inner minimization of Lagrangian, which always give us a lower bound.

maxλ≥0,μ≥0 minα,β,γ L(α,β,γ|λ,μ)

First order condition (FOC) of the Lagrangian is necessary and sufficient. Take derivative of L(α, β, γ | λ, μ) with regard to α, βk, γi respectively will give us:

∂L/∂α: Σmi=1 λi yi = 0
∂L/∂β: βk = Σmi=1 λi yi Σnk=1 xik
∂L/∂C: C = λi + μi

Note there is no primal variable α, βk, γi in ∂L/∂α and ∂L/∂C, so when you want to minimize L(α,β,γ|λ,μ), you must choose ∂L/∂α and ∂L/∂C carefully, otherwise minα,β,γ L(α,β,γ|λ,μ) will be unbounded (i.e. negative infinity). Since ∂L/∂α and ∂L/∂C must be met, with plugging in ∂L/∂β, then L(α,β,γ|λ,μ) can be simplified:

L(α,β,γ|λ,μ)
= (1/2) Σnk=1 β2k + C Σmi=1 γimi=1 λi [yi (α + Σnk=1 xik βk) - 1 + γi]
 -Σmi=1 μi γi
= (1/2) Σnk=1 β2kmi=1 λi [yi Σnk=1 xik βk]
 +Σmi=1 λi
(plugging in ∂L/∂β)
= (1/2) Σnk=1mj=1 λj yj Σnk=1 xjk)2mi=1 λi yi Σnk=1 xikmj=1 λj yj Σnq=1 xjq)
 +Σmi=1 λi
= (-1/2) Σni=1 Σmj=1 λi λj yi yj (xi)Txjmi=1 λi


After the simplification above, the Lagrangian dual program now becomes:

maxλ,μ (-1/2) Σni=1 Σmj=1 λi λj yi yj (xi)Txj + Σmi=1 λi

s.t. Σmi=1 λi yi = 0
     C = λi + μi      ∀i = 1,...,m
     λi ≥ 0, μi ≥ 0   ∀i = 1,...,m

In fact the last two constraint is actually telling you λi ≤ C. You can always adjust μi to make this happen. Now the final dual program (μ is eliminated):

maxλ (-1/2) Σni=1 Σmj=1 λi λj yi yj (xi)Txj + Σmi=1 λi

s.t. Σmi=1 λi yi = 0
     0 ≤ λi ≤ C   ∀i = 1,...,m

Now you only have one set of variables λ, which should make the first constraint equality hold and should be within the range of 0 and C.

Both primal and dual programs are constrained nonlinear programs. But why dual program is easier to solve than the primal program?

PrimalDual
Variablesα (1 variable)
β (n variables, usually very large)
γ (m variables)
λ (m variables)
ConstraintsConstraints are complicated
yi(α + βxi) ≥ 1 - γ (m constraints)
γi ≥ 0 (m constraints)

Constraints are simple
Σmi=1 λi yi = 0 (1 constraint)
0 ≤ λi (m constraints)
λi ≤ C (m constraints)
Objective functionConvexConvex


My Certificate

For more on Operations Research: The Theory for Regression and SVM, please refer to the wonderful course here https://www.coursera.org/learn/operations-research-theory


Related Quick Recap


I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!