The theory of Operations Research has been used to develop models in many fields like statistics and machine learning.

## Linear Regression

We try to find `α`

and `β`

such that a line `g = α + β x`

to minimize the sum of squared errors for all the data points:

`min`_{α,β} Σ^{n}_{i=1} [y_{i} - (α + β x_{i})]^{2}

Despite the word ** linear** in “simple linear regression”, note that this is actually to solve a

**program with no constraints. Suppose the objective function**

*nonlinear*`f(α, β) = Σ`^{n}_{i=1} [y_{i} - (α + β x_{i})]^{2}

, then its gradient and Hessian are:`∇f = ⎡ -2 Σ`^{n}_{i=1} [y_{i} - (α + β x_{i})] ⎤
⎣ -2 Σ^{n}_{i=1} [y_{i} - (α + β x_{i})] x_{i} ⎦
∇^{2}f = ⎡ 2n 2 Σ^{n}_{i=1} x_{i} ⎤
⎣ 2 Σ^{n}_{i=1} x_{i} 2 Σ^{n}_{i=1} x^{2}_{i} ⎦

Because of `n > 0`

, then the leading principal minors are all nonnegative, then the objective function is convex. Awesome. We have shown that this is an unconstrained convex program. Here we may use numerical algorithms to solve for the optimal α and β. However, analytical algorithm will give us a closed-form formula:

```
∇f(α, β) = 0
⟹ -2 Σ
```^{n}_{i=1} [y_{i} - (α + β x_{i})] = 0, and
-2 Σ^{n}_{i=1} [y_{i} - (α + β x_{i})] x_{i} = 0
⟹ n α + (Σ^{n}_{i=1} x_{i}) β = Σ^{n}_{i=1} y_{i}, and
(Σ^{n}_{i=1} x_{i}) α + (Σ^{n}_{i=1} x^{2}_{i}) β = Σ^{n}_{i=1} x_{i} y_{i}

Now you have a linear system of variable α and β, solving it will give you optimal solution directly.

The same idea applies to multiple linear regression. Given a data set `{ x`

, find ^{i}_{1}, x^{i}_{2}, ..., x^{i}_{p}, y^{i} }_{i=1,...,n}`α, β`

to solve:_{1}, β_{2}, ..., β_{p}

`min`_{α,βj} Σ^{n}_{i=1} [y_{i} - (α + Σ^{p}_{j=1} β_{j} x^{i}_{j})]^{2}

Note one reason to define fitting error as the *sum of squared errors*`Σ`

rather than the ^{n}_{i=1} [y_{i} - (α + β x_{i})]^{2}*sum of absolute errors*`Σ`

is that the latter one cannot be differentiated.^{n}_{i=1} |y_{i} - (α + β x_{i})|

### Ridge and LASSO Regression

When you apply linear regression for prediction, you want to avoid ** overfitting**, by using regularization. Usually in practice, there are hundreds (if not millions) of variables, you usually have no idea which one of them is really useful. Let

`λ > 0`

be the penalty of using variables, when we can define:```
Ridge Regression:
min
```_{α,βj} Σ^{n}_{i=1} [y^{i} - (α + β^{T} x^{i})]^{2} + λ Σ^{p}_{j=1} β^{2}_{j}
LASSO Regression:
min_{α,βj} Σ^{n}_{i=1} [y^{i} - (α + β^{T} x^{i})]^{2} + λ Σ^{p}_{j=1} |β_{j}|

We want to minimize the sum of everything, but hoping ideally all the `β`

is zero. This will choose the most effective _{j}`β`

, to make them positive or negative so that you may minimize the sum of squared error, while do not get a too large penalty._{j}

Note, both the two models are solving ** unconstrained convex programs**.

## Support Vector Machine

Classification is an important subject in machine learning. Given a data set `{ x`

where ^{i}_{1}, x^{i}_{2}, ..., x^{i}_{n}, y^{i} }_{i=1,...,m}`y`

. We want to find a classifier to assign data point ^{i} ∈ {1, -1}`i`

to a class according to the features `x`

to minimize the total number of classification error.^{i}

In the setting of 2-dimension, i.e. `R`

, linear classification is like to draw a straight line to separate a set of data points (with ^{2}`y`

) from the others (with ^{i} = 1`y`

). If 3-dimension ^{i} = -1`R`

, it is to draw a plane, and for higher dimensions ^{3}`R`

for n > 3, it is to draw a hyperplane.^{n}

The line that is the best is the one that is the ** farthest** from both sets. A set’s

**is the hyperplane that separate all the data points in the set to one side of the hyperplane. The data points that are closest to or on the supporting hyperplane are called**

*supporting hyperplane***. Support Vector Machine is a model/algorithm that try to find the best separating hyperplane**

*support vectors*`α + β`^{T} x = 0

, called the **.**

*classifier*We can classify a point as in set 1 if `α + β`

, or in set 2 if ^{T} x ≥ 1`α + β`

. It is equivalent to use ^{T} x ≤ -1`k`

and `-k`

instead of `1`

and `-1`

, since we may scale `α`

and `β`

in any way we like.

Next we want to maximize the distance between the two ** supporting** hyperplanes. Suppose

`x`^{1}

and `x`^{2}

are the two support vectors on the supporting hyperplanes of the corresponding set 1 and 2. Then the distance is the projection of `x`^{1} - x^{2}

on to the normal vector of the **hyperplane.**

*separating*Recall when doing projection of a vector `a ∈ R`

onto vector ^{n}`w ∈ R`

, the projection will give you a new vector ^{n}`a`

, where _{w} = α w`α ∈ R`

is a scalar. We then have two equations about the norm (length) of these vectors `a`

, `w`

and `a`

:_{w}

`∥a`_{w}∥ = ∥a∥ cosθ
∥a_{w}∥ = ∥w∥ α

they imply `α = ∥a∥ cosθ / ∥w∥`

. Then it follow that:

`a`_{w} = α w
= (∥a∥ cosθ / ∥w∥) w
= (∥a∥ (a^{T}w/∥a∥∥w∥) / ∥w∥) w
= (a^{T}w/∥w∥^{2}) w
∥a_{w}∥ = a^{T}w/∥w∥

Go back to the support vector machine, `a = x`

and ^{1} - x^{2}`w = β`

, so the distance is `(x`

. Finally the objective function and constraints are:^{1} - x^{2})^{T}β / ∥β∥

`max`_{α,β} (x^{1} - x^{2})^{T}β / ∥β∥
s.t y^{i}(α + βx^{i}) ≥ 1 for all i = 1,...,m
⟹ α + β^{T} x^{i} ≥ 1 when y^{i} = 1
⟹ α + β^{T} x^{i} ≤ -1 when y^{i} = -1

We could further simplify the objective function. When we plug `x`

and ^{1}`x`

into ^{2}`α + β`

, we exactly have:^{T} x^{i}

`α + β`^{T} x^{1} = 1
α + β^{T} x^{2} = -1
(x^{1} - x^{2})^{T}β = β^{T}x^{1} - β^{T}x^{2}
= (1 - α) - (-1 - α) = 2

The objective function finally becomes `max`

. Instead of maximization, we could do _{α,β} 2/∥β∥`min`

:_{α,β} ∥β∥/2

`min`_{α,β} ∥β∥/2
⟹ min_{α,β} √(β^{2}_{1} + β^{2}_{2} + ... β^{2}_{n})/2
⟹ min_{α,β} (β^{2}_{1} + β^{2}_{2} + ... β^{2}_{n})/2
⟹ min_{α,β} 1/2 Σ^{n}_{k=1} β^{2}_{k}
s.t y^{i}(α + βx^{i}) ≥ 1 for all i = 1,...,m

This is a convex program! Awesome.

### Imperfect Separation

Given a separating hyperplane `α + β`

, ideally we have ^{T} x = 0`y`

satisfied for data point ^{i}(α + βx^{i}) ≥ 1`i`

. However in practice, perfect separation is usually impossible, which means the constraint `y`

will be violated. We need to allow errors, by adding “degree of errors” ^{i}(α + βx^{i}) ≥ 1`γ`

into the objective function.^{i} ≥ 0

`min`_{α,β,γ} 1/2 Σ^{n}_{k=1} β^{2}_{k} + C Σ^{m}_{i=1} γ^{i}
s.t y^{i}(α + βx^{i}) ≥ 1 - γ^{i} for all i = 1,...,m
where γ^{i} ≥ 0, C ≥ 0

`C`

is a given parameter, the larger the greater penalty. Again, this is a convex program.

### Dualization of SVM

Researchers studied the SVM problem and that they’ve found that the dual program of SVM is easier to solve. We can get the dual program by using Lagrange duality. Let:

`λ`

be the Lagrange multiplier for the constraint of imperfect separation^{i}≥ 0`μ`

be the Lagrange multiplier for the constraint that^{i}≥ 0`γ`

^{i}≥ 0

Then the Lagrangian of the SVM program is:

```
L(α,β,γ|λ,μ)
= (1/2) Σ
```^{n}_{k=1} β^{2}_{k} + C Σ^{m}_{i=1} γ^{i}
-Σ^{m}_{i=1} λ^{i} [y^{i} (α + Σ^{n}_{k=1} x^{i}_{k} β_{k}) - 1 + γ^{i}]
-Σ^{m}_{i=1} μ^{i} γ^{i}

Then the Lagrange dual program is to choose λ and μ in the right way to maximize the outcome of inner minimization of Lagrangian, which always give us a lower bound.

`max`_{λ≥0,μ≥0} min_{α,β,γ} L(α,β,γ|λ,μ)

First order condition (FOC) of the Lagrangian is necessary and sufficient. Take derivative of `L(α, β, γ | λ, μ)`

with regard to `α, β`

respectively will give us:_{k}, γ^{i}

`∂L/∂α: Σ`^{m}_{i=1} λ^{i} y^{i} = 0
∂L/∂β: β_{k} = Σ^{m}_{i=1} λ^{i} y^{i} Σ^{n}_{k=1} x^{i}_{k}
∂L/∂C: C = λ^{i} + μ^{i}

Note there is no primal variable `α, β`

in _{k}, γ^{i}`∂L/∂α`

and `∂L/∂C`

, so when you want to minimize `L(α,β,γ|λ,μ)`

, you must choose `∂L/∂α`

and `∂L/∂C`

carefully, otherwise `min`

will be unbounded (i.e. negative infinity). Since _{α,β,γ} L(α,β,γ|λ,μ)`∂L/∂α`

and `∂L/∂C`

must be met, with plugging in `∂L/∂β`

, then `L(α,β,γ|λ,μ)`

can be simplified:

```
L(α,β,γ|λ,μ)
= (1/2) Σ
```^{n}_{k=1} β^{2}_{k} + C Σ^{m}_{i=1} γ^{i}
-Σ^{m}_{i=1} λ^{i} [y^{i} (α + Σ^{n}_{k=1} x^{i}_{k} β_{k}) - 1 + γ^{i}]
-Σ^{m}_{i=1} μ^{i} γ^{i}
= (1/2) Σ^{n}_{k=1} β^{2}_{k}
-Σ^{m}_{i=1} λ^{i} [y^{i} Σ^{n}_{k=1} x^{i}_{k} β_{k}]
+Σ^{m}_{i=1} λ^{i}
(plugging in ∂L/∂β)
= (1/2) Σ^{n}_{k=1} (Σ^{m}_{j=1} λ^{j} y^{j} Σ^{n}_{k=1} x^{j}_{k})^{2}
-Σ^{m}_{i=1} λ^{i} y^{i} Σ^{n}_{k=1} x^{i}_{k} (Σ^{m}_{j=1} λ^{j} y^{j} Σ^{n}_{q=1} x^{j}_{q})
+Σ^{m}_{i=1} λ^{i}
= (-1/2) Σ^{n}_{i=1} Σ^{m}_{j=1} λ^{i} λ^{j} y^{i} y^{j} (x^{i})^{T}x^{j}
+Σ^{m}_{i=1} λ^{i}

After the simplification above, the Lagrangian dual program now becomes:

`max`_{λ,μ} (-1/2) Σ^{n}_{i=1} Σ^{m}_{j=1} λ^{i} λ^{j} y^{i} y^{j} (x^{i})^{T}x^{j} + Σ^{m}_{i=1} λ^{i}
s.t. Σ^{m}_{i=1} λ^{i} y^{i} = 0
C = λ^{i} + μ^{i} ∀i = 1,...,m
λ^{i} ≥ 0, μ^{i} ≥ 0 ∀i = 1,...,m

In fact the last two constraint is actually telling you `λ`

. You can always adjust ^{i} ≤ C`μ`

to make this happen. Now the final dual program (^{i}`μ`

is eliminated):

`max`_{λ} (-1/2) Σ^{n}_{i=1} Σ^{m}_{j=1} λ^{i} λ^{j} y^{i} y^{j} (x^{i})^{T}x^{j} + Σ^{m}_{i=1} λ^{i}
s.t. Σ^{m}_{i=1} λ^{i} y^{i} = 0
0 ≤ λ^{i} ≤ C ∀i = 1,...,m

Now you only have one set of variables `λ`

, which should make the first constraint equality hold and should be within the range of 0 and C.

Both primal and dual programs are constrained nonlinear programs. But why dual program is easier to solve than the primal program?

Primal | Dual | |

Variables | α (1 variable) β (n variables, usually very large) γ (m variables) | λ (m variables) |

Constraints | Constraints are complicated`y` (m constraints)`γ` (m constraints) | Constraints are simple`Σ` (1 constraint)`0 ≤ λ` (m constraints)`λ` (m constraints) |

Objective function | Convex | Convex |

## My Certificate

For more on ** Operations Research: The Theory for Regression and SVM**, please refer to the wonderful course here https://www.coursera.org/learn/operations-research-theory

## Related Quick Recap

*I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai*

Don't forget to sign up newsletter, don't miss any chance to learn.

Or share what you've learned with friends!

Tweet