Operations Research: The Theory for Regression and SVM

Table of Contents

The theory of Operations Research has been used to develop models in many fields like statistics and machine learning.

Linear Regression

We try to find α and β such that a line g = α + β x to minimize the sum of squared errors for all the data points:

min_α,β Σⁿ_i=1 [y_i - (α + β x_i)]²

Despite the word linear in “simple linear regression”, note that this is actually to solve a nonlinear program with no constraints. Suppose the objective function f(α, β) = Σⁿ_i=1 [y_i - (α + β x_i)]², then its gradient and Hessian are:

∇f = ⎡ -2 Σⁿ_i=1 [y_i - (α + β x_i)]    ⎤
     ⎣ -2 Σⁿ_i=1 [y_i - (α + β x_i)] x_i ⎦

∇²f = ⎡     2n          2 Σⁿ_i=1 x_i   ⎤
      ⎣ 2 Σⁿ_i=1 x_i       2 Σⁿ_i=1 x²_i  ⎦

Because of n > 0, then the leading principal minors are all nonnegative, then the objective function is convex. Awesome. We have shown that this is an unconstrained convex program. Here we may use numerical algorithms to solve for the optimal α and β. However, analytical algorithm will give us a closed-form formula:

∇f(α, β) = 0

⟹ -2 Σⁿ_i=1 [y_i - (α + β x_i)]     = 0, and
    -2 Σⁿ_i=1 [y_i - (α + β x_i)] x_i  = 0

⟹         n α +  (Σⁿ_i=1 x_i) β = Σⁿ_i=1 y_i, and
    (Σⁿ_i=1 x_i) α + (Σⁿ_i=1 x²_i) β = Σⁿ_i=1 x_i y_i

Now you have a linear system of variable α and β, solving it will give you optimal solution directly.

The same idea applies to multiple linear regression. Given a data set { xⁱ₁, xⁱ₂, ..., xⁱ_p, yⁱ }_i=1,...,n, find α, β₁, β₂, ..., β_p to solve:

min_α,βj Σⁿ_i=1 [y_ⁱ - (α + Σ^p_j=1 β_j xⁱ_j)]²

Note one reason to define fitting error as the sum of squared errors Σⁿ_i=1 [y_i - (α + β x_i)]² rather than the sum of absolute errors Σⁿ_i=1 |y_i - (α + β x_i)| is that the latter one cannot be differentiated.

Ridge and LASSO Regression

When you apply linear regression for prediction, you want to avoid overfitting, by using regularization. Usually in practice, there are hundreds (if not millions) of variables, you usually have no idea which one of them is really useful. Let λ > 0 be the penalty of using variables, when we can define:

Ridge Regression:

min_α,βj Σⁿ_i=1 [yⁱ - (α + β^T xⁱ)]² + λ Σ^p_j=1 β²_j

LASSO Regression:

min_α,βj Σⁿ_i=1 [yⁱ - (α + β^T xⁱ)]² + λ Σ^p_j=1 |β_j|

We want to minimize the sum of everything, but hoping ideally all the β_j is zero. This will choose the most effective β_j, to make them positive or negative so that you may minimize the sum of squared error, while do not get a too large penalty.

Note, both the two models are solving unconstrained convex programs.

Support Vector Machine

Classification is an important subject in machine learning. Given a data set { xⁱ₁, xⁱ₂, ..., xⁱ_n, yⁱ }_i=1,...,m where yⁱ ∈ {1, -1}. We want to find a classifier to assign data point i to a class according to the features xⁱ to minimize the total number of classification error.

In the setting of 2-dimension, i.e. R², linear classification is like to draw a straight line to separate a set of data points (with yⁱ = 1) from the others (with yⁱ = -1). If 3-dimension R³, it is to draw a plane, and for higher dimensions Rⁿ for n > 3, it is to draw a hyperplane.

The line that is the best is the one that is the farthest from both sets. A set’s supporting hyperplane is the hyperplane that separate all the data points in the set to one side of the hyperplane. The data points that are closest to or on the supporting hyperplane are called support vectors. Support Vector Machine is a model/algorithm that try to find the best separating hyperplane α + β^T x = 0, called the classifier.

We can classify a point as in set 1 if α + β^T x ≥ 1, or in set 2 if α + β^T x ≤ -1. It is equivalent to use k and -k instead of 1 and -1, since we may scale α and β in any way we like.

Next we want to maximize the distance between the two supporting hyperplanes. Suppose x¹ and x² are the two support vectors on the supporting hyperplanes of the corresponding set 1 and 2. Then the distance is the projection of x¹ - x² on to the normal vector of the separating hyperplane.

Recall when doing projection of a vector a ∈ Rⁿ onto vector w ∈ Rⁿ, the projection will give you a new vector a_w = α w, where α ∈ R is a scalar. We then have two equations about the norm (length) of these vectors a, w and a_w:

∥a_w∥ = ∥a∥ cosθ
∥a_w∥ = ∥w∥ α

they imply α = ∥a∥ cosθ / ∥w∥. Then it follow that:

a_w = α w
= (∥a∥ cosθ / ∥w∥) w
= (∥a∥ (a^Tw/∥a∥∥w∥) / ∥w∥) w
= (a^Tw/∥w∥²) w

∥a_w∥ = a^Tw/∥w∥

Go back to the support vector machine, a = x¹ - x² and w = β, so the distance is (x¹ - x²)^Tβ / ∥β∥. Finally the objective function and constraints are:

max_α,β (x¹ - x²)^Tβ / ∥β∥
s.t yⁱ(α + βxⁱ) ≥ 1 for all i = 1,...,m

⟹ α + β^T xⁱ ≥ 1 when yⁱ = 1
⟹ α + β^T xⁱ ≤ -1 when yⁱ = -1

We could further simplify the objective function. When we plug x¹ and x² into α + β^T xⁱ, we exactly have:

α + β^T x¹ = 1
α + β^T x² = -1

(x¹ - x²)^Tβ = β^Tx¹ - β^Tx²
= (1 - α) - (-1 - α) = 2

The objective function finally becomes max_α,β 2/∥β∥. Instead of maximization, we could do min_α,β ∥β∥/2:

min_α,β ∥β∥/2
⟹ min_α,β √(β²₁ + β²₂ + ... β²_n)/2
⟹ min_α,β (β²₁ + β²₂ + ... β²_n)/2
⟹ min_α,β 1/2 Σⁿ_k=1 β²_k

s.t yⁱ(α + βxⁱ) ≥ 1 for all i = 1,...,m

This is a convex program! Awesome.

Imperfect Separation

Given a separating hyperplane α + β^T x = 0, ideally we have yⁱ(α + βxⁱ) ≥ 1 satisfied for data point i. However in practice, perfect separation is usually impossible, which means the constraint yⁱ(α + βxⁱ) ≥ 1 will be violated. We need to allow errors, by adding “degree of errors” γⁱ ≥ 0 into the objective function.

min_α,β,γ 1/2 Σⁿ_k=1 β²_k + C Σ^m_i=1 γⁱ

s.t yⁱ(α + βxⁱ) ≥ 1 - γⁱ for all i = 1,...,m
where γⁱ ≥ 0, C ≥ 0

C is a given parameter, the larger the greater penalty. Again, this is a convex program.

Dualization of SVM

Researchers studied the SVM problem and that they’ve found that the dual program of SVM is easier to solve. We can get the dual program by using Lagrange duality. Let:

λⁱ ≥ 0 be the Lagrange multiplier for the constraint of imperfect separation
μⁱ ≥ 0 be the Lagrange multiplier for the constraint that γⁱ ≥ 0

Then the Lagrangian of the SVM program is:

L(α,β,γ|λ,μ)
= (1/2) Σⁿ_k=1 β²_k + C Σ^m_i=1 γⁱ
 -Σ^m_i=1 λⁱ [yⁱ (α + Σⁿ_k=1 xⁱ_k β_k) - 1 + γⁱ]
 -Σ^m_i=1 μⁱ γⁱ

Then the Lagrange dual program is to choose λ and μ in the right way to maximize the outcome of inner minimization of Lagrangian, which always give us a lower bound.

max_{λ≥0,μ≥0} min_α,β,γ L(α,β,γ|λ,μ)

First order condition (FOC) of the Lagrangian is necessary and sufficient. Take derivative of L(α, β, γ | λ, μ) with regard to α, β_k, γⁱ respectively will give us:

∂L/∂α: Σ^m_i=1 λⁱ yⁱ = 0
∂L/∂β: β_k = Σ^m_i=1 λⁱ yⁱ Σⁿ_k=1 xⁱ_k
∂L/∂C: C = λⁱ + μⁱ

Note there is no primal variable α, β_k, γⁱ in ∂L/∂α and ∂L/∂C, so when you want to minimize L(α,β,γ|λ,μ), you must choose ∂L/∂α and ∂L/∂C carefully, otherwise min_α,β,γ L(α,β,γ|λ,μ) will be unbounded (i.e. negative infinity). Since ∂L/∂α and ∂L/∂C must be met, with plugging in ∂L/∂β, then L(α,β,γ|λ,μ) can be simplified:

L(α,β,γ|λ,μ)
= (1/2) Σⁿ_k=1 β²_k + C Σ^m_i=1 γⁱ
 -Σ^m_i=1 λⁱ [yⁱ (α + Σⁿ_k=1 xⁱ_k β_k) - 1 + γⁱ]
 -Σ^m_i=1 μⁱ γⁱ
= (1/2) Σⁿ_k=1 β²_k
 -Σ^m_i=1 λⁱ [yⁱ Σⁿ_k=1 xⁱ_k β_k]
 +Σ^m_i=1 λⁱ
(plugging in ∂L/∂β)
= (1/2) Σⁿ_k=1 (Σ^m_j=1 λ^j y^j Σⁿ_k=1 x^j_k)²
 -Σ^m_i=1 λⁱ yⁱ Σⁿ_k=1 xⁱ_k (Σ^m_j=1 λ^j y^j Σⁿ_q=1 x^j_q)
 +Σ^m_i=1 λⁱ
= (-1/2) Σⁿ_i=1 Σ^m_j=1 λⁱ λ^j yⁱ y^j (xⁱ)^Tx^j
 +Σ^m_i=1 λⁱ

After the simplification above, the Lagrangian dual program now becomes:

max_λ,μ (-1/2) Σⁿ_i=1 Σ^m_j=1 λⁱ λ^j yⁱ y^j (xⁱ)^Tx^j + Σ^m_i=1 λⁱ

s.t. Σ^m_i=1 λⁱ yⁱ = 0
     C = λⁱ + μⁱ      ∀i = 1,...,m
     λⁱ ≥ 0, μⁱ ≥ 0   ∀i = 1,...,m

In fact the last two constraint is actually telling you λⁱ ≤ C. You can always adjust μⁱ to make this happen. Now the final dual program (μ is eliminated):

max_λ (-1/2) Σⁿ_i=1 Σ^m_j=1 λⁱ λ^j yⁱ y^j (xⁱ)^Tx^j + Σ^m_i=1 λⁱ

s.t. Σ^m_i=1 λⁱ yⁱ = 0
     0 ≤ λⁱ ≤ C   ∀i = 1,...,m

Now you only have one set of variables λ, which should make the first constraint equality hold and should be within the range of 0 and C.

Both primal and dual programs are constrained nonlinear programs. But why dual program is easier to solve than the primal program?

	Primal	Dual
Variables	α (1 variable) β (n variables, usually very large) γ (m variables)	λ (m variables)
Constraints	Constraints are complicated `yⁱ(α + βxⁱ) ≥ 1 - γ` (m constraints) `γⁱ ≥ 0` (m constraints)	Constraints are simple `Σ^m_i=1 λⁱ yⁱ = 0` (1 constraint) `0 ≤ λⁱ` (m constraints) `λⁱ ≤ C` (m constraints)
Objective function	Convex	Convex

My Certificate

For more on Operations Research: The Theory for Regression and SVM, please refer to the wonderful course here https://www.coursera.org/learn/operations-research-theory

My 130th certificate from Coursera

Related Quick Recap

Lagrangian Duality and KKT Condition

I am Kesler Zhu, thank you for visiting my website. Check out more course reviews at https://KZHU.ai

Linear Regression

Ridge and LASSO Regression

Support Vector Machine

Imperfect Separation

Dualization of SVM

My Certificate

Related Quick Recap

Related Posts

My 163rd course certificate from Coursera

Kubernetes Deployment and Networking

Cloud Computing: Law Enforcement, Competition and Tax