Least Squares
Wanho Choi
(wanochoi.com)
Linear Least Squares Problem
=Ax
= b
m
> n
(# of equations) > (# of unknowns)
over-determined system
A : m
× n, x :n ×1, b : m ×1
(
)
A x
b
No Exact SolutionThe Nearest Solution
Ax
≈ b
⇔ argmin
x
Ax
− b
2
residual: r
= Ax − b
A : m
× n, x :n ×1, b : m ×1
(
)
How to Solve
Normal equations
QR decomposition
SVD
(Singular Value Decomposition)
faster cheaper
more accurate more reliable
Normal Equation
r
22= r
Tr
= Ax − b
(
)
T(
Ax
− b
)
= x
(
TA
T− b
T)
(
Ax
− b
)
= x
TA
TAx
− x
TA
Tb
− b
TAx
+ b
Tb
= x
TA
TAx
− x
TA
Tb
− x
TA
Tb
+ b
Tb
(
∵b
TAx: scalar
)
= x
TA
TAx
− 2x
TA
Tb
+ b
Tb
∂
∂x
r
2 2= 2A
TAx
− 2A
Tb
= 0
∴A
TAx
= A
Tb
: normal equationsGeometric Derivation
Ay( )
⋅ b − Ax(
)
= Ay( )
T b − Ax(
)
= yT AT(
b − Ax)
= 0 b p = Ax r = b − Ax Ay= a11 a12 ! a1n a21 a22 ! a2n ! ! " ! am1 am 2 ! amn ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ y1 y2 ! yn ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ = y1 a11 a21 ! am1 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ + y2 a12 a22 ! am 2 ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ +!+ yn a1n a2n ! amn ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥Ay : the column space of A
∴A
Tb
− Ax
(
)
= 0 ⇔ A
TAx
= A
Tb
r is minimized when y=xPseudo Inverse
p
= A A
( )
TA
−1A
Tb
A
x
P
A
TAx
= A
Tb
A
TA
( )
−1A
TA
( )
x
= A
( )
TA
−1A
Tb
x
= A
( )
TA
−1A
Tb
p
= Ax = A A
( )
TA
−1A
Tb
b
pseudo inverseCondition Number of a Function
It measures
• How much the output value changes for small change of input
• How sensitive to changes or errors in the input
• How much error in the output results from an error in the input
cond(A)
=
κ
=
σ
max(A)
σ
min(A)
σi(A) : the i-th singular value of AΔx
x
≤ cond(A)
ΔA
Condition Number of a Function
•
Well-conditioned: low condition number
•
Ill-conditioned: high condition number
•
The normal equations are sometimes ill-conditioned.
A
= 1+10
−10−1
−1
1
⎡
⎣
⎢
⎤
⎦
⎥
A
TA
= 2 + 2 ⋅10
−10+10
−20−2 −10
−10−2 −10
−102
⎡
⎣
⎢
⎢
⎤
⎦
⎥
⎥
cond(A
TA)
= cond(A)
2QR Decomposition
∵Orthogonal transformation preserve the Euclidean norm.
( ) ∵A=QU ( )
r
22= Ax − b
22= QUx − b
22= Q
TQUx
− Q
Tb
2 2= Ux − Q
Tb
2 2∴Ux = Q
T
b
U
=
* *
! *
0 *
! *
! ! " !
0 0
! *
0 0
! 0
! ! " !
0 0
! 0
⎡
⎣
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
r
m
n
Singular Value Decomposition
1 1 0 1 −1 1 ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ = 0.5774 0.7071 0.4082 0.5774 0 −0.8165 0.5774 −0.7071 0.4082 ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ 1.7321 0 0 1.4142 0 0 ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ 0 1 1 0 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥A
= USV
T
m × m m × n m × n n × n ex)Singular Value Decomposition
A
= USV
T
m × m m × n m × n n × n S = D 0 0 0 ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = σ1 0 ! 0 0 σ2 ! 0 " " # " 0 0 ! σr ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥σi(A) : the i-th singular value of A
i = 0,1,!,r = min(m,n)
r = rank(A) if A is full rank
Singular Value Decomposition
A
= USV
T
m × m
m × n m × n n × n
S : singular values of A eigenvalues of AA
(
Tor ATA)
U : left-hand singular vectors of A eigenvectors of AA
(
T)
V : right-hand singular vectors of A eigenvectors of A(
TA)
AAT = USV
(
T)
(
USVT)
T = USV(
T)
(
VSTUT)
= USVTVSTUT = US2UT ATA = USV(
T)
T(
USVT)
= VS(
TUT)
(
USVT)
= VSTUTUSVT = VS2VT λ1 ≥ λ2 ≥! ≥ λr = λr+1 = ! = λn = 0EVD vs SVD
Singular Value Decomposition lets us
decompose any matrix A as a UΣVT where U and V are
orthogonal and Σ is a diagonal matrix whose non-zero entries are square roots of the eigenvalues of ATA. The columns
of U and V give bases for the four fundamental subspaces. If A is symmetric, A is diagonalizable. In other words, if A is symmetric, we can find an orthonormal matrix Q so
that A=QΛQT. Here Λ is a diagonal matrix that has eigenvalues
of A. So, it is called eigenvalue decomposition.
Furthermore, when A is symmetric and positive semi-definite: (i.e. A=QΛQT=QΛ1/2Λ1/2QT (∵∀𝝀≥0) =QΛ1/2QTQΛ1/2QT=A1/2A1/2)
ATA=(QΛQT)TQΛQT=QΛQTQΛQT=QΛ2QT
AAT=QΛQT(QΛQT)T=QΛQTQΛQT=QΛ2QT
Pseudo Inverse
A
+= A
( )
TA
−1A
T= VS
−1U
TA
= USV
T
m × m
m × n m × n n × n
from normal equations ATAx = ATb
∴x = AT A
( )
−1 ATb ATA( )
−1 AT = USV(
(
T)
T(
USVT)
)
−1 USVT(
)
T = VST UTUSVT(
)
−1 VSTUT(
)
= VS(
2VT)
−1(
VSUT)
= V−TS−2V−1VSUT = VS−1UTComparison
•
The standard recommendation for linear
least-squares is to use QR factorization.
•
These days QR seems to be the consensus for most
practical problems.
•
Cholesky is cheaper but suffers from "conditioning
squared" due to the fact that you have to operate on
(X′X)(X′X), not XX.
•
SVD has some numerical advantages, but is more
costly than QR decomposition.
Eigen3 Example
2x − y = 2 2x + y = 2 x − 3y = −3 2 −1 1 −3 2 1 ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ x y ⎡ ⎣ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ = 2 −3 2 ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥Eigen3 Example
#include <iostream> #include <Eigen/Dense>
int main( int argc, char* argv[] ) { Eigen::MatrixXd A( 3, 2 ); A(0,0) = 2; A(0,1) = -1; A(1,0) = 1; A(1,1) = -3; A(2,0) = 2; A(2,1) = 1; Eigen::VectorXd b ( 3 ); b(0) = 2; b(1) = -3; b(2) = 2; Eigen::VectorXd x( 2 );
x = ( A.transpose() * A ).ldlt().solve( A.transpose() * b ); // normal equations x = A.colPivHouseholderQr().solve( b ); // QR decomposition
x = A.jacobiSvd( Eigen::ComputeThinU | Eigen::ComputeThinV ).solve( b ); // SVD std::cout << x << std::endl;
return 0; }