semi-supervised learning SSL

(1)

semi-supervised learning

SSL

^Continue

(2)

Outline of Lecture (1)

 Graph Spectral Theory

 Definition of Graph

 Graph Laplacian

 Laplacian Smoothing

 Graph Node Clustering

 Minimum Graph Cut

 Ratio Graph Cut

 Normalized Graph Cut

 Manifold Learning

 Spectral Analysis in Riemannian Manifolds

 Dimension Reduction, Node Embedding

 Semi-supervised Learning (SSL)

 Self-Training Methods

 SSL with SVM

 SSL with Graph using MinCut

 Semi-supervised Learning (SSL) : conti.

 SSL with Graph using Regularized Harmonic Functions

 SSL with Graph using Soft Harmonic Functions

 SSL with Graph using Manifold Regularization

 SSL with Graph using Laplacian SVMs

 SSL with Graph using Max-Margin Graph Cuts

 Online SSL and SSL for large graph

 Graph Convolution Networks (GCN)

 Graph Filtering in GCN

 Graph Pooling in GCN

 Spectral Filtering in GCN

 Spatial Filtering in GCN

(3)

SSL: Classical SVM (Review)

margin 𝒇 𝒙 = 𝒘^𝑻𝒙 + 𝑏 < 𝟎

𝒇 𝒙 = 𝒘^𝑻𝒙 + 𝑏 > 𝟎 𝒇 𝒙 = 𝒘^𝑻𝒙 + 𝑏 = 𝟎

Maximal Margin Hyperplane = Optimal Hyperplane

(4)

SSL: Classical SVM (Review)

max-margin classification: separable case min𝒘,𝑏 𝒘 ²

𝒔. 𝒕. 𝑦_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 ≥ 1, ∀ 𝑖 = 1, … , 𝑛

max-margin classification: non-separable case min𝒘,𝑏 𝜆 𝒘 ² + σ_𝑖 𝜉_𝑖

𝒔. 𝒕. 𝑦_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 ≥ 1 − 𝜉_𝑖, ∀ 𝑖 = 1, … , 𝑛 𝜉_𝑖 ≥ 0, ∀ 𝑖 = 1, … , 𝑛

margin 𝒇 𝒙 = 𝒘^𝑻𝒙 + 𝑏 < 𝟎

𝒇 𝒙 = 𝒘^𝑻𝒙 + 𝑏 > 𝟎 𝒇 𝒙 = 𝒘^𝑻𝒙 + 𝑏 = 𝟎

Maximal Margin Hyperplane=Optimal Hyperplane

(5)

SSL: Classical SVM (Review)

max-margin classification: non-separable case min𝒘,𝑏 𝜆 𝒘 ² + σ_𝑖 𝜉_𝑖

𝒔. 𝒕. 𝑦_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 ≥ 1 − 𝜉_𝑖, ∀ 𝑖 = 1, … , 𝑛 𝜉_𝑖 ≥ 0, ∀ 𝑖 = 1, … , 𝑛

Unconstrained formulation using hinge loss:

min𝒘,𝑏 𝜆 𝒘 ² + σ_𝑖 max(1 − 𝑦_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0) General formulation:

min 𝜆Ω(𝒇(𝒘, 𝑏)) + ෍ Φ(𝒙_𝑖, 𝑦_𝑖, 𝒇(𝒘, 𝑏; 𝒙_𝑖))

(6)

SSL: Classical SVM (Review)

hinge loss

Φ(𝒙_𝑖, 𝑦_𝑖, 𝒇(𝒘, 𝑏; 𝒙_𝑖)) = max(1 − 𝑦_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0)

(7)

SSL: SVM: Unlabeled Examples

Unconstrained formulation using hinge loss:

min𝒘,𝑏 𝜆 𝒘 ² + σ_𝑖=1^𝑛^𝑙 max(1 − 𝑦_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0)

How to incorporate unlabeled examples?

Prediction of 𝑓 for (any) 𝒙? 𝑦ො_𝑖 = 𝑠𝑔𝑛 𝑓 𝒙_𝑖 = 𝑠𝑔𝑛(𝒘^𝑇𝒙_𝑖 + 𝑏)

Φ(𝒙_𝑖, 𝑦_𝑖, 𝒇(𝒘, 𝑏; 𝒙_𝑖)) = max(1 − 𝑦ො_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0)

= max(1 − 𝑠𝑔𝑛(𝒘^𝑇𝒙_𝑖 + 𝑏) 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0)

= max(1 − 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0) hat loss Use 𝑦ො_𝑖 instead of 𝑦_𝑖

(8)

SSL: SVM: Unlabeled Examples

What is the difference in the objectives?

What does hinge loss penalize?

What does hat loss penalize?

(9)

SSL: SVM: Formulation

Formulation fo SSL via SVM min𝒘,𝑏 ෍

𝑖=1 𝑛_𝑙

max(1 − 𝑦_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0) + 𝜆₁ 𝒘 ² +𝜆₂ σ_𝑖=𝑛

𝑙+1 𝑛_𝑙+𝑛_𝑢

max(1 − 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0)

 Labelled data term works as a loss to learn the data

 Unlabeled data term works as a regularizer to reduce the effect of noisy data.

 The term 𝒘 works as a regularizer for a large margin.

(10)

semi-supervised learning

with graphs and harmonic functions

SSL(𝒢)

(11)

SSL with Graphs: Prehistory

Blum/Chawla: Learning from Labeled and Unlabeled Data using Graph Mincuts

http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/mincut.pdf

Some insights from vision research in 1980s

(12)

SSL with Graphs: MinCut

MinCut SSL: an idea similar to MinCut clustering

Where is the link?

What is the formal statement? We look for 𝑓 𝑥 ∈ ±1 𝑐𝑢𝑡 = ෍

𝑖,𝑗=1 𝑛_𝑙+𝑛_𝑢

𝑤_𝑖𝑗 𝑓 𝒙_𝑖 − 𝑓 𝒙_𝑗 ² = 𝒇^𝑇𝑳𝒇 = 𝛺(𝑓)

𝑓 𝒙min_𝑖 ; 𝒙_𝑖∈𝓤 𝛺(𝑓) minimal smoothness for unsupervised clustering What to do for semi-supervised learning?

(13)

SSL with Graphs: using 𝑓 𝜽; 𝒙

_𝑖

Inductive SSL with graph: using 𝑓 𝜽; 𝒙_𝑖 classifier with parameters 𝜽

min𝜽 𝜆 ෍

𝑤_𝑖𝑗 𝑓 𝜽; 𝒙_𝑖 − 𝑓 𝜽; 𝒙_𝑗 ² + 𝛾 ෍

𝑖 𝑛_𝑙

𝑓 𝜽; 𝒙_𝑖 − 𝑦_𝑖 ²

General Formulation

Regularization: Laplacian smoothing 𝛺 {𝑓 𝜽; 𝒙_𝑖 }_𝑖=1^𝑛^𝑙^+𝑛^𝑢 = ෍

𝑤_𝑖𝑗 𝑓 𝜽; 𝒙_𝑖 − 𝑓 𝜽; 𝒙_𝑗 ² = 𝒇^𝑇𝑳𝒇 Loss:

Φ 𝑓 𝜽; 𝒙_𝑖 , 𝑦_𝑖 = 𝑓 𝜽; 𝒙_𝑖 − 𝑦_𝑖 ² ∀ 𝑖 ∈ 1, … , 𝑛_𝑙

(14)

SSL with Graphs: Harmonic Functions

Transductive SSL with graph: fixing 𝑓 𝒙_𝑖 for 𝑖 ∈ ℒ

𝒇∈{±𝟏}min^{𝑛𝑙+𝑛𝑢} 𝜆 ෍

𝑤_𝑖𝑗 𝑓 𝒙_𝑖 − 𝑓 𝒙_𝑗 ² + ∞ ෍

𝑖 𝑛_𝑙

𝑓 𝒙_𝑖 − 𝑦_𝑖 ²

Solution:

An integer program: NP hard

Can we use eigenvectors? No. Why?

We need a better way to reflect the confidence.

(15)

SSL with Graphs: Harmonic Functions

Relaxation: Transductive SSL with graph: fixing 𝑓 𝒙_𝑖 for 𝑖 ∈ ℒ

𝒇∈𝑹min^{𝑛𝑙+𝑛𝑢} 𝜆 ෍

𝑤_𝑖𝑗 𝑓 𝒙_𝑖 − 𝑓 𝒙_𝑗 ² + ∞෍

𝑖 𝑛_𝑙

𝑓 𝒙_𝑖 − 𝑦_𝑖 ²

Naïve Solution

Right term solution: constrain 𝑓 to match the supervised data 𝑓 𝒙_𝑖 = 𝑦_𝑖 ∀ 𝑖 ∈ {1, … , 𝑛_𝑙}

Left term solution: enforce the solution 𝑓 to be harmonic (cf. aggregation, rw) 𝑓 𝒙_𝑖 = ^σ^𝑖𝑗 ^𝑓(𝒙^𝑗^)𝑤^𝑖𝑗

σ_𝑖𝑗 𝑤_𝑖𝑗 ∀ 𝑖 ∈ {𝑛_𝑙 + 1, … , 𝑛_𝑙+𝑛_𝑢} How can we handle unlabeled data?

(16)

SSL with Graphs: Harmonic Functions

Properties of the relaxation from ±𝟏 to ℝ

 There is a closed form solution for 𝑓

 this solution is unique

 globally optimal

 𝑓(𝒙_𝑖) may not be integer

 but we can threshold it

 electric-network interpretation

 random-walk interpretation

(17)

SSL with Graphs: Harmonic Functions

Random walk interpretation :

1) start from the vertex you want to label and randomly walk 2) 𝑃 𝑗 𝑖 = ^𝑤^𝑖𝑗

σ_𝑘 𝑤_𝑖𝑘 ⟺ 𝑷 = 𝑫^−𝟏𝑾

3) finish when a labeled vertex is hit 𝑓 𝒙_𝑖 = ^σ^𝑖𝑗 ^𝑓(𝒙^𝑗^)𝑤^𝑖𝑗

σ_𝑖𝑗𝑤_𝑖𝑗

4) 𝑓(𝒙_𝑖) is assigned by average of the labels of the hit vertices.

(18)

SSL with Graphs: Harmonic Functions

Iterative Solution: propagation

Step 1: Set 𝑓 to match the supervised data 𝑓 𝒙_𝑖 = 𝑦_𝑖 ∀ 𝑖 ∈ {1, … , 𝑛_𝑙}

Step 2: Propagate iteratively (only for unlabeled) 𝑓 𝒙_𝑖 ← ^σ^𝑖𝑗 ^𝑓(𝒙^𝑗^)𝑤^𝑖𝑗

σ_𝑖𝑗𝑤_𝑖𝑗 ∀ 𝑖 ∈ {𝑛_𝑙 + 1, … , 𝑛_𝑙+𝑛_𝑢} Properties:

 this will converge to the harmonic solution

 we can set the initial values for unlabeled nodes arbitrarily

 an interesting option for large-scale data

(19)

SSL with Graphs: Harmonic Functions

Closed form Solution:

Define 𝑓_𝑖 ≜ 𝑓 𝒙_𝑖 , 𝒇 ≜ [𝑓₁, … , 𝑓_𝑖, … , 𝑓_𝑛_𝑙_+𝑛_𝑢] 𝛺 𝒇 = ෍

𝑤_𝑖𝑗 𝑓_𝑖 − 𝑓_𝑗 ² = 𝒇^𝑇𝑳𝒇 Then, 𝑳 is a (𝑛_𝑙+𝑛_𝑢) × (𝑛_𝑙+𝑛_𝑢) matrix:

𝑳 = 𝑳_𝑙𝑙 𝑳_𝑙𝑢 𝑳_𝑢𝑙 𝑳_𝑢𝑢 The problem becomes

min𝒇 𝛺 𝒇 (= 𝒇_𝑙^𝑇𝑳_𝑙𝑙𝒇_𝑙 + 𝒇_𝑙^𝑇𝑳_𝑙𝑢𝒇_𝑢 + 𝒇_𝑢^𝑇𝑳_𝑢𝑙𝒇_𝑙 + 𝒇_𝑢^𝑇𝑳_𝑢𝑢𝒇_𝑢) The solution can be obtained by

𝛻_𝒇_𝑢𝛺 𝒇 = 2𝑳_𝑢𝑙𝒇_𝑙 + 2𝑳_𝑢𝑢𝒇_𝑢 = 0

⟹ 𝒇_𝑢 = 𝑳_𝑢𝑢⁻¹ −𝑳_𝑢𝑙𝒇_𝑙 = 𝑳_𝑢𝑢⁻¹(𝑾_𝑢𝑙𝒇_𝑙)

(20)

SSL with Graphs: Harmonic Functions

Relation between closed form solution and random work 𝒇_𝑢 = 𝑳_𝑢𝑢⁻¹ 𝑾_𝑢𝑙𝒇_𝑙 , 𝑷 = 𝑫⁻¹𝑾

Note that

𝑷 = 𝑫⁻¹𝑾 → 𝑳 = 𝑫 − 𝑾 = 𝑫 𝐈 − 𝑫⁻¹𝑾 = 𝑫 𝐈 − 𝑷 This yields

𝒇_𝑢 = 𝐈 −𝑷_𝒖𝑢 ⁻¹𝑫_𝑢𝑢⁻¹ 𝑫_𝑢𝑢𝑷_𝑢𝑙𝒇_𝑙 = 𝐈 −𝑷_𝒖𝑢 ⁻¹𝑷_𝑢𝑙𝒇_𝑙 For 𝑖 ∈ 𝒰,

𝑓_𝑖= 𝐈 −𝑷_{𝑢𝑢 𝑖𝑢}⁻¹𝑷_𝑢𝑙𝒇_𝑙

= σ_𝑗:𝑦

𝑗=1 𝐈 −𝑷_{𝑢𝑢 𝑖𝑢}⁻¹𝑷_𝑢𝑗 − σ_𝑗:𝑦

𝑗=−1 𝐈 −𝑷_{𝑢𝑢 𝑖𝑢}⁻¹𝑷_𝑢𝑗

= 𝑝_𝑖⁽⁺¹⁾ − 𝑝_𝑖⁻¹

(21)

SSL with Graphs: Regularized Harmonic Functions

For 𝑖 ∈ 𝒰,

𝑓_𝑖= 𝑝_𝑖⁽⁺¹⁾ − 𝑝_𝑖⁻¹ ⟹ 𝑓_𝑖= 𝑓_𝑖 𝑠𝑔𝑛 𝑓_𝑖 = 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 × 𝑙𝑎𝑏𝑒𝑙 What happens if an outlier sneaks in?

⟹ The prediction for the outlier can mislead the confidence How to control the confidence of the inference?

⟹ Allow the random walk to die!

⟹ We add a sink to the graph, where sink = artificial node with label 0.

We connect the sink to every other vertex.

What will the sink do in the predictions?

(22)

SSL with Graphs: Regularized Harmonic Functions

Regularized Harmonic solution on the graph with sink

𝛻_𝒇_𝑢𝛺 𝒇 = 𝟎, 𝛺 𝒇 = 𝒇^𝑇𝑳𝒇 𝑳_𝑙𝑙 + 𝛾_𝑔𝐈_𝑛_𝑙 𝑳_𝑙𝑢 −𝛾_𝑔𝟏_𝑛_𝑙_×1

𝑳_𝑢𝑙 𝑳_𝑢𝑢 + 𝛾_𝑔𝐈_𝑛_𝑢 −𝛾_𝑔𝟏_𝑛_𝑢_×1

−𝛾_𝑔𝟏_1×𝑛_𝑙 −𝛾_𝑔𝟏_1×𝑛_𝑢 (𝑛_𝑙 + 𝑛_𝑢)𝛾_𝑔

𝒇_𝑙 𝒇_𝑢

0

=

⋮ 𝟎_𝑢

⋮ We can disregard the last column and row:

𝑳_𝑙𝑙 + 𝛾_𝑔𝐈_𝑛_𝑙 𝑳_𝑙𝑢

𝑳_𝑢𝑙 𝑳_𝑢𝑢 + 𝛾_𝑔𝐈_𝑛_𝑢

𝒇_𝑙

𝒇_𝑢 = ⋮ 𝟎_𝑢

⟹ 𝑳 𝒇 + 𝑳 + 𝛾 𝐈 𝒇 = 𝟎

(23)

SSL with Graphs: Regularized Harmonic Functions

How do we compute this regularized random walk?

𝒇_𝑢 = (𝑳_𝑢𝑢 + 𝛾_𝑔𝐈)⁻¹ 𝑾_𝑢𝑙𝒇_𝑙 , How does 𝛾_𝑔 influence the solution?

What happens to sneaky outliers?

(24)

SSL with Graphs: Soft Harmonic Functions

Regularized HS objective with 𝑸 = 𝑳 + 𝛾_𝑔𝐈 : Define 𝑓_𝑖 ≜ 𝑓 𝒙_𝑖 , 𝒇 ≜ [𝑓_𝑖, … , 𝑓_𝑛_𝑙_+𝑛_𝑢]

𝒇∈ℝmin^{𝑛𝑙+𝑛𝑢} ∞ ෍

𝑖=1 𝑛_𝑙

𝑓_𝑖 − 𝑦_𝑖 ² + 𝜆 𝒇^𝑇𝑸𝒇

Soft constraints for 𝑓 𝒙_𝑖 = 𝑦_𝑖, ∀𝑖 ∈ ℒ: ∞ is replaced by finite values

𝒇∈ℝmin^{𝑛𝑙+𝑛𝑢} 𝒇 − 𝒚 ^𝑇𝑪(𝒇 − 𝒚) + 𝒇^𝑇𝑸𝒇

𝑪 is diagonal with 𝐶_𝑖𝑖 = ቊ𝐶_𝑙 𝑓𝑜𝑟 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝐶_𝑢 𝑓𝑜𝑟 𝑢𝑛𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠

𝒚 indicates pseudo-targets with 𝑦_𝑖 = ቊ𝑡𝑟𝑢𝑒 𝑙𝑎𝑏𝑒𝑙 𝑓𝑜𝑟 𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 0 𝑓𝑜𝑟 𝑢𝑛𝑙𝑎𝑏𝑒𝑙𝑒𝑑 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠

(25)

SSL with Graphs: Soft Harmonic Functions

Closed form Soft Harmonic solution:

𝒇^∗ = min

𝒇∈ℝ^{𝑛𝑙+𝑛𝑢}[ 𝒇 − 𝒚 ^𝑇𝑪 𝒇 − 𝒚 + 𝒇^𝑇𝑸𝒇] → 𝒇^∗ = (𝑪⁻¹𝑸 + 𝐈)⁻¹𝒚

What are the differences between hard and soft?

Not much different in practice.

Noisy labels may be smoothed by soft harmonic.

Generalization is improved by the sink node

(26)

Summary questions on the lecture

 What does hinge loss in SVM penalize?

 What does hat loss in SVM for semi-supervised learning penalize?

 Why can’t we use eigenvectors to solve MinCut-based SSL in graph?

 What is the meaning of harmonic function in SSL?

 What is random work interpretation of harmonic function-based SSL in graph?

 What is a key point of regularized harmonic function-based SSL in graph?

 What is a key point of soft harmonic function-based SSL in graph?