배 움

(1)

J. Y. Choi. SNU

배 움

1

(2)

배움의 자세

 자신이 익힌 지식에 집착함으로 인하여, 우월감과 타인에 대한 멸시함이 자라나지 않도록 한다.

 지식을 과시하고자 하는 충동으로 인하여 무분별한 지식 습득과 사념에 빠지지 않도록 한다.

 자신의 무지와 몰이해에 대해 자책하거나, 일깨움이 더딘 것을 탓하지 말 고 그저 부지런히 행함을 놓치지 않는다.

 언제나 자신에게 솔직할 것이며, 나눌 준비와 겸허한 태도로 자신을 있는 그대로 꾸밈없이 표현하도록 한다.

 매일 일상에서 이루어지는 순간순간의 과정이 그대를 이끌 것이니 미래를 걱정하며 계획에 연연해 하지 않는다.

(3)

J. Y. Choi. SNU

Summary Questions of the last lecture

3

What are the two options for out of sample extension in SSL?

→ One is to add the new sample to the graph and re-compute HF solution transductively and the other is to make the algorithms inductive.

Why do we have to make a classifier be smooth in inductive SSL for out of sample extension?

→ The smoothness deal with noisy samples by providing reasonable interpolation for new samples.

(4)

Summary Questions of the last lecture

What is the meaning of manifold regularization?

→ It enforces the classifier function to yield a reasonably interpolated value for any sample which is not given in the training data.

(5)

J. Y. Choi. SNU

Summary Questions of the last lecture

5

What is the key idea of Max-Margin Graph Cuts for SSL?

→ The self-training for SSL is done by using the confidently predicted labels obtained by regularized HF solution of SSL.

(6)

Summary Questions of the last lecture

What are the two options for online SSL?

→ One is to add the new sample to the graph and re-compute HF solution transductively and the other is to make the algorithms inductive. The key issue of the former option is to keep the computation cost and memory within a reasonable level even for continuous increment of unlabeled samples.

(7)

J. Y. Choi. SNU

Summary Questions of the last lecture

7

What is the key idea for keeping # of representative nodes?

→ Determine the minimum distance (R) between any two centroids and each new sample within a ball centered by a centroid and with radius R is

added to the cluster. If there are samples that do not belong to any cluster, R is doubled.

(8)

Outline of Lecture (1)

 Graph Spectral Theory

 Definition of Graph

 Graph Laplacian

 Laplacian Smoothing

 Graph Node Clustering

 Minimum Graph Cut

 Ratio Graph Cut

 Normalized Graph Cut

 Manifold Learning

 Spectral Analysis in Riemannian Manifolds

 Dimension Reduction, Node Embedding

 Semi-supervised Learning (SSL)

 Self-Training Methods

 SSL with SVM

 SSL with Graph using MinCut

 SSL with Graph using Harmonic Functions

 Semi-supervised Learning (SSL) : conti.

 SSL with Graph using Regularized Harmonic Functions

 SSL with Graph using Soft Harmonic Functions

 SSL with Graph using Manifold

Regularization (out of sample extension)

 SSL with Graph using Laplacian SVMs

 SSL with Graph using Max-Margin Graph Cuts

 Online SSL

 SSL for large graph

 Graph Convolution Networks (GCN)

 Graph Filtering in GCN

 Graph Pooling in GCN

 Spectral Filtering in GCN

 Spatial Filtering in GCN

 Recent GCN papers

(9)

J. Y. Choi. SNU

when 𝓖 does not fit to memory

9

SSL, Huge 𝒢

(10)

Scaling SSL with Graphs to Millions

Soft HS for SSL

𝒇^∗ = argmin

𝒇∈ℝ^{𝑛𝑙+𝑛𝑢}

𝒇 − 𝒚 ^𝑇𝑪 𝒇 − 𝒚 + 𝒇^𝑇𝑳𝒇

Using eigenbasis of 𝑳 = 𝑼𝚲𝑼^𝑻 and 𝒇 = 𝑼𝜶 𝜶^∗ = argmin

𝜶∈ℝ^{𝑛𝑙+𝑛𝑢}

𝑼𝜶 − 𝒚 ^𝑇𝑪 𝑼𝜶 − 𝒚 + 𝜶^𝑇𝚲𝜶

What is the problem with scalability?

What

Let’s take

can we do?

only first 𝑘 eigenvectors in 𝒇 = ഥ𝑼ഥ𝜶!

(11)

J. Y. Choi. SNU

Scaling SSL with Graphs to Millions

11

𝑼 is a 𝑁 × 𝑘 matrix, ഥഥ 𝜶 ∈ ℝ^𝑘, 𝑘 ≪ 𝑁 𝜶ഥ^∗ = argmin

𝜶∈ℝഥ ^𝑘

𝑼ഥഥ𝜶 − 𝒚 ^𝑇𝑪 ഥ𝑼ഥ𝜶 − 𝒚 + ഥ𝜶^𝑇𝚲ഥഥ𝜶

Closed form solution is

𝜶ഥ^∗ = (ഥ𝚲 + ഥ𝑼^𝑇𝑪ഥ𝑼)⁻¹𝑼ഥ^𝑇𝑪𝒚

What is the size of this system of equation now?

It requires 𝑘 × 𝑘 𝑚𝑎𝑡𝑟𝑖𝑥 𝑖𝑛𝑣𝑒𝑟𝑠𝑖𝑜𝑛

https://cs.nyu.edu/~fergus/papers/fwt_ssl.pdf

(12)

Scaling SSL with Graphs to Millions

Investigation on eigenbasis depending on 𝑵

𝑵 𝑵

(13)

J. Y. Choi. SNU

Scaling SSL with Graphs to Millions

13

What happens to 𝑳 when 𝑁 → ∞?

We have data 𝒙_𝑖 ∈ ℝ^𝑑 sampled from 𝑝(𝒙).

When 𝑁 → ∞, instead of vectors 𝒇, we consider functions 𝐹(𝒙).

Instead of 𝑳, we define ℒ_𝑝 − weighted smoothness operator ℒ_𝑝 𝐹 = ^𝟏

𝟐׬ 𝐹 𝒙₁ − 𝐹 𝒙₂ ^𝟐 𝑊 𝒙₁, 𝒙₂ 𝑝 𝒙₁ 𝑝 𝒙₂ 𝑑𝒙₁𝑑𝒙₂ where 𝑊 𝒙₁, 𝒙₂ = exp − 𝒙₁ − 𝒙₂ ² /(2𝜎²)

As 𝑁 → ∞, ¹

𝑁² 𝒇^𝑻𝑳𝒇 = ¹

2𝑁² σ_𝑖,𝑗 𝑊_𝑖𝑗 𝑓_𝑖 − 𝑓_𝑗 ² → ℒ_𝑝 𝐹

(14)

Scaling SSL with Graphs to Millions

𝑳 defines the eigenvectors for increasing smoothness.

As 𝑁 → ∞, ¹

𝑁² 𝒇^𝑻𝑳𝒇 = ¹

2𝑁² σ_𝑖,𝑗 𝑊_𝑖𝑗 𝑓_𝑖 − 𝑓_𝑗 ² → ℒ_𝑝 𝐹

What does ℒ_𝑝 define?

ℒ_𝑝 𝐹 = 𝟏

𝟐න 𝐹 𝒙₁ − 𝐹 𝒙₂ ^𝟐 𝑊 𝒙₁, 𝒙₂ 𝑝 𝒙₁ 𝑝 𝒙₂ 𝑑𝒙₁𝑑𝒙₂

→ Eigenfunctions!

(15)

J. Y. Choi. SNU

Scaling SSL with Graphs to Millions

15

ℒ_𝑝 𝐹 = 𝟏

𝟐න 𝐹 𝒙₁ − 𝐹 𝒙₂ ^𝟐 𝑊 𝒙₁, 𝒙₂ 𝑝 𝒙₁ 𝑝 𝒙₂ 𝑑𝒙₁𝑑𝒙₂ First eigenfunction: 𝜙₁ 𝒙 = min

𝐹:׬ 𝐹^𝟐 𝒙 𝑝 𝒙 𝐷 𝒙 𝑑𝒙=1 ℒ_𝑝 𝐹 where 𝐷 𝒙 = ׬ 𝑊(𝒙, 𝒙₂)𝑝 𝒙₂ 𝑑𝒙₂, lim

𝑁→∞

1

𝑁 σ_𝑖 𝑓_𝑖²𝐷 𝑖, 𝑖 = ׬ 𝐹^𝟐 𝒙 𝑝 𝒙 𝐷 𝒙 𝑑𝒙

What is the solution?

How to define 𝜙₂ 𝒙 ?

𝜙₁ 𝒙 = 1 because ℒ_𝑝 1 = 0

same, constraining to be orthogonal to 𝜙₁ 𝒙

׬ 𝐹(𝒙)𝜙₁ 𝒙 𝑝 𝒙 𝐷 𝒙 𝑑𝒙 = 0

(16)

Scaling SSL with Graphs to Millions

Eeigenfunctions of ℒ_𝑝

𝜙₃ 𝒙 as before, orthogonal to 𝜙₁ 𝒙 and 𝜙₂ 𝒙 etc.

How to define eigenvalues?

Relationship to the discrete Laplacian

1

𝑁² 𝒇^𝑇𝑳𝒇 = ¹

2𝑁² σ_𝑖,𝑗 𝑤_𝑖𝑗 𝑓_𝑖 − 𝑓_𝑗 ²

𝑁→∞ℒ_𝑝(𝐹) Can we compute eigenfunctions from 𝑝 𝒙 numerically?

𝜆_𝑘 = ℒ_𝑝 𝜙_𝑘

(17)

J. Y. Choi. SNU

Scaling SSL with Graphs to Millions

17

(18)

Scaling SSL with Graphs to Millions

Factorized data distribution:

𝑝 𝒔 = 𝑝(𝑠₁)𝑝(𝑠₂) ∙∙∙ 𝑝(𝑠_𝑑)

In general, this is not true. But we can rotate data with 𝒔 = 𝑹𝒙.

Treating each factor individually 𝑝_𝑘 ≜ marginal distribution of 𝑠_𝑘

𝜙_𝑖(𝑠_𝑘) ≜ eigenfunction of ℒ_𝑝_𝑘 with eigenvalue 𝜆_𝑖

Then: 𝜙_𝑖 𝒔 = 𝜙_𝑖(𝑠_𝑘) is eigenfunction of ℒ_𝑝 with 𝜆_𝑖 (proof)

(19)

J. Y. Choi. SNU

Scaling SSL with Graphs to Millions

19

Graph Quantization 𝑾^𝑞 = 𝑽෪𝑾^𝑞𝑽

How to approximate 1𝐷 density? Histograms!

Algorithm in [fergus2009semi-supervised] for eigenfunctions

 Find 𝑹 such that 𝒔 = 𝑹𝒙

 For each “independent” 𝑠_𝑘 approximate 𝑝(𝑠_𝑘)

 Given 𝑝(𝑠_𝑘) numerically solve for eigensystem of ℒ_𝑝_𝑘 𝑫 − 𝑷෪෩ 𝑾𝑷 𝑔 = 𝜆 ෩𝑫𝑔

 𝑔 − vector of length 𝐵 ≡ number of bins

 𝑷 − diagonal elements denote density values at discrete points

 ෩𝑫 − diagonal sum of the columns of 𝑷෪𝑾𝑷

 Order eigenfunctions by increasing eigenvalues

(20)

Scaling SSL with Graphs to Millions

Numerical 𝟏𝑫 Eigenfunctions

𝜙₂(𝑠₁) 𝜙₃(𝑠₂) 𝜙₁ 𝑠 = 1

(21)

J. Y. Choi. SNU

Scaling SSL with Graphs to Millions

21

Computational complexity for 𝑁 × 𝑑 dataset Typical harmonic approach

one diagonalization of 𝑁 × 𝑁 system (too complex)

Numerical eigenfunctions with 𝐵 bins and 𝑘 eigenvectors 𝑫 − 𝑷෪෩ 𝑾𝑷 𝑔 = 𝜆 ෩𝑫𝑔

Laplacian smoothing problem: 𝑘 × 𝑘 least squares solution (𝚲 + ഥ𝑼^𝑇𝑪ഥ𝑼) ഥ𝜶 = ഥ𝑼^𝑇𝑪𝒚

Final solution:

𝒇 = ഥ𝑼ഥ𝜶^∗

(22)

Scaling SSL with Graphs to Millions

Tiny Image CIFAR label set

(23)

J. Y. Choi. SNU

SSL without Graph (remind)

23

SVM formulation using hinge loss:

min𝒘,𝑏 𝜆 𝒘 ² + σ_𝑖=1^𝑛^𝑙 max(1 − 𝑦_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0)

Formulation for Transductive SVM min𝒘,𝑏 ෍

𝑖=1 𝑛_𝑙

max(1 − 𝑦_𝑖 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0) + 𝜆₁ 𝒘 ² +𝜆₂ σ_𝑖=𝑛

𝑙+1 𝑛_𝑙+𝑛_𝑢

max(1 − 𝒘^𝑇𝒙_𝑖 + 𝑏 , 0)

(24)

SSL with Graphs (remind)

Inductive SSL with graph: using 𝑓 𝜽; 𝒙_𝑖 classifier with parameters 𝜽

min𝜽 𝜆 ෍

𝑖,𝑗=1 𝑛_𝑙+𝑛_𝑢

𝑤_𝑖𝑗 𝑓 𝜽; 𝒙_𝑖 − 𝑓 𝜽; 𝒙_𝑗 ² + 𝛾 ෍

𝑖 𝑛_𝑙

𝑓 𝜽; 𝒙_𝑖 − 𝑦_𝑖 ² Transductive SSL with graph: fixing 𝑓 𝒙_𝑖 for 𝑖 ∈ ℒ

𝒇∈{±𝟏}min^{𝑛𝑙+𝑛𝑢} 𝜆 ෍

𝑤_𝑖𝑗 𝑓 𝒙_𝑖 − 𝑓 𝒙_𝑗 ² + ∞෍

𝑖 𝑛_𝑙

𝑓 𝒙_𝑖 − 𝑦_𝑖 ²

𝒇 ∈ 𝑹^𝑛^𝑙^+𝑛^𝑢

(25)

J. Y. Choi. SNU

SSL with Graphs (remind)

25

Transductive SSL with graph: fixing 𝑓 𝒙_𝑖 for 𝑖 ∈ ℒ

𝒇∈𝑹min^{𝑛𝑙+𝑛𝑢} 𝜆 ෍

𝑤_𝑖𝑗 𝑓 𝒙_𝑖 − 𝑓 𝒙_𝑗 ² + ∞ ෍

𝑖 𝑛_𝑙

𝑓 𝒙_𝑖 − 𝑦_𝑖 ²

𝒇_𝑙 = 𝒚_𝑙 Ω 𝒇 = 𝒇^𝑇𝑳𝒇

𝛻_𝒇_𝑢𝛺 𝒇 = 0

⟹ 𝒇_𝑢 = 𝑳_𝑢𝑢⁻¹(𝑾_𝑢𝑙𝒇_𝑙) cf. random work HS: harmonic solution:

⟹ 𝒇_𝑢= (𝑳_𝑢𝑢 + 𝛾_𝑔𝐈)⁻¹ 𝑾_𝑢𝑙𝒇_𝑙 Regularized HS:

sink node weight

robust to outliers

(26)

SSL with Graphs (remind)

Regularized Soft HS 𝒇^∗ = argmin

𝒇∈ℝ^{𝑛𝑙+𝑛𝑢}

𝒇 − 𝒚 ^𝑇𝑪 𝒇 − 𝒚 + 𝒇^𝑇𝑸𝒇 = (𝑪⁻¹𝑸 + 𝐈)⁻¹𝒚 , 𝑸 = 𝑳 + 𝛾_𝑔𝐈 , 𝒚_𝒖 = 𝟎

Out of sample extension: Laplacian SVM with kernels (Inductive)

𝑓∈ℋmin_𝒦 ෍

𝑖 𝑛_𝑙

max(0, 1 − 𝑦_𝑖𝑓 𝑥_𝑖 ) + 𝜆_𝐴 𝑓 _𝒦² + 𝜆_𝑙𝒇^𝑇𝑳𝒇 manifold regularization

𝑓^∗ 𝒙 = ෍

𝑛_𝑙+𝑛_𝑢

𝛼^∗𝒦(𝒙, 𝒙 )

(27)

J. Y. Choi. SNU

SSL with Graphs (remind)

27

Self-training with the confident data (MM Graph Cuts) 𝑓^∗ = 𝑎𝑟𝑔min

𝑓∈ℋ_𝒦

෍

𝑖: 𝑙_𝑖^∗ ≥𝜀

Φ(𝒙_𝑖, 𝑠𝑔𝑛 ℓ_𝑖^∗ , 𝑓(𝒙_𝑖)) + 𝜆 𝑓 _𝒦² s. t. ℓ^∗ = 𝑎𝑟𝑔min

ℓ∈ℝ^𝑁

ℓ^𝑇 (𝑳 + 𝛾_𝑔𝐈) ℓ

s. t. ℓ_𝑖 = 𝑦_𝑖, ∀ 𝑖 = 1, … , 𝑛_𝑙

Representer theorem is still cool:

𝑓^∗ 𝒙 = ෍

𝑖: 𝑙_𝑖^∗ ≥𝜀

𝛼_𝑖^∗𝒦(𝒙, 𝒙_𝑖)

(28)

SSL with Graphs (remind)

Online SSL

(29)

J. Y. Choi. SNU

Scaling SSL with Graphs to Millions (remind)

29

Computational complexity for 𝑁 × 𝑑 dataset Typical harmonic approach

one diagonalization of 𝑁 × 𝑁 system (too complex)

Numerical eigenfunctions with 𝐵 bins and 𝑘 eigenvectors 𝑫 − 𝑷෪෩ 𝑾𝑷 𝑔 = 𝜆 ෩𝑫𝑔

Laplacian smoothing problem: 𝑘 × 𝑘 least squares solution (𝚲 + ഥ𝑼^𝑇𝑪ഥ𝑼) ഥ𝜶 = ഥ𝑼^𝑇𝑪𝒚

Final solution:

𝒇 = ഥ𝑼ഥ𝜶^∗

(30)

SSL with Graphs:

Graph-based SSL is obviously sensitive to graph construction!

(31)

J. Y. Choi. SNU

Summary Questions of the lecture

31

 What is the key idea to mitigate the scalability problem of SSL for millions of data?

 What happens to 𝑳 when 𝑁 → ∞?

 What does the weighted smoothness operator in Hilbert space define, instead of eigenvectors of Laplacian 𝐿?

 How can we obtain numerical eigenfunctions with sampled large dada?

 How can we determine eigenvectors from the numerical eigenfunctions?