J. Y. Choi. SNU
배 움
1
배움의 자세
자신이 익힌 지식에 집착함으로 인하여, 우월감과 타인에 대한 멸시함이 자라나지 않도록 한다.
지식을 과시하고자 하는 충동으로 인하여 무분별한 지식 습득과 사념에 빠지지 않도록 한다.
자신의 무지와 몰이해에 대해 자책하거나, 일깨움이 더딘 것을 탓하지 말 고 그저 부지런히 행함을 놓치지 않는다.
언제나 자신에게 솔직할 것이며, 나눌 준비와 겸허한 태도로 자신을 있는 그대로 꾸밈없이 표현하도록 한다.
매일 일상에서 이루어지는 순간순간의 과정이 그대를 이끌 것이니 미래를 걱정하며 계획에 연연해 하지 않는다.
J. Y. Choi. SNU
Summary Questions of the last lecture
3
What are the two options for out of sample extension in SSL?
→ One is to add the new sample to the graph and re-compute HF solution transductively and the other is to make the algorithms inductive.
Why do we have to make a classifier be smooth in inductive SSL for out of sample extension?
→ The smoothness deal with noisy samples by providing reasonable interpolation for new samples.
Summary Questions of the last lecture
What is the meaning of manifold regularization?
→ It enforces the classifier function to yield a reasonably interpolated value for any sample which is not given in the training data.
J. Y. Choi. SNU
Summary Questions of the last lecture
5
What is the key idea of Max-Margin Graph Cuts for SSL?
→ The self-training for SSL is done by using the confidently predicted labels obtained by regularized HF solution of SSL.
Summary Questions of the last lecture
What are the two options for online SSL?
→ One is to add the new sample to the graph and re-compute HF solution transductively and the other is to make the algorithms inductive. The key issue of the former option is to keep the computation cost and memory within a reasonable level even for continuous increment of unlabeled samples.
J. Y. Choi. SNU
Summary Questions of the last lecture
7
What is the key idea for keeping # of representative nodes?
→ Determine the minimum distance (R) between any two centroids and each new sample within a ball centered by a centroid and with radius R is
added to the cluster. If there are samples that do not belong to any cluster, R is doubled.
Outline of Lecture (1)
Graph Spectral Theory
Definition of Graph
Graph Laplacian
Laplacian Smoothing
Graph Node Clustering
Minimum Graph Cut
Ratio Graph Cut
Normalized Graph Cut
Manifold Learning
Spectral Analysis in Riemannian Manifolds
Dimension Reduction, Node Embedding
Semi-supervised Learning (SSL)
Self-Training Methods
SSL with SVM
SSL with Graph using MinCut
SSL with Graph using Harmonic Functions
Semi-supervised Learning (SSL) : conti.
SSL with Graph using Regularized Harmonic Functions
SSL with Graph using Soft Harmonic Functions
SSL with Graph using Manifold
Regularization (out of sample extension)
SSL with Graph using Laplacian SVMs
SSL with Graph using Max-Margin Graph Cuts
Online SSL
SSL for large graph
Graph Convolution Networks (GCN)
Graph Filtering in GCN
Graph Pooling in GCN
Spectral Filtering in GCN
Spatial Filtering in GCN
Recent GCN papers
J. Y. Choi. SNU
when 𝓖 does not fit to memory
9
SSL, Huge 𝒢
Scaling SSL with Graphs to Millions
Soft HS for SSL
𝒇∗ = argmin
𝒇∈ℝ𝑛𝑙+𝑛𝑢
𝒇 − 𝒚 𝑇𝑪 𝒇 − 𝒚 + 𝒇𝑇𝑳𝒇
Using eigenbasis of 𝑳 = 𝑼𝚲𝑼𝑻 and 𝒇 = 𝑼𝜶 𝜶∗ = argmin
𝜶∈ℝ𝑛𝑙+𝑛𝑢
𝑼𝜶 − 𝒚 𝑇𝑪 𝑼𝜶 − 𝒚 + 𝜶𝑇𝚲𝜶
What is the problem with scalability?
What
Let’s take
can we do?
only first 𝑘 eigenvectors in 𝒇 = ഥ𝑼ഥ𝜶!
J. Y. Choi. SNU
Scaling SSL with Graphs to Millions
11
𝑼 is a 𝑁 × 𝑘 matrix, ഥഥ 𝜶 ∈ ℝ𝑘, 𝑘 ≪ 𝑁 𝜶ഥ∗ = argmin
𝜶∈ℝഥ 𝑘
𝑼ഥഥ𝜶 − 𝒚 𝑇𝑪 ഥ𝑼ഥ𝜶 − 𝒚 + ഥ𝜶𝑇𝚲ഥഥ𝜶
Closed form solution is
𝜶ഥ∗ = (ഥ𝚲 + ഥ𝑼𝑇𝑪ഥ𝑼)−1𝑼ഥ𝑇𝑪𝒚
What is the size of this system of equation now?
It requires 𝑘 × 𝑘 𝑚𝑎𝑡𝑟𝑖𝑥 𝑖𝑛𝑣𝑒𝑟𝑠𝑖𝑜𝑛
https://cs.nyu.edu/~fergus/papers/fwt_ssl.pdf
Scaling SSL with Graphs to Millions
Investigation on eigenbasis depending on 𝑵
https://cs.nyu.edu/~fergus/papers/fwt_ssl.pdf
𝑵 𝑵
J. Y. Choi. SNU
Scaling SSL with Graphs to Millions
13
What happens to 𝑳 when 𝑁 → ∞?
We have data 𝒙𝑖 ∈ ℝ𝑑 sampled from 𝑝(𝒙).
When 𝑁 → ∞, instead of vectors 𝒇, we consider functions 𝐹(𝒙).
Instead of 𝑳, we define ℒ𝑝 − weighted smoothness operator ℒ𝑝 𝐹 = 𝟏
𝟐 𝐹 𝒙1 − 𝐹 𝒙2 𝟐 𝑊 𝒙1, 𝒙2 𝑝 𝒙1 𝑝 𝒙2 𝑑𝒙1𝑑𝒙2 where 𝑊 𝒙1, 𝒙2 = exp − 𝒙1 − 𝒙2 2 /(2𝜎2)
As 𝑁 → ∞, 1
𝑁2 𝒇𝑻𝑳𝒇 = 1
2𝑁2 σ𝑖,𝑗 𝑊𝑖𝑗 𝑓𝑖 − 𝑓𝑗 2 → ℒ𝑝 𝐹
Scaling SSL with Graphs to Millions
𝑳 defines the eigenvectors for increasing smoothness.
As 𝑁 → ∞, 1
𝑁2 𝒇𝑻𝑳𝒇 = 1
2𝑁2 σ𝑖,𝑗 𝑊𝑖𝑗 𝑓𝑖 − 𝑓𝑗 2 → ℒ𝑝 𝐹
What does ℒ𝑝 define?
ℒ𝑝 𝐹 = 𝟏
𝟐න 𝐹 𝒙1 − 𝐹 𝒙2 𝟐 𝑊 𝒙1, 𝒙2 𝑝 𝒙1 𝑝 𝒙2 𝑑𝒙1𝑑𝒙2
→ Eigenfunctions!
J. Y. Choi. SNU
Scaling SSL with Graphs to Millions
15
ℒ𝑝 𝐹 = 𝟏
𝟐න 𝐹 𝒙1 − 𝐹 𝒙2 𝟐 𝑊 𝒙1, 𝒙2 𝑝 𝒙1 𝑝 𝒙2 𝑑𝒙1𝑑𝒙2 First eigenfunction: 𝜙1 𝒙 = min
𝐹: 𝐹𝟐 𝒙 𝑝 𝒙 𝐷 𝒙 𝑑𝒙=1 ℒ𝑝 𝐹 where 𝐷 𝒙 = 𝑊(𝒙, 𝒙2)𝑝 𝒙2 𝑑𝒙2, lim
𝑁→∞
1
𝑁 σ𝑖 𝑓𝑖2𝐷 𝑖, 𝑖 = 𝐹𝟐 𝒙 𝑝 𝒙 𝐷 𝒙 𝑑𝒙
What is the solution?
How to define 𝜙2 𝒙 ?
𝜙1 𝒙 = 1 because ℒ𝑝 1 = 0
same, constraining to be orthogonal to 𝜙1 𝒙
𝐹(𝒙)𝜙1 𝒙 𝑝 𝒙 𝐷 𝒙 𝑑𝒙 = 0
Scaling SSL with Graphs to Millions
Eeigenfunctions of ℒ𝑝
𝜙3 𝒙 as before, orthogonal to 𝜙1 𝒙 and 𝜙2 𝒙 etc.
How to define eigenvalues?
Relationship to the discrete Laplacian
1
𝑁2 𝒇𝑇𝑳𝒇 = 1
2𝑁2 σ𝑖,𝑗 𝑤𝑖𝑗 𝑓𝑖 − 𝑓𝑗 2
𝑁→∞ℒ𝑝(𝐹) Can we compute eigenfunctions from 𝑝 𝒙 numerically?
𝜆𝑘 = ℒ𝑝 𝜙𝑘
J. Y. Choi. SNU
Scaling SSL with Graphs to Millions
17
Scaling SSL with Graphs to Millions
https://cs.nyu.edu/~fergus/papers/fwt_ssl.pdf
Factorized data distribution:
𝑝 𝒔 = 𝑝(𝑠1)𝑝(𝑠2) ∙∙∙ 𝑝(𝑠𝑑)
In general, this is not true. But we can rotate data with 𝒔 = 𝑹𝒙.
Treating each factor individually 𝑝𝑘 ≜ marginal distribution of 𝑠𝑘
𝜙𝑖(𝑠𝑘) ≜ eigenfunction of ℒ𝑝𝑘 with eigenvalue 𝜆𝑖
Then: 𝜙𝑖 𝒔 = 𝜙𝑖(𝑠𝑘) is eigenfunction of ℒ𝑝 with 𝜆𝑖 (proof)
J. Y. Choi. SNU
Scaling SSL with Graphs to Millions
19
https://cs.nyu.edu/~fergus/papers/fwt_ssl.pdf
Graph Quantization 𝑾𝑞 = 𝑽෪𝑾𝑞𝑽
How to approximate 1𝐷 density? Histograms!
Algorithm in [fergus2009semi-supervised] for eigenfunctions
Find 𝑹 such that 𝒔 = 𝑹𝒙
For each “independent” 𝑠𝑘 approximate 𝑝(𝑠𝑘)
Given 𝑝(𝑠𝑘) numerically solve for eigensystem of ℒ𝑝𝑘 𝑫 − 𝑷෪෩ 𝑾𝑷 𝑔 = 𝜆 ෩𝑫𝑔
𝑔 − vector of length 𝐵 ≡ number of bins
𝑷 − diagonal elements denote density values at discrete points
෩𝑫 − diagonal sum of the columns of 𝑷෪𝑾𝑷
Order eigenfunctions by increasing eigenvalues
Scaling SSL with Graphs to Millions
Numerical 𝟏𝑫 Eigenfunctions
𝜙2(𝑠1) 𝜙3(𝑠2) 𝜙1 𝑠 = 1
J. Y. Choi. SNU
Scaling SSL with Graphs to Millions
21
Computational complexity for 𝑁 × 𝑑 dataset Typical harmonic approach
one diagonalization of 𝑁 × 𝑁 system (too complex)
Numerical eigenfunctions with 𝐵 bins and 𝑘 eigenvectors 𝑫 − 𝑷෪෩ 𝑾𝑷 𝑔 = 𝜆 ෩𝑫𝑔
Laplacian smoothing problem: 𝑘 × 𝑘 least squares solution (𝚲 + ഥ𝑼𝑇𝑪ഥ𝑼) ഥ𝜶 = ഥ𝑼𝑇𝑪𝒚
Final solution:
𝒇 = ഥ𝑼ഥ𝜶∗
Scaling SSL with Graphs to Millions
Tiny Image CIFAR label set
J. Y. Choi. SNU
SSL without Graph (remind)
23
SVM formulation using hinge loss:
min𝒘,𝑏 𝜆 𝒘 2 + σ𝑖=1𝑛𝑙 max(1 − 𝑦𝑖 𝒘𝑇𝒙𝑖 + 𝑏 , 0)
Formulation for Transductive SVM min𝒘,𝑏
𝑖=1 𝑛𝑙
max(1 − 𝑦𝑖 𝒘𝑇𝒙𝑖 + 𝑏 , 0) + 𝜆1 𝒘 2 +𝜆2 σ𝑖=𝑛
𝑙+1 𝑛𝑙+𝑛𝑢
max(1 − 𝒘𝑇𝒙𝑖 + 𝑏 , 0)
SSL with Graphs (remind)
Inductive SSL with graph: using 𝑓 𝜽; 𝒙𝑖 classifier with parameters 𝜽
min𝜽 𝜆
𝑖,𝑗=1 𝑛𝑙+𝑛𝑢
𝑤𝑖𝑗 𝑓 𝜽; 𝒙𝑖 − 𝑓 𝜽; 𝒙𝑗 2 + 𝛾
𝑖 𝑛𝑙
𝑓 𝜽; 𝒙𝑖 − 𝑦𝑖 2 Transductive SSL with graph: fixing 𝑓 𝒙𝑖 for 𝑖 ∈ ℒ
𝒇∈{±𝟏}min𝑛𝑙+𝑛𝑢 𝜆
𝑖,𝑗=1 𝑛𝑙+𝑛𝑢
𝑤𝑖𝑗 𝑓 𝒙𝑖 − 𝑓 𝒙𝑗 2 + ∞
𝑖 𝑛𝑙
𝑓 𝒙𝑖 − 𝑦𝑖 2
𝒇 ∈ 𝑹𝑛𝑙+𝑛𝑢
J. Y. Choi. SNU
SSL with Graphs (remind)
25
Transductive SSL with graph: fixing 𝑓 𝒙𝑖 for 𝑖 ∈ ℒ
𝒇∈𝑹min𝑛𝑙+𝑛𝑢 𝜆
𝑖,𝑗=1 𝑛𝑙+𝑛𝑢
𝑤𝑖𝑗 𝑓 𝒙𝑖 − 𝑓 𝒙𝑗 2 + ∞
𝑖 𝑛𝑙
𝑓 𝒙𝑖 − 𝑦𝑖 2
𝒇𝑙 = 𝒚𝑙 Ω 𝒇 = 𝒇𝑇𝑳𝒇
𝛻𝒇𝑢𝛺 𝒇 = 0
⟹ 𝒇𝑢 = 𝑳𝑢𝑢−1(𝑾𝑢𝑙𝒇𝑙) cf. random work HS: harmonic solution:
⟹ 𝒇𝑢= (𝑳𝑢𝑢 + 𝛾𝑔𝐈)−1 𝑾𝑢𝑙𝒇𝑙 Regularized HS:
sink node weight
robust to outliers
SSL with Graphs (remind)
Regularized Soft HS 𝒇∗ = argmin
𝒇∈ℝ𝑛𝑙+𝑛𝑢
𝒇 − 𝒚 𝑇𝑪 𝒇 − 𝒚 + 𝒇𝑇𝑸𝒇 = (𝑪−1𝑸 + 𝐈)−1𝒚 , 𝑸 = 𝑳 + 𝛾𝑔𝐈 , 𝒚𝒖 = 𝟎
Out of sample extension: Laplacian SVM with kernels (Inductive)
𝑓∈ℋmin𝒦
𝑖 𝑛𝑙
max(0, 1 − 𝑦𝑖𝑓 𝑥𝑖 ) + 𝜆𝐴 𝑓 𝒦2 + 𝜆𝑙𝒇𝑇𝑳𝒇 manifold regularization
𝑓∗ 𝒙 =
𝑛𝑙+𝑛𝑢
𝛼∗𝒦(𝒙, 𝒙 )
J. Y. Choi. SNU
SSL with Graphs (remind)
27
Self-training with the confident data (MM Graph Cuts) 𝑓∗ = 𝑎𝑟𝑔min
𝑓∈ℋ𝒦
𝑖: 𝑙𝑖∗ ≥𝜀
Φ(𝒙𝑖, 𝑠𝑔𝑛 ℓ𝑖∗ , 𝑓(𝒙𝑖)) + 𝜆 𝑓 𝒦2 s. t. ℓ∗ = 𝑎𝑟𝑔min
ℓ∈ℝ𝑁
ℓ𝑇 (𝑳 + 𝛾𝑔𝐈) ℓ
s. t. ℓ𝑖 = 𝑦𝑖, ∀ 𝑖 = 1, … , 𝑛𝑙
Representer theorem is still cool:
𝑓∗ 𝒙 =
𝑖: 𝑙𝑖∗ ≥𝜀
𝛼𝑖∗𝒦(𝒙, 𝒙𝑖)
SSL with Graphs (remind)
Online SSL
J. Y. Choi. SNU
Scaling SSL with Graphs to Millions (remind)
29
Computational complexity for 𝑁 × 𝑑 dataset Typical harmonic approach
one diagonalization of 𝑁 × 𝑁 system (too complex)
Numerical eigenfunctions with 𝐵 bins and 𝑘 eigenvectors 𝑫 − 𝑷෪෩ 𝑾𝑷 𝑔 = 𝜆 ෩𝑫𝑔
Laplacian smoothing problem: 𝑘 × 𝑘 least squares solution (𝚲 + ഥ𝑼𝑇𝑪ഥ𝑼) ഥ𝜶 = ഥ𝑼𝑇𝑪𝒚
Final solution:
𝒇 = ഥ𝑼ഥ𝜶∗
SSL with Graphs:
Graph-based SSL is obviously sensitive to graph construction!
J. Y. Choi. SNU
Summary Questions of the lecture
31
What is the key idea to mitigate the scalability problem of SSL for millions of data?
What happens to 𝑳 when 𝑁 → ∞?
What does the weighted smoothness operator in Hilbert space define, instead of eigenvectors of Laplacian 𝐿?
How can we obtain numerical eigenfunctions with sampled large dada?
How can we determine eigenvectors from the numerical eigenfunctions?