Review: Summary questions of the last Lecture

(1)

Review: Summary questions of the last Lecture

 Explain the meaning of Spectral Clustering in one sentence. Why spectral?

→ It means that the clustering of nodes in a graph is done on the basis of frequency components of the graph signal representing the cluster

labels of the nodes.

 What is represented by the solution of Laplacian formulation relaxing Balanced Graph cut problem?

→ It represents the second eigenvector of an appropriate Laplacian, which gives the given balancing condition.

 What does the solution of MinCut problem mean?

→ It is the second eigenvector of Laplacian, which gives the balanced cardinalities of the clustered groups.

 What does the solution of NormalizedCut problem mean?

 It is the second eigenvector of Symetric Laplacian, which gives the

(2)

Spectral Clustering: Relaxing Balanced Cuts

Optimization formulation for spectral clustering

𝐦𝐢𝐧𝒇 𝒇^𝑇𝑳𝒇 subject to 𝑓_𝑖∈ 𝑅, 𝒇 ⊥ 𝟏_𝑵, 𝒇 = 𝑵

The solution

𝜆₂ = 𝐦𝐢𝐧

𝒙^𝑇𝒙=𝟏, 𝒙⊥𝒖₁ 𝒙^𝑇𝑳𝒙, where 𝒖₁ = 𝟏_𝑵 for 𝜆₁ = 0

→ second eigenvector 𝒙 of 𝑳

Since the elements in 𝒙 are not integer and 𝒙 = 𝟏, 𝒇 can be obtained by 𝑓_𝑖 = ቊ 1 𝑖𝑓 𝑥_𝑖 ≥ 0

−1 𝑖𝑓 𝑥_𝑖 < 0 → 𝒇 = 𝑵 𝑓_𝑖∈ {1, −1}

𝑨 = 𝑩

(3)

Spectral Clustering: Approximating RatioCut

RatioCut

Define graph function 𝒇 for cluster membership of RatioCut: ^𝑓_𝑖 ^{= ൞}

𝑩

𝑨 𝑖𝑓 𝑣_𝑖 ∈ 𝑨

− ^𝑨

𝑩 𝑖𝑓 𝑣_𝑖 ∈ 𝑩

min𝑨,𝑩 𝑅𝐶𝑢𝑡 𝑨, 𝑩 = min

𝑨,𝑩 ෍

𝑖∈𝑨,𝑗∈𝑩

𝑤_𝑖𝑗 1

𝑨 + 1 𝑩

𝒇^𝑇𝑳𝒇 = 1

2 ෍

𝑖,𝑗

𝑤_𝑖𝑗 (𝑓_𝑖 − 𝑓_𝑗)²= 𝑨 + 𝑩 𝑅𝐶𝑢𝑡 𝑨, 𝑩 Since 𝑨 + 𝑩 is constant, min

𝑨,𝑩 𝑅𝐶𝑢𝑡 𝑨, 𝑩 = min

𝒇 𝒇^𝑇𝑳𝒇, subject to 𝑓_𝑖 ∈ ^𝑩

𝑨 , − ^𝑨

𝑩

𝑨 = 𝑩

(4)

Spectral Clustering: Approximating RatioCut

Optimization formulation for RatioCut (same with balanced mincut) 𝐦𝐢𝐧𝒇 𝒇^𝑇𝑳𝒇 subject to, 𝑓_𝑖 ∈ 𝑅, 𝒇 = 𝑵

𝑠. 𝑡. 𝑓_𝑖 =

𝑩

𝑨 𝑖𝑓 𝑣_𝑖 ∈ 𝑨

− 𝑨

𝑩 𝑖𝑓 𝑣_𝑖 ∈ 𝑩

min𝑨,𝑩 𝑅𝐶𝑢𝑡 𝑨, 𝑩 = min

𝑨,𝑩 𝒇^𝑇𝑳𝒇

𝒇 ^𝟐 = σ_𝑖 𝑓_𝑖² = 𝑨 ^𝑩

𝑨 + 𝑩 ^𝑨

𝑩 = 𝑨 + 𝑩 = 𝑵 → not sufficient for 𝑓_𝑖∈ ^𝑩

𝑨 , − ^𝑨

𝑩

𝑨 = 𝑩

𝒇^𝑇𝑳𝒇 ≠ 𝑨 + 𝑩 ෍ 𝑤 1

+ 1

(5)

Spectral Clustering: Approximating RatioCut

Optimization formulation for RatioCut (same with balanced Mincut) 𝐦𝐢𝐧𝒇 𝒇^𝑇𝑳𝒇 subject to 𝑓_𝑖∈ 𝑅, 𝒇 ⊥ 𝟏_𝑵, 𝒇 = 𝑵

𝑨 = 𝑩 → ෍

𝑖

𝑓_𝑖 = 0 ↔ 𝒇 ⊥ 𝟏_𝑵

Optimization formulation for RatioCut (same with balanced mincut) 𝐦𝐢𝐧𝒇 𝒇^𝑇𝑳𝒇 subject to 𝑓_𝑖∈ 𝑅, 𝒇 = 𝑵

(6)

Spectral Clustering: Approximating RatioCut

The solution

𝜆₂ = 𝐦𝐢𝐧

𝒙^𝑇𝒙=𝟏, 𝒙⊥𝒖₁ 𝒙^𝑇𝑳𝒙, where 𝒖₁ = 𝟏_𝑵 for 𝜆₁ = 0.

→ second eigenvector 𝒙 of 𝑳

Since the elements in 𝒙 are not integer and 𝒙 = 𝟏, 𝒇 can be obtained by

𝑓_𝑖 =

𝑩

𝑨 𝑖𝑓 𝑥_𝑖 ≥ 0

− 𝑨

𝑩 𝑖𝑓 𝑥_𝑖 < 0

→ 𝒇 = 𝑵

𝑨 = 𝑩 𝒇^𝑇𝑳𝒇 ≠ 𝑨 + 𝑩 ෍ 𝑤_𝑖𝑗 1

+ 1

(7)

Spectral Clustering: Approximating NormalizedCut

NormalizedCut

Balancing the clusters by considering the degrees of nodes Define graph function 𝒇 for cluster membership of NCut: ^𝑓_𝑖 ^{= ൞}

𝑣𝑜𝑙(𝑩)

𝑣𝑜𝑙(𝑨) 𝑖𝑓 𝑣_𝑖 ∈ 𝑨

− ^{𝑣𝑜𝑙(𝑨)}

𝑣𝑜𝑙(𝑩) 𝑖𝑓 𝑣_𝑖 ∈ 𝑩

min𝑨,𝑩 𝑁𝐶𝑢𝑡 𝑨, 𝑩 = min

𝑨,𝑩 ෍

𝑤_𝑖𝑗 1

𝑣𝑜𝑙(𝑨) + 1 𝑣𝑜𝑙(𝑩)

min𝐴,𝐵 𝒇^𝑇𝑳𝒇 = 𝑣𝑜𝑙 𝒱 𝑁𝐶𝑢𝑡 𝑨, 𝑩 , 𝑓_𝑖 ∈ 𝑣𝑜𝑙(𝑩)

𝑣𝑜𝑙(𝑨) , − 𝑣𝑜𝑙(𝑨) 𝑣𝑜𝑙(𝑩) 𝒇^𝑇𝑳𝒇 = ෍

𝑖,𝑗

𝑤_𝑖𝑗 𝑣𝑜𝑙(𝑩)

𝑣𝑜𝑙(𝑨) + 𝑣𝑜𝑙(𝑨) 𝑣𝑜𝑙(𝑩)

2

= ෍

𝑖,𝑗

𝑤_𝑖𝑗 𝑣𝑜𝑙 𝑩 + 𝑣𝑜𝑙(𝑨) 𝑣𝑜𝑙(𝑨) 𝑣𝑜𝑙(𝑩)

2

𝑣𝑜𝑙 𝑩 = 𝑣𝑜𝑙(𝑨)

(8)

Spectral Clustering: Approximating NormalizedCut

NormalizedCut

Define graph function 𝒇 for cluster membership of NCut: ^𝑓_𝑖 ^{= ൞}

𝑣𝑜𝑙(𝑩)

− ^{𝑣𝑜𝑙(𝑨)}

𝑨,𝑩 ෍

𝑤_𝑖𝑗 1

(𝑫𝒇)^𝑇𝟏_𝑁 = 0, 𝒇^𝑇𝑫𝒇 = 𝑣𝑜𝑙 𝒱 Necessary condition for 𝑓_𝑖 ∈ ^{𝑣𝑜𝑙(𝑩)}

𝑣𝑜𝑙(𝑨) , − ^{𝑣𝑜𝑙(𝑨)}

𝑣𝑜𝑙(𝑩)

(9)

Spectral Clustering: Approximating NormalizedCut

Optimization formulation for NormalizedCut

𝐦𝐢𝐧𝒇 𝒇^𝑇𝑳𝒇 subject to 𝑓_𝑖∈ 𝑅, 𝑫𝒇 ⊥ 𝟏_𝑵, 𝒇^𝑇𝑫𝒇 = 𝑣𝑜𝑙 𝒱 NormalizedCut

Define graph function 𝒇 for cluster membership of NCut: ^𝑓_𝑖 ^{= ൞}

𝑣𝑜𝑙(𝑩)

− ^{𝑣𝑜𝑙(𝑨)}

𝑨,𝑩 ෍

𝑤_𝑖𝑗 1

𝒇^𝑇𝑳𝒇 = 𝑣𝑜𝑙 𝒱 𝑁𝐶𝑢𝑡 𝑨, 𝑩 , (𝑫𝒇)^𝑇𝟏_𝑁 = 0, 𝒇^𝑇𝑫𝒇 = 𝑣𝑜𝑙 𝒱 ,

(10)

Spectral Clustering: Approximating NormalizedCut

𝐦𝐢𝐧𝒇 𝒇^𝑇𝑳𝒇 subject to 𝑓_𝑖∈ 𝑅, 𝑫𝒇 ⊥ 𝟏_𝑵, 𝒇^𝑇𝑫𝒇 = 𝑣𝑜𝑙 𝒱

Can we apply Rayleigh-Ritz now? Define 𝒉 = 𝑫^𝟏/𝟐𝒇 Optimization formulation for NormalizedCut

𝐦𝐢𝐧𝒉 𝒉^𝑇𝑫^−1/2𝑳𝑫^−𝟏/𝟐𝒉 subject to ℎ_𝑖∈ 𝑅, 𝒉 ⊥ 𝒖_𝟏,𝑳_𝒔𝒚𝒎, 𝒉^𝑇𝒉 = 𝑣𝑜𝑙 𝒱

𝐦𝐢𝐧𝒉 𝒉^𝑇𝑳_𝒔𝒚𝒎𝒉 subject to ℎ_𝑖∈ 𝑅, 𝒉 ⊥ 𝒖_𝟏,𝑳_𝒔𝒚𝒎, 𝒉 = 𝑣𝑜𝑙 𝒱

(11)

Spectral Clustering: Approximating NormalizedCut

Solution by Rayleigh-Ritz? 𝒉 = 𝒖_𝟐,𝑳_𝒔𝒚𝒎, 𝒇 = 𝑫^−𝟏/𝟐𝒉 𝐦𝐢𝐧𝒉 𝒉^𝑇𝑳_𝒔𝒚𝒎𝒉 subject to ℎ_𝑖∈ 𝑅, 𝒉 ⊥ 𝒖_𝟏,𝑳_𝒔𝒚𝒎, 𝒉 = 𝑣𝑜𝑙 𝒱

𝑓_𝑖 ←

𝑣𝑜𝑙(𝑩)

𝑣𝑜𝑙(𝑨) 𝑖𝑓 ℎ_𝑖 ≥ 0

− 𝑣𝑜𝑙(𝑨)

𝑣𝑜𝑙(𝑩) 𝑖𝑓 ℎ_𝑖 < 0

↔ 𝒇^𝑇𝑫𝒇 = 𝑣𝑜𝑙 𝒱 → 𝑣𝑜𝑙 𝑨 = 𝑣𝑜𝑙(𝑩)

→ eigenvector of 𝑳_𝒓𝒘

→ 𝐿𝑢 = 𝜆𝐷𝑢

(12)

Outline of Lecture

 Graph Spectral Theory

 Definition of Graph

 Graph Laplacian

 Laplacian Smoothing

 Graph Node Clustering

 Minimum Graph Cut

 Ratio Graph Cut

 Normalized Graph Cut

 Manifold Learning

 Spectral Analysis in Riemannian Manifolds

 Dimension Reduction, Node Embedding

 Semi-supervised Learning (SSL)

 Self-Training Methods

 SSL with SVM

 SSL with Graph using MinCut

 SSL with Graph using Harmonic Functions

 Semi-supervised Learning (SSL) : conti.

 SSL with Graph using Regularized Harmonic Functions

 SSL with Graph using Soft Harmonic Functions

 SSL with Graph using Manifold Regularization

 SSL with Graph using Laplacian SVMs

 SSL with Graph using Max-Margin Graph Cuts

 Online SSL and SSL for large graph

 Graph Convolution Networks (GCN)

 Graph Filtering in GCN

 Graph Pooling in GCN

 Spectral Filtering in GCN

 Spatial Filtering in GCN

(13)

Manifold Learning

ℝ ^𝑑 ⟹ ℝ ^𝑚

(14)

Manifold Learning

problem: definition reduction/manifold learning

Given {𝒙_𝑖}_𝑖=1^𝑁 from ℝ^𝑑, find {𝒚_𝑖}_𝑖=1^𝑁 in ℝ^𝑚, where 𝑚 ≪ 𝑑.

 What do we know about the dimensionality reduction?

 representation/visualization (2𝐷 𝑜𝑟 3𝐷 )

 an old example: globe to a map (3𝐷 → 2𝐷)

 often assuming ℳ ⊂ ℝ^𝑑

 feature extraction

 linear vs. nonlinear dimensionality reduction

 What do we know about linear vs. nonlinear methods?

 linear: ICA, PCA, LDA, SVD, ...

 nonlinear often preserve only local distances

(15)

Manifold Learning: Linear vs. Non-linear

(16)

Manifold Learning: Preserving (just) local distances

𝑑 𝒚_𝑖, 𝒚_𝑗 = 𝑑 𝒙_𝑖, 𝒙_𝑗 only if 𝑑 𝒙_𝑖, 𝒙_𝑗 and 𝑑 𝒚_𝑖, 𝒚_𝑗 are small.

min ෍

𝑖,𝑗

𝑤_𝑖𝑗 𝒚_𝑖 − 𝒚_𝒋 ² 𝒚_𝑖 looks similar to 𝒚_𝒋?

Yes in Euclidean space, but No in Manifolds

(17)

Manifold Learning: Riemannian manifolds

 Manifold 𝓧 = topological space

 Tangent space 𝑇_𝒙𝓧 = local Euclidean representation of manifold 𝓧 around 𝒙

Manifolds Tangent space

Geometric deep learning on graphs and manifolds, Michael Bronstein et al.,

(18)

Manifold Learning: Riemannian manifolds

 Riemannian metric describes the local intrinsic structure at 𝒙

<∙,∙>_𝑇_𝒙_𝓧: 𝑇_𝒙𝓧 × 𝑇_𝒙𝓧 → ℝ Tangent space

(19)

Manifold Learning: Riemannian manifolds

<∙,∙>_𝑇_𝒙_𝓧: 𝑇_𝒙𝓧 × 𝑇_𝒙𝓧 → ℝ

 Scalar fields 𝑓: 𝓧 → ℝ and Vector fields 𝐹: 𝓧 → 𝑇_𝒙𝓧

(20)

Manifold Learning: Riemannian manifolds

<∙,∙>_𝑇_𝒙_𝓧: 𝑇_𝒙𝓧 × 𝑇_𝒙𝓧 → ℝ

 Scalar fields 𝑓: 𝓧 → ℝ and Vector fields 𝐹: 𝓧 → 𝑇_𝒙𝓧

 Hilbert spaces with inner products

< 𝑓, 𝑔 >_𝐿²_(𝓧,𝒙)= න 𝑓 𝒙 𝑔 𝒙 𝑑𝒙

< 𝐹, 𝐺 >_𝐿²_{(𝑇𝓧,𝒙)}= න < 𝐹(𝒙), 𝐺(𝒙) >_𝑇_𝒙_𝓧 𝑑𝒙

𝐿²(𝓧, 𝝁):

square integrable in measure space 𝓧 w.r.t. measure 𝝁.

Is 𝑓 𝑥 = 1/𝑡 in 𝐿²( 𝟎, 𝟏 , 𝒙)?

(21)

Manifold Learning: Manifold Laplacian

 Laplacian ∆: 𝐿²(𝓧) → 𝐿²(𝓧)

∆𝑓 𝒙 = −𝑑𝑖𝑣𝛻𝑓(𝒙)

where gradient 𝛻:𝐿²(𝓧) → 𝐿²(𝑇𝓧) and divergence div:𝐿²(𝑇𝓧) → 𝐿²(𝓧) are adjoint operators, i.e.,

< 𝛻𝑓, 𝐺 >_𝐿²_(𝑇𝓧)=< 𝑓, −𝑑𝑖𝑣 𝐺 >_𝐿²_(𝓧) (ex) Let 𝐹 be a differentiable vector field

𝐅 = 𝐹_𝑥𝐢 + 𝐹_𝑦𝐣 + 𝐹_𝑧𝐤 Then

div 𝐅 = ^𝜕𝐹^𝑥

𝜕𝑥 + ^𝜕𝐹^𝑦

𝜕𝑦 + ^𝜕𝐹^𝑧

𝜕𝑧

Book: Laplacian on Riemannian Manifold

(22)

Manifold Learning: Manifold Laplacian

 Laplacian∆: 𝐿²(𝓧) → 𝐿²(𝓧)

∆𝑓 𝒙 = −𝑑𝑖𝑣𝛻𝑓(𝒙)

where gradient 𝛻:𝐿²(𝓧) → 𝐿²(𝑇𝓧) and divergence div:𝐿²(𝑇𝓧) → 𝐿²(𝓧) are adjoint operators, i.e.,

< 𝛻𝑓, 𝐺 >_𝐿²_(𝑇𝓧)=< 𝑓, −𝑑𝑖𝑣 𝐺 >_𝐿²_(𝓧)

 Manifold Laplacian is self-adjoint

< ∆𝑓, 𝑓 >_𝐿²_(𝓧)=< 𝑓, ∆𝑓 >_𝐿²_(𝓧)

(23)

Manifold Learning: Manifold Laplacian

 Laplacian∆: 𝐿²(𝓧) → 𝐿²(𝓧)

∆𝑓 𝒙 = −𝑑𝑖𝑣𝛻𝑓(𝒙) where gradient 𝛻:𝐿²(𝓧) → 𝐿²(𝑇𝓧) and divergence div:𝐿²(𝑇𝓧) → 𝐿²(𝓧) are adjoint operators, i.e.,

< 𝛻𝑓, 𝐺 >_𝐿²_(𝑇𝓧)=< 𝑓, −𝑑𝑖𝑣 𝐺 >_𝐿²_(𝓧)

 Laplacian is self-adjoint

< ∆𝑓, 𝑓 >_𝐿²_(𝓧)=< 𝑓, ∆𝑓 >_𝐿²_(𝓧)

 Dirichlet energy of 𝑓

< 𝛻𝑓, 𝛻𝑓 >_𝐿²_(𝑇𝓧)= න 𝑓(𝒙) ∆𝑓 𝒙 𝑑𝒙 ⟺ 𝒇^𝑇 𝑳𝒇

(24)

Manifold Learning: Laplacian Eigenmaps

Step 1 (Adjacency Graph): Given 𝑁 points 𝒙_𝟏, 𝒙_𝟐, … , 𝒙_𝑵 in 𝑅^𝑙, we construct a graph of 𝑁 nodes 𝒙_𝒊 and weighted edges 𝑤_𝒊𝒋 , where

𝑤_𝒊𝒋 = ቐ𝒆^{− 𝒙𝒊−𝒙𝒋}

𝟐

𝒕 or 1 𝑖𝑓

0 𝑜. 𝑤.

𝒙_𝑖 − 𝒙_𝑗 ≤ 𝜖 or 𝑗 ∈ 𝒌𝑵𝑵_𝑖

Step 2: Solve generalized eigenproblem (Normalized Cut):

𝑳𝒇 = 𝝀𝑫𝒇 → 𝒇_𝟏, 𝒇_𝟐, … , 𝒇_𝑵 , where 0 = 𝜆₁ ≤ … ≤ 𝜆_𝑁

Step 3: Assign 𝑚 new coordinates: use 𝑚 eigenvectors for embedding 𝒙_𝒊 in 𝑚 dimensional Euclidean space.

𝒙_𝒊 → 𝒇₂(𝑖), … , 𝒇_𝑚+1(𝑖)

(25)

Manifold Learning: Laplacian Eigenmaps

Swiss Roll 2D embeddings

𝑁 −nearest neighbors, 𝑡: heat kernel parameter

(26)

Manifold Learning: Laplacian Eigenmaps

Laplacian-eigenmap-diffusion-map-manifold-learning, (Taylor, Mathworks,

(27)

semi-supervised learning

SSL

(28)

Semi-supervised learning: How is it possible?

This is how children learn! hypothesis

(29)

Semi-supervised learning (SSL)

SSL problem: definition

Given {𝒙_𝑖}_𝑖=1^𝑁 from ℝ^𝑑 and {𝑦_𝑖}_𝑖=1^𝑛 , with 𝑛 ≪ 𝑁, find {𝑦_𝑖}_𝑖=𝑛+1^𝑁 (transductive) or find 𝒇 predicting 𝑦_𝑖 𝑦_𝑖 = 𝒇 𝒙_𝑖 , 𝑖 = 𝑛 + 1, … , 𝑁} well (inductive).

Some facts about SSL

 assumes that the unlabeled data is useful

 works with data geometry assumptions

 cluster assumption — low-density separation

 manifold assumption

 smoothness assumptions, …

 inductive or transductive/out-of-sample extension

(30)

SSL: Self-Training

(31)

SSL: Self-Training

Input: ℒ = {𝒙_𝑖, 𝑦_𝑖}_𝑖=1^𝑛 and 𝒰 = {𝒙_𝑖}_𝑖=𝑛+1^𝑁 Repeat:

 train 𝒇 using ℒ

 apply 𝒇 to some of 𝒰 and add them to ℒ What are the properties of self-training?

 heavily depends on the classifier

 nobody uses it anymore

 errors propagate (unless the clusters are well separated)

(32)

SSL: Self-Training : Bad Case

(33)

SSL: Classical SVM (Review)

margin 𝒇 𝒙 = 𝒘^𝑻𝒙 + 𝑏 < 𝟎

𝒇 𝒙 = 𝒘^𝑻𝒙 + 𝑏 > 𝟎 𝒇 𝒙 = 𝒘^𝑻𝒙 + 𝑏 = 𝟎

Maximal Margin Hyperplane = Optimal Hyperplane

(34)

Summary questions on the lecture

 What is manifold learning?

 What is Riemannian Manifold?

 What is the role of heat kernel?

 What is cluster assumption for semi-supervised learning?

 What is manifold assumption for semi-supervised learning?

Review: Summary questions of the last Lecture