• 검색 결과가 없습니다.

Introduction to Data Mining

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Data Mining"

Copied!
37
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Introduction to Data Mining

Lecture #20: Analysis of Large Graphs-2 U Kang

Seoul National University

(2)

In This Lecture

Learn the min-cut problem in graphs, and its solutions

Learn an example of spectral graph theory (how linear algebra and graph problems interact)

Learn how to find small communities in graphs

(3)

Outline

Spectral Clustering

Small Communities

(4)

Graph Partitioning

Undirected graph 𝑮𝑮(𝑽𝑽, 𝑬𝑬):

Bi-partitioning task:

Divide vertices into two disjoint groups 𝑨𝑨, 𝑩𝑩

Questions:

How can we define a “good” partition of 𝑮𝑮?

How can we efficiently identify such a partition?

1 3 2

5

4 6

A B

1

3 2

5

4 6

(5)

Graph Partitioning

What makes a good partition?

Maximize the number of within-group connections

Minimize the number of between-group connections

1

3 2

5

4 6

A B

(6)

A B

Graph Cuts

Express partitioning objectives as a function of the “edge cut” of the partition

Cut: Set of edges with only one vertex in a group:

cut(A,B) = 2

1

3 2

5

4 6

(7)

Graph Cut Criterion

Criterion: Minimum-cut

Minimize weight of connections between groups

Degenerate case:

Problem:

Only considers external cluster connections

Does not consider internal cluster connectivity

arg min

A,B

cut(A,B)

“Optimal cut”

Minimum cut

(8)

Graph Cut Criteria

Criterion: Normalized-cut [Shi-Malik, ’97]

Connectivity between groups relative to the density of each group

𝒗𝒗𝒗𝒗𝒗𝒗(𝑨𝑨): total weight of the edges with at least one endpoint in 𝑨𝑨: 𝒗𝒗𝒗𝒗𝒗𝒗 𝑨𝑨 = ∑𝒊𝒊∈𝑨𝑨 𝒌𝒌𝒊𝒊

Why use this criterion?

Produces more balanced partitions

How do we efficiently find a good partition?

Problem: Computing optimal cut is NP-hard

[Shi-Malik]

(9)

Spectral Graph Partitioning

A : adjacency matrix of undirected G

Aij =1 if (𝒊𝒊, 𝒋𝒋) is an edge, else 0

x : a vector in ℜ

n

with components (𝒙𝒙

𝟏𝟏

, … , 𝒙𝒙

𝒏𝒏

)

Think of it as a label/value of each node of 𝑮𝑮

What is the meaning of Ax ?

Entry y is a sum of labels x of neighbors of i

(10)

What is the meaning of Ax?

j

th

coordinate of Ax :

Sum of the x-values of neighbors of j

Make this a new value at node j

Spectral Graph Theory:

Analyze the “spectrum” of matrix representing 𝑮𝑮

Spectrum: Eigenvectors 𝒙𝒙𝒊𝒊 of a graph, ordered by the magnitude (strength) of their corresponding

eigenvalues 𝝀𝝀𝒊𝒊:

𝑨𝑨 ⋅ 𝒙𝒙 = 𝝀𝝀 ⋅ 𝒙𝒙

(11)

Example: d-regular graph

Suppose all nodes in 𝑮𝑮 have degree 𝒅𝒅 and 𝑮𝑮 is connected

What are some eigenvalues/vectors of 𝑮𝑮?

𝑨𝑨⋅ 𝒙𝒙 = 𝝀𝝀 ⋅ 𝒙𝒙 What is λ ? What x?

Let’s try: 𝒙𝒙 = (𝟏𝟏, 𝟏𝟏, … , 𝟏𝟏)

Then: 𝑨𝑨 ⋅ 𝒙𝒙 = 𝒅𝒅, 𝒅𝒅, … , 𝒅𝒅 = 𝝀𝝀 ⋅ 𝒙𝒙. So: 𝝀𝝀 = 𝒅𝒅

We found eigenpair of 𝑮𝑮: 𝒙𝒙 = (𝟏𝟏, 𝟏𝟏, … , 𝟏𝟏), 𝝀𝝀 = 𝒅𝒅

Remember the meaning of 𝒚𝒚 = 𝑨𝑨⋅ 𝒙𝒙:

definition of d-regular graph

(12)

Matrix Representations

Adjacency matrix (A):

n

×

n matrix

A=[aij], aij=1 if edge between node i and j

1

3 2

5

4 6

1 2 3 4 5 6

1 0 1 1 0 1 0

2 1 0 1 0 0 0

3 1 1 0 1 0 0

4 0 0 1 0 1 1

5 1 0 0 1 0 1

6 0 0 0 1 1 0

(13)

Matrix Representations

Degree matrix (D):

n

×

n diagonal matrix

D=[dii], dii = degree of node i

1

3 2

5

4 6

1 2 3 4 5 6

1 3 0 0 0 0 0

2 0 2 0 0 0 0

3 0 0 3 0 0 0

4 0 0 0 3 0 0

5 0 0 0 0 3 0

6 0 0 0 0 0 2

(14)

Matrix Representations

Laplacian matrix (L):

n

×

n symmetric matrix

What is trivial eigenpair?

𝒙𝒙 = (𝟏𝟏, … , 𝟏𝟏) then 𝑳𝑳 ⋅ 𝒙𝒙 = 𝟎𝟎 and so 𝝀𝝀 = 𝝀𝝀𝟏𝟏 = 𝟎𝟎

Important properties of L:

Eigenvalues are non-negative real numbers

Eigenvectors are real and orthogonal

𝑳𝑳 = 𝑫𝑫 − 𝑨𝑨

1

3 2

5

4 6

1 2 3 4 5 6

1 3 -1 -1 0 -1 0

2 -1 2 -1 0 0 0

3 -1 -1 3 -1 0 0

4 0 0 -1 3 -1 -1

5 -1 0 0 -1 3 -1

6 0 0 0 -1 -1 2

(15)

Our Goal

Recall: our goal is to find the minimum cut (or normalized cut)

1

3 2

5

4 6

A B

(16)

New Formulation of Min Cut [Fiedler’73]

Express partition (A,B) as a vector 𝒚𝒚

𝒊𝒊

= �+𝟏𝟏 −𝟏𝟏 𝒊𝒊𝒇𝒇 𝒊𝒊 ∈ 𝑨𝑨

𝒊𝒊𝒇𝒇 𝒊𝒊 ∈ 𝑩𝑩

Problem 1: Find a non-trivial vector y that minimizes f(y):

Problem 1 is equivalent to finding the min-cut

(A,B)

(17)

Connection of min-cut problem and Laplacian L

Why?

yTL y = ∑𝑖𝑖,𝑗𝑗=1𝑛𝑛 𝐿𝐿𝑖𝑖𝑗𝑗 𝑦𝑦𝑖𝑖𝑦𝑦𝑗𝑗 = ∑𝑖𝑖,𝑗𝑗=1𝑛𝑛 𝐷𝐷𝑖𝑖𝑗𝑗 − 𝐴𝐴𝑖𝑖𝑗𝑗 𝑦𝑦𝑖𝑖𝑦𝑦𝑗𝑗

= ∑𝑖𝑖 𝐷𝐷𝑖𝑖𝑖𝑖𝑦𝑦𝑖𝑖2 − ∑ 𝑖𝑖,𝑗𝑗 ∈𝐸𝐸 2𝑦𝑦𝑖𝑖𝑦𝑦𝑗𝑗

= ∑ 𝑖𝑖,𝑗𝑗 ∈𝐸𝐸(𝑦𝑦𝑖𝑖2 + 𝑦𝑦𝑗𝑗2 − 2𝑦𝑦𝑖𝑖𝑦𝑦𝑗𝑗) = ∑ 𝒊𝒊,𝒋𝒋 ∈𝑬𝑬 𝒚𝒚𝒊𝒊 − 𝒚𝒚𝒋𝒋 𝟐𝟐

New Formulation of Min Cut [Fiedler’73]

(18)

Until now: finding the min-cut is equiv. to solving the following problem

But, the problem is NP-hard

So, we relax the constraint to make it viable

𝒚𝒚𝒊𝒊 can be any number

Surprisingly, the solution of the problem is tightly connected with the eigenvector of L

Detail: next slide

New Formulation of Min Cut [Fiedler’73]

(19)

Rayleigh Theorem

𝝀𝝀𝟐𝟐 = 𝐦𝐦𝐦𝐦𝐧𝐧𝐲𝐲 𝒇𝒇 𝒚𝒚 : The minimum value of 𝒇𝒇(𝒚𝒚) is given by the 2nd smallest eigenvalue λ2 of the Laplacian matrix L

𝐱𝐱 = 𝐚𝐚𝐚𝐚𝐚𝐚 𝐦𝐦𝐦𝐦𝐧𝐧𝐲𝐲 𝒇𝒇 𝒚𝒚 : The optimal solution for y is given by the corresponding eigenvector 𝒙𝒙, referred as the Fiedler vector

𝑥𝑥𝑖𝑖 0 𝑥𝑥𝑗𝑗 x

(20)

Rayleigh Theorem

In fact, the minimum solution is given by y = 1 vector (the smallest eigenvector w/ eigenvalue 0); however, this does not say anything about the partition

Thus, we find the next best solution which is the Fiedler vector

Let the Fiedler vector = (𝛼𝛼1, … 𝛼𝛼𝑛𝑛). Then,

∑𝛼𝛼𝑖𝑖2 = 1 (unit length), ∑𝛼𝛼𝑖𝑖 = 0 (orthogonal to the

𝑥𝑥𝑖𝑖 0 𝑥𝑥𝑗𝑗 x

(21)

So far…

How to define a “good” partition of a graph?

Minimize a given graph cut criterion

How to efficiently identify such a partition?

Approximate using information provided by the eigenvalues and eigenvectors of Laplacian

Spectral Clustering

(22)

Spectral Clustering Algorithms

Three basic stages:

1) Pre-processing

Construct a matrix representation of the graph

2) Decomposition

Compute eigenvalues and eigenvectors of Laplacian

Map each point to a lower-dimensional representation based on one or more eigenvectors

3) Grouping

Assign points to two or more clusters, based on the new representation

(23)

Spectral Partitioning Algorithm

1) Pre-processing:

Build Laplacian matrix L of the graph

2) Decomposition:

Find eigenvalues λ and eigenvectors x of the matrix L

Map vertices to corresponding components of x

0.0 -0.4 -0.4 0.4 -0.6 0.4

0.5 0.4 -0.2 -0.5 -0.3 0.4

0.4 -0.5 0.6 0.1 -0.3 0.4

0.5 -0.4 0.6 0.1 0.3 0.4

0.0 0.4 -0.4 0.4 0.6 0.4

-0.5 -0.4 -0.2 -0.5 0.3 0.4

5.0 4.0 3.0 3.0 1.0 0.0

λ= X =

5 - -

4 0.3

3 0.3

2 0.6

1 0.3

1 2 3 4 5 6

1 3 -1 -1 0 -1 0

2 -1 2 -1 0 0 0

3 -1 -1 3 -1 0 0

4 0 0 -1 3 -1 -1

5 -1 0 0 -1 3 -1

6 0 0 0 -1 -1 2

(24)

Spectral Partitioning

3) Grouping:

Sort components of reduced 1-dimensional vector

Identify clusters by splitting the sorted vector in two

How to choose a splitting point?

Naïve approaches:

Split at 0 or median value

More expensive approaches:

Attempt to minimize normalized cut in 1-dimension

(sweep over ordering of nodes induced by the eigenvector)

4 -0.3 3 0.3 2 0.6

1 0.3 Split at 0:

Cluster A: Positive points Cluster B: Negative points

1 0.3 4 -0.3

A B

(25)

Example: Spectral Partitioning

Rank in x2 Value of x 2

(26)

Example: Spectral Partitioning

Value of x 2

Components of x

2

(27)

Example: Spectral partitioning

Components of x1

(28)

k-Way Spectral Clustering

How do we partition a graph into k clusters?

Two basic approaches:

Recursive bi-partitioning [Hagen et al., ’92]

Recursively apply bi-partitioning algorithm in a hierarchical divisive manner

Disadvantages: Inefficient, unstable

Cluster multiple eigenvectors [Shi-Malik, ’00]

Build a reduced space from multiple eigenvectors

Commonly used in recent papers

(29)

Outline

Spectral Clustering

Small Communities

(30)

Small Communities in Web

What is the signature of a community / discussion in a Web graph?

[Kumar et al. ‘99]

Dense 2-layer graph

Intuition: Many people all talking about the same things

Use this to define “topics”:

What the same people on the left talk about on the right Remember HITS!

(31)

Searching for Small Communities

A more well-defined problem:

Enumerate complete bipartite subgraphs K

s,t

Where Ks,t : s nodes on the “left” where each links to the same t other nodes on the “right”

K

3,4

|X| = s = 3

|Y| = t = 4

X Y

(32)

Frequent Itemset Enumeration

Market basket analysis. Setting:

Market: Universe U of n items

Baskets: m subsets of U: S1, S2, …, Sm ⊆ U (Si is a set of items one person bought)

Support: minimum number of occurrence f

Goal:

Find all subsets T appearing in at least f sets Si (items in T were bought together at least f times)

What’s the connection between the

itemsets and complete bipartite graphs?

[Agrawal-Srikant ‘99]

(33)

From Itemsets to Bipartite K s,t

Frequent itemsets = complete bipartite graphs!

How?

View each node i as a set Si of nodes i points to

Ks,t = a set Y of size t that occurs in s sets Si

Looking for Ks,t  finding frequent sets of size t

with support s

[Kumar et al. ‘99]

i

b c d a

Si={a,b,c,d}

j i k

b c d a

X Y

s … minimum support (|X|=s)

(34)

From Itemsets to Bipartite K s,t

[Kumar et al. ‘99]

i

b c d a

Si={a,b,c,d}

x y

b a

Find frequent itemsets:

s … minimum support t … itemset size

x

b c a

We found K

s,t

!

K

s,t

= a set Y of size t View each node i as a

set S

i

of nodes i points to Say we find a frequent

itemset Y={a,b,c} of supp s So, there are s nodes that link to all of {a,b,c}:

z

a b c y

b c a

(35)

Example (1)

Minimum support s=2

{b,d}: support 3

{e,f}: support 2

And we just found 2 bipartite subgraphs:

c

a b

d

f

Itemsets:

a = {b,c,d}

b = {d}

c = {b,d,e,f}

d = {e,f}

e = {b,d}

e

c

a b

d

c

d

e f

(36)

Example (2)

Example of a community from a web graph

Nodes on the right Nodes on the left

(37)

Questions?

참조

관련 문서

Usefulness of co-treatment with immunomodulators in patients with inflammatory bowel disease treated with scheduled infliximab maintenance therapy.. Oussalah A, Chevaux JB, Fay

However, the 432 Detector output does not change to the selected balance until the detector is autozeroed by a contact closure at the Auto Zero input terminals on the

The index is calculated with the latest 5-year auction data of 400 selected Classic, Modern, and Contemporary Chinese painting artists from major auction houses..

The key issue is whether HTS can be defined as the 6th generation of violent extremism. That is, whether it will first safely settle as a locally embedded group

The “Asset Allocation” portfolio assumes the following weights: 25% in the S&P 500, 10% in the Russell 2000, 15% in the MSCI EAFE, 5% in the MSCI EME, 25% in the

The eigenvalue problem can be used to identify the limit state of the process, in which the state vector x is reproduced under the multiplication by the.. stochastic matrix

If you started writing your hypothetical novel using L A TEX instead of a WYSIWYG editor, and used custom environments for typesetting events in different dimensions, and

Step 2 The subsidiary equation is solved by purely algebraic manipulations3. Step 3 The solution in Step 2 is transformed back, resulting in the solution of