Introduction to Data Mining

(1)

Introduction to Data Mining

Lecture #9: Link Analysis U Kang

Seoul National University

(2)

Outline

Overview

PageRank: Flow Formulation

(3)

Graph Data: Social Networks

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

(4)

Graph Data: Media Networks

Connections between political blogs

(5)

Graph Data: Information Nets

Citation networks and Maps of science

[Börner et al., 2012]

(6)

domain2

domain1

domain3 router

Internet

Graph Data: Communication Nets

(7)

Graph Data: Classic Example

Seven Bridges of Königsberg

[Euler, 1735]

Return to the starting point by traveling each link of the graph once and only once.

(8)



Web as a directed graph:

 Nodes: Webpages

 Edges: Hyperlinks

This is a class on

Data

Mining. Classes

are in the 302

building 302

building is located near SNU

SNU homepage

Web as a Graph

(9)

Web as a Graph



Web as a directed graph:

 Nodes: Webpages

 Edges: Hyperlinks

This is a class on

Data

Mining. Classes

are in the 302

building 302

building is located near SNU

SNU homepage

(10)

Web as a Directed Graph

(11)

Broad Question



How to organize the Web?



First try: Human curated Web directories

 Yahoo, DMOZ, LookSmart



Second try: Web Search

 Information Retrieval investigates:

Find relevant docs in a small and trusted set

 Newspaper articles, Patents, etc.

 But: Web is huge, full of untrusted documents, random things, web spam, etc.

(12)

Web Search: 2 Challenges

2 challenges of web search:



(1) Web contains many sources of information Who to “trust”?

 Idea: Trustworthy pages may point to each other!



(2) What is the “best” answer to the query

“newspaper”?

 No single right answer

 Idea: Pages that actually know about newspapers might all be pointing to many newspapers

(13)



All web pages are not equally “important”

www.joe-schmoe.com vs. www.snu.ac.kr



There is large diversity in the web-graph

node connectivity.

Let’s rank the pages by the link structure!

Ranking Nodes on the Graph

(14)

Link Analysis Algorithms



We will cover the following Link Analysis approaches for computing importances of nodes in a graph:

 Page Rank

 Topic-Specific (Personalized) Page Rank

 Web Spam Detection Algorithms

(15)

Outline

Overview

PageRank: Flow Formulation

(16)



Idea: Links as votes

 A page is more important if it has more links

 In-coming links? Out-going links?



Think of in-links as votes:

 www.snu.ac.kr has 100,000 in-links

 www.joe-schmoe.com has 1 in-link



Are all in-links equal?

 Links from important pages count more

 Recursive question!

Links as Votes

(17)

Example: PageRank Scores

B

38.4 C

34.3

E 8.1

F 3.9 D

3.9 A

3.3

1.6

1.6 1.6 1.6 1.6

(18)



Each link’s vote is proportional to the importance of its source page



If page j with importance r

_j

has n out-links, each link gets r

_j

/ n votes



Page j ’ s own importance is the sum of the votes on its in-links

j

k i

r_j/3 r_j/3 r_j/3

r_j = r_i/3+r_k/4

r_i /3 r_k/4

Simple Recursive Formulation

(19)

PageRank: The “Flow” Model



A “vote” from an important page is worth more



A page is important if it is

pointed to by other important pages



Define a “rank” r

_j

for page j

∑

→

=

j i

i j

r r

d

i

y

m a a/2

y/2 a/2

m y/2

“Flow” equations:

r_y = r_y/2 + r_a /2 r_a = r_y/2 + r_m r_m = r_a /2

𝒅𝒅_𝒊𝒊 … out-degree of node 𝒊𝒊 i -> j : all i that point to j

(20)

 3 equations, 3 unknowns, no constants

 No unique solution

 All solutions equivalent modulo the scale factor

 I.e., Multiplying c to given a solution r_y, r_a, r_m will give you another solution

 Additional constraint forces uniqueness:

 𝒓𝒓_𝒚𝒚 + 𝒓𝒓_𝒂𝒂 + 𝒓𝒓_𝒎𝒎 = 𝟏𝟏

 Solution: 𝒓𝒓_𝒚𝒚 = ^𝟐𝟐_𝟓𝟓 , 𝒓𝒓_𝒂𝒂 = ^𝟐𝟐_𝟓𝟓, 𝒓𝒓_𝒎𝒎 = ^𝟏𝟏_𝟓𝟓



Gaussian elimination method works for

small examples, but we need a better method for large web-size graphs

 We need a new formulation!

Flow equations:

Solving the Flow Equations

(21)

21 U Kang



Stochastic adjacency matrix 𝑴𝑴

 Let page 𝑖𝑖 has 𝑑𝑑_𝑖𝑖 out-links

 If 𝑖𝑖 → 𝑗𝑗, then

𝑀𝑀

_{𝑗𝑗𝑖𝑖}

=

_𝑑𝑑¹

𝑖𝑖

else

𝑀𝑀

_{𝑗𝑗𝑖𝑖}

= 0

 𝑴𝑴 is a column stochastic matrix

 Columns sum to 1

NOTE: A matrix M is called `column

stochastic’ if the sum of each column is 1

y

a m

y a m

y ½ ½ 0

a ½ 0 1

m 0 ½ 0

𝑀𝑀

Source

Destination

PageRank: Matrix Formulation

(22)



Rank vector 𝒓𝒓: vector with an entry per page

 𝑟𝑟_𝑖𝑖 is the importance score of page 𝑖𝑖

 ∑_𝑖𝑖 𝑟𝑟_𝑖𝑖 = 1



The flow equations can be written

𝒓𝒓 = 𝑴𝑴 ⋅ 𝒓𝒓

Why?

∑

→

=

j i

i j

r r

di

= x

r_j

d2

1

d5

1

d9

1

r₂

r₅

PageRank: Matrix Formulation

(23)



The flow equations can be written

𝒓𝒓 = 𝑴𝑴 � 𝒓𝒓



So the rank vector r is an eigenvector of the web matrix M, with the corresponding eigenvalue 1



Fact: The largest eigenvalue of a column stochastic matrix is 1



We can now efficiently solve for r!

The method is called Power iteration

NOTE1: x is an eigenvector with the corresponding eigenvalue λ if:

𝑨𝑨𝑨𝑨 = 𝝀𝝀𝑨𝑨

Eigenvector Formulation

(24)

r = M∙r

y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m y

a m

y a m

y ½ ½ 0

a ½ 0 1

m 0 ½ 0

r_y = r_y /2 + r_a /2 r_a = r_y /2 + r_m r_m = r_a /2

Example: Flow Equations & M

(25)



Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks



Power iteration: a simple iterative scheme

 Suppose there are N web pages

 Initialize: r⁽⁰⁾ = [1/N,….,1/N]^T

 Iterate: r^(t+1) = M ∙ r^(t)

 Stop when |r^(t+1)– r^(t)|₁ < ε

∑

→

+ =

j i

t t i

j

r r

i ) ( )

1 (

d

d_i …. out-degree of node i

|x|₁ = ∑_1≤i≤N|x_i| (called the L¹ norm)

Can use any other vector norm, e.g., Euclidean

Power Iteration Method

(26)

Power Iteration Method



Power iteration:

A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)

 𝒓𝒓^(𝟏𝟏) = 𝑴𝑴 ⋅ 𝒓𝒓^(𝟎𝟎)

 𝒓𝒓^(𝟐𝟐) = 𝑴𝑴 ⋅ 𝒓𝒓 ^𝟏𝟏 = 𝑴𝑴 𝑴𝑴𝒓𝒓 ^𝟏𝟏 = 𝑴𝑴^𝟐𝟐 ⋅ 𝒓𝒓 ^𝟎𝟎

 𝒓𝒓^(𝟑𝟑) = 𝑴𝑴 ⋅ 𝒓𝒓 ^𝟐𝟐 = 𝑴𝑴 𝑴𝑴^𝟐𝟐𝒓𝒓 ^𝟎𝟎 = 𝑴𝑴^𝟑𝟑 ⋅ 𝒓𝒓 ^𝟎𝟎



Fact:

Sequence 𝑴𝑴 ⋅ 𝒓𝒓

^𝟎𝟎

, 𝑴𝑴

^𝟐𝟐

⋅ 𝒓𝒓

^𝟎𝟎

, … 𝑴𝑴

^𝒌𝒌

⋅ 𝒓𝒓

^𝟎𝟎

, … approaches the dominant eigenvector of 𝑴𝑴

 Dominant eigenvector = the one corresponding to the largest eigenvalue

(27)

PageRank: How to solve?



Power Iteration:

 Set 𝑟𝑟_𝑗𝑗 = 1/N

 1: 𝑟𝑟𝑟_𝑗𝑗 = ∑_{𝑖𝑖→𝑗𝑗} _𝑑𝑑^𝑟𝑟^𝑖𝑖

𝑖𝑖

 2: 𝑟𝑟 = 𝑟𝑟𝑟

 Goto 1



Example:

r_y 1/3 1/3 5/12 9/24 6/15

r_a = 1/3 3/6 1/3 11/24 … 6/15

r_m 1/3 1/6 3/12 1/6 3/15

y

a m

y a m

y ½ ½ 0

a ½ 0 1

m 0 ½ 0

Iteration 0, 1, 2, …

(28)



Imagine a random web surfer:

 At any time 𝒕𝒕, surfer is on some page 𝒊𝒊

 At time 𝒕𝒕 + 𝟏𝟏, the surfer follows an out-link from 𝒊𝒊 uniformly at random

 Ends up on some page 𝒋𝒋 linked from 𝒊𝒊

 Process repeats indefinitely



Let:

 𝒑𝒑(𝒕𝒕) … vector whose 𝒊𝒊^th coordinate is the prob. that the surfer is at page 𝒊𝒊 at time 𝒕𝒕

 So, 𝒑𝒑(𝒕𝒕) is a probability distribution over pages

∑

→

=

j i

i j

r r

(i) d_out

j

i₁ i₂ i₃

Random Walk Interpretation

(29)



Where is the surfer at time t+1?

 Follows a link uniformly at random

𝒑𝒑 𝒕𝒕 + 𝟏𝟏 = 𝑴𝑴 ⋅ 𝒑𝒑(𝒕𝒕)



Suppose the random walk reaches a state 𝒑𝒑 𝒕𝒕 + 𝟏𝟏 = 𝑴𝑴 ⋅ 𝒑𝒑(𝒕𝒕) = 𝒑𝒑(𝒕𝒕)

then 𝒑𝒑(𝒕𝒕) is called stationary distribution of a random walk



Our original rank vector 𝒓𝒓 satisfies 𝒓𝒓 = 𝑴𝑴 ⋅ 𝒓𝒓

 So, 𝒓𝒓 is a stationary distribution for the random walk

) ( M

) 1

(t p t

p + = ⋅

j

i₁ i₂ i₃

The Stationary Distribution

(30)

Existence and Uniqueness



A central result from the theory of random walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions, the stationary distribution is unique and

eventually will be reached no matter what the initial probability distribution at time t = 0

Certain conditions: a walk starting from a random page can reach any other page

(31)