Introduction to Data Mining
Lecture #9: Link Analysis U Kang
Seoul National University
Outline
Overview
PageRank: Flow Formulation
Graph Data: Social Networks
Facebook social graph
4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]
Graph Data: Media Networks
Connections between political blogs
Graph Data: Information Nets
Citation networks and Maps of science
[Börner et al., 2012]
domain2
domain1
domain3 router
Internet
Graph Data: Communication Nets
Graph Data: Classic Example
Seven Bridges of Königsberg
[Euler, 1735]
Return to the starting point by traveling each link of the graph once and only once.
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
This is a class on
Data
Mining. Classes
are in the 302
building 302
building is located near SNU
SNU homepage
Web as a Graph
Web as a Graph
Web as a directed graph:
Nodes: Webpages
Edges: Hyperlinks
This is a class on
Data
Mining. Classes
are in the 302
building 302
building is located near SNU
SNU homepage
Web as a Directed Graph
Broad Question
How to organize the Web?
First try: Human curated Web directories
Yahoo, DMOZ, LookSmart
Second try: Web Search
Information Retrieval investigates:
Find relevant docs in a small and trusted set
Newspaper articles, Patents, etc.
But: Web is huge, full of untrusted documents, random things, web spam, etc.
Web Search: 2 Challenges
2 challenges of web search:
(1) Web contains many sources of information Who to “trust”?
Idea: Trustworthy pages may point to each other!
(2) What is the “best” answer to the query
“newspaper”?
No single right answer
Idea: Pages that actually know about newspapers might all be pointing to many newspapers
All web pages are not equally “important”
www.joe-schmoe.com vs. www.snu.ac.kr
There is large diversity in the web-graph
node connectivity.
Let’s rank the pages by the link structure!
Ranking Nodes on the Graph
Link Analysis Algorithms
We will cover the following Link Analysis approaches for computing importances of nodes in a graph:
Page Rank
Topic-Specific (Personalized) Page Rank
Web Spam Detection Algorithms
Outline
Overview
PageRank: Flow Formulation
Idea: Links as votes
A page is more important if it has more links
In-coming links? Out-going links?
Think of in-links as votes:
www.snu.ac.kr has 100,000 in-links
www.joe-schmoe.com has 1 in-link
Are all in-links equal?
Links from important pages count more
Recursive question!
Links as Votes
Example: PageRank Scores
B
38.4 C
34.3
E 8.1
F 3.9 D
3.9 A
3.3
1.6
1.6 1.6 1.6 1.6
Each link’s vote is proportional to the importance of its source page
If page j with importance r
jhas n out-links, each link gets r
j/ n votes
Page j ’ s own importance is the sum of the votes on its in-links
j
k i
rj/3 rj/3 rj/3
rj = ri/3+rk/4
ri /3 rk/4
Simple Recursive Formulation
PageRank: The “Flow” Model
A “vote” from an important page is worth more
A page is important if it is
pointed to by other important pages
Define a “rank” r
jfor page j
∑
→=
j i
i j
r r
d
iy
m a a/2
y/2 a/2
m y/2
“Flow” equations:
ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2
𝒅𝒅𝒊𝒊 … out-degree of node 𝒊𝒊 i -> j : all i that point to j
3 equations, 3 unknowns, no constants
No unique solution
All solutions equivalent modulo the scale factor
I.e., Multiplying c to given a solution ry, ra, rm will give you another solution
Additional constraint forces uniqueness:
𝒓𝒓𝒚𝒚 + 𝒓𝒓𝒂𝒂 + 𝒓𝒓𝒎𝒎 = 𝟏𝟏
Solution: 𝒓𝒓𝒚𝒚 = 𝟐𝟐𝟓𝟓 , 𝒓𝒓𝒂𝒂 = 𝟐𝟐𝟓𝟓, 𝒓𝒓𝒎𝒎 = 𝟏𝟏𝟓𝟓
Gaussian elimination method works for
small examples, but we need a better method for large web-size graphs
We need a new formulation!
ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2
Flow equations:
Solving the Flow Equations
21 U Kang
Stochastic adjacency matrix 𝑴𝑴
Let page 𝑖𝑖 has 𝑑𝑑𝑖𝑖 out-links
If 𝑖𝑖 → 𝑗𝑗, then
𝑀𝑀
𝑗𝑗𝑖𝑖=
𝑑𝑑1𝑖𝑖
else
𝑀𝑀
𝑗𝑗𝑖𝑖= 0
𝑴𝑴 is a column stochastic matrix
Columns sum to 1
NOTE: A matrix M is called `column
stochastic’ if the sum of each column is 1
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
𝑀𝑀
Source
Destination
PageRank: Matrix Formulation
Rank vector 𝒓𝒓: vector with an entry per page
𝑟𝑟𝑖𝑖 is the importance score of page 𝑖𝑖
∑𝑖𝑖 𝑟𝑟𝑖𝑖 = 1
The flow equations can be written
𝒓𝒓 = 𝑴𝑴 ⋅ 𝒓𝒓
Why?
∑
→=
j i
i j
r r
di
= x
rj
d2
1
d5
1
d9
1
r2
r5
PageRank: Matrix Formulation
The flow equations can be written
𝒓𝒓 = 𝑴𝑴 � 𝒓𝒓
So the rank vector r is an eigenvector of the web matrix M, with the corresponding eigenvalue 1
Fact: The largest eigenvalue of a column stochastic matrix is 1
We can now efficiently solve for r!
The method is called Power iteration
NOTE1: x is an eigenvector with the corresponding eigenvalue λ if:
𝑨𝑨𝑨𝑨 = 𝝀𝝀𝑨𝑨
Eigenvector Formulation
r = M∙r
y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
Example: Flow Equations & M
Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks
Power iteration: a simple iterative scheme
Suppose there are N web pages
Initialize: r(0) = [1/N,….,1/N]T
Iterate: r(t+1) = M ∙ r(t)
Stop when |r(t+1) – r(t)|1 < ε
∑
→+ =
j i
t t i
j
r r
i ) ( )
1 (
d
di …. out-degree of node i
|x|1 = ∑1≤i≤N|xi| (called the L1 norm)
Can use any other vector norm, e.g., Euclidean
Power Iteration Method
Power Iteration Method
Power iteration:
A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)
𝒓𝒓(𝟏𝟏) = 𝑴𝑴 ⋅ 𝒓𝒓(𝟎𝟎)
𝒓𝒓(𝟐𝟐) = 𝑴𝑴 ⋅ 𝒓𝒓 𝟏𝟏 = 𝑴𝑴 𝑴𝑴𝒓𝒓 𝟏𝟏 = 𝑴𝑴𝟐𝟐 ⋅ 𝒓𝒓 𝟎𝟎
𝒓𝒓(𝟑𝟑) = 𝑴𝑴 ⋅ 𝒓𝒓 𝟐𝟐 = 𝑴𝑴 𝑴𝑴𝟐𝟐𝒓𝒓 𝟎𝟎 = 𝑴𝑴𝟑𝟑 ⋅ 𝒓𝒓 𝟎𝟎
Fact:
Sequence 𝑴𝑴 ⋅ 𝒓𝒓
𝟎𝟎, 𝑴𝑴
𝟐𝟐⋅ 𝒓𝒓
𝟎𝟎, … 𝑴𝑴
𝒌𝒌⋅ 𝒓𝒓
𝟎𝟎, … approaches the dominant eigenvector of 𝑴𝑴
Dominant eigenvector = the one corresponding to the largest eigenvalue
PageRank: How to solve?
Power Iteration:
Set 𝑟𝑟𝑗𝑗 = 1/N
1: 𝑟𝑟𝑟𝑗𝑗 = ∑𝑖𝑖→𝑗𝑗 𝑑𝑑𝑟𝑟𝑖𝑖
𝑖𝑖
2: 𝑟𝑟 = 𝑟𝑟𝑟
Goto 1
Example:
ry 1/3 1/3 5/12 9/24 6/15
ra = 1/3 3/6 1/3 11/24 … 6/15
rm 1/3 1/6 3/12 1/6 3/15
y
a m
y a m
y ½ ½ 0
a ½ 0 1
m 0 ½ 0
Iteration 0, 1, 2, …
ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2
Imagine a random web surfer:
At any time 𝒕𝒕, surfer is on some page 𝒊𝒊
At time 𝒕𝒕 + 𝟏𝟏, the surfer follows an out-link from 𝒊𝒊 uniformly at random
Ends up on some page 𝒋𝒋 linked from 𝒊𝒊
Process repeats indefinitely
Let:
𝒑𝒑(𝒕𝒕) … vector whose 𝒊𝒊th coordinate is the prob. that the surfer is at page 𝒊𝒊 at time 𝒕𝒕
So, 𝒑𝒑(𝒕𝒕) is a probability distribution over pages
∑
→=
j i
i j
r r
(i) dout
j
i1 i2 i3
Random Walk Interpretation
Where is the surfer at time t+1?
Follows a link uniformly at random
𝒑𝒑 𝒕𝒕 + 𝟏𝟏 = 𝑴𝑴 ⋅ 𝒑𝒑(𝒕𝒕)
Suppose the random walk reaches a state 𝒑𝒑 𝒕𝒕 + 𝟏𝟏 = 𝑴𝑴 ⋅ 𝒑𝒑(𝒕𝒕) = 𝒑𝒑(𝒕𝒕)
then 𝒑𝒑(𝒕𝒕) is called stationary distribution of a random walk
Our original rank vector 𝒓𝒓 satisfies 𝒓𝒓 = 𝑴𝑴 ⋅ 𝒓𝒓
So, 𝒓𝒓 is a stationary distribution for the random walk
) ( M
) 1
(t p t
p + = ⋅
j
i1 i2 i3
The Stationary Distribution
Existence and Uniqueness
A central result from the theory of random walks (a.k.a. Markov processes):
For graphs that satisfy certain conditions, the stationary distribution is unique and
eventually will be reached no matter what the initial probability distribution at time t = 0
Certain conditions: a walk starting from a random page can reach any other page