• 검색 결과가 없습니다.

Introduction to Data Mining

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Data Mining"

Copied!
31
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Introduction to Data Mining

Lecture #9: Link Analysis U Kang

Seoul National University

(2)

Outline

Overview

PageRank: Flow Formulation

(3)

Graph Data: Social Networks

Facebook social graph

4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011]

(4)

Graph Data: Media Networks

Connections between political blogs

(5)

Graph Data: Information Nets

Citation networks and Maps of science

[Börner et al., 2012]

(6)

domain2

domain1

domain3 router

Internet

Graph Data: Communication Nets

(7)

Graph Data: Classic Example

Seven Bridges of Königsberg

[Euler, 1735]

Return to the starting point by traveling each link of the graph once and only once.

(8)

Web as a directed graph:

Nodes: Webpages

Edges: Hyperlinks

This is a class on

Data

Mining. Classes

are in the 302

building 302

building is located near SNU

SNU homepage

Web as a Graph

(9)

Web as a Graph

Web as a directed graph:

Nodes: Webpages

Edges: Hyperlinks

This is a class on

Data

Mining. Classes

are in the 302

building 302

building is located near SNU

SNU homepage

(10)

Web as a Directed Graph

(11)

Broad Question

How to organize the Web?

First try: Human curated Web directories

Yahoo, DMOZ, LookSmart

Second try: Web Search

Information Retrieval investigates:

Find relevant docs in a small and trusted set

Newspaper articles, Patents, etc.

But: Web is huge, full of untrusted documents, random things, web spam, etc.

(12)

Web Search: 2 Challenges

2 challenges of web search:

(1) Web contains many sources of information Who to “trust”?

Idea: Trustworthy pages may point to each other!

(2) What is the “best” answer to the query

“newspaper”?

No single right answer

Idea: Pages that actually know about newspapers might all be pointing to many newspapers

(13)

All web pages are not equally “important”

www.joe-schmoe.com vs. www.snu.ac.kr

There is large diversity in the web-graph

node connectivity.

Let’s rank the pages by the link structure!

Ranking Nodes on the Graph

(14)

Link Analysis Algorithms

We will cover the following Link Analysis approaches for computing importances of nodes in a graph:

Page Rank

Topic-Specific (Personalized) Page Rank

Web Spam Detection Algorithms

(15)

Outline

Overview

PageRank: Flow Formulation

(16)

Idea: Links as votes

A page is more important if it has more links

In-coming links? Out-going links?

Think of in-links as votes:

www.snu.ac.kr has 100,000 in-links

www.joe-schmoe.com has 1 in-link

Are all in-links equal?

Links from important pages count more

Recursive question!

Links as Votes

(17)

Example: PageRank Scores

B

38.4 C

34.3

E 8.1

F 3.9 D

3.9 A

3.3

1.6

1.6 1.6 1.6 1.6

(18)

Each link’s vote is proportional to the importance of its source page

If page j with importance r

j

has n out-links, each link gets r

j

/ n votes

Page j ’ s own importance is the sum of the votes on its in-links

j

k i

rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri /3 rk/4

Simple Recursive Formulation

(19)

PageRank: The “Flow” Model

A “vote” from an important page is worth more

A page is important if it is

pointed to by other important pages

Define a “rank” r

j

for page j

=

j i

i j

r r

d

i

y

m a a/2

y/2 a/2

m y/2

“Flow” equations:

ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2

𝒅𝒅𝒊𝒊 … out-degree of node 𝒊𝒊 i -> j : all i that point to j

(20)

3 equations, 3 unknowns, no constants

No unique solution

All solutions equivalent modulo the scale factor

I.e., Multiplying c to given a solution ry, ra, rm will give you another solution

Additional constraint forces uniqueness:

𝒓𝒓𝒚𝒚 + 𝒓𝒓𝒂𝒂 + 𝒓𝒓𝒎𝒎 = 𝟏𝟏

Solution: 𝒓𝒓𝒚𝒚 = 𝟐𝟐𝟓𝟓 , 𝒓𝒓𝒂𝒂 = 𝟐𝟐𝟓𝟓, 𝒓𝒓𝒎𝒎 = 𝟏𝟏𝟓𝟓

Gaussian elimination method works for

small examples, but we need a better method for large web-size graphs

We need a new formulation!

ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2

Flow equations:

Solving the Flow Equations

(21)

21 U Kang

Stochastic adjacency matrix 𝑴𝑴

Let page 𝑖𝑖 has 𝑑𝑑𝑖𝑖 out-links

If 𝑖𝑖 → 𝑗𝑗, then

𝑀𝑀

𝑗𝑗𝑖𝑖

=

𝑑𝑑1

𝑖𝑖

else

𝑀𝑀

𝑗𝑗𝑖𝑖

= 0

𝑴𝑴 is a column stochastic matrix

Columns sum to 1

NOTE: A matrix M is called `column

stochastic’ if the sum of each column is 1

y

a m

y a m

y ½ ½ 0

a ½ 0 1

m 0 ½ 0

𝑀𝑀

Source

Destination

PageRank: Matrix Formulation

(22)

Rank vector 𝒓𝒓: vector with an entry per page

𝑟𝑟𝑖𝑖 is the importance score of page 𝑖𝑖

𝑖𝑖 𝑟𝑟𝑖𝑖 = 1

The flow equations can be written

𝒓𝒓 = 𝑴𝑴 ⋅ 𝒓𝒓

Why?

=

j i

i j

r r

di

= x

rj

d2

1

d5

1

d9

1

r2

r5

PageRank: Matrix Formulation

(23)

The flow equations can be written

𝒓𝒓 = 𝑴𝑴 � 𝒓𝒓

So the rank vector r is an eigenvector of the web matrix M, with the corresponding eigenvalue 1

Fact: The largest eigenvalue of a column stochastic matrix is 1

We can now efficiently solve for r!

The method is called Power iteration

NOTE1: x is an eigenvector with the corresponding eigenvalue λ if:

𝑨𝑨𝑨𝑨 = 𝝀𝝀𝑨𝑨

Eigenvector Formulation

(24)

r = M∙r

y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m y

a m

y a m

y ½ ½ 0

a ½ 0 1

m 0 ½ 0

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

Example: Flow Equations & M

(25)

Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks

Power iteration: a simple iterative scheme

Suppose there are N web pages

Initialize: r(0) = [1/N,….,1/N]T

Iterate: r(t+1) = M ∙ r(t)

Stop when |r(t+1) – r(t)|1 < ε

+ =

j i

t t i

j

r r

i ) ( )

1 (

d

di …. out-degree of node i

|x|1 = ∑1≤i≤N|xi| (called the L1 norm)

Can use any other vector norm, e.g., Euclidean

Power Iteration Method

(26)

Power Iteration Method

Power iteration:

A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)

𝒓𝒓(𝟏𝟏) = 𝑴𝑴 ⋅ 𝒓𝒓(𝟎𝟎)

𝒓𝒓(𝟐𝟐) = 𝑴𝑴 ⋅ 𝒓𝒓 𝟏𝟏 = 𝑴𝑴 𝑴𝑴𝒓𝒓 𝟏𝟏 = 𝑴𝑴𝟐𝟐 ⋅ 𝒓𝒓 𝟎𝟎

𝒓𝒓(𝟑𝟑) = 𝑴𝑴 ⋅ 𝒓𝒓 𝟐𝟐 = 𝑴𝑴 𝑴𝑴𝟐𝟐𝒓𝒓 𝟎𝟎 = 𝑴𝑴𝟑𝟑 ⋅ 𝒓𝒓 𝟎𝟎

Fact:

Sequence 𝑴𝑴 ⋅ 𝒓𝒓

𝟎𝟎

, 𝑴𝑴

𝟐𝟐

⋅ 𝒓𝒓

𝟎𝟎

, … 𝑴𝑴

𝒌𝒌

⋅ 𝒓𝒓

𝟎𝟎

, … approaches the dominant eigenvector of 𝑴𝑴

Dominant eigenvector = the one corresponding to the largest eigenvalue

(27)

PageRank: How to solve?

Power Iteration:

Set 𝑟𝑟𝑗𝑗 = 1/N

1: 𝑟𝑟𝑟𝑗𝑗 = ∑𝑖𝑖→𝑗𝑗 𝑑𝑑𝑟𝑟𝑖𝑖

𝑖𝑖

2: 𝑟𝑟 = 𝑟𝑟𝑟

Goto 1

Example:

ry 1/3 1/3 5/12 9/24 6/15

ra = 1/3 3/6 1/3 11/24 … 6/15

rm 1/3 1/6 3/12 1/6 3/15

y

a m

y a m

y ½ ½ 0

a ½ 0 1

m 0 ½ 0

Iteration 0, 1, 2, …

ry = ry/2 + ra /2 ra = ry/2 + rm rm = ra /2

(28)

Imagine a random web surfer:

At any time 𝒕𝒕, surfer is on some page 𝒊𝒊

At time 𝒕𝒕 + 𝟏𝟏, the surfer follows an out-link from 𝒊𝒊 uniformly at random

Ends up on some page 𝒋𝒋 linked from 𝒊𝒊

Process repeats indefinitely

Let:

𝒑𝒑(𝒕𝒕) … vector whose 𝒊𝒊th coordinate is the prob. that the surfer is at page 𝒊𝒊 at time 𝒕𝒕

So, 𝒑𝒑(𝒕𝒕) is a probability distribution over pages

=

j i

i j

r r

(i) dout

j

i1 i2 i3

Random Walk Interpretation

(29)

Where is the surfer at time t+1?

Follows a link uniformly at random

𝒑𝒑 𝒕𝒕 + 𝟏𝟏 = 𝑴𝑴 ⋅ 𝒑𝒑(𝒕𝒕)

Suppose the random walk reaches a state 𝒑𝒑 𝒕𝒕 + 𝟏𝟏 = 𝑴𝑴 ⋅ 𝒑𝒑(𝒕𝒕) = 𝒑𝒑(𝒕𝒕)

then 𝒑𝒑(𝒕𝒕) is called stationary distribution of a random walk

Our original rank vector 𝒓𝒓 satisfies 𝒓𝒓 = 𝑴𝑴 ⋅ 𝒓𝒓

So, 𝒓𝒓 is a stationary distribution for the random walk

) ( M

) 1

(t p t

p + = ⋅

j

i1 i2 i3

The Stationary Distribution

(30)

Existence and Uniqueness

A central result from the theory of random walks (a.k.a. Markov processes):

For graphs that satisfy certain conditions, the stationary distribution is unique and

eventually will be reached no matter what the initial probability distribution at time t = 0

Certain conditions: a walk starting from a random page can reach any other page

(31)

Questions?

참조

관련 문서

Average of the indexed values of the following data: (1) Total value of the indexed score for disability-adjusted life years (the number of years lost due to illness,

The “Asset Allocation” portfolio assumes the following weights: 25% in the S&amp;P 500, 10% in the Russell 2000, 15% in the MSCI EAFE, 5% in the MSCI EME, 25% in the

1 John Owen, Justification by Faith Alone, in The Works of John Owen, ed. John Bolt, trans. Scott Clark, &#34;Do This and Live: Christ's Active Obedience as the

The eigenvalue problem can be used to identify the limit state of the process, in which the state vector x is reproduced under the multiplication by the.. stochastic matrix

If you started writing your hypothetical novel using L A TEX instead of a WYSIWYG editor, and used custom environments for typesetting events in different dimensions, and

For Practical determination of the inverse A -1 of a nonsingular nxn matrix A, Gauss elimination can be used.. : This method is

For Practical determination of the inverse A -1 of a nonsingular nxn matrix A, Gauss elimination can be used.. : This method is

The struggle was successful, and in January 1926 the first of four papers on “Quantization as an Eigenvalue