• 검색 결과가 없습니다.

Introduction to Data Mining

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Data Mining"

Copied!
27
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

1 U Kang

Introduction to Data Mining

Lecture #19: Analysis of Large Graphs U Kang

Seoul National University

(2)

In This Lecture

Understand the definition and the motivation of community detection problem

Learn the Girvan-Newman algorithm for community detection

Learn how graph theories (connected

component, betweenness, BFS, modularity) are

applied for data mining

(3)

3 U Kang

Outline

Overview

Community Detection

(4)

Networks & Communities

We often think of networks being organized into

modules, cluster, communities:

(5)

5 U Kang

Goal: Find Densely Linked Clusters

(6)

Micro-Markets in Sponsored Search

Find micro-markets by partitioning the query-to- advertiser graph:

advertiser

query

[Andersen, Lang: Communities from seed sets, 2006]

(7)

7 U Kang

Movies and Actors

Clusters in Movies-to-Actors graph:

[Andersen, Lang: Communities from seed sets, 2006]

(8)

Twitter & Facebook

Discovering social circles, circles of trust:

[McAuley, Leskovec: Discovering social circles in ego networks, 2012]

(9)

9 U Kang

Outline

Overview

Community Detection

(10)

Community Detection

How to find communities?

(11)

11 U Kang

Method 1: Strength of Weak Ties

Edge betweenness: Number of

shortest paths passing over the edge

Intuition:

Edge strengths (call volume) in a real network

Edge betweenness in a real network

b=16 b=7.5

(12)

Method 1: Girvan-Newman

Divisive hierarchical clustering based on the notion of edge betweenness:

Number of shortest paths passing through the edge

Girvan-Newman Algorithm:

Undirected unweighted networks

Repeat until no edges are left:

Calculate betweenness of edges

Remove edges with highest betweenness

Connected components are communities

Gives a hierarchical decomposition of the network

[Girvan-Newman ‘02]

(13)

13 U Kang

Girvan-Newman: Example

Need to re-compute betweenness at

every step

33 49 1 12

(14)

Girvan-Newman: Example

Step 1: Step 2:

Step 3: Hierarchical network decomposition:

(15)

15 U Kang

Girvan-Newman: Results

Communities in physics collaborations

(16)

Girvan-Newman: Results

Zachary’s Karate club:

Hierarchical decomposition

(17)

17 U Kang

WE NEED TO RESOLVE 2 QUESTIONS

1.

How to compute betweenness?

2.

How to select the number of clusters?

(18)

How to Compute Betweenness?

Want to compute betweenness of paths starting at node E

Overview

Step 1: construct the BFS tree starting from E

Step 2: label each node by the # of shortest

paths from the root E

Step 3: for each edge, calculate the sum (over all nodes Y) of the

fraction of shortest paths from E to Y

Note: if we do this for all the starting nodes,

and divide by 2, we get the betweenness (BFS: Breadth First Search)

(19)

19 U Kang

How to Compute Betweenness?

Step 1: construct the BFS (breadth-first-search) tree starting from E

(20)

How to Compute Betweenness?

Step 2: label each node by the # of shortest paths from the root E

(21)

21 U Kang

How to Compute Betweenness?

Step 3: for each edge, calculate the sum (over all

nodes Y) of the fraction of shortest paths from E to Y

Leaf node gets credit 1

Each non-leaf node gets a credit (1 + sum of child edges’ credits)

Each node’s credit is sent to the edges above it (divide by labels from step 2)

(22)

WE NEED TO RESOLVE 2 QUESTIONS

1.

How to compute betweenness?

2.

How to select the number of clusters?

(23)

23 U Kang

Network Communities

Communities: sets of tightly connected nodes

Define: Modularity 𝑸𝑸

A measure of how well a network is partitioned into communities

Given a partitioning of the network into groups 𝒔𝒔 ∈ 𝑺𝑺:

Qs S [ (# edges within group s) –

(expected # edges within group s) ]

Need a null model!

(24)

Null Model: Configuration Model

Given real 𝑮𝑮 on 𝒏𝒏 nodes and 𝒎𝒎 edges, construct rewired network 𝑮𝑮𝑮

Same degree distribution but random connections

Consider 𝑮𝑮𝑮 as a multigraph

The expected number of edges between nodes

𝒊𝒊 and 𝒋𝒋 of degrees 𝒌𝒌𝒊𝒊 and 𝒌𝒌𝒋𝒋 equals to: 𝒌𝒌𝒊𝒊𝟐𝟐𝒎𝒎𝒌𝒌𝒋𝒋 = 𝒌𝒌𝟐𝟐𝒎𝒎𝒊𝒊𝒌𝒌𝒋𝒋

Check - the expected number of edges in (multigraph) G’:

= 𝟏𝟏𝟐𝟐𝒊𝒊∈𝑵𝑵 𝒋𝒋∈𝑵𝑵𝒌𝒌𝟐𝟐𝒎𝒎𝒊𝒊𝒌𝒌𝒋𝒋 = 𝟏𝟏𝟐𝟐 𝟐𝟐𝒎𝒎𝟏𝟏 𝒊𝒊∈𝑵𝑵𝒌𝒌𝒊𝒊 𝒋𝒋∈𝑵𝑵𝒌𝒌𝒋𝒋 =

= 𝟒𝟒𝒎𝒎𝟏𝟏 𝟐𝟐𝒎𝒎 ⋅ 𝟐𝟐𝒎𝒎 = 𝒎𝒎

j i

𝑢𝑢∈𝑁𝑁

𝑘𝑘𝑢𝑢 = 2𝑚𝑚

Note:

(25)

25 U Kang

Modularity

Modularity of partitioning S of graph G:

Q ∝ s S [ (# edges within group s) –

(expected # edges within group s) ]

𝑸𝑸 𝑮𝑮, 𝑺𝑺 = 𝟐𝟐𝒎𝒎𝟏𝟏𝒔𝒔∈𝑺𝑺𝒊𝒊∈𝒔𝒔𝒋𝒋∈𝒔𝒔 𝑨𝑨𝒊𝒊𝒋𝒋𝒌𝒌𝟐𝟐𝒎𝒎𝒊𝒊𝒌𝒌𝒋𝒋

Modularity values take range [−1,1]

It is positive if the number of edges within groups exceeds the expected number

0.3-0.7<Q means significant community structure

Aij = 1 if i→j, 0 else Normalizing cost.: -1<Q<1

(26)

Modularity: Number of clusters

Modularity is useful for selecting the

number of clusters:

Q

(27)

27 U Kang

Questions?

참조

관련 문서

 Given a minimum support s, then sets of items that appear in at least s baskets are called frequent itemsets.

 Learn the definition and properties of SVD, one of the most important tools in data mining!.  Learn how to interpret the results of SVD, and how to use it

 Drineas et al., Fast Monte Carlo Algorithms for Matrices III: Com puting a Compressed Approximate Matrix Decomposition, SIAM Journal on Computing,

 Communication cost = input file size + 2 × (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce

 Because output is spread across R files (each reduce task creates one file in DFS).. Task

 Data stream consists of a universe of elements chosen from a set of size N.  Maintain a count of the number of distinct elements

 In fact, the minimum solution is given by y = 1 vector (the smallest eigenvector w/ eigenvalue 0); however, this does not say anything about the partition.  Thus, we find

 Main idea: Recommend items to customer x similar to previous items rated highly by x.  Andy enjoyed