• 검색 결과가 없습니다.

Introduction to Data Mining

N/A
N/A
Protected

Academic year: 2022

Share "Introduction to Data Mining"

Copied!
21
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Introduction to Data Mining

Lecture #4: MapReduce-2 U Kang

Seoul National University

(2)

Outline

Problems Suited For Map-Reduce

Pointers and Further Reading

(3)

Example: Host size

Suppose we have a large web corpus

Look at the metadata file

Lines of the form: (URL, size, date, …)

For each host, find the total number of bytes

That is, the sum of the page sizes for all URLs from that particular host

(4)

Example: Language Model

Statistical machine translation:

Need to count number of times every 5-word sequence occurs in a large corpus of documents

Very easy with MapReduce:

Map:

Extract (5-word sequence, count) from document

Reduce:

Combine the counts

(5)

More Examples

Distributed Grep

Map() : emits a line if it matches a supplied pattern

Reverse Web-Link graph

Map() : output <target, source> for each target in a source web page

Reduce: output <target, list(source)>

(6)

More Examples

Term-Vector per Host

Term vector : summarizes most important words that occur in a given host

Map: output <hostname, term vector> for a given document

Reduce: output <hostname, term vector> for frequent terms

(7)

More Examples

Inverted index

Map(): output <word, document ID>

Reduce(): output <word, list(document ID)>

(8)

Example: Join By Map-Reduce

Compute the natural join R(A,B) ⋈ S(B,C)

R and S are each stored in files

Tuples are pairs (a,b) or (b,c)

A B

a1 b1

a2 b1

a3 b2

a4 b3

B C

b2 c1

b2 c2

b3 c3

A C

a3 c1

a3 c2

a4 c3

=

R

S Student id Course id Course id

Course

name Student id

Course name

(Enrollment)

(Course Info.)

(9)

Map-Reduce Join

Use a hash function h from B-values to 1...k

A Map process turns:

Each input tuple R(a,b) into key-value pair (b,(a,R))

Each input tuple S(b,c) into (b,(c,S))

Map processes send each key-value pair with key b to Reduce process h(b)

Hadoop does this automatically; just tell it what k is.

Each Reduce process matches all the pairs (b,(a,R))

with all (b,(c,S)) and outputs (a,b,c).

(10)

Cost Measures for Algorithms

In MapReduce we quantify the cost of an algorithm using

1.

Communication cost = total I/O of all processes

2.

Elapsed communication cost = max of I/O along any path

3.

(Elapsed) computation cost analogous, but

count only running time of processes

(11)

Example: Cost Measures

For a map-reduce algorithm:

Communication cost = input file size + 2 × (sum of the sizes of all files passed from Map processes to Reduce processes) + the sum of the output sizes of the Reduce processes.

Elapsed communication cost is the sum of the largest input + output for any map process, plus the same for any reduce process

(12)

What Cost Measures Mean

Either the I/O (communication) or processing (computation) cost dominates

Ignore one or the other

Total cost tells what you pay in rent from your friendly neighborhood cloud

Elapsed cost is wall-clock time using parallelism

(13)

Cost of Map-Reduce Join

Total communication cost

= O(|R|+|S|+|R ⋈ S|)

Elapsed communication cost = O(s)

We’re going to pick k (# of reducers) and the number of Map processes so that the I/O limit s is respected

We put a limit s on the amount of input or output that any one process can have. s could be:

What fits in main memory

What fits on local disk

In many cases, computation cost is linear in the input + output size

So computation cost is like comm. cost

(14)

Outline

Problems Suited For Map-Reduce

Pointers and Further Reading

(15)

Implementations

Google

Not available outside Google

Hadoop

An open-source implementation in Java

Uses HDFS for stable storage

Download: http://lucene.apache.org/hadoop/

(16)

Cloud Computing

Ability to rent computing by the hour

Additional services e.g., persistent storage

Amazon’s “Elastic Compute Cloud” (EC2)

(17)

Reading

Jeffrey Dean and Sanjay Ghemawat: MapReduce:

Simplified Data Processing on Large Clusters

http://static.googleusercontent.com/media/research.

google.com/ko//archive/mapreduce-osdi04.pdf

Must Read!

Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung: The Google File System

http://static.googleusercontent.com/media/research.

google.com/ko//archive/gfs-sosp2003.pdf

(18)

Resources

Hadoop Wiki

Introduction

http://wiki.apache.org/lucene-hadoop/

Getting Started

http://wiki.apache.org/lucene-hadoop/GettingStartedWithHadoop

Map/Reduce Overview

http://wiki.apache.org/lucene-hadoop/HadoopMapReduce

http://wiki.apache.org/lucene-hadoop/HadoopMapRedClasses

Eclipse Environment

http://wiki.apache.org/lucene-hadoop/EclipseEnvironment

Javadoc

http://lucene.apache.org/hadoop/docs/api/

(19)

Resources

Releases from Apache download mirrors

http://www.apache.org/dyn/closer.cgi/lucene/hadoop /

Nightly builds of source

http://people.apache.org/dist/lucene/hadoop/nightly/

Source code from subversion

http://lucene.apache.org/hadoop/version_control.ht ml

(20)

Further Reading

Programming model inspired by functional language primitives

Partitioning/shuffling similar to many large-scale sorting systems

NOW-Sort ['97]

Re-execution for fault tolerance

BAD-FS ['04] and TACC ['97]

Locality optimization has parallels with Active Disks/Diamond w ork

Active Disks ['01], Diamond ['04]

Backup tasks similar to Eager Scheduling in Charlotte system

Charlotte ['96]

Dynamic load balancing solves similar problem as River's distribu ted queues

River ['99]

(21)

Questions?

참조

관련 문서

After first field tests, we expect electric passenger drones or eVTOL aircraft (short for electric vertical take-off and landing) to start providing commercial mobility

The “Asset Allocation” portfolio assumes the following weights: 25% in the S&amp;P 500, 10% in the Russell 2000, 15% in the MSCI EAFE, 5% in the MSCI EME, 25% in the

1 John Owen, Justification by Faith Alone, in The Works of John Owen, ed. John Bolt, trans. Scott Clark, &#34;Do This and Live: Christ's Active Obedience as the

Existing method has applied the size of the target to the virtual character by measuring manually, but now using Kinect sensor the motion data file can

• 이명의 치료에 대한 매커니즘과 디지털 음향 기술에 대한 상업적으로의 급속한 발전으로 인해 치료 옵션은 증가했 지만, 선택 가이드 라인은 거의 없음.. •

• The Office of International Affairs processes the dormitory application on behalf of a new student only in the first semester. • Students will be assigned to one of

웹 표준을 지원하는 플랫폼에서 큰 수정없이 실행 가능함 패키징을 통해 다양한 기기를 위한 앱을 작성할 수 있음 네이티브 앱과

_____ culture appears to be attractive (도시의) to the