• 검색 결과가 없습니다.

An Insight of Speedup

N/A
N/A
Protected

Academic year: 2021

Share "An Insight of Speedup"

Copied!
5
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

!N )NSIGHT OF 3PEEDUP

속도향상에 대한 고찰

!NDO +I  기안도

#OMPUTER 3YSTEM $EPARTMENT 하드웨어구조연구팀 선임연구원

3PEEDUP IS OFTEN USED TO SHOW SCALABILITY BUT ITS CLASSICAL DE`NITION FAILS TO EXPLAIN SOME REAL MEASUREMENTS SUCH AS SUPERLINEAR SPEEDUP 4HIS LEADS TO SCALED SPEEDUP WHICH SCALES OTHER SYSTEM PARAMETERS AS NUMBER OF PROCESSORS CHANGES )N THIS PAPER SCALED SPEEDUP AND ARCHITECTURAL SPEEDUP ARE INTRODUCED AND SUPERLINEAR SPEEDUP IS EXPLAINED WITH ITS CAUSE

) )NTRODUCTION

3CALABILITY IS MOST BASIC ONE TO EVALUATE MULTIPROCESSORS AND A MEASURE OF ABILITY TO ACHIEVE PERFORMANCE PROPORTIONAL TO THE NUM BER OF PROCESSORS 5NFORTUNATELY IT IS DIbCULT TO MEASURE THE SCALABILITY DIRECTLY SO THAT IN STEAD SOME OTHER METRICS ARE USED TO EVALU ATE THE SCALABILITY OF MULTIPROCESSORS SUCH AS SPEEDUP "ECAUSE SPEEDUP CAN BE EASILY DE TERMINED BY MEASURING EXECUTION TIME OF PRO GRAMS IT IS WIDELY ACCEPTED (OWEVER THE EX ECUTION TIME OF A CERTAIN PROGRAM ON A SYSTEM MAY VARY BECAUSE OF THE PROGRAMS CHARACTERIS TICS AND SYSTEM ENVIRONMENT INCLUDING EVERY THING WHICH A_ECTS THE PERFORMANCE 4HUS A DE`NITION OF SPEEDUP CALLED ARCHITECTURAL SPEEDUP IS INTRODUCED WHICH CAN SEPARATE AR CHITECTURAL ISSUES FROM MEASUREMENT

4HEORETICAL OR ANALYTICAL MODELS USUALLY USE ASYMPTOTIC ANALYSIS WHICH ASSUMES LARGE

NUMBER OF PROCESSORS LARGE SIZE OF PROBLEM AND IDEALIZED MACHINE MODELS 4HIS IMPLIES THAT ASYMPTOTIC ANALYSIS IS LIKELY TO IGNORE LOWER ORDER TERMS THAT CAN BE SIGNI`CANT FOR A GIVEN PROBLEM SIZE AND PROCESSOR NUMBER &UR THERMORE IDEALIZED MACHINE MODEL CAN BE VERY DI_ERENT FROM THE PHYSICAL MACHINE IN WHICH ONE IS INTERESTED !S A RESULT SPEEDUP ANALYSIS FOR REAL MACHINE NORMALLY USES REAL MEASURE MENT

)N THIS PAPER SPEEDUP IS DISCUSSED AND CLAS SI`ED 4HEN ARCHITECTURAL SPEEDUP IS INTRO DUCED AFTER THE SCALED SPEEDUP IS EXPLAINED

)N ADDITION LINEAR AND SUPERLINEAR SPEEDUP IS DISCUSSED

)) 3PEEDUP

4HE MOST FREQUENTLY USED PERFORMANCE METRIC OF PARALLEL PROCESSING ON MULTIPROCESSORS IS SPEEDUP



(2)

전자자통통신신동동향향분분석 제권 제호 년 

WHICH IS GIVEN BY

3N  4



4

N



WHERE N IS THE NUMBER OF PROCESSORS 4



IS SE QUENTIAL EXECUTION TIME AND 4

N

IS PARALLEL EXECU TION TIME #LEARLY THE LARGER THE SPEEDUP THE BETTER THE PARALLEL ALGORITHM OR ARCHITECTURE 3E QUENTIAL EXECUTION CAN BE BASED ON THE BEST SE RIAL ALGORITHM OR THE SINGLE PROCESSOR EXECUTION OF THE PARALLEL ALGORITHM 7HEN THE SEQUENTIAL EXECU TION TIME IS CHOSEN AS THE SINGLE PROCESSOR EXECUTION TIME OF THE PARALLEL ALGORITHM SPEEDUP IS REFERRED AS RELATIVE SPEEDUP /N THE OTHER HAND SPEEDUP IS CALLED ABSOLUTE SPEEDUP WHEN THE BEST SEQUENTIAL ALGORITHM IS USED

4HE ABSOLUTE SPEEDUP CAN BE USED TO EVALUATE PARALLEL ALGORITHMS 4HE RELATIVE SPEEDUP IS USED TO COMPARE THE ALGORITHM ITSELF WITH A DI_ERENT NUM BER OF PROCESSORS AND IT GIVES INFORMATION ON THE VARIATIONS AND DEGRADATIONS OF PARALLELISM )N THIS WORK RELATIVE SPEEDUP IS DISCUSSED BECAUSE WE ARE NOT INTERESTED IN COMPARING SEQUENTIAL AND PARALLEL ALGORITHMS BUT RATHER THE E_ECT OF ARCHITECTURE ON PARALLEL ALGORITHMS

2ELATIVE SPEEDUP CAN BE DIVIDED INTO `XED SIZE SPEEDUP AND SCALED SPEEDUP 4HE MOST COMMONLY USED SPEEDUP METRIC IS `XED SIZE SPEEDUP IN WHICH THE PROBLEM SIZE REMAINS CONSTANT AND IS INDEPEN DENT OF THE NUMBER OF PROCESSORS )N THIS CONTEXT THE LIMIT OF PARALLEL EXECUTION GIVEN BY !MDAHLS LAW ;= IMPLIES THAT THE SERIAL FRACTION _ ACTUALLY

IT IS THE RATIO OF THE SERIAL SECTION TO PARALLEL SEC TION LIMITS THE PARALLEL ALGORITHMS SPEEDUP TO AN ASYMPTOTIC SPEEDUP OF _ )F AN EXECUTION OF A PROGRAM CAN BE DIVIDED INTO S AND P WHERE S IS THE AMOUNT OF TIME SPENT ON SERIAL PART OF THE PROGRAM AND P IS THE AMOUNT OF TIME SPENT ON PARALLEL PART OF THE PROGRAM EQUATION  CAN BE THE FOLLOWING

3N  S P

S

NP



)N AN IDEAL CASE PN PART OF EQUATION  WOULD DIMINISH AS THE NUMBER OF PROCESSOR N INCREASES

4HEREFORE EQUATION  CAN BE EXPRESSED AS FOLLOWS

N

LIM 3N   P

S 

4HIS EQUATION IMPLIES THAT THE RATIO OF THE PAR ALLEL SECTION TO THE SERIAL SECTION DETERMINES THE MAXIMUM SPEEDUP

%QUATION  CAN BE EXPRESSED USING _ 

SP

AS SHOWN BELOW

3N  4



_4



p_N

4



 

_

p_N



WHICH WILL RESULT IN THE FOLLOWING AS N INCREASES

N

LIM 3N  

_ 

!BOVE EQUATION CLEARLY STATES !MDAHLS LAW THAT THE IMPROVEMENT IN PERFORMANCE OF A PARALLEL EXE CUTION OVER A CORRESPONDING SEQUENTIAL EXECUTION IS LIMITED BY THE FRACTION OF THE ALGORITHM THAT CANNOT BE PARALLELIZED



(3)

PROCESSORS INCREASES THE AMOUNT OF WORK TO BE PERFORMED INCREASES IN ORDER TO OBTAIN A MORE ACCURATE OR BETTER RESULT SUCH AS MERELY SPEC IFYING A `NER MESH OR HIGHER RESOLUTION FOR THE SOLUTION OF SOME PHYSICAL PROBLEM )T IS A FACT THAT IN MOST ENGINEERING AND SCIENTI`C COM PUTING PROBLEMS THE SERIAL FRACTION DEPENDS ON THE PROBLEM SIZE 4HIS MEANS THAT THE SE RIAL FRACTION _ DEPENDS ON THE PROBLEM SIZE X AND _X WOULD DIMINISH AS THE SIZE OF PROBLEM INCREASES

NX LIM 3N X  N 

4HIS IMPLIES THAT NEARLY LINEAR SPEEDUP CAN BE ACHIEVED WHEN THE E_ECTIVE ALGORITHM WITH SUbCIENTLY LARGE PROBLEMS IS CONSIDERED

4HE ABOVE CONCEPT LEAD TO SCALED SPEEDUP

; = 'USTAFSON EXAMINED HOW THE PROBLEM SIZE CAN BE SCALED UP WHILE THE EXECUTION TIME IS `XED )T IS `XED TIME SPEEDUP WHICH SCALES THE PROBLEM SIZE TO MEET THE `XED EXECUTION TIME &URTHERMORE 3UN AND OTHERS ; = SUG GESTED A MEMORY BOUNDED SPEEDUP WHICH SCALES THE PROBLEM SIZE BASED ON THE AVAILABLE MEM ORY -EMORY BOUNDED SPEEDUP IS BASED ON THE CONCEPT THAT THE SIZE OF SCALABLE PROBLEM IS OF TEN DETERMINED BY THE MEMORY AVAILABLE 4HE SHORTAGE OF MEMORY IS PAID FOR IN PROBLEM SO LUTION TIME DUE TO THE INPUT OUTPUT ACTIVITIES

RATHER THAN THE ALGORITHM ITSELF 4HUS ONE WANTS TO IGNORE RELATIONSHIP BETWEEN THE SERIAL WORK AND PARALLEL WORK IN A GIVEN ALGORITHM AS

!MDAHLS LAW IMPLIES

,ET US CONSIDER PARALLEL ALGORITHM CONSISTS OF SERIAL PARTS 3   3   q q q 3 A AND PARALLEL PARTS 0   0   q q q 0 B  0RACTICALLY SPEAKING THE PAR ALLEL WORKS CAN BE DE`NED BETWEEN THREAD CRE ATION AND JOIN POINTS ,ET 4 3

I

BE THE AMOUNT OF TIME SPENT ON SERIAL WORK 3 I AND 4 0

J

BE THE AMOUNT OF TIME SPENT ON PARALLEL WORK 0 J  ,ET 4 P N BE THE PARALLEL EXECUTION TIME USING N PROCESSORS ,ET ARCHITECTURAL SPEEDUP !3N BE DE`NED AS THE RATIO OF THE SINGLE PROCESSOR EXECUTION TIME 4 0  OF THE PARALLEL WORKS TO N PROCESSORS EXECUTION TIME 4 0 N OF THE PARALLEL WORKS SO THAT THE !3N IS

!3N  4 P 

4 P N 

4HE STRENGTH OF THIS DE`NITION IS THAT IT USES EXECUTION TIME OF THE PARALLELIZED PART AND THUS INCORPORATES ANY COMMUNICATION OR SYN CHRONIZATION OVERHEAD BUT EXCLUDES THE PERTUR BATION OF THE SERIAL PART 4HE ARCHITECTURAL SPEEDUP CAN GIVE INFORMATION ON THE VARIATIONS AND DEGRADATIONS OF ARCHITECTURAL ISSUES 4O



(4)

전자자통통신신동동향향분분석 제권 제호 년 

MAKE IT SIMPLER IT IS SUPPOSED THAT ALL BENCH MARK PROGRAMS IN THIS WORK ARE CLEARLY CODED IN SEQUENTIAL AND PARALLEL PARTS

)6 ,INEAR AND SUPERLINEAR SPEEDUP

3PEEDUP IS SAID TO BE LINEAR IF AN N PROCESSOR YIELDS A SPEEDUP OF N ,INEAR SPEEDUP IS NOT ACHIEVABLE IN GENERAL BECAUSE OF VARIOUS OVERHEADS ASSOCIATED WITH PARALLEL COMPUTATION SUCH AS CONTENTION FOR SHARED RE SOURCES AND THE TIME REQUIRED TO COMMUNICATE BETWEEN PROCESSORS AND BETWEEN PROCESSES OR THREADS ;=

5NDER THE PARTICULAR SITUATION WE CAN OBSERVE SUPERLINEAR SPEEDUP WHICH MEANS SPEEDUP BETTER THAN LINEAR 4HERE ARE SOME POSSIBLE CAUSES OF SUPERLINEAR SPEEDUP DIS CUSSED IN ; = CACHE SIZE INCREASED IN PAR ALLEL PROCESSING OVERHEAD REDUCED IN PARALLEL PROCESSING LATENCY HIDDEN IN PARALLEL PROCESS ING RANDOMIZED ALGORITHMS MATHEMATICAL IN EbCIENCY OF THE SERIAL ALGORITHM AND HIGHER MEMORY ACCESS LATENCY IN THE SEQUENTIAL PRO CESSING

,ET ASSUME A MULTIPROCESSOR IN WHICH EACH PROCESSOR HAS ITS PRIVATE CACHE ,ET SUPPOSE THAT THE DATA SIZE OF A GIVEN PROBLEM IS BIGGER THAN A SINGLE CACHE SIZE BUT SMALLER THAN THE SIZE OF TWO OR MORE CACHE SIZES 7HEN A SIN GLE PROCESSOR EXECUTES THIS PROBLEM THERE WILL BE CACHE MISSES AND IT CAN TAKE LONGER TIME

THAN EXPECTED (OWEVER MORE THAN ONE PRO CESSOR WILL EXECUTE THIS PROBLEM WITH LESS CACHE MISSES #ONSEQUENTLY THE RATIO SINGLE PROCES SORS EXECUTION TIME OVER MULTIPLE PROCESSORS

WILL BE BETTER THAN LINEAR 4HEREFORE THE PROB LEM SIZE AND MEMORY ARCHITECTURE HAVE STRONG RELATIONSHIP WHEN WE CONSIDER THE `XED SIZE SPEEDUP

6 3UMMARY

3PEEDUP IS WIDELY ACCEPTED AS A PERFOR MANCE METRIC SINCE IT CAN CLEARLY SHOW SUPE RIORITY BETWEEN COMPUTERS -EASURING OR CAL CULATING SPEEDUP IS NOT EASY TASK SINCE THERE ARE SEVERAL VARIATIONS ALTHOUGH THE CONCEPT OF SPEEDUP IS SIMPLE

!N INSIGHT CALLED !MDAHLS LAW SHOWING ACHIEVABLE SPEEDUP USING PARALLEL COMPUTER WAS SUPPORTED BY A DE`NITION OF SPEEDUP

(OWEVER THE FACT THAT MEASUREMENTS FROM REAL COMPUTERS DID NOT ALWAYS FOLLOW THE CLASSICAL DE`NITION AND SHOWED UNEXPECTED RESULTS RE QUIRED CORRECTION OF THE DE`NITION !S THE RE SULT SEVERAL NEW DE`NITIONS OF SPEEDUP WERE INTRODUCED AND THESE EXPLAINED UNEXPECTED RE SULTS OF REAL MEASUREMENTS

)N HERE SCALED SPEEDUP IS EXPLAINED IN WHICH OTHER SYSTEM PARAMETERS SUCH AS PROB LEM SIZE OR MEMORY SIZE ARE SCALED AS NUMBER OF PROCESSORS CHANGES )N ADDITION ARCHITEC TURAL SPEEDUP IS INTRODUCED WHICH CAN EXCLUDE E_ECT OF SERIAL PART OF EXECUTION



(5)

<3PEEDUP 6ERSUS %bCIENCY IN 0ARALLEL 3YSTEMS  )%%%

4RANSACTIONS ON #OMPUTERS 6OL   PP p

;= *OHN , 'USTAFSON <2EEVALUATING !MDAHLS ,AW  #OM

MUNICATIONS OF THE !#- 6OL

 .O   PP

p

;= *OHN , 'USTAFSON 'RAY 2 -ONTRY AND 2OBERT % "ENNER

<$EVELOPMENT OF 0ARALLEL -ETHODS FOR A  0ROCESSOR (YPERCUBE  3)!- *OURNAL OF 3CIT 3TAT #OMPUT 6OL

 .O   PP p

;= $AVID 0 (ELMBOLD AND #HARLES % -C$OWELL <-ODELING 3PEEDUPN 'REATER 4HAN .  )N 0ROCEEDINGS OF THE )N

TERNATIONAL #ONFERENCE ON 0ARALLEL 0ROCESSING  PP

)))pp

;= 8IAN (E 3UN AND *OHN , 'USTAFSON <4OWARD A "ETTER 0ARALLEL 0ERFORMANCE -ETRIC  0ARALLEL #OMPUTING 6OL 

 PP p

;= 8IAN (E 3UN AND ,IONEL - .I <3CALABLE 0ROBLEMS AND -EMORY "OUNDED 3PEEDUP  *OURNAL OF 0ARALLEL AND $IS

TRIBUTED #OMPUTING 6OL   PP p

;= 8IAN (E 3UN AND *IANPING :HU <3HARED 6IRTUAL -EM ORY AND 'ENERALIZED 3PEEDUP  )N 0ROCEEDINGS OF THE TH

)NTERNATIONAL 0ARALLEL 0ROCESSING 3YMPOSIUM  PP

p



참조

관련 문서

The total execution time of a parallel application on a high performance computing environment depends on a task scheduling algorithm.. Most studies on task

In a heterogeneous multicore platform, the execution time of an application depends on which core it executes on.. That is to say, the effectiveness of task

As a conclusion, the performance of the parallel hydraulic simulation algorithm, in terms of execution time, highly de- pends on the complexity of the underlying linear equations

However, this algorithm is not good for high-speed processing using parallel processor architectures including multimedia processors, because the computation of an output pixel is

Our testing framework provides the data of service response time to service developer by measuring the service execution time.. We develope an Aspect-based timer

This paper proposes a parallel implementation of the video based 4-stage fire detection algorithm using a general-purpose graphics processing unit (GPGPU) to

By the experiment result, the proposed method outperforms 3.34% in early-stage of the execution time that impacts the sensible performance, and exhibits negligible differences

Abstract When the trace-driven simulation is used for the performance analysis of widely used multicore processors in the initial design stage, much time and disk space