-EMORY ,ATENCY (IDING 4ECHNIQUES
메모리 지연을 감추는 기법들
!NDO +I 기안도 #OMPUTER 3YSTEM $EPARTMENT
컴퓨터 q 소프트웨어기술연구소 컴퓨터시스템연구부 선임연구원
4HE OBVIOUS WAY TO MAKE A COMPUTER SYSTEM MORE POWERFUL IS TO MAKE THE PROCESSOR AS FAST AS POSSIBLE &UR THERMORE ADOPTING A LARGE NUMBER OF SUCH FAST PROCESSORS WOULD BE THE NEXT STEP 4HIS MULTIPROCESSOR SYSTEM COULD BE USEFUL ONLY IF IT DISTRIBUTES WORKLOAD UNIFORMLY AND IF ITS PROCESSORS ARE FULLY UTILIZED 4O ACHIEVE A HIGHER PROCESSOR UTILIZATION MEMORY ACCESS LATENCY MUST BE REDUCED AS MUCH AS POSSIBLE AND EVEN MORE THE REMAINING LATENCY MUST BE HIDDEN 4HE ACTUAL LATENCY CAN BE REDUCED BY USING FAST LOGIC AND THE E_ECTIVE LATENCY CAN BE REDUCED BY USING CACHE 4HIS ARTICLE DISCUSSES WHAT THE MEMORY LATENCY PROBLEM IS HOW SERIOUS IT IS BY PRE SENTING ANALYTICAL AND SIMULATION RESULTS AND EXISTING TECHNIQUES FOR COPING WITH IT SUCH AS WRITE BU_ER RELAXED CONSISTENCY MODEL MULTI THREADING DATA LOCALITY OPTIMIZATION DATA FORWARDING AND DATA PREFETCHING
) -EMORY ,ATENCY 0ROBLEM
4HE LATENCY OF MEMORY ACCESSES IS THE TIME SPENT WAITING FOR THE DATA TO ARRIVE AT THE PRO CESSOR AND IT IS A WELL KNOWN OBSTACLE IN SHARED MEMORY COMPUTERS BECAUSE PROCESSORS STALL FOR THE DURATION OF MEMORY REQUESTS
4HE SPEED GAP BETWEEN PROCESSOR AND MEMORY HAS WIDENED IN THE LAST FEW YEARS AND THE TECH NOLOGY TRENDS INDICATE THAT THIS GAP IS NOT LIKELY TO DECREASE IN THE FUTURE ;= 4HE TENDENCY OF COMPUTER ARCHITECTURE IN WHICH PROCESSORS AND MEMORIES BECOME MORE PHYSICALLY DISTRIBUTED ;=
MAKES MEMORY ACCESS LATENCY INCREASE 4HE INTER CONNECTION NETWORK OF MULTIPROCESSORS BECOMES
COMPLEX RESULTING IN NETWORK DELAY WHICH CON TRIBUTES TO MEMORY ACCESS LATENCY 4HE CHARACTER ISTICS OF INCREASING NETWORK TRAbC IN PARALLEL WORK LOADS RESULT IN NETWORK AND MEMORY CONTENTION WHICH INCREASE THE MEMORY ACCESS LATENCY STILL FUR THER !S THE MEMORY ACCESS LATENCY BECOMES LARGER HIGH PERFORMANCE COMPUTERS BECOME MORE SENSI TIVE TO STALLS FOR MEMORY ACCESSES
4HE MOST FUNDAMENTAL APPROACH TO TACKLING THE LATENCY PROBLEM IS ADOPTING A MULTI LEVEL MEMORY HIERARCHY 4HE GOAL OF THE MEMORY HIERARCHY IS TO BRIDGE THE SPEED GAP BETWEEN HIGH SPEED PROCES SORS AND RELATIVELY LOW SPEED MEMORIES BY INTRO DUCING FASTER CACHE STORAGES BETWEEN THEM 4HE CACHE IS A SMALL BUT FAST MEMORY WHOSE PURPOSE
전 전 전
전자자자통자통통통신신신동신동동동향향향향분분분분석석석석 제권 제호 년 월
IS TO HOLD CURRENTLY ACTIVE DATA THEREBY REDUCING THE NUMBER OF ACCESSES TO MEMORY ;= 4HE MEM ORY HIERARCHY WORKS E_ECTIVELY BECAUSE OF THE PRIN CIPLE OF LOCALITY DURING AN EXECUTION OF A SPECI`C PART OF PROGRAM THE PROGRAM REFERS TO ONLY A SMALL PORTION OF ITS ADDRESS SPACE 4HEREFORE BY KEEP ING THE ACTIVE FRACTION OF THE ADDRESS SPACE IN FAST STORAGE THE PROGRAM RUNS AS IF THE WHOLE ADDRESS SPACE IS FAST %VEN THOUGH THE MEMORY HIERARCHY WORKS WELL THE PROCESSOR MUST WAIT UNTIL THE RE QUIRED DATA ARRIVES IF THE DATA IS NOT FOUND IN THE NEAREST STORAGE 4HEREFORE ALTHOUGH CACHES REDUCE THE E_ECTIVE MEMORY ACCESS TIME CACHE MISSES STILL SU_ER FROM LARGE MEMORY ACCESS LATENCY
7E WILL SHOW THAT THE LATENCY OF MEMORY AC CESSES IS A SERIOUS PROBLEM BY PRESENTING TWO EXAM PLES THAT ARE AN ANALYTICAL MODEL AND A SIMULATION RESULT
! 3IMPLE !NALYTICAL -ODEL
7E ASSUME A SIMPLE MEMORY HIERARCHY MODEL THAT CONSISTS OF ONE LEVEL CACHE AND AN IDEAL PRO CESSOR THAT EXECUTES ONE INSTRUCTION PER CYCLE 7E TAKE A TYPICAL PROGRAM IN WHICH THE NUMBER OF IN STRUCTIONS EXECUTED IS .
)THAT CAN BE DIVIDED INTO .
#.
3AND .
,WHERE .
#IS THE NUMBER OF IN STRUCTIONS THAT DO NOT REFER TO MEMORY .
3IS THE NUMBER OF STORES AND .
,IS THE NUMBER OF LOADS
)N THIS SIMPLE MODEL WE IGNORE SYNCHRONIZATION AND STORE STALL TIMES )N PRACTICE STORE STALL TIME CAN BE REMOVED BY USING A WRITE BU_ER AND RELAXED MEM
ORY CONSISTENCY MODEL ; =
4HE PROGRAM EXECUTION TIME T
EXEWITH CACHE MISS RATE M WILL BE
.
#.
3.
,.
,4
#M.
,4
-WHERE 4
#IS CACHE ACCESS TIME AND 4
-MEMORY AC CESS TIME ,OAD STALL TIME T
STALLDUE TO MEMORY READS WILL BE
.
,4
#M.
,4
-4HEREFORE THE LOAD STALL TIME OVERHEAD 2
3THAT IS THE RATIO
OF LOAD STALL TIME T
STALLTO THE PROGRAM EXECUTION TIME T
EXEWILL BE
L4
#ML4
-L4
#ML4
-WHERE L IS THE RATIO
OF THE NUMBER OF LOADS .
,TO THE NUMBER OF INSTRUCTIONS .
)&OR SIMPLICITY WE ONLY CONSIDER GLOBAL LOADS AND ACCORDING TO 30,!3( CHARACTERISTICS ;=
L RANGES BETWEEN TO AND M IS UP TO
WITH A REASONABLE CACHE CON`GURATION DIRECT MAPPED -"YTE SIZE
&IGURE SHOWS LOAD STALL TIME OVERHEAD 2
3VERSUS MISS RATE M WITH LOAD RATE L AND DIF FERENT VALUES FOR CACHE ACCESS DELAY 4
#AND MEM ORY ACCESS DELAY 4
-4HE RESULTS SHOWN IN &IG URE IMPLY THAT CACHE HIT TIME SHOULD BE FASTER
2
3TTSTALLEXE
. .,4#M.,4-#.3.,.,4#M.,4-
L
..,)
. .,#.3.,
&IG ,OAD STALL TIME OVERHEAD 2
3VERSUS MISS RATE M WITH DI_ERENT CACHE DELAY 4
#MEMORY LATENCY 4
-AND `XED LOAD RATE L 2
3IS DE`NED AS THE RATIO OF THE STALL TIME DUE TO MEMORY LOADS TO THE PROGRAM EXECUTION TIME
TO REDUCE LOAD STALL TIME BECAUSE MISS RATE BI ASES LOAD STALL TIMES 4HIS IS WHY A MULTI LEVEL CACHE HIERARCHY IS NEEDED BECAUSE A FASTER CACHE NEAR THE PROCESSOR E_ECTIVELY REDUCES THE CACHE HIT TIME &IGURE ALSO REVEALS THAT ALTHOUGH THE CACHE HIT RATE IS HIGH ENOUGH MISS RATE IS LESS THAN
THE PROGRAM SU_ERS LARGE AMOUNT OF MEMORY ACCESS STALL TIME THAT IS SPENT IN WAITING FOR DATA !S THE RESULTS WITH 4
#SHOW 2
3IS MORE SENSITIVE TO LATENCY OF MEMORY ACCESS WHEN THE CACHE ACCESS TIME IS SMALL
! 3IMULATION 2ESULT
&OR THE SAKE OF REALITY WE CONDUCT A SIMULATION OF A SET OF PARALLEL APPLICATIONS ON A SHARED MEMORY ARCHITECTURE TO SHOW THE IMPACT OF MEMORY ACCESS LATENCY
4HE PROGRAMS ARE TAKEN FROM THE 30,!3(
TWO LEVEL CACHE HIERARCHY +WORD CACHE SIZE AND
7ORD CACHE LINE SIZE FOR THE PRIMARY CACHE WITH DIRECT MAPPED WRITE THROUGH AND +7ORD CACHE SIZE AND 7ORD CACHE LINE SIZE FOR THE SECONDARY CACHE WITH DIRECT MAPPED WRITE BACK 7E USED NO WAIT REPLY
FOR THE PRIMARY CACHE CYCLE AC CESS TIME FOR THE SECONDARY CACHE CYCLE LATENCY FOR NETWORK AND CYCLE DELAY FOR MEMORY
&IGURE SHOWS A SIMULATION RESULT THAT BREAKS DOWN THE TOTAL EXECUTION TIMES INTO INSTRUCTION EXE CUTION STALLS DUE TO MEMORY ACCESSES AND SYNCHRO NIZATION !S WE CAN SEE MEMORY ACCESS STALLS STORE STALL AND LOAD STALL SECTIONS WAITING FOR MEMORY ACCESSES TO COMPLETE ARE SIGNI`CANT
&ROM THESE TWO DEMONSTRATIONS OF THE ANALYT ICAL AND SIMULATION RESULTS IT IS NOW CLEAR THAT A CONSIDERABLE AMOUNT OF EXECUTION TIME IS WASTED IN WAITING FOR DATA AND THIS CAUSES THE PROCESSOR TO BE UNDER UTILIZED (ENCE WE CONCLUDE THAT REDUC ING THE STALL TIME IS A CHALLENGING APPROACH TO GET
)N PRACTICE A WORD MEANS BYTE SO THAT +7ORD IS EQUAL TO +"YTE
7HEN AN ACCESS HITS ON THE PRIMARY CACHE THE PROCESSOR ACCESSES THE DATA WITHOUT ANY DELAY
)F WE ASSUME AN AGGRESSIVE ARCHITECTURAL MODEL IN WHICH THE PROCESSOR RUNS -(Z PROCESSOR CYCLE AND EXECUTES ONE IN STRUCTION PER CYCLE THEN CYCLE CACHE DELAY RESULTS IN
NSEC CYCLE MEANS NSEC DELAY
0.15
0.1 0.005 0.01 0.015 0.02 miss rate (m)
tC tM
0
전 전 전
전자자자통자통통통신신신동신동동동향향향향분분분분석석석석 제권 제호 년 월
&IG "REAKDOWN OF EXECUTION TIMES OF SCIENTI`C AND ENGI NEERING APPLICATIONS ON SHARED MEMORY MULTIPROCESSOR ARCHITECTURE
HIGHER COMPUTER PERFORMANCE
)) #OPING WITH -EMORY ,ATENCY
-ANY SOLUTIONS HAVE BEEN PROPOSED TO DEAL WITH THE MEMORY ACCESS LATENCY PROBLEM 4HE `RST TWO APPROACHES ARE MAINLY CONCERNED WITH HIDING WRITE LATENCY 4HE MIDDLE ONE APPROACH EXPLOITS OVERLAP PING OPPORTUNITY BETWEEN COMPUTATIONS AND MEM ORY ACCESSES 4HE LAST THREE APPROACHES FOCUS ON MAXIMIZING CACHE HIT RATE
{ 7RITE BUFFERS ; = CAN HIDE WRITE LA TENCY BY BU_ERING WRITE REQUESTS AND RELEASING THE PROCESSOR FROM WAITING FOR THE WRITE TO COM PLETE
{ 2ELAXED CONSISTENCY MODELS ; = ALLOW BU_ER ING AND PIPELINING OF MEMORY REQUESTS BY RE
LAXING THE ORDER OF REQUESTS "U_ERING WRITES USING A WRITE BU_ER AND A RELAXED CONSISTENCY MODEL CAN REMOVE ALMOST ALL WRITE STALL TIMES AND IS COMPLEMENTARY TO DATA PREFETCHING SINCE EACH TACKLES DI_ERENT PARTS OF STALL TIMES ; =
{ -ULTI THREADING ; = ALLOWS PROCES SORS TO HIDE MEMORY ACCESS LATENCY BY SWITCH ING FROM ONE THREAD TO ANOTHER WHEN A LONG LA TENCY REFERENCE IS ENCOUNTERED AND PROCEEDING WITH OTHER USEFUL WORK WHILE THE MISS IS BEING SERVICED -ULTI THREADING MAY NOT BE BENE`
CIAL WHEN THE CONTEXT SWITCHING COST IS LARGER THAN THE MEMORY ACCESS PENALTY
{ $ATA LOCALITY OPTIMIZATION ; = OPTIMIZES DATA PARTITIONING OR LAYOUT TO AVOID UNNECES SARY MEMORY REFERENCES -ANY TECHNIQUES SUCH AS BLOCKING ;= AND LOOP TRANSFORMATION ;=
ARE DEVISED MAINLY FOR NUMERICAL APPLICATIONS
4HE BENE`TS DEPEND ON PROGRAMS AND DATA SETS AND THIS OPTIMIZATION APPROACH MAKES THE COM PILER COMPLICATED AND COMPILATION TIME LONG
{ $ATA FORWARDING ; = CAN HIDE THE MEMORY LATENCY OF SHARING ACCESSES BY SENDING DATA TO DESTINATION PROCESSORS CACHES OR LOCAL MEMO RIES WHEN THE DATA IS AVAILABLE 4HIS TECHNIQUE IS ONLY APPLICABLE TO SHARED DATA
{ $ATA PREFETCHING ; = HIDES THE MEM ORY LATENCY BY BRINGING DATA CLOSE TO THE PRO CESSOR BEFORE THE DATA IS ACTUALLY USED
4HESE TECHNIQUES ARE DISCUSSED IN DETAIL IN THE
100
75
50
25
0
barnes choleskynormalised execution times (%)
fft fmm lu ocean radix water 0.6
10.9 12.9 21.5 5.2
3.8 16.2
37.1 0.6 2.9 5.6
1.2
18.0 31.2
26.6
4.2
synchronisation
store stall
load stall
code
,OAD INSTRUCTIONS USUALLY MAKE THE PROCESSOR STALL UNTIL THE REQUIRED DATA ARRIVES (OWEVER STORE INSTRUCTIONS DO NOT NEED TO MAKE THE PROCESSOR WAIT UNLESS THE WRITE REQUEST IS LOST )N OTHER WORDS A WRITE REQUEST GENERATED BY A STORE INSTRUCTION CAN RELEASE THE PROCESSOR FROM WAITING FOR THE WRITE TO COMPLETE AND THE WRITE REQUEST REACHES THE MEM ORY AT A LATER TIME 7RITE BU_ERS USE THE ABOVE IDEA 4HE WRITE BU_ER ENABLES THE PROCESSOR TO CONTINUE ITS EXECUTION WITHOUT STALLING BY BU_ER ING WRITE REQUESTS
4HERE ARE TWO POSSIBLE MECHANISMS TO MANAGE THE WRITE BU_ER NO BYPASS SCHEME IN WHICH WRITE REQUESTS ARE BU_ERED AND RELEASE THE PROCESSOR FROM WAITING FOR THE WRITE TO COMPLETE AND READ REQUESTS CANNOT BYPASS PENDING WRITES BYPASS SCHEME IN WHICH WRITE REQUESTS ARE TREATED AS THE NO BYPASS SCHEME DOES BUT READ REQUESTS CAN OVER TAKE PEND ING WRITES )T IS NOT DIbCULT TO IMAGINE THE NO BYPASS SCHEME CANNOT GET MUCH BENE`T BECAUSE THE BENE`T OF CYCLE SAVING IS LIMITED BY THE DIS TANCE BETWEEN A WRITE REQUEST AND THE NEXT READ REQUEST WHICH MUST WAIT ALL WRITE REQUESTS IN THE WRITE BU_ER `NISH 4HE BYPASS SCHEME HAS THE PO TENTIAL TO REMOVE STALL TIMES DUE TO WRITE REQUESTS SINCE READ REQUESTS CAN BE ISSUED IN SPITE OF PENDING
THE ORDER OF MEMORY ACCESSES 4HEREFORE THE PRO GRAMMER SHOULD BE AWARE OF A CERTAIN RULE WHICH SPECI`ES THE IMPLICATION OF MEMORY ACCESSES AND SHOULD CODE PROGRAMS USING EXPLICIT SYNCHRONIZA TIONS INCLUDING LOCK BARRIER AND MEMORY BARRIER
OR FENCE IF NECESSITATED
4HE STRICTEST ORDERING MODEL DOES NOT ALLOW ANY RELAXATION IE EACH LOAD AND STORE MAKES THE PROCESSOR WAIT FOR EACH OF THEM TO COMPLETE 4HE SEQUENTIAL CONSISTENCY MODEL ;= ALLOWS WRITE RE QUESTS TO BE BU_ERED BUT NOT TO BE BYPASSED BY READS BECAUSE ALL READS AND WRITES MUST BE OB SERVED IN THE SAME ORDER BY ALL PROCESSORS 4HE PROCESSOR CONSISTENCY MODEL ;= ALLOWS WRITE RE QUESTS TO BE BU_ERED AND READS TO BYPASS THE BU_ERED WRITES BECAUSE WRITE ISSUED FROM DI_ERENT PROCESSORS CAN BE SEEN IN DI_ERENT ORDER BY DIF FERENT PROCESSORS 4HERE ARE SEVERAL MORE RELAXED CONSISTENCY MODELS SUCH AS WEAK CONSISTENCY ;=
AND RELEASE CONSISTENCY ;= MODELS
.O MATTER HOW RELAXED THE CONSISTENCY MODEL IS THERE ARE ONLY TWO SCHEMES FOR THE WRITE BU_ER AS DISCUSSED AT THE BEGINNING OF THIS SECTION &ROM THE VIEWPOINT OF WRITE BU_ER THE STRICTEST CONSISTENCY MODEL MUST NOT USE WRITE BU_ER THE SEQUENTIAL CONSISTENCY MODEL CAN USE NO BYPASS WRITE BU_ER
전 전 전
전자자자통자통통통신신신동신동동동향향향향분분분분석석석석 제권 제호 년 월
AND THE OTHER MORE RELAXED CONSISTENCY MODELS IN CLUDING THE PROCESSOR THE WEAK AND THE RELEASE CONSISTENCY MODELS CAN USE BYPASS WRITE BU_ER
)6 -ULTITHREADING
&IGURE ILLUSTRATES HOW MULTITHREADING HIDES MEMORY ACCESS LATENCY 7HEN THREAD ENCOUN TERS A CACHE MISS CAUSED BY LOADING LOCATION IT IS SWAPPED OUT AND THE PROCESSOR BEGINS TO EXECUTE THREAD (OPEFULLY BY THE TIME THE THREAD NEEDS TO BE SWAPPED OUT WHEN IT ENCOUNTERS A CACHE MISS THE MEMORY FETCH FOR THE THREAD HAS COMPLETED AND THEREFORE THE THREAD IS READY TO RUN AGAIN
&IG )LLUSTRATE OF HOW MULTITHREADING HIDES MEMORY AC CESS LATENCY
-ULTITHREADING HAS THE FOLLOWING ADVANTAGES )T
CAN HANDLE ARBITRARILY COMPLICATED ACCESS PATTERNS AND IT CAN BE APPLICABLE TO EXISTING EXECUTABLE WITHOUT RECOMPILATIONS (OWEVER THERE ARE SEV ERAL DISADVANTAGES )T HAS THE OVERHEAD OF CONTEXT SWITCHING AND IT CANNOT IMPROVE EXECUTION TIME OF A SINGLE APPLICATION UNLESS THERE ARE SUbCIENT CON CURRENT THREADS OF THE APPLICATION
6 ,OCALITY /PTIMIZATION
,OCALITY OPTIMIZATIONS ATTEMPT TO MAKE CACHES MORE E_ECTIVE BY RESTRUCTURING COMPUTATION TO EN HANCE DATA LOCALITY SO THAT IT COULD AVOID UNNEC ESSARY MEMORY REFERENCE BY OPTIMIZING DATA PARTI TIONING OR LAYOUT
/NE PARTICULAR EXAMPLE IS BLOCKING OR TILING ALGORITHM WHICH OPERATES ON SUB MATRICES CALLED BLOCK RATHER THAN OPERATING ON ENTIRE ROWS OR COLUMNS OF AN ARRAY SO THAT DATA LOADED INTO THE CACHES ARE REUSED
7HILE LOCALITY OPTIMIZATION APPROACHES ARE QUITE USEFUL IN REDUCING ACCESSES TO MEMORY THEIR APPLICABILITY IS SOMEWHAT LIMITED SINCE EXTENSIVE DATA DEPENDENCY ANALYSIS AND RE PROGRAMMING ARE REQUIRED
6) $ATA 0REFETCHING AND $ATA
&ORWARDING
4HE BASIC IDEA OF DATA PREFETCHING IS TO GENER ATE DATA FETCH REQUESTS PRIOR TO THE ACTUAL USE OF
switching context switching context
thread 1
thread 2
fetch B Load A
thread 1
fetch A
Load B
time
PING COMPUTATION AND MEMORY ACCESSES
&IGURE A SHOWS A SIMPLE EXAMPLE OF HOW DE MAND FETCHES MAKE THE PROCESSOR WAIT 4HE DE MAND FETCHES FOR ! AND " THAT ARE NOT PRESENT IN THE CACHE EXPERIENCE MEMORY ACCESS DELAY )F SOMEHOW PREFETCHES FOR ! AND " CAN BE ISSUED THEN
! AND " WILL BE AVAILABLE BEFORE THE DEMAND FETCHES ARE ISSUED &IGURE B SHOWS HOW PREFETCH HIDES MEMORY ACCESS DELAY &OR THIS EXAMPLE WE ASSUME THAT PREFETCH CAN BE OVERLAPPED WITH COMPUTATION AND OTHER MEMORY ACCESSES
0REFETCHING SCHEMES DI_ER IN WHERE THE PREFETCHING DATA IS LOCATED HOW PREFETCHING AD DRESSES ARE DETERMINED AND WHEN PREFETCHINGS ARE INITIATED 0REFETCHING DATA IS USUALLY LOCATED INTO CACHE AND IT REMAINS VISIBLE TO THE MEMORY CO HERENCY ACTIVITY AND IT IS KEPT CONSISTENT UNTIL THE PROCESSOR ACTUALLY USES THE VALUE 0REFETCH ING ADDRESSES CAN BE DETERMINED BY SOFTWARE HARD WARE OR COMBINATION OF THESE TWO )N HARDWARE
CONTROLLED PREFETCHING THE HARDWARE ALONE DECIDES WHAT DATA TO PREFETCH AND WHEN AND WHERE TO PREFETCH THE DATA 4HEREFORE PREFETCHINGS ARE HAN DLED DYNAMICALLY WITHOUT INTERVENTION OF COMPILER OR PROGRAMMER AND THERE IS NO NEED TO CHANGE CODE 7ITH SOFTWARE DIRECTED PREFETCHING THE
&IG !N EXAMPLE OF DEMAND FETCH AND PREFETCHING ,EFT HAND SIDE `GURE SHOWS DEMAND FETCHES WHICH CAUSE THE PROCESSOR TO STALL RIGHT HAND SIDE `GURE SHOWS HOW PREFETCHING HIDES THE MEMORY ACCESS DELAY
COMPILER OR PROGRAMMER DIRECTS PREFETCHING BY IN SERTING PREFETCHING INSTRUCTIONS INTO THE CODE 4HE SOFTWARE PREFETCHING NEEDS THE NON BLOCKING LOAD OR PREFETCHING LOAD INSTRUCTION AND SPECIAL HARDWARE TO TREAT THIS INSTRUCTION 4HE SOFTWARE PREFETCH ING INCREASES COMPILATION TIME RUN TIME OVERHEAD AND CODE SIZE BECAUSE ADDITIONAL INSTRUCTIONS ARE NEEDED TO CALCULATE THE PREFETCHING ADDRESS AND IS SUE THE PREFETCHING LOADS )NTEGRATED PREFETCHING OR HYBRID PREFETCHING IS A TECHNIQUE COMBINING HARD WARE AND SOFTWARE
4HERE ARE TWO MAIN HARDWARE APPROACHES ONE IS SEQUENTIAL PREFETCH WHOSE SUCCESS DEPENDS ON SPA
time time
load A
add A, B load B
memory acess delay
memory acess delay
load A load B add A, B
B arrives
instruction execution waiting for data
(b) prefetching
(a) demand fetch
전 전 전
전자자자통자통통통신신신동신동동동향향향향분분분분석석석석 제권 제호 년 월
TIAL LOCALITY AND THE OTHER IS STRIDE PREFETCH THAT IS BASED ON STRIDE CALCULATION )N THE SEQUEN TIAL PREFETCH METHOD ACCESS TO CACHE LINE CAUSES PREFETCHING FOR SUCCESSIVE LINE OR LINES )F THERE IS SUbCIENT SPATIAL LOCALITY THIS APPROACH CAN WORK WELL )N A STRIDE PREFETCH METHOD THE INSTRUCTION AND OPERAND ADDRESS OF EACH LOAD IS STORED IN A HARDWARE LOOKUP TABLE AND THE INSTRUCTION ADDRESS ENABLES THE SYSTEM TO `ND THE ACCESS STREAM WHICH BELONGS TO THE SAME INSTRUCTION 4HE OPERAND AD DRESS OF EACH LOAD IS COMPARED WITH THE PREVIOUS ONE FROM THE SAME INSTRUCTION ADDRESS AND THE DIF FERENCE OF THESE TWO OPERAND ADDRESSES BECOMES THE STRIDE 4HE STRIDE IS USED TO DETERMINE THE POSSIBLE NEXT ACCESS BY ADDING THE STRIDE AND THE CURRENT OPERAND ADDRESS
7HEREAS DATA PREFETCHING INITIATES DATA MOVE MENT OPERATION BY CONSUMER DATA FORWARDING INI TIATES IT BY PRODUCER 7HEN A PROCESSOR UPDATES A WORD IN ITS CACHE DATA FORWARDING ALSO PROPAGATES THE UPDATE TO THE CACHES OF THE PROCESSORS THAT WILL USE THE WORD IN THE NEAR FUTURE (OPEFULLY WHEN THESE CONSUMER PROCESSORS REFER TO THE WORD LATER ON THEY WILL `ND THE UPDATED DATA IN THEIR CACHES AND THEREFORE PROCEED WITHOUT STALLING
$ATA FORWARDING NEEDS A SPECIAL INSTRUCTION WHICH COMBINES A WRITE AND A SENDER INITIATED FOR WARDING OPERATION AND REQUIRES AN EXTENSIVE COM PILER SUPPORT
4HE $!3( DELIVER OPERATION ;= TRANSFERS A COPY OF A CACHE BLOCK TO ONE OR MORE EXPLICITLY SPEC
I`ED CLUSTER CACHES 4HE +32 POSTSTORE INSTRUCTION
;= ENABLES A PROCESSOR TO FORWARD DATA TO OTHER PROCESSORS
6)) 3UMMARY
!S PROCESSOR SPEED IS INCREASING AT A MUCH FASTER RATE THAN MEMORY SPEED AND COMPUTER SYS TEM SCALES UP IN THE NUMBER OF PROCESSORS LATENCIES FOR MEMORY ACCESSES INCREASE 4HEREFORE LATENCY HIDING TECHNIQUES FOR COMPUTER SYSTEMS ARE CRITICAL FOR ACHIEVING A HIGHER PERFORMANCE 4HIS ARTICLE HAS ATTEMPTED TO ADDRESS LATENCY HIDING TECHNIQUES
WRITE BU_ER RELAXED CONSISTENCY MODEL DATA LO CALITY OPTIMIZATION DATA PREFETCHING AND DATA FOR WARDING
)N THIS SECTION THE ARTICLE IS CONCLUDED BY SUM MARIZING THE POSSIBLE BENE`TS OF THE LATENCY HIDING TECHNIQUES $ATA PREFETCHING IS VERY USEFUL IN RE DUCING THE STALLS DUE TO READ LATENCIES 7RITE BU_ER IN ASSOCIATION WITH RELAXED CONSISTENCY MODEL CAN REMOVE ALMOST ALL STALLS DUE TO WRITE LATENCIES
$ATA FORWARDING CAN BE USEFUL IN REDUCING LATEN CIES FOR ONLY CACHEABLE AND SHARED DATA
4O ACHIEVE OVERALL BENE`TS `RST THE LATENCY SHOULD BE REDUCED AS MUCH AS POSSIBLE THROUGH LO CALITY OPTIMIZATION !FTER THAT THE REMAINING LA TENCY SHOULD BE REDUCED BY USING WRITE BU_ER DATA PREFETCHING DATA FORWARDING AND MULTI THREADING
7HILE COMBINING MORE THAN ONE TECHNIQUE SUCH AS WRITE BU_ER DATA PREFETCHING AND DATA FOR
TERFERENCE
2EFERENCES
;= !NANT !GARWAL *OHN +UBIATOWICZ $AVID +RANZ "ENG (ONG ,IM $ONALD 9EUNG 'ODFREY $3OUZA AND -IKE 0ARKIN <3PARCLE !N %VOLUTIONARY 0ROCESSOR $ESIGN FOR ,ARGE 3CALE -ULTIPROCESSORS )%%% -ICRO
*UNE
;= !NANT !GARWAL " ( ,IM $ +RANZ AND * +UBIATOW ICZ <!PRIL ! 0ROCESSOR !RCHITECTURE FOR -ULTIPROCESSING
0ROC TH !NNUAL )NTL 3YMP ON #OMPUTER !RCHITECTURE-AY PP
;= 3TEVE #ARR +ATHRYN 3 -C+INLEY AND #HAU 7EN 4SENG
</MPILER /TIMIZATIONS FOR )MPROVING $ATA ,OCALITY 0ROC
TH )NTL #ONF ON !RCHITECTURAL 3UPPORT FOR 0ROGRAMMING ,ANGUAGES AND /PERATING 3YSTEMS /CTOBER PP
;= 4IEN &U #HEN AND *EAN ,OUP "AER <! 0ERFORMANCE 3TUDY OF 3OFTWARE AND (ARDWARE $ATA 0REFETCHING 3CHEMES
0ROC ST !NNUAL )NTL 3YMP ON #OMPUTER !RCHITECTUREPP
;= -ICHEL $UBOIS AND #HRISTOPH 3CHEURICH <-EMORY !CCESS
$EPENDENCIES IN 3HARED -EMORY -ULTIPROCESSORS )%%%
4RANSACTIONS ON 3OFTWARE %NGINEERING *UNE
;= -ICHEL $UBOIS #HRISTOPH 3CHEURICH AND &AYE "RIGGS
<-EMORY !CCESS "U_ERING IN -ULTIPROCESSORS 0ROC TH
!NNUAL )NTL 3YMP ON #OMPUTER !RCHITECTURE *UNE
PP
;= +OUROSH 'HARACHORLOO !NOOP 'UPTA AND *OHN (ENNESSY
<0ERFORMANCE %VALUATION OF -EMORY #ONSISTENCY -ODELS FOR 3HARED -EMORY -ULTIPROCESSORS 0ROC TH )NTL #ONF
;= *AMES 2 'OODMAN #ACHE #ONSISTENCY AND 3EQUENTIAL
#ONSISTENCY 4ECHNICAL 2EPORT 42 5NIVERSITY OF
7ISCONSIN -ADISON #OMPUTER 3CIENCES $EPARTMENT HTTPWWWCSWISCEDUTRSHTML &EBRUARY
;= %DWARD ( 'ORNISH %LANA $ 'RANSTON AND !LENANDER 6 6EIDENBAUM <#OMPILER $IRECTED $ATA 0REFETCHING IN -ULTIPROCESSORS WITH -EMORY (IERARCHIES0ROC TH )NTL
#ONF ON 3UPERCOMPUTING *UNE PP
;= *OHN , (ENNESSY AND .ORMAN 0 *OUPPI <#OMPUTER 4ECHNOLOGY AND !RCHITECTURE !N %VOLVING )NTERACTION
)%%% #OMPUTER 3EPTEMBER;= !NDO +I $ATA 0REFETCHING 4OWARDS (IDING -EMORY ,A
TENCY IN -ULTIPROCESSOR 3YSTEMS 0H$ THESIS 5NIVERSITYOF -ANCHESTER $EPARTMENT OF #OMPUTER 3CIENCE
;= !NDO +I 4HE %_ECTIVENESS OF 7RITE "U_ERS IN 2E
DUCING THE )MPACT OF -EMORY ,ATENCY4ECHNICAL 2EPORT 5-#3 5NIVERSITY OF -ANCHESTER
$EPARTMENT OF #OMPUTER 3CIENCE /CTOBER
HTTPWWWCSMANACUKCSONLYCSTECHREPINDEXHTML
;= $ ! +OUFATY 8 #HEN $ + 0OULSEN AND * 4ORREL LAS <$ATA &ORWARDING IN 3CALABLE 3HARED -EMORY -UL TIPROCESSORS 0ROC TH )NTL #ONF ON 3UPERCOMPUTING
"ARCELONA *ULY PP
;= -ONICA 3 ,AM %% 2OTHBERG AND -ICHAEL %DWARD 7OLF
<4HE #ACHE 0ERFORMANCE AND /PTIMIZATIONS OF "LOCKED !L GORITHMS 0ROC TH )NTL #ONF ON !RCHITECTURAL 3UPPORT
FOR 0ROGRAMMING ,ANGUAGES AND /PERATING 3YSTEMS !PRILPP
;= ,ESLIE ,AMPORT <(OW TO -AKE A -ULTIPROCESSOR #OMPUTER THAT #ORRECTLY %XECUTES -ULTIPROCESS 0ROGRAMS )%%%
4RANSACTIONS ON #OMPUTERS # 3EPTEMBER
;= $ANIEL ,ENOSKI *AMES ,AUDON +OUROSH 'HARACHORLOO
전 전 전
전자자자통자통통통신신신동신동동동향향향향분분분분석석석석 제권 제호 년 월
7OLF $IETRICH 7EBER AND !NOOP 'UPTA <4HE 3TAN FORD $!3( -ULTIPROCESSOR )%%% #OMPUTER
-ARCH
;= 4ODD # -OWRY AND !NOOP 'UPTA <4OLERATING ,A TENCY THROUGH 3OFTWARE #ONTROLLED 0REFETCHING IN 3HARED -EMORY -ULTIPROCESSORS *OURNAL OF 0ARALLEL AND $IS
TRIBUTED #OMPUTING;= (ENK , -ULLER 0AUL 7! 3TALLARD AND $AVID ($
7ARREN <(IDING -ISS ,ATENCIES WITH -ULTITHREADING ON THE $ATA $I_USION -ACHINE 0ROC )NTL #ONF
ON 0ARALLEL 0ROCESSING VOL !UGUST PP!LSO AVAILABLE AT HTTPWWWCSBRISACUK2ESEARCH
$$-PUBLICATIONSHTML
;= "ILL .ITZBERG AND 6IRGINIA ,O <$ISTRIBUTED 3HARED -EM ORY ! 3URVEY OF )SSUES AND !LGORITHMS )%%% #OMPUTER
!UGUST
;= !LAN *AY 3MITH <#ACHE -EMORIES !#- #OMPUTING 3UR
VEYS;= 7OLF $IETRICH 7EBER AND !NOOP 'UPTA <%XPLORING THE
"ENE`TS OF -ULTIPLE (ARDWARE #ONTEXTS IN A -ULTIPROCES SOR !RCHITECTURE 0RELIMINARY 2ESULTS 0ROC TH !NNUAL
)NTL 3YMP ON #OMPUTER !RCHITECTURE *UNE PP;= $ANIEL 7INDHEISER %RIC , "OYD %RIC (AO 3ANTOSH '
!BRAHAM AND %DWARD 3 $AVIDSON <+32 -ULTIPRO CESSOR !NALYSIS OF ,ATENCY (IDING 4ECHNIQUES IN 3PARSE 3OLVER 0ROC TH )NTL 0ARALLEL 0ROCESSING 3YMPOSIUM
PP
;= -ICHAEL %DWARD 7OLF )MPROVING ,OCALITY AND 0ARAL
LELISM IN .ESTED ,OOPS 0H$ THESIS 3TANFORD 5NIVERSITY !UGUST !VAILABLE AT 4HE 3TANFORD 35)&
#OMPILER 'ROUP $EPARTMENT OF #OMPUTER 3CIENCE VIA HTTPSUIFSTANFORDEDUPAPERS
;= -ICHAEL %DWARD 7OLF AND -ONICA 3 ,AM <! $ATA ,OCAL ITY /PTIMIZING !LGORITHM 0ROC 3)'0,!. #ONF ON
0ROGRAMMING ,ANGUAGE $ESIGN AND )MPLEMENTATION *UNEPP
;= 3TEVEN #AMERON 7OO -ORIYOSHI /HARA %VAN 4ORRIE
*ASWINDER 0AL 3INGH AND !NOOP 'UPTA <4HE 30,!3( 0ROGRAMS #HARACTERIZATION AND -ETHODOLOGICAL #ONSID
ERATIONS 0ROC ND !NNUAL )NTL 3YMP ON #OMPUTER
!RCHITECTURE PP