• 검색 결과가 없습니다.

Performance Analysis Model

N/A
N/A
Protected

Academic year: 2022

Share "Performance Analysis Model"

Copied!
63
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Performance Analysis Model

Jihong Kim

Dept. of CSE, SNU

(2)

Performance Metrics

• Throughput

• Response Time

• Capacity

• Reliability

• Cost

• …

• Used for CPU, Memory, SSD, Network, ….

(3)

Memory Hierarchy

Higher Speed, Cost Larger

Size

(4)

Latency Numbers Every Programmer Should Know

(5)

Throughput

• IO rate

– IOPS (accesses/second)

– Used for applications where the size of each request is small

– e.g. transaction processing

• Data rate

– MB/Sec or bytes/Sec

– Used for applications where the size of each request is large

– e.g. scientific applications

(6)

Response Time

• How long a storage system takes to access data

• Depending on what you view as the storage system

– User’s perspective – OS’s perspective

– Disk controller’s perspective

Application

O/S

Storage

(7)

Disk I/O Performance

Response time = Queue + Device Service time Proc

Queue

IOC Device

Metrics:

Response Time Throughput

Q: how to balance response time &

throughput?

(e.g., adding more servers)

[Hennessy&Patterson ]

Knee of curve

e.g., disks at ~70%

(8)

• Better to understand key performance factors

• High-level insight on system performance

• Complementary to other performance evaluation methods:

– Measurement-based – Simulation-based

Performance Analytic Model

(9)

• Original Amdahl’s Law

– Useful in understanding an upper bound on overall speedup (insight with less accuracy!!)

Example Analytic Model

N - 1 1

N S

1 f

f (1 Nx

f(1 1 S

f 1 t

and t t

t f t

, N t

t

t S t

s p s

p s p s

s p

1

)) 1 1

1

= −

− +

=

− + =

= +

= +

S: speedup

N: no. of processors

f: fraction of sequential part tp and ts:

exec. time for parallel and sequential part

(10)

Example: 100 processors

S = 100

f = ?

S = 99

f = 1/(99*99) = 0.000102 N

- 1 1

N S

1 f

− 1

=

(11)
(12)

Amdahl’s law (in general)

Execution Time_after improvement

=

+ Execution Time_unaffected by improvement 𝐄𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧 𝐓𝐢𝐦𝐞_𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑 𝑏𝑦 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡

𝐀𝐦𝐨𝐮𝐧𝐭 𝐨𝐟 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐦𝐞𝐧𝐭

(13)

• An upper bound on the throughput of a system from the throughput of its subsystems.

1. The max. throughput of K subsystems in parallel

= the sum of the subsystem throughputs.

2. The max. throughput of K subsystems in series

= the minimum of the subsystem throughputs.

Bottleneck Analysis

(14)

• Max Bandwidth = MIN(9, (4+3), 6, 10) = 6

• The most expensive subsystem should be bottlenecked while non-bottlenecked

throughputs should be decreased to save cost.

Example: Bottleneck Analysis

9 GB/S

4 GB/s 3 GB/s

6 GB/s 10 GB/s

(15)

• A visual way to identify performance bottlenecks and to find potential optimization directions

• Bottleneck analysis between

application characteristics and architectural capabilities

Roofline Model

FLOPS = #𝐹𝐿𝑂𝑃

𝑠𝑒𝑐 = #𝐹𝐿𝑂𝑃

𝑏𝑦𝑡𝑒 × 𝐵𝑦𝑡𝑒𝑠

𝑠𝑒𝑐

= 𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝐼𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 × 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ

= AI × 𝐵𝑊

Attainable FLOPS = min(AI x BW, Peak FLOPS)

application

characteristics

architectural

capabilities

(16)

• The number of operations per unit data

• Algorithm specific

• Example AI:

– Matrix Addition of two N x N matrices

• Total number of data accesses = 2NN (reads) + NN (writes)= 3NN

• Total number of operations = NN (adds)

• 𝐴𝐼 = 𝑁𝑁

3𝑁𝑁 = 1

3 (constant)

– Matrix Multiplication of two N x N matrices

• 𝐴𝐼 = 2𝑁 −1 ×𝑁𝑁

3𝑁𝑁 = 2𝑁−1

3 (as N gets higher, AI increases)

Arithmetic Intensity = #𝐹𝐿𝑂𝑃

𝑏𝑦𝑡𝑒

(17)

𝐹𝐿𝑂𝑃𝑆 = AI × 𝐵𝑊

log 𝐹𝐿𝑂𝑃𝑆 = log 𝐴𝐼 + log 𝐵𝑊 Y = X + C

Roofline Graph

log 𝐴𝐼 log 𝐹𝐿𝑂𝑃𝑆

log 𝐵𝑊 𝑃𝑒𝑎𝑘 𝐹𝐿𝑂𝑃𝑆

Compute Bound BW Bound

(18)

Alg 1: Move Right (higher data reuse) Move Up (higher BW memory) Alg 2: Move Up (more powerful CPU)

Optimization Guidelines

(19)

Roofline Model w/ Ceilings

(20)

• Original Roofline Model

– AI is computed over bytes moved out of DRAM.

• Cache-aware Roofline Model

– AI is computed over the total number of bytes moved to the CPU. (That is, L1-cache AI)

Cache-Aware Roofline Model

AI varies for a given algorithm:

(e.g., different loop iterations or different optimizations)

 AI is constant.

(21)
(22)

• In the original roofline model, if an application is in the memory bound region, it is difficult to achieve the maximum attainable performance.

Cache-aware vs Original Roofline

(23)

• Large red dot: takes up the most time,

→ candidate for optimization

Cache-Aware Roofline Model in Intel

Advisor

(24)

Little’s Law: L = λ W

L = average number of items in the queuing system

W = average waiting time in the system for an item (i.e. latency) λ = average number of items arriving per unit time

(i.e., bandwidth)

L = (average throughput) x (average latency)

Queuing System:

Items in Queue &

Items in Service

Arrivals Departures

If you know two, one can be easily computed

(25)

n(t) = the number of items in queue at time t T = a long period of time

A(T) = the area under n(t) over T N(T) = the number of arrivals in T

Intuitive Argument

(26)

λ(T) = N(T)/T L(T) = A(T)/T

W(T) = A(T)/N(T)

→L(T) = λ(T)W(T)

𝑇→∞

lim L 𝑇 = L

𝑇→∞

lim λ 𝑇 = λ

𝑇→∞

lim W 𝑇 = W

(27)

Example 1: 평균재학기간

• 매년 신입생이 2000명인 학교에서

– λ = 2000/년 (avg throughput)

• 평균적으로 학교를 다니는 학생이 3000명이면

– L = 3000

• 학생 1인당 학교에 다니는 기간은? (avg latency)

– W = 3000/2000 = 1.5년

(28)

Example 2: Disk Utilization

For a single disk,

every second, 50 I/O requests are coming in

&

avg disk service time = 10 ms, what is the disk utilization?

답: 50 x 0.01

= 0.50 I/O

(29)

Example 3: Buffer Requirements

Consider a cache memory with

the cache miss ratio = 6.25% &

the avg miss latency = 100 ns.

Q: If this cache receives 2 memory references per cycle at 2.5 GHz, how many buffers are needed for recording outstanding misses?

답: avg throughput =

2 refs/cycle x 2.5G cycles/s x 0.0625 misses/ref

= 0.3125 Gmisses/s

L = 0.3125 Gmisses/s x 100 ns = 31.25 misses

(30)

M/M/1 Queue

M: exponential interarrival times with mean 1/R M: exponential service times with mean S

1: Single server

M/M/1 queue will reveal trade-off among

(a) allowing unscheduled task arrivals, (b) minimizing latency , and (c ) maximizing throughput

L = Q + S

Q

(31)

• R/(1/S) = RS (1/S > R, RS < 1)

• From queuing theory, y = 1/(1-x)

M/M/1 Queue Latency vs. Throughput Tradeoff

Normalized average latency

frac. of max. throughput, 𝑅1

𝑆

𝐿 𝑆

(32)

High Throughput

→ High Latency

Low Throughput

→ Low Latency

Big Impact on Latency

X: 0.25 0.50 0.75 0.90

Y: 1.33 2.00 4.00 10.0

(33)

Q: When R = 10 packets/s, if we want to keep L = 100 ms, what S should be?

• Trial 1: Since 1/R = 100 ms/packet, how about S = 100 ms??

• In M/M/1, S = 50 ms with Q = 50 ms Solve for S in L/S = 1/(1-RxS)

1/10S = 1/(1 – 10S) 1 – 10S = 10S

S = 1/20 sec ( RxS = 0.5 → 50% of max throughput)

Example: M/M/1

(34)

• M. Hill, “Three Other Models of Computer System Performance,” 2019. Available at https://arxiv.org/pdf/1901.02926.pdf .

• S. Williams et al.,“Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures,” CACM, April 2008.

• A. Ilic et al., “Cache-Aware Roofline Model:

Upgrading the Loft,” IEEE CAL, 2014.

• A. Mohan, “Understanding Roofline Charts,”

http://telesens.co/2018/07/26/....

References

(35)

Performance Summary and Sampling

Ref. Chap. 12, The Art of Computer Systems Performance Analysis (by R. K. Jain)

(36)

Performance Summary

How to characterise with a single number?

Arithmetic mean =

Weighted AM =

Computer A Computer B Program 1 1 10

Program 2 1000 100 Total time 1001 110

= n

n i 1Timei 1

= n i 1

i i x Time w

(37)

Summary by a Single Number

Sample mean vs. Sample median vs. Sample mode

(38)

Mean vs. Median vs. Mode

Examples:

Most frequently-used microprocessor in smartphones

Average CPU time for each triangle drawing

Average system configuration

(39)

Common Misuses of Means

Using mean of significantly different values

Mean of CPU time for 10 ms and 1000 ms

Using mean without regard to the skewness of distribution

(40)

MySQL Command Latency

[Source]http://www.brendangregg.com/FrequencyTrails/mean.html

(41)

Common Misuses of Means

Multiplying means to get the mean of a product

E(xy) = E(x)E(y) if x and y are independent

E.g., the ave. no. of users = 23, the ave. no. of subprocesses per user

= 2, what is the ave. no. of subprocesses?

Taking a mean of a ratio with different bases

More discussions to follow

(42)

Geometric Mean

Used if the product of the observations is a quantity of interest

E.g., Improvement in each layer of network protocol

7 1.18 × 1.13 × 1.11 × 1.08 × 1.10 × 1.28 × 1.05 − 1

= 0.13

(43)

Multiplicativity Property in GM

GM of a ratio is the ratio of the GMs of the numerator and denominator

The choice of the base does not change the conclusion

(44)

Harmonic Mean

Should be used whenever an arithmetic mean can be justified for 1/xi

Example: Average MIPS rate for the processor

Assume that the benchmark has m million instructions.

: MIPS computed from the i-th measurement

(45)

Normalized execution time is handy for scaling performance

P1 P2 relative to A relative to B

secs secs P1 P2 AM P1 P2 AM

A 10 50 1 1 1 0.1 50 25 B 100 1 10 0.02 5 1 1 1

From total execution time, A is faster than B by the factor of 101/60 = 1.7 (equal # of runs of P1 and P2)

Arith. Mean of relative time say either

A is 5 times faster than B

B is 25 time faster than A

Normalized Performance

Do not summarize relative execution time with Arith. Mean.

(46)

n

Geometric mean used to summarize relative measure P1 P2 relative to A relative to B

secs secs P1 P2 GM P1 P2 GM

A 10 50 1 1 1 0.1 50 2.2 B 100 1 10 0.02 0.45 1 1 1

Relative to A : B is 2.2X faster

Relative to B : B is 2.2X faster

(BUT) two unfortunate properties

Drawback 1: does not track execution time

Drawback 2: rewards all improvements equally

On machine B, contribution of 50 sec in P1 = contribution of 0.5 sec in P2.

Geometric Mean Exec. Time Ratio

i

(47)

Mean of a Ratio: CASE 1

Both sum of numerators and sum of denominators have a physical meaning.

→ The average of the ratio is the ratio of the averages

(48)

Example of Case 1

(49)

Mean of a Ratio: CASE 2

Sum of numerators has a physical meaning && the denominator is a constant

The arithmetic mean of the ratios can be used

(50)

Mean of a Ratio: CASE 3

Sum of the denominators has a physical meaning && the numerators are constant

The harmonic mean of the ratios can be used

(51)

Mean of a Ratio: CASE 4

If ai = c * bi where c is approximately constant, c can be estimated by the geometric mean of ai/bi

A ratio of sums??

1. Workload sizes are widely different (119: 8612)

2. ai = c * bi

(52)

SPEC Benchmarking Process

• For each SPEC benchmark program p ( ), measure Tp.

• Compute geometric mean:

N N

p

=1 p p) reference(

T T

N p 1

SPECRatio

B A A

B B

A

e Performanc

e Performanc ime

ExecutionT

ime ExecutionT

SPECRatio

SPECRatio = =

(53)

SPEC 89

• Compiler enhancements and performance

0 100 200 300 400 500 600 700 800

tomcatv fpppp

matrix300 eqntott

li nasa7

doduc spice

espresso gcc

Benchmark

Compiler

Enhanced compiler

SPEC performance ratio

over

VAX-11/780

(54)

Combining Cost and Performance

Bandwidth per pin vs. Execution time per dollar

Can divide cost into performance if the choice of metric for performance grows in the opposite direction as the metric for cost

Must multiply cost and performance if the choice of metric for performance grows in the same direction as the metric for cost

(55)

More Examples

Transactions per second per dollar

Request latency per pin

Execution-time-dollars

Request-latency-pin-count product

(56)

Pareto Optimality

When multi-dimensional design spaces are considered

Using a single number is not really appropriate

Unless we know exact weights of different metrics or

Unless we care about one aspect to the exclusion of all others

Pareto Optimality

One candidate solution to a problem is better than another candidate solution if the first dominates the second

The first is better than or equal to the second in all figures of merit

(57)
(58)

Pareto Plot vs. Cost-Performance

(59)

Sampled Average

If sampling for data/X,

Must sample this metric in equal steps of dimension X.

Miles per gallon

Cycles per instruction

(60)

Example: Miles Per Gallon

(61)

Sampling over Time

(62)

Sampling over Distance

(63)

Sampling over Fuel Consumption

참조

관련 문서

In this thesis, tracked vehicle's dynamic performance of the two models, single-body model and Multi-body model, were compared through the numerical simulation.. The

MODEL STEP VOLTAGE CURRENT RESISTANCE INDUCTANCE HOLDING ROTOR NUMBER WEIGHT DIMENSION. ANGLE TORQUE

Using 1987 data, had 46,400 traffic fatalities in 1924 billion miles or 24 fatalities/billion vehicle miles.. 37% drivers, 18% passengers, 14% pedestrians, 9%

5 ) Stop Light Switch (Mechanical) 6 ) Front/Rear Balance Valve 6 ) Front/Rear Balance Valve 7 ) Pressure Differentiavl Valve 8 ) Brake Warning Lamp.. 9 ) Brake Fluid

▪ Local temperature of PMSM can be calculated through thermal resistance model and heat generation model, presenting the cooling performance thermal

Roles of Computer in Product Develoment Cycle. „ Quick generation

- flux of solute mass, that is, the mass of a solute crossing a unit area per unit time in a given direction, is proportional to the gradient of solute concentration

à For each subentity, create a table that includes the attributes of that entity set plus the primary key of the higher level