Performance Analysis Model

(1)

Performance Analysis Model

Jihong Kim

Dept. of CSE, SNU

(2)

Performance Metrics

• Throughput

• Response Time

• Capacity

• Reliability

• Cost

• …

• Used for CPU, Memory, SSD, Network, ….

(3)

Memory Hierarchy

Higher Speed, Cost Larger

Size

(4)

Latency Numbers Every Programmer Should Know

(5)

Throughput

• IO rate

– IOPS (accesses/second)

– Used for applications where the size of each request is small

– e.g. transaction processing

• Data rate

– MB/Sec or bytes/Sec

– Used for applications where the size of each request is large

– e.g. scientific applications

(6)

Response Time

• How long a storage system takes to access data

• Depending on what you view as the storage system

– User’s perspective – OS’s perspective

– Disk controller’s perspective

Application

O/S

Storage

(7)

Disk I/O Performance

Response time = Queue + Device Service time Proc

Queue

IOC Device

Metrics:

Response Time Throughput

Q: how to balance response time &

throughput?

(e.g., adding more servers)

[Hennessy&Patterson ]

Knee of curve

e.g., disks at ~70%

(8)

• Better to understand key performance factors

• High-level insight on system performance

• Complementary to other performance evaluation methods:

– Measurement-based – Simulation-based

Performance Analytic Model

(9)

• Original Amdahl’s Law

– Useful in understanding an upper bound on overall speedup (insight with less accuracy!!)

Example Analytic Model

N - 1 1

N S

1 f

f (1 Nx

f(1 1 S

f 1 t

and t t

t f t

, N t

t

t S t

s p s

p s p s

s p

1

)) 1 1

1

= −



− +

=

− + =

= +

S: speedup

N: no. of processors

f: fraction of sequential part tp and ts:

exec. time for parallel and sequential part

(10)

Example: 100 processors

• S = 100

• f = ?

• S = 99

• f = 1/(99*99) = 0.000102 N

- 1 1

N S

1 f

− 1

=

(11)

(12)

Amdahl’s law (in general)

Execution Time_after improvement

=

+ Execution Time_unaffected by improvement 𝐄𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧 𝐓𝐢𝐦𝐞_𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑 𝑏𝑦 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡

𝐀𝐦𝐨𝐮𝐧𝐭 𝐨𝐟 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐦𝐞𝐧𝐭

(13)

• An upper bound on the throughput of a system from the throughput of its subsystems.

1. The max. throughput of K subsystems in parallel

= the sum of the subsystem throughputs.

2. The max. throughput of K subsystems in series

= the minimum of the subsystem throughputs.

Bottleneck Analysis

(14)

• Max Bandwidth = MIN(9, (4+3), 6, 10) = 6

• The most expensive subsystem should be bottlenecked while non-bottlenecked

throughputs should be decreased to save cost.

Example: Bottleneck Analysis

9 GB/S

4 GB/s 3 GB/s

6 GB/s 10 GB/s

(15)

• A visual way to identify performance bottlenecks and to find potential optimization directions

• Bottleneck analysis between

application characteristics and architectural capabilities

Roofline Model

FLOPS = ^{#𝐹𝐿𝑂𝑃}

𝑠𝑒𝑐 = ^{#𝐹𝐿𝑂𝑃}

𝑏𝑦𝑡𝑒 × ^{𝐵𝑦𝑡𝑒𝑠}

𝑠𝑒𝑐

= 𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝐼𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 × 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ

= AI × 𝐵𝑊

Attainable FLOPS = min(AI x BW, Peak FLOPS)

application

characteristics

architectural

capabilities

(16)

• The number of operations per unit data

• Algorithm specific

• Example AI:

– Matrix Addition of two N x N matrices

• Total number of data accesses = 2NN (reads) + NN (writes)= 3NN

• Total number of operations = NN (adds)

• 𝐴𝐼 = ^𝑁𝑁

3𝑁𝑁 = ¹

3 (constant)

– Matrix Multiplication of two N x N matrices

• 𝐴𝐼 = ^{2𝑁 −1 ×𝑁𝑁}

3𝑁𝑁 = ^2𝑁−1

3 (as N gets higher, AI increases)

Arithmetic Intensity = ^{#𝐹𝐿𝑂𝑃}

𝑏𝑦𝑡𝑒

(17)

𝐹𝐿𝑂𝑃𝑆 = AI × 𝐵𝑊

log 𝐹𝐿𝑂𝑃𝑆 = log 𝐴𝐼 + log 𝐵𝑊 Y = X + C

Roofline Graph

log 𝐴𝐼 log 𝐹𝐿𝑂𝑃𝑆

log 𝐵𝑊 𝑃𝑒𝑎𝑘 𝐹𝐿𝑂𝑃𝑆

Compute Bound BW Bound

(18)

Alg 1: Move Right (higher data reuse) Move Up (higher BW memory) Alg 2: Move Up (more powerful CPU)

Optimization Guidelines

(19)

Roofline Model w/ Ceilings

(20)

• Original Roofline Model

– AI is computed over bytes moved out of DRAM.

• Cache-aware Roofline Model

– AI is computed over the total number of bytes moved to the CPU. (That is, L1-cache AI)

Cache-Aware Roofline Model

AI varies for a given algorithm:

(e.g., different loop iterations or different optimizations)

 AI is constant.

(21)

(22)

• In the original roofline model, if an application is in the memory bound region, it is difficult to achieve the maximum attainable performance.

Cache-aware vs Original Roofline

(23)

• Large red dot: takes up the most time,

→ candidate for optimization

Cache-Aware Roofline Model in Intel

Advisor

(24)

Little’s Law: L = λ W

L = average number of items in the queuing system

W = average waiting time in the system for an item (i.e. latency) λ = average number of items arriving per unit time

(i.e., bandwidth)

L = (average throughput) x (average latency)

Queuing System:

Items in Queue &

Items in Service

Arrivals Departures

If you know two, one can be easily computed

(25)

n(t) = the number of items in queue at time t T = a long period of time

A(T) = the area under n(t) over T N(T) = the number of arrivals in T

Intuitive Argument

(26)

λ(T) = N(T)/T L(T) = A(T)/T

W(T) = A(T)/N(T)

→L(T) = λ(T)W(T)

𝑇→∞

lim L 𝑇 = L

𝑇→∞

lim λ 𝑇 = λ

𝑇→∞

lim W 𝑇 = W

(27)

Example 1: 평균재학기간

• 매년 신입생이 2000명인 학교에서

– λ = 2000/년 (avg throughput)

• 평균적으로 학교를 다니는 학생이 3000명이면

– L = 3000

• 학생 1인당 학교에 다니는 기간은? (avg latency)

– W = 3000/2000 = 1.5년

(28)

Example 2: Disk Utilization

For a single disk,

every second, 50 I/O requests are coming in

&

avg disk service time = 10 ms, what is the disk utilization?

답: 50 x 0.01

= 0.50 I/O

(29)

Example 3: Buffer Requirements

Consider a cache memory with

the cache miss ratio = 6.25% &

the avg miss latency = 100 ns.

Q: If this cache receives 2 memory references per cycle at 2.5 GHz, how many buffers are needed for recording outstanding misses?

답: avg throughput =

2 refs/cycle x 2.5G cycles/s x 0.0625 misses/ref

= 0.3125 Gmisses/s

L = 0.3125 Gmisses/s x 100 ns = 31.25 misses

(30)

M/M/1 Queue

M: exponential interarrival times with mean 1/R M: exponential service times with mean S

1: Single server

M/M/1 queue will reveal trade-off among

(a) allowing unscheduled task arrivals, (b) minimizing latency , and (c ) maximizing throughput

L = Q + S

Q

(31)

• R/(1/S) = RS (1/S > R, RS < 1)

• From queuing theory, y = 1/(1-x)

M/M/1 Queue Latency vs. Throughput Tradeoff

Normalized average latency

frac. of max. throughput, ^𝑅₁

𝑆

𝐿 𝑆

(32)

High Throughput

→ High Latency

Low Throughput

→ Low Latency

Big Impact on Latency

X: 0.25 0.50 0.75 0.90

Y: 1.33 2.00 4.00 10.0

(33)

Q: When R = 10 packets/s, if we want to keep L = 100 ms, what S should be?

• Trial 1: Since 1/R = 100 ms/packet, how about S = 100 ms??

• In M/M/1, S = 50 ms with Q = 50 ms Solve for S in L/S = 1/(1-RxS)

1/10S = 1/(1 – 10S) 1 – 10S = 10S

S = 1/20 sec ( RxS = 0.5 → 50% of max throughput)

Example: M/M/1

(34)

• M. Hill, “Three Other Models of Computer System Performance,” 2019. Available at https://arxiv.org/pdf/1901.02926.pdf .

• S. Williams et al.,“Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures,” CACM, April 2008.

• A. Ilic et al., “Cache-Aware Roofline Model:

Upgrading the Loft,” IEEE CAL, 2014.

• A. Mohan, “Understanding Roofline Charts,”

http://telesens.co/2018/07/26/....

References

(35)

Performance Summary and Sampling

Ref. Chap. 12, The Art of Computer Systems Performance Analysis (by R. K. Jain)

(36)

Performance Summary

⚫ How to characterise with a single number?

⚫ Arithmetic mean =

⚫ Weighted AM =

Computer A Computer B Program 1 1 10

Program 2 1000 100 Total time 1001 110

= n

n i ₁Timeⁱ 1

= n i 1

i i x Time w

(37)

Summary by a Single Number

⚫ Sample mean vs. Sample median vs. Sample mode

(38)

Mean vs. Median vs. Mode

Examples:

◆ Most frequently-used microprocessor in smartphones

◆ Average CPU time for each triangle drawing

◆ Average system configuration

(39)

Common Misuses of Means

⚫ Using mean of significantly different values

⚫ Mean of CPU time for 10 ms and 1000 ms

⚫ Using mean without regard to the skewness of distribution

(40)

MySQL Command Latency

[Source]http://www.brendangregg.com/FrequencyTrails/mean.html

(41)

Common Misuses of Means

⚫ Multiplying means to get the mean of a product

⚫ E(xy) = E(x)E(y) if x and y are independent

⚫ E.g., the ave. no. of users = 23, the ave. no. of subprocesses per user

= 2, what is the ave. no. of subprocesses?

⚫ Taking a mean of a ratio with different bases

⚫ More discussions to follow

(42)

Geometric Mean

⚫ Used if the product of the observations is a quantity of interest

⚫ E.g., Improvement in each layer of network protocol

7 1.18 × 1.13 × 1.11 × 1.08 × 1.10 × 1.28 × 1.05 − 1

= 0.13

(43)

Multiplicativity Property in GM

⚫ GM of a ratio is the ratio of the GMs of the numerator and denominator

⚫ The choice of the base does not change the conclusion

(44)

Harmonic Mean

⚫ Should be used whenever an arithmetic mean can be justified for 1/xi

⚫ Example: Average MIPS rate for the processor

⚫ Assume that the benchmark has m million instructions.

⚫ : MIPS computed from the i-th measurement

(45)

⚫ Normalized execution time is handy for scaling performance

P1 P2 relative to A relative to B

secs secs P1 P2 AM P1 P2 AM

A 10 50 1 1 1 0.1 50 25 B 100 1 10 0.02 5 1 1 1

⚫ From total execution time, A is faster than B by the factor of 101/60 = 1.7 (equal # of runs of P1 and P2)

⚫ Arith. Mean of relative time say either

⚫ A is 5 times faster than B

⚫ B is 25 time faster than A

Normalized Performance

Do not summarize relative execution time with Arith. Mean.

(46)

n

⚫ Geometric mean used to summarize relative measure P1 P2 relative to A relative to B

secs secs P1 P2 GM P1 P2 GM

A 10 50 1 1 1 0.1 50 2.2 B 100 1 10 0.02 0.45 1 1 1

⚫ Relative to A : B is 2.2X faster

⚫ Relative to B : B is 2.2X faster

⚫ (BUT) two unfortunate properties

⚫ Drawback 1: does not track execution time

⚫ Drawback 2: rewards all improvements equally

⚫ On machine B, contribution of 50 sec in P1 = contribution of 0.5 sec in P2.

Geometric Mean   Exec. Time Ratio

_i

(47)

Mean of a Ratio: CASE 1

Both sum of numerators and sum of denominators have a physical meaning.

→ The average of the ratio is the ratio of the averages

(48)

Example of Case 1

(49)

Mean of a Ratio: CASE 2

Sum of numerators has a physical meaning && the denominator is a constant

→ The arithmetic mean of the ratios can be used

(50)

Mean of a Ratio: CASE 3

Sum of the denominators has a physical meaning && the numerators are constant

→ The harmonic mean of the ratios can be used

(51)

Mean of a Ratio: CASE 4

If ai = c * bi where c is approximately constant, c can be estimated by the geometric mean of ai/bi

A ratio of sums??

1. Workload sizes are widely different (119: 8612)

2. ai = c * bi

(52)

SPEC Benchmarking Process

• For each SPEC benchmark program p ( ), measure T_p.

• Compute geometric mean:

N N

p



=1 p p) reference(

T T

N p 1 

SPECRatio

B A A

B B

A

e Performanc

e Performanc ime

ExecutionT

ime ExecutionT

SPECRatio

SPECRatio = =

(53)

SPEC 89

• Compiler enhancements and performance

0 100 200 300 400 500 600 700 800

tomcatv fpppp

matrix300 eqntott

li nasa7

doduc spice

espresso gcc

Benchmark

Compiler

Enhanced compiler

SPEC performance ratio

over

VAX-11/780

(54)

Combining Cost and Performance

⚫ Bandwidth per pin vs. Execution time per dollar

⚫ Can divide cost into performance if the choice of metric for performance grows in the opposite direction as the metric for cost

⚫ Must multiply cost and performance if the choice of metric for performance grows in the same direction as the metric for cost

(55)

More Examples

⚫ Transactions per second per dollar

⚫ Request latency per pin

Execution-time-dollars

Request-latency-pin-count product

(56)

Pareto Optimality

⚫ When multi-dimensional design spaces are considered

⚫ Using a single number is not really appropriate

⚫ Unless we know exact weights of different metrics or

⚫ Unless we care about one aspect to the exclusion of all others

⚫ Pareto Optimality

⚫ One candidate solution to a problem is better than another candidate solution if the first dominates the second

⚫ The first is better than or equal to the second in all figures of merit

(57)

(58)

Pareto Plot vs. Cost-Performance

(59)

Sampled Average

⚫ If sampling for data/X,

⚫ Must sample this metric in equal steps of dimension X.

⚫ Miles per gallon

⚫ Cycles per instruction

⚫ …

(60)

Example: Miles Per Gallon

(61)

Sampling over Time

(62)

Sampling over Distance

(63)

Performance Analysis Model