Performance Analysis Model
Jihong Kim
Dept. of CSE, SNU
Performance Metrics
• Throughput
• Response Time
• Capacity
• Reliability
• Cost
• …
• Used for CPU, Memory, SSD, Network, ….
Memory Hierarchy
Higher Speed, Cost Larger
Size
Latency Numbers Every Programmer Should Know
Throughput
• IO rate
– IOPS (accesses/second)
– Used for applications where the size of each request is small
– e.g. transaction processing
• Data rate
– MB/Sec or bytes/Sec
– Used for applications where the size of each request is large
– e.g. scientific applications
Response Time
• How long a storage system takes to access data
• Depending on what you view as the storage system
– User’s perspective – OS’s perspective
– Disk controller’s perspective
Application
O/S
Storage
Disk I/O Performance
Response time = Queue + Device Service time Proc
Queue
IOC Device
Metrics:
Response Time Throughput
Q: how to balance response time &
throughput?
(e.g., adding more servers)
[Hennessy&Patterson ]
Knee of curve
e.g., disks at ~70%
• Better to understand key performance factors
• High-level insight on system performance
• Complementary to other performance evaluation methods:
– Measurement-based – Simulation-based
Performance Analytic Model
• Original Amdahl’s Law
– Useful in understanding an upper bound on overall speedup (insight with less accuracy!!)
Example Analytic Model
N - 1 1
N S
1 f
f (1 Nx
f(1 1 S
f 1 t
and t t
t f t
, N t
t
t S t
s p s
p s p s
s p
1
)) 1 1
1
= −
− +
=
− + =
= +
= +
S: speedup
N: no. of processors
f: fraction of sequential part tp and ts:
exec. time for parallel and sequential part
Example: 100 processors
• S = 100
• f = ?
• S = 99
• f = 1/(99*99) = 0.000102 N
- 1 1
N S
1 f
− 1
=
Amdahl’s law (in general)
Execution Time_after improvement
=
+ Execution Time_unaffected by improvement 𝐄𝐱𝐞𝐜𝐮𝐭𝐢𝐨𝐧 𝐓𝐢𝐦𝐞_𝑎𝑓𝑓𝑒𝑐𝑡𝑒𝑑 𝑏𝑦 𝑖𝑚𝑝𝑟𝑜𝑣𝑒𝑚𝑒𝑛𝑡
𝐀𝐦𝐨𝐮𝐧𝐭 𝐨𝐟 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐦𝐞𝐧𝐭
• An upper bound on the throughput of a system from the throughput of its subsystems.
1. The max. throughput of K subsystems in parallel
= the sum of the subsystem throughputs.
2. The max. throughput of K subsystems in series
= the minimum of the subsystem throughputs.
Bottleneck Analysis
• Max Bandwidth = MIN(9, (4+3), 6, 10) = 6
• The most expensive subsystem should be bottlenecked while non-bottlenecked
throughputs should be decreased to save cost.
Example: Bottleneck Analysis
9 GB/S
4 GB/s 3 GB/s
6 GB/s 10 GB/s
• A visual way to identify performance bottlenecks and to find potential optimization directions
• Bottleneck analysis between
application characteristics and architectural capabilities
Roofline Model
FLOPS = #𝐹𝐿𝑂𝑃
𝑠𝑒𝑐 = #𝐹𝐿𝑂𝑃
𝑏𝑦𝑡𝑒 × 𝐵𝑦𝑡𝑒𝑠
𝑠𝑒𝑐
= 𝐴𝑟𝑖𝑡ℎ𝑚𝑒𝑡𝑖𝑐 𝐼𝑛𝑡𝑒𝑛𝑠𝑖𝑡𝑦 × 𝐵𝑎𝑛𝑑𝑤𝑖𝑑𝑡ℎ
= AI × 𝐵𝑊
Attainable FLOPS = min(AI x BW, Peak FLOPS)
application
characteristics
architectural
capabilities
• The number of operations per unit data
• Algorithm specific
• Example AI:
– Matrix Addition of two N x N matrices
• Total number of data accesses = 2NN (reads) + NN (writes)= 3NN
• Total number of operations = NN (adds)
• 𝐴𝐼 = 𝑁𝑁
3𝑁𝑁 = 1
3 (constant)
– Matrix Multiplication of two N x N matrices
• 𝐴𝐼 = 2𝑁 −1 ×𝑁𝑁
3𝑁𝑁 = 2𝑁−1
3 (as N gets higher, AI increases)
Arithmetic Intensity = #𝐹𝐿𝑂𝑃
𝑏𝑦𝑡𝑒
𝐹𝐿𝑂𝑃𝑆 = AI × 𝐵𝑊
log 𝐹𝐿𝑂𝑃𝑆 = log 𝐴𝐼 + log 𝐵𝑊 Y = X + C
Roofline Graph
log 𝐴𝐼 log 𝐹𝐿𝑂𝑃𝑆
log 𝐵𝑊 𝑃𝑒𝑎𝑘 𝐹𝐿𝑂𝑃𝑆
Compute Bound BW Bound
Alg 1: Move Right (higher data reuse) Move Up (higher BW memory) Alg 2: Move Up (more powerful CPU)
Optimization Guidelines
Roofline Model w/ Ceilings
• Original Roofline Model
– AI is computed over bytes moved out of DRAM.
• Cache-aware Roofline Model
– AI is computed over the total number of bytes moved to the CPU. (That is, L1-cache AI)
Cache-Aware Roofline Model
AI varies for a given algorithm:
(e.g., different loop iterations or different optimizations)
AI is constant.
• In the original roofline model, if an application is in the memory bound region, it is difficult to achieve the maximum attainable performance.
Cache-aware vs Original Roofline
• Large red dot: takes up the most time,
→ candidate for optimization
Cache-Aware Roofline Model in Intel
Advisor
Little’s Law: L = λ W
L = average number of items in the queuing system
W = average waiting time in the system for an item (i.e. latency) λ = average number of items arriving per unit time
(i.e., bandwidth)
L = (average throughput) x (average latency)
Queuing System:
Items in Queue &
Items in Service
Arrivals Departures
If you know two, one can be easily computed
n(t) = the number of items in queue at time t T = a long period of time
A(T) = the area under n(t) over T N(T) = the number of arrivals in T
Intuitive Argument
λ(T) = N(T)/T L(T) = A(T)/T
W(T) = A(T)/N(T)
→L(T) = λ(T)W(T)
𝑇→∞
lim L 𝑇 = L
𝑇→∞
lim λ 𝑇 = λ
𝑇→∞
lim W 𝑇 = W
Example 1: 평균재학기간
• 매년 신입생이 2000명인 학교에서
– λ = 2000/년 (avg throughput)
• 평균적으로 학교를 다니는 학생이 3000명이면
– L = 3000
• 학생 1인당 학교에 다니는 기간은? (avg latency)
– W = 3000/2000 = 1.5년
Example 2: Disk Utilization
For a single disk,
every second, 50 I/O requests are coming in
&
avg disk service time = 10 ms, what is the disk utilization?
답: 50 x 0.01
= 0.50 I/O
Example 3: Buffer Requirements
Consider a cache memory with
the cache miss ratio = 6.25% &
the avg miss latency = 100 ns.
Q: If this cache receives 2 memory references per cycle at 2.5 GHz, how many buffers are needed for recording outstanding misses?
답: avg throughput =
2 refs/cycle x 2.5G cycles/s x 0.0625 misses/ref
= 0.3125 Gmisses/s
L = 0.3125 Gmisses/s x 100 ns = 31.25 misses
M/M/1 Queue
M: exponential interarrival times with mean 1/R M: exponential service times with mean S
1: Single server
M/M/1 queue will reveal trade-off among
(a) allowing unscheduled task arrivals, (b) minimizing latency , and (c ) maximizing throughput
L = Q + S
Q
• R/(1/S) = RS (1/S > R, RS < 1)
• From queuing theory, y = 1/(1-x)
M/M/1 Queue Latency vs. Throughput Tradeoff
Normalized average latency
frac. of max. throughput, 𝑅1
𝑆
𝐿 𝑆
High Throughput
→ High Latency
Low Throughput
→ Low Latency
Big Impact on Latency
X: 0.25 0.50 0.75 0.90
Y: 1.33 2.00 4.00 10.0
Q: When R = 10 packets/s, if we want to keep L = 100 ms, what S should be?
• Trial 1: Since 1/R = 100 ms/packet, how about S = 100 ms??
• In M/M/1, S = 50 ms with Q = 50 ms Solve for S in L/S = 1/(1-RxS)
1/10S = 1/(1 – 10S) 1 – 10S = 10S
S = 1/20 sec ( RxS = 0.5 → 50% of max throughput)
Example: M/M/1
• M. Hill, “Three Other Models of Computer System Performance,” 2019. Available at https://arxiv.org/pdf/1901.02926.pdf .
• S. Williams et al.,“Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures,” CACM, April 2008.
• A. Ilic et al., “Cache-Aware Roofline Model:
Upgrading the Loft,” IEEE CAL, 2014.
• A. Mohan, “Understanding Roofline Charts,”
http://telesens.co/2018/07/26/....
References
Performance Summary and Sampling
Ref. Chap. 12, The Art of Computer Systems Performance Analysis (by R. K. Jain)
Performance Summary
⚫ How to characterise with a single number?
⚫ Arithmetic mean =
⚫ Weighted AM =
Computer A Computer B Program 1 1 10
Program 2 1000 100 Total time 1001 110
= n
n i 1Timei 1
= n i 1
i i x Time w
Summary by a Single Number
⚫ Sample mean vs. Sample median vs. Sample mode
Mean vs. Median vs. Mode
Examples:
◆ Most frequently-used microprocessor in smartphones
◆ Average CPU time for each triangle drawing
◆ Average system configuration
Common Misuses of Means
⚫ Using mean of significantly different values
⚫ Mean of CPU time for 10 ms and 1000 ms
⚫ Using mean without regard to the skewness of distribution
MySQL Command Latency
[Source]http://www.brendangregg.com/FrequencyTrails/mean.html
Common Misuses of Means
⚫ Multiplying means to get the mean of a product
⚫ E(xy) = E(x)E(y) if x and y are independent
⚫ E.g., the ave. no. of users = 23, the ave. no. of subprocesses per user
= 2, what is the ave. no. of subprocesses?
⚫ Taking a mean of a ratio with different bases
⚫ More discussions to follow
Geometric Mean
⚫ Used if the product of the observations is a quantity of interest
⚫ E.g., Improvement in each layer of network protocol
7 1.18 × 1.13 × 1.11 × 1.08 × 1.10 × 1.28 × 1.05 − 1
= 0.13
Multiplicativity Property in GM
⚫ GM of a ratio is the ratio of the GMs of the numerator and denominator
⚫ The choice of the base does not change the conclusion
Harmonic Mean
⚫ Should be used whenever an arithmetic mean can be justified for 1/xi
⚫ Example: Average MIPS rate for the processor
⚫ Assume that the benchmark has m million instructions.
⚫ : MIPS computed from the i-th measurement
⚫ Normalized execution time is handy for scaling performance
P1 P2 relative to A relative to B
secs secs P1 P2 AM P1 P2 AM
A 10 50 1 1 1 0.1 50 25 B 100 1 10 0.02 5 1 1 1
⚫ From total execution time, A is faster than B by the factor of 101/60 = 1.7 (equal # of runs of P1 and P2)
⚫ Arith. Mean of relative time say either
⚫ A is 5 times faster than B
⚫ B is 25 time faster than A
Normalized Performance
Do not summarize relative execution time with Arith. Mean.
n
⚫ Geometric mean used to summarize relative measure P1 P2 relative to A relative to B
secs secs P1 P2 GM P1 P2 GM
A 10 50 1 1 1 0.1 50 2.2 B 100 1 10 0.02 0.45 1 1 1
⚫ Relative to A : B is 2.2X faster
⚫ Relative to B : B is 2.2X faster
⚫ (BUT) two unfortunate properties
⚫ Drawback 1: does not track execution time
⚫ Drawback 2: rewards all improvements equally
⚫ On machine B, contribution of 50 sec in P1 = contribution of 0.5 sec in P2.
Geometric Mean Exec. Time Ratio
iMean of a Ratio: CASE 1
Both sum of numerators and sum of denominators have a physical meaning.
→ The average of the ratio is the ratio of the averages
Example of Case 1
Mean of a Ratio: CASE 2
Sum of numerators has a physical meaning && the denominator is a constant
→ The arithmetic mean of the ratios can be used
Mean of a Ratio: CASE 3
Sum of the denominators has a physical meaning && the numerators are constant
→ The harmonic mean of the ratios can be used
Mean of a Ratio: CASE 4
If ai = c * bi where c is approximately constant, c can be estimated by the geometric mean of ai/bi
A ratio of sums??
1. Workload sizes are widely different (119: 8612)
2. ai = c * bi
SPEC Benchmarking Process
• For each SPEC benchmark program p ( ), measure Tp.
• Compute geometric mean:
N N
p
=1 p p) reference(T T
N p 1
SPECRatio
B A A
B B
A
e Performanc
e Performanc ime
ExecutionT
ime ExecutionT
SPECRatio
SPECRatio = =
SPEC 89
• Compiler enhancements and performance
0 100 200 300 400 500 600 700 800
tomcatv fpppp
matrix300 eqntott
li nasa7
doduc spice
espresso gcc
Benchmark
Compiler
Enhanced compiler
SPEC performance ratio
over
VAX-11/780
Combining Cost and Performance
⚫ Bandwidth per pin vs. Execution time per dollar
⚫ Can divide cost into performance if the choice of metric for performance grows in the opposite direction as the metric for cost
⚫ Must multiply cost and performance if the choice of metric for performance grows in the same direction as the metric for cost
More Examples
⚫ Transactions per second per dollar
⚫ Request latency per pin
Execution-time-dollars
Request-latency-pin-count product
Pareto Optimality
⚫ When multi-dimensional design spaces are considered
⚫ Using a single number is not really appropriate
⚫ Unless we know exact weights of different metrics or
⚫ Unless we care about one aspect to the exclusion of all others
⚫ Pareto Optimality
⚫ One candidate solution to a problem is better than another candidate solution if the first dominates the second
⚫ The first is better than or equal to the second in all figures of merit
Pareto Plot vs. Cost-Performance
Sampled Average
⚫ If sampling for data/X,
⚫ Must sample this metric in equal steps of dimension X.
⚫ Miles per gallon
⚫ Cycles per instruction
⚫ …