내장형시스템특강 (임베디드시스템최적화)
Course Information
[4190.763(001)]
Jihong Kim
Department of Computer Science & Engineering Seoul National University
Seoul National University
Instructor
이메일 예약후 화상면담
연구 분야:
Flash Storage Systems Low-Power Systems
Embedded Systems & Software Computer Architecture
880-8792
jihong@davinci.snu.ac.kr
Seoul National University
Today’s Lecture
• Course Overview & Information
Jihong Kim 3
Computer Architecture: Chapter 0.
Seoul National University
Computers Everywhere
Dramatic progress in terms of
Seoul National University
Embedded Systems
• Often focused on a particular application area
• Audio processing
• Video processing
• Graphics
• Network/Communication
• DNN
• …
• Their optimized design/implementation is an important requirement in market.
Jihong Kim 5
Cou.
Seoul National University
Evaluation Metrics
Power
Performance Cost
Device & Technology Landscape
• We’re running into Walls
End of Scaling in Technology
Jihong Kim 7
SOS from Devices
Device Innovations 만으로는 충분하지 않다.
즉, Device 수준의 최적화 (closed optimizations) 가 한계에 도달하면 수직 통합된 Cross-layered 최 적화가 필요
대표적인 사례: Power-Aware Computing
Computer Organization Algorithms
Applications Operating Systems
Compiler
Transistor/Circuit/Logic
9 Jihong Kim
Main Roles of Optimizations
Role #1: Device Innovation의 효과가 사용자에 게 막힘 없이 전달되도록
Role #2: Device 수준의 문제점이 사용자에게 전
달되지 않도록
Seoul National University
Course Goal
• Study common optimization techniques used in embedded systems.
• Performance optimizations
• Power optimizations
• Memory size optimizations
• Study specific optimization techniques used in
NAND flash-based storage systems.
Seoul National University
Course Overview
• Consists of three parts
• Part 1: Performance Basics
• Part 2: Performance/Power/Memory Optimizations
• Part 3: Flash-aware computing
• Prerequisite:
• Computer Architecture
• Operating Systems
Jihong Kim 11
Cou.
Seoul National University
Evaluation
• 3번의 숙제: 30%
2번의 시험: 60% (중간: 30%, 기말: 30%) 1번의 프로젝트: 10%
•
숙제:
• 3개의 Labs (Linux 환경)
• 성능 평가 도구 개발 및 최적화 사례 실습
•
시험:
• Closed Book
• 대면 시험 (예정)
•
프로젝트:
• 자유 주제
• 최적화의 요소를 반드시 포함하여야 함.
Seoul National University
Cheating Policy
• For Any Type of Cheating, a grade of F will be given.
• The College & University as well as YOUR ADVISOR will be notified of cheating activities for further
disciplinary actions.
Jihong Kim 13
Computer Architecture: Chapter 0.
Seoul National University
Performance Observability.1 Jihong Kim
Performance Observability
Computer Engineering Methodology
Evaluate Existing
Systems for Bottlenecks
Simulate New Designs and Implement
Next
Generation System
Technology Trends
Benchmarks Implementation
Complexity
Performance Observability.3 Jihong Kim
최적화의 출발점
⚫ Make the common case fast!!
⚫ How to find the common case?
⚫ How to make it fast?
⚫ 최적화를 위해서는 abstraction layer들간의 정보 교환이 중요
Layer 2의 common case
Layer 1의
common case
Execution Time After Improvement = Execution Time Unaffected +
(Execution Time Affected / Amount of Improvement )
⚫ Example:
"Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"
Amdahl's Law
Performance Observability.5 Jihong Kim
⚫ How about making it 5 times faster?
Corollary: Make the common case fast
Source of Performance Issues
⚫ Load vs. Architecture?
⚫ Can perform badly because of too much load applied or
⚫ Can perform badly because of an issue with SW or HW configurations
Performance Observability.7 Jihong Kim
Observability Tools
⚫ System-wide vs. Per-process
⚫ Counters vs. Tracing
⚫ Hardware events and software events
⚫ # of instructions retired, # of CPU migrations
⚫ Static vs. dynamic tracepoints
Perf Example
Performance Observability.9 Jihong Kim
Linux Performance Observability Tools
Software Performance Analysis Tools and Techniques
⚫ Static Analysis Tool
⚫ Dynamic Analysis Tool
⚫ Static vs. Dynamic Analysis Tool?
⚫ Non-binary modifications vs. Binary-level alternations
Reference: http://www.cse.wustl.edu/~jain/cse567-06/ftp/sw_monitors1/index.html
Performance Observability.11 Jihong Kim
Static Analysis Tool
Type Features Shortcomings Example tools
Static Analysis Tools
1. Source code modifications or sampling
1. Many can obtain kernel, library, HW levels
2. Simple to configure
1. sampling: results are approximate
2. Limited types of events and values can be monitored
gprof qprof Oprofile Perfsuite vTune
Compile-time Instrumentation Tools
⚫ Compile-time Instrumentation tools (CITs)
⚫ Oldest, stable
⚫ First developed during early 1980’s
⚫ Still used widely
⚫ Require entire source of the application
⚫ Processed by specialized compiler to instrument
⚫ Counters or monitor function calls are inserted existing function call structure
⚫ While executing instrumented binary, statistical data can be recorded triggered by inserted statements
Performance Observability.13 Jihong Kim
Compile-time Instrumentation Tools
Operational Description of a Typical CIT
CITs - Gprof
⚫ Gprof
⚫ CIT and sampling tool
⚫ CIT function of Gprof
⚫ Augmenting compiler
⚫ Insert a monitoring function at compile time
⚫ Data analysis wrapper
⚫ Instrumented binary is executed via the wrapper
Performance Observability.15 Jihong Kim
Sampling Tools
⚫ No need to have source code
⚫ Timer based sampling
⚫ Pauses execution at specific points
⚫ Record system state
Sampling Tools
⚫ Timer-based sampling
⚫ Relies on sampling period
⚫ Statistical accuracy vs. execution time
⚫ Too coarse grained → missing program state
⚫ Too often→ program runs very slowly
⚫ Variety of sampling rates can be tuned in an analysis
⚫ Software-based triggering mechanism
⚫ Activation can often be delayed due to other interactions
⚫ User-level tools cannot monitor kernel level performance issues
Performance Observability.17 Jihong Kim
Sampling Tools
⚫ qprof
⚫ http://www.hpl.hp.com/research/linux/qprof/
⚫ User-level sampling tool
⚫ Linux based
⚫ To run
⚫ Setup some environment variables
⚫ Issue qprof_start
⚫ To stop, issue qprof_stop
$ source alias.csh
$ setenv QPROF_COLOR green
$ qprof_start
$ ./a.out
qprof: /home/hboehm/qprof/a.out: 118 samples, 118 counts main:dumb_test.c:34 47 ( 40%) main:dumb_test.c:35 47 ( 40%) main:dumb_test.c:36 24 ( 20%)
$ qprof_stop
Performance Observability.19 Jihong Kim
Hardware Counter Tools
⚫ Recent processors have a number of on-chip event counters
⚫ Exploit the ability to program these counters – developer can easily select the types of events
⚫ After that, HCT acts as a wrapper around the chosen application
⚫ Virtual counter support
⚫ Counter overflow support
⚫ Multiplexing – virtually many number of counters
⚫ Statistical sampling
⚫ Multiple run of target application with different subset of events
Dynamic Analysis Tool
Type Features Shortcomings Example tools
Dynamic Analysis Tools
1. Binary level instrumentation
2. Some can obtain both library and kernel level information
3. Wide range of values with customizable
instrumentation routines
1. Static Probing: static probes are limited
2. Relatively high overhead
3. Instrumentation routines are not
compatible with other tools
4. Often tied to a specific OS
Pin Dtrace
Performance Observability.21 Jihong Kim
Dynamic analysis tools
⚫ Binary-level alterations
⚫ Runtime change of code to gather statistical data
⚫ Two types of dynamic analysis
⚫ Binary instrumentation
⚫ Probing
⚫ Binary altering online
⚫ Slow execution time
⚫ Can affect low level statistics such as the flow of instruction pipelining and cache contents
Binary Instrumentation Tools
⚫ Binary modification techniques
⚫ Inject analysis routines into an application while it is executing
⚫ How?
⚫ BIT first gets application’s instruction stream
⚫ By OS support hooking application memory space
⚫ Inject analysis routine
⚫ Resume the application with modified code
Performance Observability.23 Jihong Kim
Operational Example of BIT
Dynamic Tracing
⚫ Overcome the shortcomings of static probes
⚫ Candles at fixed locations vs. a flashlight you can carry
⚫ Example: DTrace
⚫ Support both static and dynamic probing of user- and kernel- level SW in real time
⚫ When probes are hit, arbitrary actions may be performed in its D language
Performance Observability.25 Jihong Kim
Static vs. Dynamic Tracing in DTrace
Dynamic tracing to biodone()
BITs: Pin
⚫ Dynamic binary instrumentation
⚫ Do not need source code, recompilation, post-linking
⚫ Instrumentation is performed by a JIT compiler
⚫ Pin intercepts the execution of the
executable and generates new code for the code and runs it instead.
Instrumentation routine
Analysis routine
Software & Services Group
27
Starting at first application IP Read a Trace from Application Code
Jit it, adding instrumentation code from inscount.dll
Encode the trace into the Code Cache
Execute Jitted code Execution of Trace ends
Call into PINVM.DLL to Jit next trace
Pass in app IP of Trace’s target Source Trace exit branch is modified to directly branch to Destination Trace
Pin Invocationgzip.exe input.txt
Application Code and Data
Application Process
System Call Dispatcher
Event
Dispatcher
Thread Dispatcher
PINVM.DLL
inscount.dll PIN.LIB
Code Cache
NTDLL.DLL
Windows kernel
CreateProcess (gzip.exe, input.txt, suspended)
Launcher PIN.EXE
Launcher Process
Boot Routine + Data:
firstAppIp,
“Inscount.dll”
Load PINVM.DLL
Inject Pin BootRoutine and Data into application
Load
inscount.dll and run its main() Start PINVM.DLL running
(firstAppIp,
“inscount.dll”)
pin.exe –t inscount.dll – gzip.exe input.txtCount 258743109
PinTool that counts application instructions executed, prints Count at end
Resume at BootRoutine
First app IP
app Ip of Trace’s
target
Read a Trace from Application Code
Jit it, adding instrumentation code from inscount.dll
Encode the jitted trace into the Code Cache
GetContext(&firstAppIp) SetContext(BootRoutineIp)
WriteProcessMemory(BootRoutine, BootData)
Decoder Encoder
Overview of
Performance Monitoring Units
(& Hardware Performance Counters)
Performance Observability.29 Jihong Kim
PMU Overview
⚫ What is Performance Monitoring Unit (PMU)?
⚫ A special set of registers which count a number of different micro- architectural events
⚫ Explains run-time behavior of a program at micro-architectural level
⚫ All modern processors have PMUs
⚫ X86-family: IA-32, IA-64, AMD64
⚫ ARM: XScale, Cortex
Performance Monitoring Registers
⚫ Clock counter
⚫ Performance counter
⚫ Performance monitor control register
Performance Observability.31 Jihong Kim
Event Types
⚫ Occurrence events:
⚫ whenever the event happens, counter++
⚫ Duration events:
⚫ while (ConditionIsTrue) counter++
⚫ Performance monitoring Examples
⚫ Clock counter total number of cycles during monitoring
⚫ Performance counter Duration events counting
⚫ % of Event duration =
(performance counter / clock counter) * 100
Extending Count Duration > 32 bits
⚫ Overflow interrupts whenever Counter reaches the max value (if enabled)
⚫ Counting continues under Overflow until it is disabled by ISR
⚫ A typical ISR will do accumulate the number of overflows in a memory location.
⚫ ISR overhead is negligible:
⚫ Overhead = tens of cycles << 232
Performance Observability.33 Jihong Kim
PMU Support in ARM Processors
⚫ Different in the number of H/W Performance Counters
⚫ XScale core (2 counters)
⚫ ARM11 core (2 counters)
⚫ Cortex-R4 (3 counters)
⚫ Cortex-A8 (4 counters)
⚫ Different in the types of events collected
⚫ Architecture specific
⚫ L1 cache, L2 cache
⚫ In-order, out-of-order
PMU Support in XScale Processor
⚫ Overview
⚫ Two Hardware Performance Counters
⚫ 14 different PMU events
⚫ PMU Registers (accessible through CP14)
⚫ CCNT (Clock Counter)
⚫ PMNC (Performance Monitor Counter Register)
⚫ PMN0, PMN1 (Performance Count Registers)
[Intel XScale Core Developers Manual]
Performance Observability.35 Jihong Kim
PMU Events in XScale Processor
PMU event Description Value
ICACHE_MISS instruction cache miss 0x0
ICACHE_CANNOT_DELIVER instruction cache cannot deliver an instruction 0x1 DATA_STALL stall due to a data dependency 0x2
ITLB_MISS instruction TLB miss 0x3
DTLB_MISS data TLB miss 0x4
BRANCH_INST_EXEC branch instruction executed 0x5
BRANCH_MISPREDICTED branch mispredicted 0x6
INST_COUNT instruction executed 0x7
DBUFFER_STALL_DURATION stall because the data cache buffer are full 0x8 DBUFFER_STALL stall because the data cache buffer are full 0x9
DCACHE_ACCESS data cache access 0xA
DCACHE_MISS data cache miss 0xB
DCACHE_WRITEBACK data cache write-back 0xC
PC_CHANGED software changed the PC 0xD
Clock Counter (CCNT)
⚫ Counts core clock cycles.
⚫ If the max value (0xFFFFFFFF) is reached,
⚫ Reset to 0
⚫ Overflow bit (PMNC bit 10) set to 1
⚫ IRQ/FIQ
⚫ If Enabled (PMNC bit 6)
Performance Observability.37 Jihong Kim
Performance Monitor Control Reg. (PMNC)
⚫ 19:12/27:20 Event Type for PMN0/1
⚫ 10:8 Overflow/Interrupt flag
⚫ Identifies which counter overflowed
⚫ bit 10: CCNT, bit 8/9: PMN0/1
⚫ 6:4 Interrupt Enable
⚫ Bit 6: CCNT, bit 4/5: PMN0/1
⚫ 1 = enable , 0 = disable
⚫ 3(D) clock counter divider
⚫ 0 = count every cycle, 1 = count every 64 cycles
⚫ 2(C) clock counter reset
⚫ 0 = no action, 1 = reset CCNT to 0
⚫ 1(P) performance counter reset
⚫ 0 = no action, 1 = reset both PMN0/1 to 0
⚫ 0(E) enable
⚫ All three counters disable (0)/enable (1)
Performance Observability.39 Jihong Kim
Performance Monitoring Events
⚫ 0x0: I-cache miss (O)
⚫ 0x1: I-cache miss or I-TLB miss (D)
⚫ 0x2: stall due to a data dependency (D)
⚫ 0x3: I-TLB miss (O), 0x4: D-TLB miss (O)
⚫ 0x5: branch executed (O), 0x6: branch mispredicted (O)
⚫ 0x7: instruction executed (O)
⚫ 0x8/0x9: stall because the data cache buffers are full (D/O)
⚫ 0xA: D-cache access (O), 0xB: D-cache miss (O)
⚫ 0xC: D-cache write-back (O)
⚫ 0xD: S/W changed the PC (O)
Performance Monitoring Examples
Performance Observability.41 Jihong Kim
Example: I-Cache Efficiency & CPI Monitoring
⚫ PMN0 Instruction Count
⚫ PMNC.evtCount0 = 0x7
⚫ PMN1 # of I-cache misses
⚫ PMNC.evtCount1 = 0x0
⚫ CCNT Total number of cycles
⚫ I-cache miss rate = PMN1/PMN0
⚫ CPI (cycles per ins.) = CCNT/PMN0
Example: D-Cache Efficiency
⚫ PMN0 # of D-cache accesses
⚫ PMNC.evtCount0 = 0xA
⚫ PMN1 # of D-cache misses
⚫ PMNC.evtCount1 = 0xB
⚫ D-cache miss rate = PMN1/PMN0
Performance Observability.43 Jihong Kim
PMU Control Scenario
⚫ Procedures
Time
Program start Monitoring start Monitoring end
2. Enable PMU - PMNC.enable = 1
4. Disable PMU
-- Read PMU registers -- PMNC.enable = 0
Overflow occurs!
3. Interrupt handling
- Handle overflow
Program end
1. Configure PMU -- PMU events -- Overflow flags -…
Configuring PMNC
Interrupt handling
Computing results
Performance Observability.45 Jihong Kim
BPF Performance Tools
Reference:
Gregg Brendan, Chap. 2 of BPF Performance Tools, Addison-Wesley Professional, 2019
Berkeley Packet Filter (BPF)
⚫ Originated as a technology for optimizing packet filters
Performance Observability.47 Jihong Kim
Extended Berkeley Packet Filter (eBPF)
⚫ From kernel 4.1, enhancement of BPF allows BPF to do much more than just filtering packets (The emergence of eBPF)
⚫ Enables the kernel to run mini programs on system and application events (e.g., disk I/O, exec() call,…)
⚫ Makes the kernel programmable by users (not kernel developers)
⚫ These enhancements allow custom analysis programs to be executed on Linux dynamic tracing, static tracing,
and profiling events
⚫ BPF and eBPF were interchangeably used these days.
Support for In-Kernel Processing
⚫ Example Tracing Tool: bitehist
Performance Observability.49 Jihong Kim
Before BPF
Using BPF
Performance Observability.51 Jihong Kim
BPF Illustration
BPF In-Kernel Verifier
⚫ Risks in allowing user-space code to run inside the kernel.
⚫ BPF programs checked before loaded:
⚫ Must terminates without any loops
⚫ No out-of-bounds jumps/ No out-of-range data access
⚫ Type/alignment/no bounds violations for pointer arithmetic
⚫ No reads to uninitialized registers
⚫ Kernel functions limited over the BPF program type
⚫ E.g., BPF_PROG_TYPE_KPROBE /* kprobe */
Performance Observability.53 Jihong Kim
Classic BPF vs. Extended BPF
⚫ Both are limited virtual machines
Classic BPF Extended BPF
#Registers 2 10 + one frame
pointer
Register Width 32 bits 64 bits
Storage M[0 – 15] 512 byte stack +
infinite “map”
storage
Event Targets Packets, seccomp- BPF
Packets, kernel functions, user
functions, tracepoints, user
markers, PMCs Restricted
Kernel Calls Very limited Yes
BPF Runtime Internals
Performance Observability.55 Jihong Kim
Writing BPF Programs
⚫ BPF can be programmed in a high-level language.
⚫ BCC
⚫ Kernel instrumentation in C
⚫ Front-ends in Python and lua
⚫ Bpftrace
⚫ Uses LLVM as a backend to compile scripts to BPF bytecodes
⚫ Uses BCC for interacting with the Linux BPF system as well as Linux tracing capabilities.
BCC
# load BPF program b = BPF(text="""#include <uapi/linux/ptrace.h>
#include <linux/blkdev.h>
BPF_HISTOGRAM(dist);
BPF_HISTOGRAM(dist_linear);
int kprobe__blk_account_io_done(struct pt_regs *ctx, struct request *req) {
dist.increment(bpf_log2l(req->__data_len / 1024));
dist_linear.increment(req->__data_len / 1024);
return 0;
}
""")
# header
print("Tracing... Hit Ctrl-C to end.")
# trace until Ctrl-C try:
sleep(99999999) except KeyboardInterrupt:
print()
# output
print("log2 histogram") … print("\nlinear histogram")
Performance Observability.57 Jihong Kim
bpftrace
# Files opened by process
bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'
# Syscall count by program
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# Read bytes by process:
bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }'
bpftrace
BEGIN {
printf("Tracing block device I/O... Hit Ctrl-C to end.\n");
}
kprobe:blk_account_io_start {
@start[arg0] = nsecs;
}
kprobe:blk_account_io_done /@start[arg0]/
{
@usecs = hist((nsecs - @start[arg0]) / 1000);
delete(@start[arg0]);
} END {
clear(@start);
}