Course Information

(1)

내장형시스템특강 (임베디드시스템최적화)

Course Information

[4190.763(001)]

Jihong Kim

Department of Computer Science & Engineering Seoul National University

(2)

Seoul National University

Instructor

이메일 예약후 화상면담

연구 분야:

Flash Storage Systems Low-Power Systems

Embedded Systems & Software Computer Architecture

880-8792

jihong@davinci.snu.ac.kr

(3)

Today’s Lecture

• Course Overview & Information

Jihong Kim 3

Computer Architecture: Chapter 0.

(4)

Computers Everywhere

Dramatic progress in terms of

(5)

Embedded Systems

• Often focused on a particular application area

• Audio processing

• Video processing

• Graphics

• Network/Communication

• DNN

• …

• Their optimized design/implementation is an important requirement in market.

Jihong Kim 5

Cou.

(6)

Evaluation Metrics

Power

Performance Cost

(7)

Device & Technology Landscape

• We’re running into Walls

End of Scaling in Technology

Jihong Kim 7

(8)

SOS from Devices



Device Innovations 만으로는 충분하지 않다.



즉, Device 수준의 최적화 (closed optimizations) 가 한계에 도달하면 수직 통합된 Cross-layered 최 적화가 필요



대표적인 사례: Power-Aware Computing

Computer Organization Algorithms

Applications Operating Systems

Compiler

Transistor/Circuit/Logic

(9)

9 Jihong Kim

Main Roles of Optimizations



Role #1: Device Innovation의 효과가 사용자에 게 막힘 없이 전달되도록



Role #2: Device 수준의 문제점이 사용자에게 전

달되지 않도록

(10)

Course Goal

• Study common optimization techniques used in embedded systems.

• Performance optimizations

• Power optimizations

• Memory size optimizations

• Study specific optimization techniques used in

NAND flash-based storage systems.

(11)

Course Overview

• Consists of three parts

• Part 1: Performance Basics

• Part 2: Performance/Power/Memory Optimizations

• Part 3: Flash-aware computing

• Prerequisite:

• Computer Architecture

• Operating Systems

Jihong Kim 11

Cou.

(12)

Evaluation

• 3번의 숙제: 30%

2번의 시험: 60% (중간: 30%, 기말: 30%) 1번의 프로젝트: 10%

•

숙제:

• 3개의 Labs (Linux 환경)

• 성능 평가 도구 개발 및 최적화 사례 실습

•

시험:

• Closed Book

• 대면 시험 (예정)

•

프로젝트:

• 자유 주제

• 최적화의 요소를 반드시 포함하여야 함.

(13)

Cheating Policy

• For Any Type of Cheating, a grade of F will be given.

• The College & University as well as YOUR ADVISOR will be notified of cheating activities for further

disciplinary actions.

Jihong Kim 13

Computer Architecture: Chapter 0.

(14)

(15)

Performance Observability.1 Jihong Kim

Performance Observability

(16)

Computer Engineering Methodology

Evaluate Existing

Systems for Bottlenecks

Simulate New Designs and Implement

Next

Generation System

Technology Trends

Benchmarks Implementation

Complexity

(17)

최적화의 출발점

⚫ Make the common case fast!!

⚫ How to find the common case?

⚫ How to make it fast?

⚫ 최적화를 위해서는 abstraction layer들간의 정보 교환이 중요

Layer 2의 common case

Layer 1의

common case

(18)

Execution Time After Improvement = Execution Time Unaffected +

(Execution Time Affected / Amount of Improvement )

⚫ Example:

"Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"

Amdahl's Law

(19)

⚫ How about making it 5 times faster?

Corollary: Make the common case fast

(20)

Source of Performance Issues

⚫ Load vs. Architecture?

⚫ Can perform badly because of too much load applied or

⚫ Can perform badly because of an issue with SW or HW configurations

(21)

Observability Tools

⚫ System-wide vs. Per-process

⚫ Counters vs. Tracing

⚫ Hardware events and software events

⚫ # of instructions retired, # of CPU migrations

⚫ Static vs. dynamic tracepoints

(22)

Perf Example

(23)

Linux Performance Observability Tools

(24)

Software Performance Analysis Tools and Techniques

⚫ Static Analysis Tool

⚫ Dynamic Analysis Tool

⚫ Static vs. Dynamic Analysis Tool?

⚫ Non-binary modifications vs. Binary-level alternations

Reference: http://www.cse.wustl.edu/~jain/cse567-06/ftp/sw_monitors1/index.html

(25)

Static Analysis Tool

Type Features Shortcomings Example tools

Static Analysis Tools

1. Source code modifications or sampling

1. Many can obtain kernel, library, HW levels

2. Simple to configure

1. sampling: results are approximate

2. Limited types of events and values can be monitored

gprof qprof Oprofile Perfsuite vTune

(26)

Compile-time Instrumentation Tools

⚫ Compile-time Instrumentation tools (CITs)

⚫ Oldest, stable

⚫ First developed during early 1980’s

⚫ Still used widely

⚫ Require entire source of the application

⚫ Processed by specialized compiler to instrument

⚫ Counters or monitor function calls are inserted existing function call structure

⚫ While executing instrumented binary, statistical data can be recorded triggered by inserted statements

(27)

Compile-time Instrumentation Tools

Operational Description of a Typical CIT

(28)

CITs - Gprof

⚫ Gprof

⚫ CIT and sampling tool

⚫ CIT function of Gprof

⚫ Augmenting compiler

⚫ Insert a monitoring function at compile time

⚫ Data analysis wrapper

⚫ Instrumented binary is executed via the wrapper

(29)

Sampling Tools

⚫ No need to have source code

⚫ Timer based sampling

⚫ Pauses execution at specific points

⚫ Record system state

(30)

Sampling Tools

⚫ Timer-based sampling

⚫ Relies on sampling period

⚫ Statistical accuracy vs. execution time

⚫ Too coarse grained → missing program state

⚫ Too often→ program runs very slowly

⚫ Variety of sampling rates can be tuned in an analysis

⚫ Software-based triggering mechanism

⚫ Activation can often be delayed due to other interactions

⚫ User-level tools cannot monitor kernel level performance issues

(31)

Sampling Tools

⚫ qprof

⚫ http://www.hpl.hp.com/research/linux/qprof/

⚫ User-level sampling tool

⚫ Linux based

⚫ To run

⚫ Setup some environment variables

⚫ Issue qprof_start

⚫ To stop, issue qprof_stop

(32)

$ source alias.csh

$ setenv QPROF_COLOR green

$ qprof_start

$ ./a.out

qprof: /home/hboehm/qprof/a.out: 118 samples, 118 counts main:dumb_test.c:34 47 ( 40%) main:dumb_test.c:35 47 ( 40%) main:dumb_test.c:36 24 ( 20%)

$ qprof_stop

(33)

Hardware Counter Tools

⚫ Recent processors have a number of on-chip event counters

⚫ Exploit the ability to program these counters – developer can easily select the types of events

⚫ After that, HCT acts as a wrapper around the chosen application

⚫ Virtual counter support

⚫ Counter overflow support

⚫ Multiplexing – virtually many number of counters

⚫ Statistical sampling

⚫ Multiple run of target application with different subset of events

(34)

Dynamic Analysis Tool

Type Features Shortcomings Example tools

Dynamic Analysis Tools

1. Binary level instrumentation

2. Some can obtain both library and kernel level information

3. Wide range of values with customizable

instrumentation routines

1. Static Probing: static probes are limited

2. Relatively high overhead

3. Instrumentation routines are not

compatible with other tools

4. Often tied to a specific OS

Pin Dtrace

(35)

Dynamic analysis tools

⚫ Binary-level alterations

⚫ Runtime change of code to gather statistical data

⚫ Two types of dynamic analysis

⚫ Binary instrumentation

⚫ Probing

⚫ Binary altering online

⚫ Slow execution time

⚫ Can affect low level statistics such as the flow of instruction pipelining and cache contents

(36)

Binary Instrumentation Tools

⚫ Binary modification techniques

⚫ Inject analysis routines into an application while it is executing

⚫ How?

⚫ BIT first gets application’s instruction stream

⚫ By OS support hooking application memory space

⚫ Inject analysis routine

⚫ Resume the application with modified code

(37)

Operational Example of BIT

(38)

Dynamic Tracing

⚫ Overcome the shortcomings of static probes

⚫ Candles at fixed locations vs. a flashlight you can carry

⚫ Example: DTrace

⚫ Support both static and dynamic probing of user- and kernel- level SW in real time

⚫ When probes are hit, arbitrary actions may be performed in its D language

(39)

Static vs. Dynamic Tracing in DTrace

Dynamic tracing to biodone()

(40)

BITs: Pin

⚫ Dynamic binary instrumentation

⚫ Do not need source code, recompilation, post-linking

⚫ Instrumentation is performed by a JIT compiler

⚫ Pin intercepts the execution of the

executable and generates new code for the code and runs it instead.

Instrumentation routine

Analysis routine

(41)

Software & Services Group

27

Starting at first application IP Read a Trace from Application Code

Jit it, adding instrumentation code from inscount.dll

Encode the trace into the Code Cache

Execute Jitted code Execution of Trace ends

Call into PINVM.DLL to Jit next trace

Pass in app IP of Trace’s target Source Trace exit branch is modified to directly branch to Destination Trace

Pin Invocationgzip.exe input.txt

Application Code and Data

Application Process

System Call Dispatcher

Event

Dispatcher

Thread Dispatcher

PINVM.DLL

inscount.dll PIN.LIB

Code Cache

NTDLL.DLL

Windows kernel

CreateProcess (gzip.exe, input.txt, suspended)

Launcher PIN.EXE

Launcher Process

Boot Routine + Data:

firstAppIp,

“Inscount.dll”

Load PINVM.DLL

Inject Pin BootRoutine and Data into application

Load

inscount.dll and run its main() Start PINVM.DLL running

(firstAppIp,

“inscount.dll”)

pin.exe –t inscount.dll – gzip.exe input.txtCount 258743109

PinTool that counts application instructions executed, prints Count at end

Resume at BootRoutine

First app IP

app Ip of Trace’s

target

Read a Trace from Application Code

Jit it, adding instrumentation code from inscount.dll

Encode the jitted trace into the Code Cache

GetContext(&firstAppIp) SetContext(BootRoutineIp)

WriteProcessMemory(BootRoutine, BootData)

Decoder Encoder

(42)

Overview of

Performance Monitoring Units

(& Hardware Performance Counters)

(43)

PMU Overview

⚫ What is Performance Monitoring Unit (PMU)?

⚫ A special set of registers which count a number of different micro- architectural events

⚫ Explains run-time behavior of a program at micro-architectural level

⚫ All modern processors have PMUs

⚫ X86-family: IA-32, IA-64, AMD64

⚫ ARM: XScale, Cortex

(44)

Performance Monitoring Registers

⚫ Clock counter

⚫ Performance counter

⚫ Performance monitor control register

(45)

Event Types

⚫ Occurrence events:

⚫ whenever the event happens, counter++

⚫ Duration events:

⚫ while (ConditionIsTrue) counter++

⚫ Performance monitoring Examples

⚫ Clock counter  total number of cycles during monitoring

⚫ Performance counter  Duration events counting

⚫ % of Event duration =

(performance counter / clock counter) * 100

(46)

Extending Count Duration > 32 bits

⚫ Overflow interrupts whenever Counter reaches the max value (if enabled)

⚫ Counting continues under Overflow until it is disabled by ISR

⚫ A typical ISR will do accumulate the number of overflows in a memory location.

⚫ ISR overhead is negligible:

⚫ Overhead = tens of cycles << 2³²

(47)

PMU Support in ARM Processors

⚫ Different in the number of H/W Performance Counters

⚫ XScale core (2 counters)

⚫ ARM11 core (2 counters)

⚫ Cortex-R4 (3 counters)

⚫ Cortex-A8 (4 counters)

⚫ Different in the types of events collected

⚫ Architecture specific

⚫ L1 cache, L2 cache

⚫ In-order, out-of-order

(48)

PMU Support in XScale Processor

⚫ Overview

⚫ Two Hardware Performance Counters

⚫ 14 different PMU events

⚫ PMU Registers (accessible through CP14)

⚫ CCNT (Clock Counter)

⚫ PMNC (Performance Monitor Counter Register)

⚫ PMN0, PMN1 (Performance Count Registers)

[Intel XScale Core Developers Manual]

(49)

PMU Events in XScale Processor

PMU event Description Value

ICACHE_MISS instruction cache miss 0x0

ICACHE_CANNOT_DELIVER instruction cache cannot deliver an instruction 0x1 DATA_STALL stall due to a data dependency 0x2

ITLB_MISS instruction TLB miss 0x3

DTLB_MISS data TLB miss 0x4

BRANCH_INST_EXEC branch instruction executed 0x5

BRANCH_MISPREDICTED branch mispredicted 0x6

INST_COUNT instruction executed 0x7

DBUFFER_STALL_DURATION stall because the data cache buffer are full 0x8 DBUFFER_STALL stall because the data cache buffer are full 0x9

DCACHE_ACCESS data cache access 0xA

DCACHE_MISS data cache miss 0xB

DCACHE_WRITEBACK data cache write-back 0xC

PC_CHANGED software changed the PC 0xD

(50)

Clock Counter (CCNT)

⚫ Counts core clock cycles.

⚫ If the max value (0xFFFFFFFF) is reached,

⚫ Reset to 0

⚫ Overflow bit (PMNC bit 10) set to 1

⚫ IRQ/FIQ

⚫ If Enabled (PMNC bit 6)

(51)

Performance Monitor Control Reg. (PMNC)

⚫ 19:12/27:20 Event Type for PMN0/1

⚫ 10:8 Overflow/Interrupt flag

⚫ Identifies which counter overflowed

⚫ bit 10: CCNT, bit 8/9: PMN0/1

⚫ 6:4 Interrupt Enable

⚫ Bit 6: CCNT, bit 4/5: PMN0/1

⚫ 1 = enable , 0 = disable

(52)

⚫ 3(D) clock counter divider

⚫ 0 = count every cycle, 1 = count every 64 cycles

⚫ 2(C) clock counter reset

⚫ 0 = no action, 1 = reset CCNT to 0

⚫ 1(P) performance counter reset

⚫ 0 = no action, 1 = reset both PMN0/1 to 0

⚫ 0(E) enable

⚫ All three counters disable (0)/enable (1)

(53)

Performance Monitoring Events

⚫ 0x0: I-cache miss (O)

⚫ 0x1: I-cache miss or I-TLB miss (D)

⚫ 0x2: stall due to a data dependency (D)

⚫ 0x3: I-TLB miss (O), 0x4: D-TLB miss (O)

⚫ 0x5: branch executed (O), 0x6: branch mispredicted (O)

⚫ 0x7: instruction executed (O)

⚫ 0x8/0x9: stall because the data cache buffers are full (D/O)

⚫ 0xA: D-cache access (O), 0xB: D-cache miss (O)

⚫ 0xC: D-cache write-back (O)

⚫ 0xD: S/W changed the PC (O)

(54)

Performance Monitoring Examples

(55)

Example: I-Cache Efficiency & CPI Monitoring

⚫ PMN0  Instruction Count

⚫ PMNC.evtCount0 = 0x7

⚫ PMN1  # of I-cache misses

⚫ PMNC.evtCount1 = 0x0

⚫ CCNT  Total number of cycles

⚫ I-cache miss rate = PMN1/PMN0

⚫ CPI (cycles per ins.) = CCNT/PMN0

(56)

Example: D-Cache Efficiency

⚫ PMN0  # of D-cache accesses

⚫ PMNC.evtCount0 = 0xA

⚫ PMN1  # of D-cache misses

⚫ PMNC.evtCount1 = 0xB

⚫ D-cache miss rate = PMN1/PMN0

(57)

PMU Control Scenario

⚫ Procedures

Time

Program start Monitoring start Monitoring end

2. Enable PMU - PMNC.enable = 1

4. Disable PMU

-- Read PMU registers -- PMNC.enable = 0

Overflow occurs!

3. Interrupt handling

- Handle overflow

Program end

1. Configure PMU -- PMU events -- Overflow flags -…

(58)

Configuring PMNC

Interrupt handling

Computing results

(59)

BPF Performance Tools

Reference:

Gregg Brendan, Chap. 2 of BPF Performance Tools, Addison-Wesley Professional, 2019

(60)

Berkeley Packet Filter (BPF)

⚫ Originated as a technology for optimizing packet filters

(61)

Extended Berkeley Packet Filter (eBPF)

⚫ From kernel 4.1, enhancement of BPF allows BPF to do much more than just filtering packets (The emergence of eBPF)

⚫ Enables the kernel to run mini programs on system and application events (e.g., disk I/O, exec() call,…)

⚫ Makes the kernel programmable by users (not kernel developers)

⚫ These enhancements allow custom analysis programs to be executed on Linux dynamic tracing, static tracing,

and profiling events

⚫ BPF and eBPF were interchangeably used these days.

(62)

Support for In-Kernel Processing

⚫ Example Tracing Tool: bitehist

(63)

Before BPF

(64)

Using BPF

(65)

BPF Illustration

(66)

BPF In-Kernel Verifier

⚫ Risks in allowing user-space code to run inside the kernel.

⚫ BPF programs checked before loaded:

⚫ Must terminates without any loops

⚫ No out-of-bounds jumps/ No out-of-range data access

⚫ Type/alignment/no bounds violations for pointer arithmetic

⚫ No reads to uninitialized registers

⚫ Kernel functions limited over the BPF program type

⚫ E.g., BPF_PROG_TYPE_KPROBE /* kprobe */

(67)

Classic BPF vs. Extended BPF

⚫ Both are limited virtual machines

Classic BPF Extended BPF

#Registers 2 10 + one frame

pointer

Register Width 32 bits 64 bits

Storage M[0 – 15] 512 byte stack +

infinite “map”

storage

Event Targets Packets, seccomp- BPF

Packets, kernel functions, user

functions, tracepoints, user

markers, PMCs Restricted

Kernel Calls Very limited Yes

(68)

BPF Runtime Internals

(69)

Writing BPF Programs

⚫ BPF can be programmed in a high-level language.

⚫ BCC

⚫ Kernel instrumentation in C

⚫ Front-ends in Python and lua

⚫ Bpftrace

⚫ Uses LLVM as a backend to compile scripts to BPF bytecodes

⚫ Uses BCC for interacting with the Linux BPF system as well as Linux tracing capabilities.

(70)

BCC

# load BPF program b = BPF(text="""

#include <uapi/linux/ptrace.h>

#include <linux/blkdev.h>

BPF_HISTOGRAM(dist);

BPF_HISTOGRAM(dist_linear);

int kprobe__blk_account_io_done(struct pt_regs *ctx, struct request *req) {

dist.increment(bpf_log2l(req->__data_len / 1024));

dist_linear.increment(req->__data_len / 1024);

return 0;

}

""")

# header

print("Tracing... Hit Ctrl-C to end.")

# trace until Ctrl-C try:

sleep(99999999) except KeyboardInterrupt:

print()

# output

print("log2 histogram") … print("\nlinear histogram")

(71)

bpftrace

# Files opened by process

bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'

# Syscall count by program

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Read bytes by process:

bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }'

(72)

bpftrace

BEGIN {

printf("Tracing block device I/O... Hit Ctrl-C to end.\n");

}

kprobe:blk_account_io_start {

@start[arg0] = nsecs;

}

kprobe:blk_account_io_done /@start[arg0]/

{

@usecs = hist((nsecs - @start[arg0]) / 1000);

delete(@start[arg0]);

} END {

clear(@start);

}