• 검색 결과가 없습니다.

Course Information

N/A
N/A
Protected

Academic year: 2022

Share "Course Information"

Copied!
72
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

내장형시스템특강 (임베디드시스템최적화)

Course Information

[4190.763(001)]

Jihong Kim

Department of Computer Science & Engineering Seoul National University

(2)

Seoul National University

Instructor

이메일 예약후 화상면담

연구 분야:

Flash Storage Systems Low-Power Systems

Embedded Systems & Software Computer Architecture

880-8792

jihong@davinci.snu.ac.kr

(3)

Seoul National University

Today’s Lecture

• Course Overview & Information

Jihong Kim 3

Computer Architecture: Chapter 0.

(4)

Seoul National University

Computers Everywhere

Dramatic progress in terms of

(5)

Seoul National University

Embedded Systems

• Often focused on a particular application area

• Audio processing

• Video processing

• Graphics

• Network/Communication

• DNN

• …

• Their optimized design/implementation is an important requirement in market.

Jihong Kim 5

Cou.

(6)

Seoul National University

Evaluation Metrics

Power

Performance Cost

(7)

Device & Technology Landscape

• We’re running into Walls

End of Scaling in Technology

Jihong Kim 7

(8)

SOS from Devices

Device Innovations 만으로는 충분하지 않다.

즉, Device 수준의 최적화 (closed optimizations) 가 한계에 도달하면 수직 통합된 Cross-layered 최 적화가 필요

대표적인 사례: Power-Aware Computing

Computer Organization Algorithms

Applications Operating Systems

Compiler

Transistor/Circuit/Logic

(9)

9 Jihong Kim

Main Roles of Optimizations

Role #1: Device Innovation의 효과가 사용자에 게 막힘 없이 전달되도록

Role #2: Device 수준의 문제점이 사용자에게 전

달되지 않도록

(10)

Seoul National University

Course Goal

• Study common optimization techniques used in embedded systems.

• Performance optimizations

• Power optimizations

• Memory size optimizations

• Study specific optimization techniques used in

NAND flash-based storage systems.

(11)

Seoul National University

Course Overview

• Consists of three parts

• Part 1: Performance Basics

• Part 2: Performance/Power/Memory Optimizations

• Part 3: Flash-aware computing

• Prerequisite:

• Computer Architecture

• Operating Systems

Jihong Kim 11

Cou.

(12)

Seoul National University

Evaluation

• 3번의 숙제: 30%

2번의 시험: 60% (중간: 30%, 기말: 30%) 1번의 프로젝트: 10%

숙제:

3개의 Labs (Linux 환경)

성능 평가 도구 개발 및 최적화 사례 실습

시험:

Closed Book

대면 시험 (예정)

프로젝트:

자유 주제

최적화의 요소를 반드시 포함하여야 함.

(13)

Seoul National University

Cheating Policy

• For Any Type of Cheating, a grade of F will be given.

• The College & University as well as YOUR ADVISOR will be notified of cheating activities for further

disciplinary actions.

Jihong Kim 13

Computer Architecture: Chapter 0.

(14)

Seoul National University

(15)

Performance Observability.1 Jihong Kim

Performance Observability

(16)

Computer Engineering Methodology

Evaluate Existing

Systems for Bottlenecks

Simulate New Designs and Implement

Next

Generation System

Technology Trends

Benchmarks Implementation

Complexity

(17)

Performance Observability.3 Jihong Kim

최적화의 출발점

Make the common case fast!!

How to find the common case?

How to make it fast?

최적화를 위해서는 abstraction layer들간의 정보 교환이 중요

Layer 2의 common case

Layer 1의

common case

(18)

Execution Time After Improvement = Execution Time Unaffected +

(Execution Time Affected / Amount of Improvement )

Example:

"Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?"

Amdahl's Law

(19)

Performance Observability.5 Jihong Kim

How about making it 5 times faster?

Corollary: Make the common case fast

(20)

Source of Performance Issues

Load vs. Architecture?

Can perform badly because of too much load applied or

Can perform badly because of an issue with SW or HW configurations

(21)

Performance Observability.7 Jihong Kim

Observability Tools

System-wide vs. Per-process

Counters vs. Tracing

Hardware events and software events

# of instructions retired, # of CPU migrations

Static vs. dynamic tracepoints

(22)

Perf Example

(23)

Performance Observability.9 Jihong Kim

Linux Performance Observability Tools

(24)

Software Performance Analysis Tools and Techniques

Static Analysis Tool

Dynamic Analysis Tool

Static vs. Dynamic Analysis Tool?

Non-binary modifications vs. Binary-level alternations

Reference: http://www.cse.wustl.edu/~jain/cse567-06/ftp/sw_monitors1/index.html

(25)

Performance Observability.11 Jihong Kim

Static Analysis Tool

Type Features Shortcomings Example tools

Static Analysis Tools

1. Source code modifications or sampling

1. Many can obtain kernel, library, HW levels

2. Simple to configure

1. sampling: results are approximate

2. Limited types of events and values can be monitored

gprof qprof Oprofile Perfsuite vTune

(26)

Compile-time Instrumentation Tools

Compile-time Instrumentation tools (CITs)

Oldest, stable

First developed during early 1980’s

Still used widely

Require entire source of the application

Processed by specialized compiler to instrument

Counters or monitor function calls are inserted existing function call structure

While executing instrumented binary, statistical data can be recorded triggered by inserted statements

(27)

Performance Observability.13 Jihong Kim

Compile-time Instrumentation Tools

Operational Description of a Typical CIT

(28)

CITs - Gprof

Gprof

CIT and sampling tool

CIT function of Gprof

Augmenting compiler

Insert a monitoring function at compile time

Data analysis wrapper

Instrumented binary is executed via the wrapper

(29)

Performance Observability.15 Jihong Kim

Sampling Tools

No need to have source code

Timer based sampling

Pauses execution at specific points

Record system state

(30)

Sampling Tools

Timer-based sampling

Relies on sampling period

Statistical accuracy vs. execution time

Too coarse grained → missing program state

Too often→ program runs very slowly

Variety of sampling rates can be tuned in an analysis

Software-based triggering mechanism

Activation can often be delayed due to other interactions

User-level tools cannot monitor kernel level performance issues

(31)

Performance Observability.17 Jihong Kim

Sampling Tools

qprof

http://www.hpl.hp.com/research/linux/qprof/

User-level sampling tool

Linux based

To run

Setup some environment variables

Issue qprof_start

To stop, issue qprof_stop

(32)

$ source alias.csh

$ setenv QPROF_COLOR green

$ qprof_start

$ ./a.out

qprof: /home/hboehm/qprof/a.out: 118 samples, 118 counts main:dumb_test.c:34 47 ( 40%) main:dumb_test.c:35 47 ( 40%) main:dumb_test.c:36 24 ( 20%)

$ qprof_stop

(33)

Performance Observability.19 Jihong Kim

Hardware Counter Tools

Recent processors have a number of on-chip event counters

Exploit the ability to program these counters – developer can easily select the types of events

After that, HCT acts as a wrapper around the chosen application

Virtual counter support

Counter overflow support

Multiplexing – virtually many number of counters

Statistical sampling

Multiple run of target application with different subset of events

(34)

Dynamic Analysis Tool

Type Features Shortcomings Example tools

Dynamic Analysis Tools

1. Binary level instrumentation

2. Some can obtain both library and kernel level information

3. Wide range of values with customizable

instrumentation routines

1. Static Probing: static probes are limited

2. Relatively high overhead

3. Instrumentation routines are not

compatible with other tools

4. Often tied to a specific OS

Pin Dtrace

(35)

Performance Observability.21 Jihong Kim

Dynamic analysis tools

Binary-level alterations

Runtime change of code to gather statistical data

Two types of dynamic analysis

Binary instrumentation

Probing

Binary altering online

Slow execution time

Can affect low level statistics such as the flow of instruction pipelining and cache contents

(36)

Binary Instrumentation Tools

Binary modification techniques

Inject analysis routines into an application while it is executing

How?

BIT first gets application’s instruction stream

By OS support hooking application memory space

Inject analysis routine

Resume the application with modified code

(37)

Performance Observability.23 Jihong Kim

Operational Example of BIT

(38)

Dynamic Tracing

Overcome the shortcomings of static probes

Candles at fixed locations vs. a flashlight you can carry

Example: DTrace

Support both static and dynamic probing of user- and kernel- level SW in real time

When probes are hit, arbitrary actions may be performed in its D language

(39)

Performance Observability.25 Jihong Kim

Static vs. Dynamic Tracing in DTrace

Dynamic tracing to biodone()

(40)

BITs: Pin

Dynamic binary instrumentation

Do not need source code, recompilation, post-linking

Instrumentation is performed by a JIT compiler

Pin intercepts the execution of the

executable and generates new code for the code and runs it instead.

Instrumentation routine

Analysis routine

(41)

Software & Services Group

27

Starting at first application IP Read a Trace from Application Code

Jit it, adding instrumentation code from inscount.dll

Encode the trace into the Code Cache

Execute Jitted code Execution of Trace ends

Call into PINVM.DLL to Jit next trace

Pass in app IP of Trace’s target Source Trace exit branch is modified to directly branch to Destination Trace

Pin Invocationgzip.exe input.txt

Application Code and Data

Application Process

System Call Dispatcher

Event

Dispatcher

Thread Dispatcher

PINVM.DLL

inscount.dll PIN.LIB

Code Cache

NTDLL.DLL

Windows kernel

CreateProcess (gzip.exe, input.txt, suspended)

Launcher PIN.EXE

Launcher Process

Boot Routine + Data:

firstAppIp,

“Inscount.dll”

Load PINVM.DLL

Inject Pin BootRoutine and Data into application

Load

inscount.dll and run its main() Start PINVM.DLL running

(firstAppIp,

“inscount.dll”)

pin.exe –t inscount.dll – gzip.exe input.txtCount 258743109

PinTool that counts application instructions executed, prints Count at end

Resume at BootRoutine

First app IP

app Ip of Trace’s

target

Read a Trace from Application Code

Jit it, adding instrumentation code from inscount.dll

Encode the jitted trace into the Code Cache

GetContext(&firstAppIp) SetContext(BootRoutineIp)

WriteProcessMemory(BootRoutine, BootData)

Decoder Encoder

(42)

Overview of

Performance Monitoring Units

(& Hardware Performance Counters)

(43)

Performance Observability.29 Jihong Kim

PMU Overview

What is Performance Monitoring Unit (PMU)?

A special set of registers which count a number of different micro- architectural events

Explains run-time behavior of a program at micro-architectural level

All modern processors have PMUs

X86-family: IA-32, IA-64, AMD64

ARM: XScale, Cortex

(44)

Performance Monitoring Registers

Clock counter

Performance counter

Performance monitor control register

(45)

Performance Observability.31 Jihong Kim

Event Types

Occurrence events:

whenever the event happens, counter++

Duration events:

while (ConditionIsTrue) counter++

Performance monitoring Examples

Clock counter  total number of cycles during monitoring

Performance counter  Duration events counting

% of Event duration =

(performance counter / clock counter) * 100

(46)

Extending Count Duration > 32 bits

Overflow interrupts whenever Counter reaches the max value (if enabled)

Counting continues under Overflow until it is disabled by ISR

A typical ISR will do accumulate the number of overflows in a memory location.

ISR overhead is negligible:

Overhead = tens of cycles << 232

(47)

Performance Observability.33 Jihong Kim

PMU Support in ARM Processors

Different in the number of H/W Performance Counters

XScale core (2 counters)

ARM11 core (2 counters)

Cortex-R4 (3 counters)

Cortex-A8 (4 counters)

Different in the types of events collected

Architecture specific

L1 cache, L2 cache

In-order, out-of-order

(48)

PMU Support in XScale Processor

Overview

Two Hardware Performance Counters

14 different PMU events

PMU Registers (accessible through CP14)

CCNT (Clock Counter)

PMNC (Performance Monitor Counter Register)

PMN0, PMN1 (Performance Count Registers)

[Intel XScale Core Developers Manual]

(49)

Performance Observability.35 Jihong Kim

PMU Events in XScale Processor

PMU event Description Value

ICACHE_MISS instruction cache miss 0x0

ICACHE_CANNOT_DELIVER instruction cache cannot deliver an instruction 0x1 DATA_STALL stall due to a data dependency 0x2

ITLB_MISS instruction TLB miss 0x3

DTLB_MISS data TLB miss 0x4

BRANCH_INST_EXEC branch instruction executed 0x5

BRANCH_MISPREDICTED branch mispredicted 0x6

INST_COUNT instruction executed 0x7

DBUFFER_STALL_DURATION stall because the data cache buffer are full 0x8 DBUFFER_STALL stall because the data cache buffer are full 0x9

DCACHE_ACCESS data cache access 0xA

DCACHE_MISS data cache miss 0xB

DCACHE_WRITEBACK data cache write-back 0xC

PC_CHANGED software changed the PC 0xD

(50)

Clock Counter (CCNT)

Counts core clock cycles.

If the max value (0xFFFFFFFF) is reached,

Reset to 0

Overflow bit (PMNC bit 10) set to 1

IRQ/FIQ

If Enabled (PMNC bit 6)

(51)

Performance Observability.37 Jihong Kim

Performance Monitor Control Reg. (PMNC)

19:12/27:20 Event Type for PMN0/1

10:8 Overflow/Interrupt flag

Identifies which counter overflowed

bit 10: CCNT, bit 8/9: PMN0/1

6:4 Interrupt Enable

Bit 6: CCNT, bit 4/5: PMN0/1

1 = enable , 0 = disable

(52)

3(D) clock counter divider

0 = count every cycle, 1 = count every 64 cycles

2(C) clock counter reset

0 = no action, 1 = reset CCNT to 0

1(P) performance counter reset

0 = no action, 1 = reset both PMN0/1 to 0

0(E) enable

All three counters disable (0)/enable (1)

(53)

Performance Observability.39 Jihong Kim

Performance Monitoring Events

0x0: I-cache miss (O)

0x1: I-cache miss or I-TLB miss (D)

0x2: stall due to a data dependency (D)

0x3: I-TLB miss (O), 0x4: D-TLB miss (O)

0x5: branch executed (O), 0x6: branch mispredicted (O)

0x7: instruction executed (O)

0x8/0x9: stall because the data cache buffers are full (D/O)

0xA: D-cache access (O), 0xB: D-cache miss (O)

0xC: D-cache write-back (O)

0xD: S/W changed the PC (O)

(54)

Performance Monitoring Examples

(55)

Performance Observability.41 Jihong Kim

Example: I-Cache Efficiency & CPI Monitoring

PMN0  Instruction Count

PMNC.evtCount0 = 0x7

PMN1  # of I-cache misses

PMNC.evtCount1 = 0x0

CCNT  Total number of cycles

I-cache miss rate = PMN1/PMN0

CPI (cycles per ins.) = CCNT/PMN0

(56)

Example: D-Cache Efficiency

PMN0  # of D-cache accesses

PMNC.evtCount0 = 0xA

PMN1  # of D-cache misses

PMNC.evtCount1 = 0xB

D-cache miss rate = PMN1/PMN0

(57)

Performance Observability.43 Jihong Kim

PMU Control Scenario

Procedures

Time

Program start Monitoring start Monitoring end

2. Enable PMU - PMNC.enable = 1

4. Disable PMU

-- Read PMU registers -- PMNC.enable = 0

Overflow occurs!

3. Interrupt handling

- Handle overflow

Program end

1. Configure PMU -- PMU events -- Overflow flags -…

(58)

Configuring PMNC

Interrupt handling

Computing results

(59)

Performance Observability.45 Jihong Kim

BPF Performance Tools

Reference:

Gregg Brendan, Chap. 2 of BPF Performance Tools, Addison-Wesley Professional, 2019

(60)

Berkeley Packet Filter (BPF)

Originated as a technology for optimizing packet filters

(61)

Performance Observability.47 Jihong Kim

Extended Berkeley Packet Filter (eBPF)

From kernel 4.1, enhancement of BPF allows BPF to do much more than just filtering packets (The emergence of eBPF)

Enables the kernel to run mini programs on system and application events (e.g., disk I/O, exec() call,…)

Makes the kernel programmable by users (not kernel developers)

These enhancements allow custom analysis programs to be executed on Linux dynamic tracing, static tracing,

and profiling events

BPF and eBPF were interchangeably used these days.

(62)

Support for In-Kernel Processing

Example Tracing Tool: bitehist

(63)

Performance Observability.49 Jihong Kim

Before BPF

(64)

Using BPF

(65)

Performance Observability.51 Jihong Kim

BPF Illustration

(66)

BPF In-Kernel Verifier

Risks in allowing user-space code to run inside the kernel.

BPF programs checked before loaded:

Must terminates without any loops

No out-of-bounds jumps/ No out-of-range data access

Type/alignment/no bounds violations for pointer arithmetic

No reads to uninitialized registers

Kernel functions limited over the BPF program type

E.g., BPF_PROG_TYPE_KPROBE /* kprobe */

(67)

Performance Observability.53 Jihong Kim

Classic BPF vs. Extended BPF

Both are limited virtual machines

Classic BPF Extended BPF

#Registers 2 10 + one frame

pointer

Register Width 32 bits 64 bits

Storage M[0 – 15] 512 byte stack +

infinite “map”

storage

Event Targets Packets, seccomp- BPF

Packets, kernel functions, user

functions, tracepoints, user

markers, PMCs Restricted

Kernel Calls Very limited Yes

(68)

BPF Runtime Internals

(69)

Performance Observability.55 Jihong Kim

Writing BPF Programs

BPF can be programmed in a high-level language.

BCC

Kernel instrumentation in C

Front-ends in Python and lua

Bpftrace

Uses LLVM as a backend to compile scripts to BPF bytecodes

Uses BCC for interacting with the Linux BPF system as well as Linux tracing capabilities.

(70)

BCC

# load BPF program b = BPF(text="""

#include <uapi/linux/ptrace.h>

#include <linux/blkdev.h>

BPF_HISTOGRAM(dist);

BPF_HISTOGRAM(dist_linear);

int kprobe__blk_account_io_done(struct pt_regs *ctx, struct request *req) {

dist.increment(bpf_log2l(req->__data_len / 1024));

dist_linear.increment(req->__data_len / 1024);

return 0;

}

""")

# header

print("Tracing... Hit Ctrl-C to end.")

# trace until Ctrl-C try:

sleep(99999999) except KeyboardInterrupt:

print()

# output

print("log2 histogram") … print("\nlinear histogram")

(71)

Performance Observability.57 Jihong Kim

bpftrace

# Files opened by process

bpftrace -e 'tracepoint:syscalls:sys_enter_open { printf("%s %s\n", comm, str(args->filename)); }'

# Syscall count by program

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

# Read bytes by process:

bpftrace -e 'tracepoint:syscalls:sys_exit_read /args->ret/ { @[comm] = sum(args->ret); }'

(72)

bpftrace

BEGIN {

printf("Tracing block device I/O... Hit Ctrl-C to end.\n");

}

kprobe:blk_account_io_start {

@start[arg0] = nsecs;

}

kprobe:blk_account_io_done /@start[arg0]/

{

@usecs = hist((nsecs - @start[arg0]) / 1000);

delete(@start[arg0]);

} END {

clear(@start);

}

참조

관련 문서

The author, therefore, propose a fuzzy-PID controller which combines PID control and fuzzy technique to obtain the good performance of water level control

“With the MySQL Query Analyzer, we were able to identify and analyze problematic SQL code, and triple our database performance. More importantly, we were able to accomplish

○ In the future, the goals of Korea’s international agricultural development cooperation projects should be connected with SDGs, and it is important to

 Construction procedure greatly affects performance.. ② Excavate to first strut level with the margin of working area. ③ Place wales and struts. ④ Excavate to second

This research further enriches the previous research on the influencing factors of employee organizational commitment and OCB, which help Chinese companies

Understanding assembly code and how it relates to the original C code is a key step in understanding how computers execute

– Gate-level design: technology mapping, P&amp;R – RT-level design: Logic synthesis tool. – Behavioral-level design:

Performance evaluation of Volume PTV based on Affine Performance evaluation of Volume PTV based on Affine Performance evaluation of Volume PTV based on Affine Performance