• 검색 결과가 없습니다.

Causal Profiling

N/A
N/A
Protected

Academic year: 2022

Share "Causal Profiling"

Copied!
57
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Causal Profiling

Ref:

Coz: Finding Code that Counts with Causal Profiling, CACM, June 2018

(2)

A Simple Multithreaded Program

(3)

Conventional Profiler: gprof

• Optimizing a() will be useless although it takes up 55% of execution time.

(4)

Causal Profiler

(5)

• A causal profiler

– Indicates where optimizations should be focused &

– Quantifies the potential impact of optimized code

• HOW:

– runs performance experiments with optimized code.

• Trick:

– Since we cannot speed up the target code during run time…

– Slow down all other code running concurrently.

Coz: Finding Code that Counts with Causal Profiling

[SOSP ‘15]

(6)
(7)

Virtual Speedup

(8)

Case Study: SQLite

[under write-intensive parallel workload]

(9)

Perf Profile Result

(10)

ARM Microprocessor

(11)

ARM Processors

ARM 1 in 1985

By 2001, more than 1 billion ARM processors shipped

Widely used in many successful

embedded systems

(12)

ARM Design Philosophy

Low power

High code density Low cost

Small die size (i.e., more for peripherals)

Hardware debug for the time to market

(13)

ARM ISA

Load-store architecture Fixed 32-bit instructions Pipelining

A large number of registers Not a pure RISC!

Variable cycle instructions

Complex instructions using inline barrel shifter Thumb instruction set

Conditional execution

(14)

ARM-based Embedded Device

ARM Proc.

Interrupt Handler

AHB arbiter

AHB-APB bridge

Memory controller

AHB-external bridge

Real-time clk Serial UARTs

Ethernet Counter/Timers

Ethernet Physical Driver

ROM SRAM Flash DRAM

Console

(15)

ARM Bus

An on-chip bus technology necessary (vs. off-chip bus)

Advanced Microcontroller Bus Architecture introduced in 1996.

ARM System Bus (ASB)

ARM Peripheral Bus (APB)

ARM High Performance Bus (AHB)

Plug-and-Play interface

(16)

ARM Peripherals

Memory mapped I/O Memory controller

Interrupt controller

Standard interrupt controller

Vector interrupt controller

(17)

Embedded System Software

Initialization (Boot) Code

Take the processor from the reset state to a state where OS can run

Three steps:

Initial H/W configuration

Memory remapping

Diagnostics Booting

(18)

Memory Remapping

Boot ROM I/O Regs FAST SRAM

DRAM

I/O Regs Boot ROM

DRAM

FAST SRAM Exception Vector

table Initialization

Code from ROM

(19)

ARM Architecture History

(20)
(21)

ARM Instruction Set

ARM V4T (Thumb)

ARM7TDMI, ARM920T

ARM V5TE (DSP)

ARM966E-S, ARM1020E, XScale

ARM V5TEJ (Jazelle)

ARM926EJ-S, ARM1026EJ-S

ARM V6 (Media)

ARM11 Family

ARM V4

StrongARM

(22)

Jazelle: Java acceleration, TrustZone: security, NEON: multimedia VFP: floating-point operations

(23)
(24)

ARMv7

Cortex-A

Application Profile (MMU, NEON, FP)

Targeting high-end consumer electronics, networking and mobile internet devices

Cortex-R

Embedded Profile (MPU)

Targeting high-performance real-time control systems (e.g., automotive, mass storage devices)

Cortex-M

Microcontroller Profile

Targeting cost-sensitive devices

(25)

ARM Cortex-A9 Core

8-stage pipeline Superscalar Decoding two instructions/cycle Speculative exec Variable-length out-of-order pipeline

Dynamic branch

(26)
(27)

big.Little Processing

Two architecturally identical processors

One for high performance

The other for power efficiency e.g., Cortex-A15:

Out-of-order, triple-issue, 15-24 pipeline stages

Cortex-A7:

In-order, dual-issue, 8-10 pipeline stages

(28)
(29)

ARM Processor Organization

ARM core

Cache Memory

Management

(30)

ARM9TDMI Pipeline

1. Fetch 2. Decode 3. Execute

ARM7

1. Fetch 2. Decode 3. Execute 4. Data/Buffer 5. Write Back

ARM9

I-cache

rot/sgn ex +4

byte repl.

ALU

I decode

register read

D-cache

fetch

instruction decode

execute

buffer/

data

write-back

forwarding paths

immediate fields next

pc

reg shift

load/store address

LDR pc SUBS pc

post - index

pre-index LDM/

S TM

register write

r15 pc + 8

pc + 4

+4

mux

shi ft mul

B, B L MOV pc

(31)

Intel XScale Pipeline

Cortex-A8: 13-stage Pipeline

(32)

ARM9TDMI

ARM9TDMI Core

5-stage pipeline

Separate instruction/data memory ports 32-bit ARM & 16-bit THUMB instruction set support

Coprocessor interfaces for FP and DSP

On-chip debug support

(33)

ARM9TDMI Pipeline

instruction fetch

instr uction fetch

Thumb decompress

ARM decode

reg read

reg write shift/ALU

reg write shift/ALU

r. read decode

data memor y access

Fetch Decode Execute

Memory Write Fetch Decode Execute

ARM9TDMI:

ARM7TDMI:

(34)

ARM 920T

ARM920T

ARM9TDMI core based

Instruction/Data Cache

MMU, write buffer

(35)

ARM920T Functional Block

Diagram

(36)

ARM and Thumb

32bit ARM Instruction Sets 16 bit Thumb Instruction Set

Small code size

Low instruction cache energy

All instructions are executed as ARM

Decompressor converts

Thumb to equivalent ARM instruction

Decompressor present in the decode stage of the pipeline.

(37)

Jazelle

• Similar to Thumb extension!

• Java mode like Thumb mode

(38)

Data Types Supported

8-bit signed and unsigned bytes

16-bit signed and unsigned half-words 32-bit signed and unsigned words

Alignment Required

(39)

Memory organization

Default: little-endian

half-word4 word16

0 1

2 3

4 5

6 7

8 9

10 11

byte0

byte

12 13

14 15

16 17

18 19

20 21

22 23

byte1 byte2

half-word14

byte3

byte6 half-word6

word16

3 2

1 0

7 6

5 4

11 10

9 8

byte3

byte

15 14

13 12

19 18

17 16

23 22

21 20

byte2 byte1

half-word12

byte0

byte5 address

address

bit 31 bit 0 bit 31 bit 0

half-word12 half-word14

word8 word8

(40)

Banked registers

20 registers hidden

Available only in a particular mode

Registers

31 32-bit registers User mode:

R0~R15 (16 reg.) FIQ mode:

R8_fiq~R14_fiq (7) 4 other modes : 2 each,

R13_xxx~R14_xxx (8)

6 Program Status Register

PSR 또는 CPSR

(41)

Special Register Usage

Program Counter (r15)

Can be used as an operand

‘pc’ keyword

Link Register (r14)

Used in BL (branch with link)

Move current pc to lr before jump RET == mov pc, lr

Stack Pointer (r13)

(42)

ARM C urrent P rogram S tatus R egister

(43)

ARM v7: CPSR

Execution state bits:

(44)

Processor Modes

ARM processor modes in version 4

(45)

ARM Exceptions

IRQ: interrupt request

FIQ: fast interrupt request

Conceptually same as IRQ

Support a faster dispatch of ISR

Software Interrupt

Via SWI instruction

Abort

When there is a failed attempt to access memory Prefetch Abort & Data Abort

(46)

Exceptions & Modes

Source: ARM System Developer’s Guide

(47)

ARM Exception Processing

Vector Table at 0x00

Jump to Exception handler in the vector

table entry

(48)

Exception Vector Table

(49)

Exception Vector Table (head.s)

0x00: b reset_handler

0x04: b vector_undefinstr

0x08: b vector_swi

0x0c: b vector_prefetch

0x10: b vector_data

0x14: b vector_addrexcptn

0x18: b vector_IRQ

0x1c: b _unexp_fiq

0x20: reset_handler: mov r0, #(MODE_IRQ)

0x24: msr cpsr, r0

0x28: ldr r13, =IRQ_STACK

0x2c: mov r0, #(MODE_SVC|I_BIT|F_BIT)

0x30: msr cpsr, r0

(50)

Processor Mode vs. Registers

Registers are remapped depending on the processor mode

Reducing the overhead of saving and

restoring registers

(51)

Registers

31 32-bit registers User mode:

R0~R15 (16 reg.) FIQ mode:

R8_fiq~R14_fiq (7) 4 other modes : 2 each,

R13_xxx~R14_xxx (8)

6 Program Status

(52)

Example: IRQ Mode

In User mode, r0~r15 used

If there is an interrupt request,

Processor mode = IRQ

r13_irq & r14_irq available (instead of r13 &

r14)

A separate stack pointer

(Last executed instruction + 8) in r14(r14_irq)

Return address from IRQ = r14_irq - 4

If necessary, r0~r12 must be saved before used

(53)

Processor Mode vs. PSR

CPSR is saved into a corresponding SPSR

Example: SPSR_irq CPSR when IRQ

Both CPSR & SPSR_irq accessible in IRQ

MRS r1, cpsr

(54)

Exception Priority

1) Reset (The Highest)

2) Data abort 3) FIQ

4) IRQ

5) Prefetch abort

6) Undefined Instruction, Software

Interrupt

(55)

ARM Instruction Set: Overview

(56)

ARM Instruction Set Overview

- Condition Codes

(57)

ARM Instruction Set Overview

- Data processing Instructions

참조

관련 문서

The PWM waveform is generated by setting (or clearing) the OC2x Register at the compare match between OCR2x and TCNT2, and clearing (or setting) the OC2x Register at the

Input Serial Port to Sequencer Sync (Bits 3:2) Normally, the internal sequencer is synchronized to the incoming audio frame rate by comparing the internal program counter

Bit(s) Field Name Description Type Reset 31:3 Reserved, write zero, read as don’t care.. 2 SPI2 enable If set the SPI 2

The purpose of this study is to examine industrial specialized high school students’ recognition on career and the status of career instruction, and analyze

Incheon Support Center for Foreign Workers (the IFC) had the goodbye party for the year a little bit early, on November 29, at Welfare Hall of the Incheon

proficiency level, they enjoyed English flipped learning. Based on the study findings teaching implications in order to improve English flipped instruction for middle

When it comes to Arbitration which is a useful way to resolve the international investment disputes, the Civil procedure for Recognition and Enforcement of Foreign Judgement

Write on Common Data Bus to all awaiting RS units, FP registers, and store buffers; mark its reservation station available.  Normal data bus: data + destination