• 검색 결과가 없습니다.

Introduction to Instruction Scheduling

N/A
N/A
Protected

Academic year: 2024

Share "Introduction to Instruction Scheduling"

Copied!
29
0
0

로드 중.... (전체 텍스트 보기)

전체 글

(1)

Introduction to Instruction Scheduling

Å

Scheduling Constraints

Å

Basic Block Scheduling (list scheduling)

(2)

Instruction Scheduling

Å

Scheduling or reordering of instructions to improve CPU performance

¿ Important impact to performance

¿ Hottest issues in 90’s

¿ CPU optimization is less important than memory optimization these days

¿ Optimization for multi-threaded CPUs is hotter than multi-issue CPUs these days

(3)

History of Instruction Scheduling

Å

Early pipelined machines before RISC

¿ Overlapping the execution of instructions requires microcode compaction

Å

Early single-issue RISC machines

¿ Filling branch/load delay slot requires reordering

Å

Superscalar (multi-issue) RISC, VLIW, CISC

¿ Elaborate code scheduling to fill multiple issue slots for parallel execution

(4)

Instruction Scheduling Techniques

Categories of instruction scheduling

Å

Local, basic block scheduling

Å

Global, cross-block scheduling

Å

Software pipelining

Instruction scheduling is done via code motion or code placement

Instruction scheduling is constrained by

resource and dependence constraints

(5)

Scheduling Constraints

Å

Resource Constraints

¿ Caused by mutual exclusion of functional units

¿ Machine constraints of real CPUs are complex

Å

Program Constraints

¿ Caused by precedence dependences in program

¿ Control dependences

¿ Data dependences

¿ Techniques to overcome these constraints

(6)

Control Dependences

Å An example

if ( a > t ) then { b = a*a;

}

c=a+d;

Å Control Dependence

¿ b=a*a is control dependent on a>t

Å Cannot move b=a*a ahead of the branch (a>t)

¿ c=a+d is not control dependent on a>t although it is located after the branch (a>t)

Å Can move c=a+d before the branch with no problem

(7)

Overcoming Control Dependences

Speculative

code motion

¿ Move control-dependent instruction before conditional branch so that they are executed speculatively

b = a*a;

if ( a > t ) then { }

c=a+d;

Å Effectiveness of speculative execution

¿ If the original path is taken: successful

¿ If the original path is not taken: nothing to lose

¿ Speculative code motion should be made to use otherwise idle resources

(8)

Speculation

Å

Correctness of speculative code motion

¿ Should not affect the correctness

/* if b is live after c=a+d */

b’ = a*a;

if ( a > t ) then { b = b’;

}

c=a+d;

¿ Should not cause an exception

Å Speculative loads

¿ Should not cause a permanent update

Å No speculative stores

(9)

Useful code motion

Non-speculative code motion

c=a+d;

if ( a > t ) then { b = a*a;

}

Should perform useful code motion as many as possible

Å Non-speculative code motion

Å Partially-speculative code motion

Å Unification

Å Speculative code motion with higher hit ratio

Å Based on profiling

(10)

Unification

A

x=load()

y = x + 1 B

z = x + 1

y = x + 1 A

x=load() y = x + 1

B

z = y

Simplest form: moving an instruction below a hammock to above the hammock

More sophisticated form of unification

(11)

Data Dependences

Must maintain the order of accesses to the potentially same locations

¿ True dependence: write → read

¿ Anti dependence: read → write

¿ Output dependence: write → write

Analysis on register operands is easy

(12)

Overcoming Data Dependences

Anti and output data dependences are non- true dependences

¿ Caused by conserving registers

¿ Speculative code motion can cause violation of data dependences

¿ Can be overcome by (partial) renaming

True data dependences on copies can also be

overcome by forward substitution

(13)

Renaming

add r1,2->r2 add r2,2->r3

add r1,2->r2’

mov r2’->r2 add r2,2,r3

...

add r1, 1 -> r2 ...

sub r3, 2 -> r1 ...

...

sub r3, 2 -> r1’

...

add r1,1 -> r2 ...

mov r1’ -> r1 ...

(14)

Forward-substitution

...

mov r1-> r2 ...

add r2,1->r3 ...

...

add r1,1->r3 ...

mov r1-> r2 ...

(15)

Parallelism and Registers

Å

Register Allocation: more register < - >

more parallelism

r1 = r2 + r3 r1 = r2 + r3

a = r1 a = r1

r1 = r2 + r5 r4 = r2 + r5

b = r1 b = r4

(16)

Analysis of Memory Data Dependences

Analysis on Memory Variables

¿ Simple: base+offset1 = base+offset2 ?

¿ Data dependence analysis: A[2i] = A[2i+1] ?

¿ Interprocedural analysis: global = parameter ?

¿ Pointer analysis: p1 = p2 ?

Requires memory disambiguation

¿ Front-ends may help

¿ Back-ends can also analyze by itself by following the roots of memory addresses

(17)

Memory Disambiguation

Memory disambiguation results

¿ Yes: store-load can be replaced by store-copy

¿ Partially yes

¿ No: no data dependence

¿ Maybe: speculative code motion

Å Exploit unsafe load and address compare buffer

(18)

Overview of Scheduling

Å

Local, basic block (BB) scheduling

¿ List scheduling

¿ Interaction between register allocation and scheduling

Å

Global Scheduling

¿ Cross-Block code scheduling Å

Software Pipelining

(19)

Machine Model

Å Machine model used in our scheduling examples

Å Parallel operations

¿ One integer operation:

Å ALUop dest = src1, src2 (1 cycle)

¿ One memory operation:

Å LD dest = addr (2 cycle), ST src, addr (1 cycle)

Å All registers are read at the beginning of a cycle, and are written at the end of a cycle

¿ LD r2=0(r1) followed by ALU r1=r3, r4 can execute in parallel

¿ load uses the old value of r1

¿ Anti-dependence has a delay of zero cycle

(20)

Precedence Constraints

Å

Data dependences among instructions in a BB form a directed acyclic graph (DAG)

¿ Nodes: instructions

¿ Edges: data dependence constraints, labeled by its delay cycles

¿ An example BB

i1 : LD r2 = 9(r1) i2 : ST r4, 0(r3) i3 : ADD r4 = r4, r3 i4 : ADD r1 = r5, r4 i5 : ADD r6 = r2, r4

(21)

Scheduling without Resource Constraints

Å

Topological Sort

Ready = nodes with zero predecessors Loop until READY is empty

Schedule each node in READY as early as possible Update predecessor count of successor nodes

Update READY

Å

Length of the schedule = critical path length

(22)

Scheduling with Finite Resources

Å Optimal scheduling is NP-complete in general

Å Need a heuristic solution

Å List scheduling

READY = nodes with zero predecessors Loop until READY is empty

Let n be the node in READY with highest priority Schedule n in the earliest slot

that satisfies precedence + resource constraints Update predecessor count of n’s successor nodes Update READY

(23)

List Scheduling

Å

Scope: DAGs

¿ Schedule operations in topological order

¿ Never backtracks

Å

Consider “best” among those in the critical

path, branch-and-bound for microinstruction

scheduling [Fisher]

(24)

Variations of List Scheduling

Priority function for node n

¿ delay: max delay slots from

n

to any node

¿ critical path: max cycles from

n

to any node

¿ resource requirements

¿ source order

Direction: top-down vs. bottom-up

Operation vs. cycle scheduling

(25)

Direction: Top-down vs. Bottom-up

¿ Bottom-up (ALAP) shortens register lifetimes for expression trees than top-down (AEAP)

Bottom-Up Top-Down

(26)

Cycle vs. Operation Scheduling

Å

Operation scheduling

¿ Schedule critical operations first

Å

Cycle Scheduling

¿ Concept of a current cycle

¿ Only instructions ready in the current cycle are considered

(27)

Cycle vs. Operation Scheduling

Å

Advantages of operation scheduling

¿ highest priority operation gets highest priority Å

Advantages of cycle scheduling

¿ Simpler to implement

(no need to keep history of resource usage)

¿ Easier to keep track of register lifetimes

(28)

Phase Ordering of RA and IS

Å

Allocation and assign registers → Schedule

¿ round robin, partial renaming, forward substitution Å

Allocate to infinite registers → Schedule →

Assign registers

¿ Add spill code later

Å

Allocate to infinite registers → Schedule → Assign registers → Schedule (spill code)

¿ Current popular wisdom

(29)

Integrated Solution

Å

Integrated register allocation & scheduling [Goodman & Hsu]

¿ List schedule (keep track of liveness of registers)

Å Priority 1: maximize parallelism

Å Priority 2: reduces registers

¿ When remaining registers > threshold, use priority 1

¿ Otherwise, use priority 2

참조

관련 문서

Because luminal types have higher incidence of lymph node metastasis with a larger HR of Np/T on survival compared with non-luminal subtypes, surveillance for axillary lymph

We show that when the number of agents is fixed, the single machine case with the first objective is NP-hard in the ordinary sense, and present the poly- nomial-time algorithm for

The scheme is a greedy algorithm that iteratively chooses nodes with high degree as a reference node until the chosen local maps are successfully merged with a sufficient number

In this paper, a hybrid genetic algorithm (HGA) approach is proposed for an n-job, m-machine flow shop scheduling problem with limited capacity buffers with blocking in

This paper considers a parallel-machine scheduling problem with job release times and sequence-dependent setup times. The objective of this problem is to determine

In this research, the single machine capacitated lot-sizing and scheduling problem with sequence- dependent setup costs and setup times (CLSPSD) is considered.. This problem

In general, for parallel machine scheduling problem with consecutive eligibility, the performance ratio of MFFH is the same as that of the LP-based algorithm (Lenstra et

The proposed scheduling policy is capable of enhancing real-time performance of the entire system by maximizing the number of events with higher priority completed