Area (gates)

69  Download (0)

전체 글

(1)

CAP Laboratory, SNU 1

Schedule

„ 1. Introduction

„ 2. System Modeling Language: System C *

„ 3. HW/SW Cosimulation *

„ 4. C-based Design *

„ 5. Data-flow Model and SW Synthesis

„ 6. HW and Interface Synthesis (Midterm)

„ 7. Models of Computation

„ 8. Model based Design of Embedded SW

„ 9. Design Space Exploration (Final Exam)

(Term Project)

(2)

Reference

„ PeaCE Approach

z Hyunuk Jung, Kangnyoung Lee, and Soonhoi Ha, “Efficient Hardware Controller Synthesis for Synchronous Data Flow in System Level Design,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems Vol. 10 pp 423-428 August 2002

z Hyunuk Jung, Hoeseok Yang, and Soonhoi Ha, "Optimized RTL

Code Generation from Coarse-Grain Dataflow Specification for Fast HW/SW Cosynthesis", Journal of VLSI Signal Processing (online published) 10, June. 2007

(3)

CAP Laboratory, SNU 3

Contents

„ Introduction

z Higher-level design from dataflow model

z PeaCE design flow & cosynthesis

„ Previous works

„ Block Definition

„ Controller Synthesis

„ Schedule-Based Design

„ FRDF Specification for more efficient HW implementation

„ Conclusions

IEEE Transaction on VLSI Systems Vol.10, August 2002

J. VLSI Signal Processing, 2007

(4)

Hardware Synthesis Problem

„ Automatic Hardware Synthesis from Coarse-Grained Dataflow Specification in System Level Design

z A node represents a coarse grain computation block such as FIR filter or DCT.

z A node has complex properties such as data sample rates, I/O timings, data types, and its internal states.

z A central controller should be generated automatically in order to control these complex coarse-grain HW library blocks and registers.

DFG B

C D

DFG B

C D

Block

Libraries B C D Block

Libraries B C D controller

C B clk

rst D

controller C

B clk

rst D

(5)

CAP Laboratory, SNU 5

Design Size & Abstraction Level

„ 1970s, several hundred transistors

z Transistor and gate level design

„ 1980s

z Register Transfer Level (RTL) design

z Hardware Description Language (HDL)

„ 1990s

z High-level synthesis (or behavioral synthesis)

z Behavioral HDL, imperative programming languages (C,C++)

„ Today, several million gates

z Exponential increase in transistor density

z HW/SW Co-design (System Level Design)

z We need higher-level tools for system design and hardware description

(6)

Conventional High-level Synthesis

„ Hardware synthesis

z Conventional architecture and logic synthesis techniques are used

Focus!

Functional Spec HDL Coding

Simulation Logic Synthesis

Layout Synthesis Back Annotation

Layout

Architecture Synthesis Register-Transfer-Level

Behavioral-Level

Gate-Level

(7)

CAP Laboratory, SNU 7

Architecture Synthesis Problem

Scheduling Resource binding Data path synthesis Control logic design

behavioral model

(CDFG)

constraints (timing, area, performance, resource binding)

resource

(library + module generator)

RTL description

primitives

area, delay given

area, delay estimated Architecture Synthesis

(8)

New Trend: C-based design

„ From C/C++ to Hardware

z Mentor Graphics(www.mentor.com): “Catapult C Cynthesis”

z Forte design systems (www.forteds.com): “Cynthesizer”

z Synfora(www.synfora.com): “PICO Express”

z Y Explorations Inc.(www.yxi.com): “eXCite”

z Celoxica(www.celoxica.com): “Agility Compiler”

RTL level compiler

(9)

CAP Laboratory, SNU 9

System-level HW Synthesis

„ Higher Level Hardware Synthesis

z Increasing need for a design methodology of higher abstraction level

z Growing complexity, fast design turn-around time

z Easy to modify and maintain

„ Automatic code generation from data flow graph

z SDF semantics should be preserved - “refinement”

z The kernel code of a block is already optimized in the library.

z Determine the schedule and resource allocation.

z Controller is generated according to the scheduled sequence and the resource mapping

„ Fundamental Question

z Can we generate the HDL code with the synthesizable area and the similar performance as manually optimized code?

(10)

Hardware Synthesis Strategies

Partitioned Graph

Behavioral level HDL

Behavioral level Synthesis

(high-level synthesis)

RT- level HDL

Synthesizable C

Logic-Synthesis C-to-HDL/HW

Cycle-based C

C-to-HW

Current PeaCE

Future

(11)

CAP Laboratory, SNU 11

System Design Flow in PeaCE

Architecture Specification Architecture Specification

Partitioning/Scheduling Partitioning/Scheduling

SW

C code generation SW

C code generation VHDL code generationHW

VHDL code generationHW Dataflow Specification Dataflow Specification

Cosimulation/Cosynthesis Cosimulation/Cosynthesis

SW subgraph SW schedule

VHDL Code C Code

Node-PE Performance DB

Node-PE Performance DB HW Schedule Info.

HW Subgraph

(12)

HW/SW Interface

„ Hardware Synthesis for HW/SW Cosynthesis

z After HW/SW partitioning of an initial dataflow specification, a partitioned subgraph mapped to hardware is automatically generated.

z A partitioned subgraph has interfacing blocks such as SND(send) and RCV(receive) blocks for communication.

These interfacing blocks have internal buffers and shared memory access logics.

1 1 4 1 1 1B

A C

D

E

1

1 4

1 1 1 4 1 1 1B C

RCV1 SND1

Mapped to hardware

Initial Dataflow Hardware

(13)

CAP Laboratory, SNU 13

HW/SW Cosynthesis

wrapper

AHB

uProcessor

Memory

S R

R

S R

S

OS/device driver

Synthesized HW (VHDL domain)

HW SW

SW SW

(14)

Contents

„ Introduction

„ Previous works

z GRAPE, Meyr’s work, Ptolemy, and PeaCE

„ Block Definition

„ Controller Synthesis

„ Schedule-Based Design

„ FRDF Specification for more efficient HW implementation

„ Conclusions & Future Directions

(15)

CAP Laboratory, SNU 15

GRAPE Approach

„ GRAPE (Graphical Rapid Prototyping Environment) is a HW/SW codesign environment for the functional emulation of DSP systems.

„ Using cyclo-static dataflow specification

z Sample rates are changed periodically

„ Distributed controls

z Using hand-shaking protocol between blocks

z FIFO buffers

z No central controller

„ Generated hardware implementation has one-to-one correspondence to dataflow specification

z Simple architecture generation

(16)

GRAPE Standard Interface

R data

strb rdy

data wr wr_ok

B data

wr wr_ok

data rd_ok rd

U data

rd_ok rd

data wr wr_ok

receive node buffer user task node

R1

R2

S1 B4

B3 B1

B2

U1

U2

U3 B5 U4 B6

„ Asynchronous communication between every blocks using hand-shaking protocol

(17)

CAP Laboratory, SNU 17

Meyr’s Approach

„ Using Synchronous Dataflow

„ Serialized I/O of multi-rate port

„ Fully static I/O timing analysis

M1 M2 M3

IN SIG1

SIG2 SIG3

OU T Synchronous Dataflow

Graph

M1 M2 M3

IN

RTL Target Architecture

IF1

IF3

OUT IF2

Pattern adjust

Initial Values Shimming

registers Stall

Gen.

Reset Gen.

clock

(18)

Ptolemy Approach: VHDL domain

„ Sequential VHDL code generation

z For simulation

z Entire application is described in a single process using only variables.

„ Structural VHDL code generation

z For synthesis

z Individual firings(or invocations) of a node are instantiated in separate hardware resources

Fully parallel HW architecture for multi-rate specification

A 4 1

B

A B B B

B B

A B

(19)

CAP Laboratory, SNU 19

Limitations of Previous Approaches

„ GRAPE

z Distributed control using handshaking protocol

z Synthesis problem is simple

z Only for rapid prototyping

„ Meyr’s

z Supports only static I/O timings

z Resource sharing is not considered

„ Ptolemy

z Impractically large area overhead in case of multi-rate specification

(20)

Comparison among Approaches

Approaches Ptolemy Meyr’s GRAPE PeaCE

Implementati on of multi- rate spec.

Parallel implementat ion

Sequential implementati on

Sequential implementati on

Parallel/Sequenti al/Hybrid

implementation Resource

allocation

Multiple- resource allocation

Single- resource allocation

Single- resource allocation

Multiple/Single/S hared –resource allocation

Inter-block Communicati on

Synchronou s

Synchronou s

Asynchrono us (FIFO)

Synchronous Block control Centralized

control

Centralized control

Distributed control

Centralized Control

Block exec.

time

Fixed Fixed Variable Variable

(21)

CAP Laboratory, SNU 21

Contents

„ Introduction

„ Previous works and our contributions

„ Block Definition

z Block I/O Model & Block Types

„ Controller Synthesis

„ Schedule-Based Design

„ FRDF Specification for more efficient HW implementation

„ Conclusions & Future Directions

(22)

Block Libraries

„ Types of block implementations

z A : Combinational logic

z B : Single-cycle sequential logic

z C : Multi-cycle sequential logic with fixed execution time

z D : Multi-cycle sequential logic with variable execution time.

„ Timing model of HW block

z Execution time of block

z I/O of multi-rate block

(23)

CAP Laboratory, SNU 23

Block Types & Control Signals

A B C D

SND

Type A : combinational logic

Type B : single-cycle sequential logic

Type C : multi-cycle sequential logic with fixed execution time Type D : multi-cycle sequential logic with variable execution time RCV

state_update signal clock

reset

start signal reset

start signal

done signal

RCV A B C D SND

clock

reset clock

en_b en_c en_d

en_a

(24)

Execution Time = propagation delay / clock period (cycles)

Type A : Combinational logic

Adder

inputs output

Execution time A

B

C ADDER

(25)

CAP Laboratory, SNU 25

VHDL Star with States

This logic is separated into combinational logic and state and implemented as Mealy machine.

A Accumulator C

Type B

: Single-cycle sequential logic

Mealy type state machine Adder

state register

State update signal

Execution Time = propagation delay / clock period (cycles)

(26)

A FIR filter C

Type C : Multi-cycle sequential logic with fixed execution time

FIR filter

clock reset start

output update signal

Multi-cycle logic (fixed)

The number of cycles is fixed.

Clock and reset signal are needed.

Controller should provide start and output update signal.

Execution time = specified number of cycles

Output Register

(27)

CAP Laboratory, SNU 27

A Divider C

Type D : Multi-cycle sequential logic with variable execution time

Divider

clock reset start

output update signal

Multi-cycle logic (variable)

The number of cycles varies at run-time.

Clock and reset signal are needed.

Controller should provide start and output update signal.

Done signal should be generated by a library block and be used to decide its finish time by the controller.

Execution time = specified number of cycles

Output Register

done signal

(28)

Timing Model of HW Library Block

HW Block inputs

outputs

Execution time

„ Strict Execution

z A block can start its execution after all its inputs are valid and finish its execution after all its outputs are valid.

(29)

CAP Laboratory, SNU 29

Timing Model of HW Library Block

„ Start time = 3, End time = 8

„ Execution time = End time – Start time + 1 = 6(cycles)

Execution time

Input valid timing

8 9 10

3 4 5 6 7

2

clock

start signal

output latch signal output valid timing counter

(30)

Timing Model of Multi-rate Block

„ Only Parallel I/O of multi-rate block is supported.

„ FRDF implementation can make it possible to serialize the I/O operation

A

1 2 A

time

I O

I O

I A O1

O2

time I

O1 O2

Serial I/O Parallel I/O

A

1 /2 1

(31)

CAP Laboratory, SNU 31

Contents

„ Introduction

„ Previous works and our contributions

„ Block Definition

„ Controller Synthesis

z Interface Problem

z Cascaded counter controller

z Looping control & buffer management

„ Schedule-Based Design

„ FRDF Specification for more efficient HW implementation

„ Conclusions & Future Directions

(32)

Controller Synthesis

„ Issue 1: Solving non-deterministic timing of I/O

z Communication with the outside of hardware module

HW/SW Interface

z Communication between blocks inside of hardware module

Blocks with variable execution time

„ Issue2 : Supporting various schedule

z Looped scheduling

z Resource sharing

z Buffer management

(33)

CAP Laboratory, SNU 33

Communication between Modules

„ The types of communication schemes

z Synchronous communication

Communication timing is predetermined.

Drawback : Tasks should be scheduled assuming the worst case execution time.

z Batch communication

It is possible to emulate synchronous communication with buffers in asynchronous interface.

Drawback : It cannot be applied to DFG with global feedback.

z Asynchronous communication

Communication timing is varied at run-time.

z There exist many cases in which asynchronous communication scheme is an efficient or a unique solution.

z The asynchronous communication with the outside is not considered in the previous approaches except GRAPE

(34)

Basic Idea

„ Counter-based solution

z simple and intuitive

RCV

B C

D SND

A

Combinational logic

count : 60 v

enable

z zero

enable signal of send buffer and state register valid signal of

receive node

10

30

20 20

state register

(35)

CAP Laboratory, SNU 35

Multiple RCV nodes

„ Main goal

z Obtaining the earliest time for the readiness of output regardless of the order in which the inputs arrive

„ Valid timing equation of send node

RCV1 A B

D C

RCV2 SND RCV3

D1 = 60 D2 = 40 D3 = 50 30

10

20 20

) (

max

i i

i

RT D

VT = +

VT : valid timing

RTi : receive timing of i-th receive node

Di : critical path length from the i-th receive node

(36)

Cascaded Counter: Idea

RCV1 A B

D C

RCV2 SND RCV3

30

20 60 50 40 0

10 20

RCV1 RCV3 RCV2 SND

D1 = 60, D2 = 40, D3 = 50

critical path time

length computation

count : 10 count : 10 count : 40

z z

RCV3 v

RCV2

v Enable signal for send buffer and delay register update and clear signal for valid and zero register

RCV1 v

z

SND

cascaded counter

(37)

CAP Laboratory, SNU 37

With Multiple Send Nodes

60 50 40 30 0 time

RCV1 RCV3 RCV2 SND1 SND2

D1,1 = 30 D1,2 = 60 D2,1 = 10 D2,2 = 40 D3,2 = 50

RCV1 A B

D C

RCV2 SND2 RCV3

SND1 30

10

20 20

compare : 30 =

SND1 count : 10 count : 10 count : 40

z z

RCV3 v

RCV2

v Enable signal for send buffer and delay register update and clear signal for valid and zero register

RCV1 v

z

SND2

(38)

Delay elements

Delay elements may exist and they correspond to data registers in hardware implementation.

A C

Delay Register

Central Controller Enable signal

clock reset

B D

A B C D

With Delay Registers

(39)

CAP Laboratory, SNU 39

With Delay Registers

40 20 0 time RCV1 SND1 D1

RCV1 A B C SND1

10 20

30

D1 : delay element

count : 40

v z

RCV1

compare : 20

= SND1

enable signal of delay register D1

(40)

Nodes with Variable Execution Time

„ The cascaded counter controller provides a clean solution for this node.

A B C

An asynchronous node that takes non- deterministic time unit for its execution

A

SND

B

RCV

C

Modify!

start done

clock

(41)

CAP Laboratory, SNU 41

Equivalent FSM Controller Implementation

Central Controller : FSM rstclk

CounterValue IterationBound

start signals RegisterEnable signals

done or rcv signals : check point

Currently, we implement FSM controller equivalent to cascaded counter in VHDL domain of PeaCE.

In this implementation, we use only one increasing counter with multiple check logics.

(42)

Equivalent FSM Controller Implementation

If rst = ‘1’ then

Counter <= 0;

elsif rising_edge(clk) then

if Counter = CheckValue0 and CheckSig(0) = ‘0’ then Counter <= Counter; -- hold value

elsif Counter = CheckValue1 and CheckSig(1) = ‘0’ then Counter <= Counter; -- hold value

elsif Counter = LastValue then

Counter <= 0; -- initialize else

Counter <= Counter+1; -- counting..

end if;

end if;

If Counter = LastValue then

IterationBound <= ‘1’;

Else IterationBound <= ‘0’;

Example Code

(43)

CAP Laboratory, SNU 43

Looping Control

„ PeaCE supports controller generation for looped schedule

„ Looped schedule can be structured hierarchically.

A 4 1

B

A

B B B

B time

B B B B

A

A B

B B B

A B

M U X

LOOP 4

(44)

Looping Control

A

B B B

B time

A B

M U X

Loop1 : Loop1_Counter

Top level Counter : CounterValue

control flow

A_start <=

'1' when CounterValue = 0 else '0';

B_start <=

'1' when Loop1_Counter = 0 and Loop1_busy = ‘1’ else '0';

Loop1_start <=

‘1’ when CounterValue = 20 else ‘0’;

(45)

CAP Laboratory, SNU 45

Buffer Management

„ Multi-rate buffering

„ Data types

z Int, Macroblock(16x16), Frame(176x144)

„ Register : small data type

z I/O timing control

z Buffer allocation

„ Memory : large data type

z Memory access logic

z Synchronization

z Memory allocation

A B

2 3

A B AA

M U X

B A

Memory sync.

(46)

Resource Management

„ Resource sharing or Multiple instantiation

„ Input multiplexing and output buffer access

X

MU X

MU

X

Mux select signal

Output buffer latch signal

X

multiplier

X X X

(47)

CAP Laboratory, SNU 47

Contents

„ Introduction

„ Previous works and our contributions

„ Block Definition

„ Controller Synthesis

„ Schedule-Based Design

z Motivation

z Schedule information & controller generation

z Experiments

„ FRDF Specification for more efficient HW implementation

„ Conclusions & Future Directions

(48)

Schedule Information

Node : start_time exe_time

B : 0 10 C : 0 8 D : 10 5

DFG

B C

D

Block Libraries

B C D

controller C

B clk

rst D

Schedule-based HW Synthesis

(49)

CAP Laboratory, SNU 49

Previous Works

„ 2-dimensional DCT algorithm

„ Generated Hardware Architecture from Ptolemy

Transpose

8x8 DCT1D Transpose

8x8 DCT1D

64 8 8

64 8 64 64 8

8 1-dimensional DCT blocks

8 1-dimensional DCT blocks DCT1D

DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D

Transpose 8x8 matrix

DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D

Transpose 8x8 matrix

64 16bit inputs

8 16bit signals

(50)

Previous Works

„ Generated Hardware Architecture from Meyr’s works

„

„ Generated Hardware Architecture from GRAPE

DCT 1D ctrl ctrl

DCT 1D ctrl ctrl

FIFO with 64 buffers

wr

wr_ok rd

rd_ok

DCT 1D

DCT 1D

M U X M

U X

controller

(51)

CAP Laboratory, SNU 51

Motivation

„ In the previous works, a single execution schedule is assumed for HW implementation.

„ But, proposed approach allows the designer to provide the execution schedule:

a multi-rate dataflow graph can be implemented into many hardware architectures.

Multi-rate

Dataflow Graph 1 4 1 1B C

B

C C C

C time

hardware

resources B

C

C C

C B

C C

C C

Fully-sequential Fully-parallel Hybrid

(52)

Motivation

„ Sharing

„ multiple-instantiation

input DCT

1D

M U X

controller

output

DCT 1D

M U X

DCT 1D

DCT 1D

M U X

DCT 1D

M U X

M U X Transpose

8x8 DCT1D Transpose

8x8 DCT1D

64 8 8

64 8 64 64 8

(53)

CAP Laboratory, SNU 53

Schedule Information 1

# resource allocation table Transpose 2

DCT1D 2

# resource mapping & schedule information

# (instance name, resource number, start, duration)

# loop ( loop count, start, loop period) Transpose_0 0 0 1

Loop 8 1 2 {

DCT1D_0 0 0 2 }

Transpose_1 1 17 1 Loop 8 18 2 {

DCT1D_1 1 0 2 }

DCT 1D

DCT 1D

M U X M

U X

controller

Transpose

8x8 DCT1D Transpose

8x8 DCT1D

64 8 8

64 8 64 64 8

0 1

DCT1D_0 DCT1D_1 1-to-1 mapping of

graph node ÅÆ HW resource

(54)

CAP Laboratory, SNU 54

Schedule Information 2 : Sharing

# resource allocation table Transpose 2

DCT1D 1

# resource mapping & schedule information

# (instance name, resource number, start, duration)

# loop ( loop count, start, loop period) Transpose_0 0 0 1

Loop 8 1 2 {

DCT1D_0 0 0 2 }

Transpose_1 1 17 1 Loop 8 18 2 {

DCT1D_1 0 0 2 }

input DCT

1D

M U

X output

Transpose

8x8 DCT1D Transpose

8x8 DCT1D

64 8 8

64 8 64 64 8

DCT1D_0 DCT1D_1

0

N-to-1 mapping of

graph node ÅÆ HW resource

(55)

CAP Laboratory, SNU 55

Schedule Information 3 : Multiple Instantiation

# resource allocation table Transpose 2

DCT1D 4

# resource mapping & schedule information Transpose_0 0 0 1

Loop 4 1 2 {

DCT1D_0 0 0 2 DCT1D_0 1 0 2 }

Transpose_1 1 9 1 Loop 4 10 2 {

DCT1D_1 2 0 2 DCT1D_1 3 0 2

} DCT

1D

M U X

controller DCT

1D

DCT 1D

M U X

DCT 1D

M U X

M U X Transpose

8x8 DCT1D Transpose

8x8 DCT1D

64 8 8

64 8 64 64 8

DCT1D_0 DCT1D_1

0 2

1-to-N mapping of

graph node ÅÆ HW resource

1 3

(56)

HW Controller Generation

„ Counter-based Controller

z Buffer control, Mux control, start and done signal of block

DCT1D_res0_sel

DCT 1D

M U X

Loop1 Counter Loop1 IterNum Buffer Controller

MUX Controller

DCT1D_res0_input

DCT1D_0_output_0_en DCT1D_0_output_1_en

(57)

CAP Laboratory, SNU 57

Experiment 1 :

2-dimensional DCT Algorithm

0 50000 100000 150000 200000 250000 300000

0 100 200 300 400 500 600 700 800

1/Throughput (ns/sample)

Area (gates)

Ptolemy GRAPE Auto Manual

16 IDCT resources

4 IDCT resources

2 IDCT resources

1 IDCT resources : Sharing

(58)

Contents

„ Introduction

„ Previous works and our contributions

„ Block Definition

„ Controller Synthesis

„ Schedule-Based Design

„ FRDF Specification for more efficient HW implementation

z FRDF model

z Examples & Experiments

„ Conclusions & Future Directions

(59)

CAP Laboratory, SNU 59

Fractional Rate Dataflow Specification

„ The gap between automatic and manual design still exists.

z We cannot optimize the automatic design further because of dataflow semantic.

z Dataflow semantic has more strict rules for firing.

This requires more buffers to satisfy firing condition.

z Real design has more freedom of implementation for efficient design

„ It is necessary to reduce the buffer requirements for practical efficient design

z We choose FRDF in which fractional number of data samples can be produced or consumed.

z FRDF makes the automatic design a little closer to the manual design.

(60)

Fractional Rate Dataflow Specification

„ Every block with multi-rate specification has its equivalent block with FRDF specification.

z

Functionally equivalent

z

Internal algorithm and its schedule can be different.

4 1

Add4

1 1/4

Add4

In one invocation (or firing),

Comsumes 4 input data samples

Produces 1 output data sample

Requires 4 input buffers

Consumes 1 input data sample in one invocation

Produces only 1 output data sample during 4 invocations

Requires only 1 input buffers

Requires 4 invocations to perform entire function

(61)

CAP Laboratory, SNU 61

Fractional Rate Dataflow(FRDF)

„ Non-FRDF implementation

z “Add4” block is invoked after its all inputs are valid.

z Parallel I/O

Ramp Sink

1 4 1 1

Add4

LOOP 4

Ramp Add4 sink

Ramp Ramp Ramp Ramp Add4 sink

time Combinational logic

(62)

Fractional Rate Dataflow(FRDF)

„ FRDF implementation

z The execution of block “Add4” is divided into 4 phases.

z Serial I/O at each phase

Ramp Sink

1 1 1/4 1

Add4

Ramp Add4

LOOP 4

sink

Ramp Add4 sink

phase0 Ramp Add4

phase1 Ramp Add4

phase2 Ramp Add4

phase3

Sequential logic with internal state : sum & phase

time

(63)

CAP Laboratory, SNU 63

DeQ IZ IDCT

Skip 1 Mux

1 1

1 1 1 1 1 1 1

DeQ IZ IDCT

Skip 1 Mux

1 1

1 1 1 1 1 1 1

DeQ IZ IDCT

Skip 1 Mux

1 1

1 1 1 1 1 1 1

Motion Compensation 4

1

1

4 1 1

dx dy mode

QP

Repeat Repeat 1

1 4 1

1

1

1 1

1

1 4

1

16bit 8x8 Block FRAME

8bit integer YBlock

UBlock

VBlock

Previous Y,U,V Frame

1/99 1/99 1/99 1/991/99

1/99

Experiment 2 :

Parts of H.263 Decoder

No Resource Sharing

Core: 282383, Buffer: 172032, Glue logic: 52575

Total Area: 506,987 gates

X

(64)

DeQ IZ IDCT Skip

1 Mux

1 1

1 1 1 1 1 1 1

DeQ IZ IDCT

Skip 1 Mux

1 1

1 1 1 1 1 1 1

DeQ IZ IDCT

Skip 1 Mux

1 1

1 1 1 1 1 1 1

Motion Compensation 4

1

1

4 1 1

dx dy mode

QP

Repeat Repeat 1

1 4 1

1

1

1 1

1

1 4

1

16bit 8x8 Block FRAME

8bit integer YBlock

UBlock

VBlock

Previous Y,U,V Frame

1/99 1/99 1/99 1/991/99

1/99

Experiment 2 :

Parts of H.263 Decoder

Maximum Resource Sharing

Core: 161164, Buffer: 172032, Glue logic: 66304

X

(65)

CAP Laboratory, SNU 65

Experiment 2 :

Parts of H.263 Decoder

DeQ IZ IDCT

Skip

Mux

Read Prev Block & Half Pixel

Truncation

& ADD Saturation

Mux WriteBlock 16bit 8x8 Block 8bit 8x8 Block FRAME

dx dy

1/(6x99)

1

1/(6x99) 1/6

mode 1/6

CBP

1

8bit integer

SRAM

1

1 1

1 1 1 1 1 1 1 1

1

1 1

1 1 1 1

1

More fractional rate specification

Separate Y, U, and V data paths are merged

MC block is divided into several small blocks for FRDF

Core: 89033, Buffer: 65536, Glue logic: 22574 Total Area: 177,143 gates

(66)

Experiment 2 :

Parts of H.263 Decoder

282380

172032

52575 506987

161164172032

66304 399500

89033 65536

22574 177143

0 100000 200000 300000 400000 500000 600000

gates

original original_shared advanced

core buffer

glue logics total

(67)

CAP Laboratory, SNU 67

Contents

„ Introduction

„ Previous works and our contributions

„ Block Definition

„ Controller Synthesis

„ Schedule-Based Design

„ FRDF Specification for more efficient HW implementation

„ Conclusions & Future Directions

(68)

Conclusions

„ Synthesize efficient hardware from dataflow specification.

z We use SDF and its extension to FRDF.

„ The main goal of our research

z Overcoming the limitations of previous approaches

Solving non-deterministic timing of I/O

Schedule-based design: resource sharing, looping control

z Efficient hardware synthesis applicable to practical HW design

Supporting FRDF specification

„ All of these techniques are implemented in VHDL domain of PeaCE codesign environment and verified by some examples; DCT and H.263 decoder

(69)

CAP Laboratory, SNU 69

Future work

„ Extension of expressiveness

z Piggybacked dataflow

z Dynamic construct : for (data-dependent iteration), case (if-then- else, conditional execution)

„ Support of legacy HW platforms

z Support legacy HW IP Å SW code generation

z Support various types of BUS & memory interface

Local SRAM, dual-port memory, shared memory

z Current : Shared memory through AMBA interface

„ Optimization issues

z Buffer elimination

z Buffer sharing

z FRDF

수치

Updating...

참조

관련 주제 :