CAP Laboratory, SNU 1
Schedule
1. Introduction
2. System Modeling Language: System C *
3. HW/SW Cosimulation *
4. C-based Design *
5. Data-flow Model and SW Synthesis
6. HW and Interface Synthesis (Midterm)
7. Models of Computation
8. Model based Design of Embedded SW
9. Design Space Exploration (Final Exam)
(Term Project)
Reference
PeaCE Approach
z Hyunuk Jung, Kangnyoung Lee, and Soonhoi Ha, “Efficient Hardware Controller Synthesis for Synchronous Data Flow in System Level Design,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems Vol. 10 pp 423-428 August 2002
z Hyunuk Jung, Hoeseok Yang, and Soonhoi Ha, "Optimized RTL
Code Generation from Coarse-Grain Dataflow Specification for Fast HW/SW Cosynthesis", Journal of VLSI Signal Processing (online published) 10, June. 2007
CAP Laboratory, SNU 3
Contents
Introduction
z Higher-level design from dataflow model
z PeaCE design flow & cosynthesis
Previous works
Block Definition
Controller Synthesis
Schedule-Based Design
FRDF Specification for more efficient HW implementation
Conclusions
IEEE Transaction on VLSI Systems Vol.10, August 2002
J. VLSI Signal Processing, 2007
Hardware Synthesis Problem
Automatic Hardware Synthesis from Coarse-Grained Dataflow Specification in System Level Design
z A node represents a coarse grain computation block such as FIR filter or DCT.
z A node has complex properties such as data sample rates, I/O timings, data types, and its internal states.
z A central controller should be generated automatically in order to control these complex coarse-grain HW library blocks and registers.
DFG B
C D
DFG B
C D
Block
Libraries B C D Block
Libraries B C D controller
C B clk
rst D
controller C
B clk
rst D
CAP Laboratory, SNU 5
Design Size & Abstraction Level
1970s, several hundred transistors
z Transistor and gate level design
1980s
z Register Transfer Level (RTL) design
z Hardware Description Language (HDL)
1990s
z High-level synthesis (or behavioral synthesis)
z Behavioral HDL, imperative programming languages (C,C++)
Today, several million gates
z Exponential increase in transistor density
z HW/SW Co-design (System Level Design)
z We need higher-level tools for system design and hardware description
Conventional High-level Synthesis
Hardware synthesis
z Conventional architecture and logic synthesis techniques are used
Focus!
Functional Spec HDL Coding
Simulation Logic Synthesis
Layout Synthesis Back Annotation
Layout
Architecture Synthesis Register-Transfer-Level
Behavioral-Level
Gate-Level
CAP Laboratory, SNU 7
Architecture Synthesis Problem
Scheduling Resource binding Data path synthesis Control logic design
… behavioral model
(CDFG)
constraints (timing, area, performance, resource binding)
resource
(library + module generator)
RTL description
primitives
area, delay given
area, delay estimated Architecture Synthesis
New Trend: C-based design
From C/C++ to Hardware
z Mentor Graphics(www.mentor.com): “Catapult C Cynthesis”
z Forte design systems (www.forteds.com): “Cynthesizer”
z Synfora(www.synfora.com): “PICO Express”
z Y Explorations Inc.(www.yxi.com): “eXCite”
z Celoxica(www.celoxica.com): “Agility Compiler”
− RTL level compiler
CAP Laboratory, SNU 9
System-level HW Synthesis
Higher Level Hardware Synthesis
z Increasing need for a design methodology of higher abstraction level
z Growing complexity, fast design turn-around time
z Easy to modify and maintain
Automatic code generation from data flow graph
z SDF semantics should be preserved - “refinement”
z The kernel code of a block is already optimized in the library.
z Determine the schedule and resource allocation.
z Controller is generated according to the scheduled sequence and the resource mapping
Fundamental Question
z Can we generate the HDL code with the synthesizable area and the similar performance as manually optimized code?
Hardware Synthesis Strategies
Partitioned Graph
Behavioral level HDL
Behavioral level Synthesis
(high-level synthesis)
RT- level HDL
Synthesizable C
Logic-Synthesis C-to-HDL/HW
Cycle-based C
C-to-HW
Current PeaCE
Future
CAP Laboratory, SNU 11
System Design Flow in PeaCE
Architecture Specification Architecture Specification
Partitioning/Scheduling Partitioning/Scheduling
SW
C code generation SW
C code generation VHDL code generationHW
VHDL code generationHW Dataflow Specification Dataflow Specification
Cosimulation/Cosynthesis Cosimulation/Cosynthesis
SW subgraph SW schedule
VHDL Code C Code
Node-PE Performance DB
Node-PE Performance DB HW Schedule Info.
HW Subgraph
HW/SW Interface
Hardware Synthesis for HW/SW Cosynthesis
z After HW/SW partitioning of an initial dataflow specification, a partitioned subgraph mapped to hardware is automatically generated.
z A partitioned subgraph has interfacing blocks such as SND(send) and RCV(receive) blocks for communication.
− These interfacing blocks have internal buffers and shared memory access logics.
1 1 4 1 1 1B
A C
D
E
1
1 4
1 1 1 4 1 1 1B C
RCV1 SND1
Mapped to hardware
Initial Dataflow Hardware
CAP Laboratory, SNU 13
HW/SW Cosynthesis
wrapper
AHB
uProcessor
Memory
S R
R
S R
S
OS/device driver
Synthesized HW (VHDL domain)
HW SW
SW SW
Contents
Introduction
Previous works
z GRAPE, Meyr’s work, Ptolemy, and PeaCE
Block Definition
Controller Synthesis
Schedule-Based Design
FRDF Specification for more efficient HW implementation
Conclusions & Future Directions
CAP Laboratory, SNU 15
GRAPE Approach
GRAPE (Graphical Rapid Prototyping Environment) is a HW/SW codesign environment for the functional emulation of DSP systems.
Using cyclo-static dataflow specification
z Sample rates are changed periodically
Distributed controls
z Using hand-shaking protocol between blocks
z FIFO buffers
z No central controller
Generated hardware implementation has one-to-one correspondence to dataflow specification
z Simple architecture generation
GRAPE Standard Interface
R data
strb rdy
data wr wr_ok
B data
wr wr_ok
data rd_ok rd
U data
rd_ok rd
data wr wr_ok
receive node buffer user task node
R1
R2
S1 B4
B3 B1
B2
U1
U2
U3 B5 U4 B6
Asynchronous communication between every blocks using hand-shaking protocol
CAP Laboratory, SNU 17
Meyr’s Approach
Using Synchronous Dataflow
Serialized I/O of multi-rate port
Fully static I/O timing analysis
M1 M2 M3
IN SIG1
SIG2 SIG3
OU T Synchronous Dataflow
Graph
M1 M2 M3
IN
RTL Target Architecture
IF1
IF3
OUT IF2
Pattern adjust
Initial Values Shimming
registers Stall
Gen.
Reset Gen.
clock
Ptolemy Approach: VHDL domain
Sequential VHDL code generation
z For simulation
z Entire application is described in a single process using only variables.
Structural VHDL code generation
z For synthesis
z Individual firings(or invocations) of a node are instantiated in separate hardware resources
− Fully parallel HW architecture for multi-rate specification
A 4 1
B
A B B B
B B
A B
CAP Laboratory, SNU 19
Limitations of Previous Approaches
GRAPE
z Distributed control using handshaking protocol
z Synthesis problem is simple
z Only for rapid prototyping
Meyr’s
z Supports only static I/O timings
z Resource sharing is not considered
Ptolemy
z Impractically large area overhead in case of multi-rate specification
Comparison among Approaches
Approaches Ptolemy Meyr’s GRAPE PeaCE
Implementati on of multi- rate spec.
Parallel implementat ion
Sequential implementati on
Sequential implementati on
Parallel/Sequenti al/Hybrid
implementation Resource
allocation
Multiple- resource allocation
Single- resource allocation
Single- resource allocation
Multiple/Single/S hared –resource allocation
Inter-block Communicati on
Synchronou s
Synchronou s
Asynchrono us (FIFO)
Synchronous Block control Centralized
control
Centralized control
Distributed control
Centralized Control
Block exec.
time
Fixed Fixed Variable Variable
CAP Laboratory, SNU 21
Contents
Introduction
Previous works and our contributions
Block Definition
z Block I/O Model & Block Types
Controller Synthesis
Schedule-Based Design
FRDF Specification for more efficient HW implementation
Conclusions & Future Directions
Block Libraries
Types of block implementations
z A : Combinational logic
z B : Single-cycle sequential logic
z C : Multi-cycle sequential logic with fixed execution time
z D : Multi-cycle sequential logic with variable execution time.
Timing model of HW block
z Execution time of block
z I/O of multi-rate block
CAP Laboratory, SNU 23
Block Types & Control Signals
A B C D
SNDType A : combinational logic
Type B : single-cycle sequential logic
Type C : multi-cycle sequential logic with fixed execution time Type D : multi-cycle sequential logic with variable execution time RCV
state_update signal clock
reset
start signal reset
start signal
done signal
RCV A B C D SND
clock
reset clock
en_b en_c en_d
en_a
Execution Time = propagation delay / clock period (cycles)
Type A : Combinational logic
Adder
inputs output
Execution time A
B
C ADDER
CAP Laboratory, SNU 25
VHDL Star with States
This logic is separated into combinational logic and state and implemented as Mealy machine.
A Accumulator C
Type B
: Single-cycle sequential logic
Mealy type state machine Adder
state register
State update signal
Execution Time = propagation delay / clock period (cycles)
A FIR filter C
Type C : Multi-cycle sequential logic with fixed execution time
FIR filter
clock reset start
output update signal
Multi-cycle logic (fixed)
The number of cycles is fixed.
Clock and reset signal are needed.
Controller should provide start and output update signal.
Execution time = specified number of cycles
Output Register
CAP Laboratory, SNU 27
A Divider C
Type D : Multi-cycle sequential logic with variable execution time
Divider
clock reset start
output update signal
Multi-cycle logic (variable)
The number of cycles varies at run-time.
Clock and reset signal are needed.
Controller should provide start and output update signal.
Done signal should be generated by a library block and be used to decide its finish time by the controller.
Execution time = specified number of cycles
Output Register
done signal
Timing Model of HW Library Block
HW Block inputs
outputs
Execution time
Strict Execution
z A block can start its execution after all its inputs are valid and finish its execution after all its outputs are valid.
CAP Laboratory, SNU 29
Timing Model of HW Library Block
Start time = 3, End time = 8
Execution time = End time – Start time + 1 = 6(cycles)
Execution time
Input valid timing
8 9 10
3 4 5 6 7
2
clock
start signal
output latch signal output valid timing counter
Timing Model of Multi-rate Block
Only Parallel I/O of multi-rate block is supported.
FRDF implementation can make it possible to serialize the I/O operation
A
1 2 A
time
I O
I O
I A O1
O2
time I
O1 O2
Serial I/O Parallel I/O
A
1 /2 1
CAP Laboratory, SNU 31
Contents
Introduction
Previous works and our contributions
Block Definition
Controller Synthesis
z Interface Problem
z Cascaded counter controller
z Looping control & buffer management
Schedule-Based Design
FRDF Specification for more efficient HW implementation
Conclusions & Future Directions
Controller Synthesis
Issue 1: Solving non-deterministic timing of I/O
z Communication with the outside of hardware module
− HW/SW Interface
z Communication between blocks inside of hardware module
− Blocks with variable execution time
Issue2 : Supporting various schedule
z Looped scheduling
z Resource sharing
z Buffer management
CAP Laboratory, SNU 33
Communication between Modules
The types of communication schemes
z Synchronous communication
− Communication timing is predetermined.
− Drawback : Tasks should be scheduled assuming the worst case execution time.
z Batch communication
− It is possible to emulate synchronous communication with buffers in asynchronous interface.
− Drawback : It cannot be applied to DFG with global feedback.
z Asynchronous communication
− Communication timing is varied at run-time.
z There exist many cases in which asynchronous communication scheme is an efficient or a unique solution.
z The asynchronous communication with the outside is not considered in the previous approaches except GRAPE
Basic Idea
Counter-based solution
z simple and intuitive
RCV
B C
D SND
A
Combinational logic
count : 60 v
enable
z zero
enable signal of send buffer and state register valid signal of
receive node
10
30
20 20
state register
CAP Laboratory, SNU 35
Multiple RCV nodes
Main goal
z Obtaining the earliest time for the readiness of output regardless of the order in which the inputs arrive
Valid timing equation of send node
RCV1 A B
D C
RCV2 SND RCV3
D1 = 60 D2 = 40 D3 = 50 30
10
20 20
) (
max
i ii
RT D
VT = +
VT : valid timing
RTi : receive timing of i-th receive node
Di : critical path length from the i-th receive node
Cascaded Counter: Idea
RCV1 A B
D C
RCV2 SND RCV3
30
20 60 50 40 0
10 20
RCV1 RCV3 RCV2 SND
D1 = 60, D2 = 40, D3 = 50
critical path time
length computation
count : 10 count : 10 count : 40
z z
RCV3 v
RCV2
v Enable signal for send buffer and delay register update and clear signal for valid and zero register
RCV1 v
z
SND
cascaded counter
CAP Laboratory, SNU 37
With Multiple Send Nodes
60 50 40 30 0 time
RCV1 RCV3 RCV2 SND1 SND2
D1,1 = 30 D1,2 = 60 D2,1 = 10 D2,2 = 40 D3,2 = 50
RCV1 A B
D C
RCV2 SND2 RCV3
SND1 30
10
20 20
compare : 30 =
SND1 count : 10 count : 10 count : 40
z z
RCV3 v
RCV2
v Enable signal for send buffer and delay register update and clear signal for valid and zero register
RCV1 v
z
SND2
Delay elements
Delay elements may exist and they correspond to data registers in hardware implementation.
A C
Delay Register
Central Controller Enable signal
clock reset
B D
A B C D
With Delay Registers
CAP Laboratory, SNU 39
With Delay Registers
40 20 0 time RCV1 SND1 D1
RCV1 A B C SND1
10 20
30
D1 : delay element
count : 40
v z
RCV1
compare : 20
= SND1
enable signal of delay register D1
Nodes with Variable Execution Time
The cascaded counter controller provides a clean solution for this node.
A B C
An asynchronous node that takes non- deterministic time unit for its execution
A
SNDB
RCVC
Modify!
start done
clock
CAP Laboratory, SNU 41
Equivalent FSM Controller Implementation
Central Controller : FSM rstclk
CounterValue IterationBound
start signals RegisterEnable signals
done or rcv signals : check point
Currently, we implement FSM controller equivalent to cascaded counter in VHDL domain of PeaCE.
In this implementation, we use only one increasing counter with multiple check logics.
Equivalent FSM Controller Implementation
If rst = ‘1’ then
Counter <= 0;
elsif rising_edge(clk) then
if Counter = CheckValue0 and CheckSig(0) = ‘0’ then Counter <= Counter; -- hold value
elsif Counter = CheckValue1 and CheckSig(1) = ‘0’ then Counter <= Counter; -- hold value
elsif Counter = LastValue then
Counter <= 0; -- initialize else
Counter <= Counter+1; -- counting..
end if;
end if;
If Counter = LastValue then
IterationBound <= ‘1’;
Else IterationBound <= ‘0’;
Example Code
CAP Laboratory, SNU 43
Looping Control
PeaCE supports controller generation for looped schedule
Looped schedule can be structured hierarchically.
A 4 1
B
A
B B B
B time
B B B B
A
A B
B B B
A B
M U X
LOOP 4
Looping Control
A
B B B
B time
A B
M U X
Loop1 : Loop1_Counter
Top level Counter : CounterValue
control flow
A_start <=
'1' when CounterValue = 0 else '0';
B_start <=
'1' when Loop1_Counter = 0 and Loop1_busy = ‘1’ else '0';
Loop1_start <=
‘1’ when CounterValue = 20 else ‘0’;
CAP Laboratory, SNU 45
Buffer Management
Multi-rate buffering
Data types
z Int, Macroblock(16x16), Frame(176x144)
Register : small data type
z I/O timing control
z Buffer allocation
Memory : large data type
z Memory access logic
z Synchronization
z Memory allocation
A B
2 3
A B AA
M U X
B A
Memory sync.
Resource Management
Resource sharing or Multiple instantiation
Input multiplexing and output buffer access
X
MU X
MU
X
Mux select signal
Output buffer latch signal
X
multiplierX X X
CAP Laboratory, SNU 47
Contents
Introduction
Previous works and our contributions
Block Definition
Controller Synthesis
Schedule-Based Design
z Motivation
z Schedule information & controller generation
z Experiments
FRDF Specification for more efficient HW implementation
Conclusions & Future Directions
Schedule Information
Node : start_time exe_time
B : 0 10 C : 0 8 D : 10 5
DFG
B C
D
Block Libraries
B C D
controller C
B clk
rst D
Schedule-based HW Synthesis
CAP Laboratory, SNU 49
Previous Works
2-dimensional DCT algorithm
Generated Hardware Architecture from Ptolemy
Transpose
8x8 DCT1D Transpose
8x8 DCT1D
64 8 8
64 8 64 64 8
8 1-dimensional DCT blocks
8 1-dimensional DCT blocks DCT1D
DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D
Transpose 8x8 matrix
DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D DCT1D
Transpose 8x8 matrix
64 16bit inputs
8 16bit signals
Previous Works
Generated Hardware Architecture from Meyr’s works
Generated Hardware Architecture from GRAPE
DCT 1D ctrl ctrl
DCT 1D ctrl ctrl
FIFO with 64 buffers
wr
wr_ok rd
rd_ok
DCT 1D
DCT 1D
M U X M
U X
controller
CAP Laboratory, SNU 51
Motivation
In the previous works, a single execution schedule is assumed for HW implementation.
But, proposed approach allows the designer to provide the execution schedule:
a multi-rate dataflow graph can be implemented into many hardware architectures.
Multi-rate
Dataflow Graph 1 4 1 1B C
B
C C C
C time
hardware
resources B
C
C C
C B
C C
C C
Fully-sequential Fully-parallel Hybrid
Motivation
Sharing
multiple-instantiation
input DCT
1D
M U X
controller
output
DCT 1D
M U X
DCT 1D
DCT 1D
M U X
DCT 1D
M U X
M U X Transpose
8x8 DCT1D Transpose
8x8 DCT1D
64 8 8
64 8 64 64 8
CAP Laboratory, SNU 53
Schedule Information 1
# resource allocation table Transpose 2
DCT1D 2
# resource mapping & schedule information
# (instance name, resource number, start, duration)
# loop ( loop count, start, loop period) Transpose_0 0 0 1
Loop 8 1 2 {
DCT1D_0 0 0 2 }
Transpose_1 1 17 1 Loop 8 18 2 {
DCT1D_1 1 0 2 }
DCT 1D
DCT 1D
M U X M
U X
controller
Transpose
8x8 DCT1D Transpose
8x8 DCT1D
64 8 8
64 8 64 64 8
0 1
DCT1D_0 DCT1D_1 1-to-1 mapping of
graph node ÅÆ HW resource
CAP Laboratory, SNU 54
Schedule Information 2 : Sharing
# resource allocation table Transpose 2
DCT1D 1
# resource mapping & schedule information
# (instance name, resource number, start, duration)
# loop ( loop count, start, loop period) Transpose_0 0 0 1
Loop 8 1 2 {
DCT1D_0 0 0 2 }
Transpose_1 1 17 1 Loop 8 18 2 {
DCT1D_1 0 0 2 }
input DCT
1D
M U
X output
Transpose
8x8 DCT1D Transpose
8x8 DCT1D
64 8 8
64 8 64 64 8
DCT1D_0 DCT1D_1
0
N-to-1 mapping of
graph node ÅÆ HW resource
CAP Laboratory, SNU 55
Schedule Information 3 : Multiple Instantiation
# resource allocation table Transpose 2
DCT1D 4
# resource mapping & schedule information Transpose_0 0 0 1
Loop 4 1 2 {
DCT1D_0 0 0 2 DCT1D_0 1 0 2 }
Transpose_1 1 9 1 Loop 4 10 2 {
DCT1D_1 2 0 2 DCT1D_1 3 0 2
} DCT
1D
M U X
controller DCT
1D
DCT 1D
M U X
DCT 1D
M U X
M U X Transpose
8x8 DCT1D Transpose
8x8 DCT1D
64 8 8
64 8 64 64 8
DCT1D_0 DCT1D_1
0 2
1-to-N mapping of
graph node ÅÆ HW resource
1 3
HW Controller Generation
Counter-based Controller
z Buffer control, Mux control, start and done signal of block
DCT1D_res0_sel
DCT 1D
M U X
Loop1 Counter Loop1 IterNum Buffer Controller
MUX Controller
DCT1D_res0_input
DCT1D_0_output_0_en DCT1D_0_output_1_en
CAP Laboratory, SNU 57
Experiment 1 :
2-dimensional DCT Algorithm
0 50000 100000 150000 200000 250000 300000
0 100 200 300 400 500 600 700 800
1/Throughput (ns/sample)
Area (gates)
Ptolemy GRAPE Auto Manual
16 IDCT resources
4 IDCT resources
2 IDCT resources
1 IDCT resources : Sharing
Contents
Introduction
Previous works and our contributions
Block Definition
Controller Synthesis
Schedule-Based Design
FRDF Specification for more efficient HW implementation
z FRDF model
z Examples & Experiments
Conclusions & Future Directions
CAP Laboratory, SNU 59
Fractional Rate Dataflow Specification
The gap between automatic and manual design still exists.
z We cannot optimize the automatic design further because of dataflow semantic.
z Dataflow semantic has more strict rules for firing.
− This requires more buffers to satisfy firing condition.
z Real design has more freedom of implementation for efficient design
It is necessary to reduce the buffer requirements for practical efficient design
z We choose FRDF in which fractional number of data samples can be produced or consumed.
z FRDF makes the automatic design a little closer to the manual design.
Fractional Rate Dataflow Specification
Every block with multi-rate specification has its equivalent block with FRDF specification.
z
Functionally equivalent
z
Internal algorithm and its schedule can be different.
4 1
Add4
1 1/4
Add4
• In one invocation (or firing),
• Comsumes 4 input data samples
• Produces 1 output data sample
• Requires 4 input buffers
• Consumes 1 input data sample in one invocation
• Produces only 1 output data sample during 4 invocations
• Requires only 1 input buffers
• Requires 4 invocations to perform entire function
CAP Laboratory, SNU 61
Fractional Rate Dataflow(FRDF)
Non-FRDF implementation
z “Add4” block is invoked after its all inputs are valid.
z Parallel I/O
Ramp Sink
1 4 1 1
Add4
LOOP 4
Ramp Add4 sink
Ramp Ramp Ramp Ramp Add4 sink
time Combinational logic
Fractional Rate Dataflow(FRDF)
FRDF implementation
z The execution of block “Add4” is divided into 4 phases.
z Serial I/O at each phase
Ramp Sink
1 1 1/4 1
Add4
Ramp Add4
LOOP 4
sink
Ramp Add4 sink
phase0 Ramp Add4
phase1 Ramp Add4
phase2 Ramp Add4
phase3
Sequential logic with internal state : sum & phase
time
CAP Laboratory, SNU 63
DeQ IZ IDCT
Skip 1 Mux
1 1
1 1 1 1 1 1 1
DeQ IZ IDCT
Skip 1 Mux
1 1
1 1 1 1 1 1 1
DeQ IZ IDCT
Skip 1 Mux
1 1
1 1 1 1 1 1 1
Motion Compensation 4
1
1
4 1 1
dx dy mode
QP
Repeat Repeat 1
1 4 1
1
1
1 1
1
1 4
1
16bit 8x8 Block FRAME
8bit integer YBlock
UBlock
VBlock
Previous Y,U,V Frame
1/99 1/99 1/99 1/991/99
1/99
Experiment 2 :
Parts of H.263 Decoder
No Resource Sharing
Core: 282383, Buffer: 172032, Glue logic: 52575
Total Area: 506,987 gates
X
DeQ IZ IDCT Skip
1 Mux
1 1
1 1 1 1 1 1 1
DeQ IZ IDCT
Skip 1 Mux
1 1
1 1 1 1 1 1 1
DeQ IZ IDCT
Skip 1 Mux
1 1
1 1 1 1 1 1 1
Motion Compensation 4
1
1
4 1 1
dx dy mode
QP
Repeat Repeat 1
1 4 1
1
1
1 1
1
1 4
1
16bit 8x8 Block FRAME
8bit integer YBlock
UBlock
VBlock
Previous Y,U,V Frame
1/99 1/99 1/99 1/991/99
1/99
Experiment 2 :
Parts of H.263 Decoder
Maximum Resource Sharing
Core: 161164, Buffer: 172032, Glue logic: 66304
X
CAP Laboratory, SNU 65
Experiment 2 :
Parts of H.263 Decoder
DeQ IZ IDCT
Skip
Mux
Read Prev Block & Half Pixel
Truncation
& ADD Saturation
Mux WriteBlock 16bit 8x8 Block 8bit 8x8 Block FRAME
dx dy
1/(6x99)
1
1/(6x99) 1/6
mode 1/6
CBP
1
8bit integer
SRAM
1
1 1
1 1 1 1 1 1 1 1
1
1 1
1 1 1 1
1
• More fractional rate specification
• Separate Y, U, and V data paths are merged
• MC block is divided into several small blocks for FRDF
Core: 89033, Buffer: 65536, Glue logic: 22574 Total Area: 177,143 gates
Experiment 2 :
Parts of H.263 Decoder
282380
172032
52575 506987
161164172032
66304 399500
89033 65536
22574 177143
0 100000 200000 300000 400000 500000 600000
gates
original original_shared advanced
core buffer
glue logics total
CAP Laboratory, SNU 67
Contents
Introduction
Previous works and our contributions
Block Definition
Controller Synthesis
Schedule-Based Design
FRDF Specification for more efficient HW implementation
Conclusions & Future Directions
Conclusions
Synthesize efficient hardware from dataflow specification.
z We use SDF and its extension to FRDF.
The main goal of our research
z Overcoming the limitations of previous approaches
− Solving non-deterministic timing of I/O
− Schedule-based design: resource sharing, looping control
z Efficient hardware synthesis applicable to practical HW design
− Supporting FRDF specification
All of these techniques are implemented in VHDL domain of PeaCE codesign environment and verified by some examples; DCT and H.263 decoder
CAP Laboratory, SNU 69
Future work
Extension of expressiveness
z Piggybacked dataflow
z Dynamic construct : for (data-dependent iteration), case (if-then- else, conditional execution)
Support of legacy HW platforms
z Support legacy HW IP Å SW code generation
z Support various types of BUS & memory interface
− Local SRAM, dual-port memory, shared memory
z Current : Shared memory through AMBA interface
Optimization issues
z Buffer elimination
z Buffer sharing
z FRDF