850 Sung-Woo Choi et al. ETRI Journal, Volume 30, Number 6, December 2008
ABSTRACT⎯This letter presents a power efficient 64-state Viterbi decoder (VD) employing a two-stage radix-4 add- compare-select architecture. A class of VD architectures is implemented, and their hardware complexity, maximum operating speed, and power consumption are compared.
Implementation results show that the proposed VD architecture is suitable for multiband orthogonal frequency-division multiplexing (MB-OFDM) ultra-wideband (UWB) systems, which can support the data rate of 480 Mbps even when implemented using 0.18-µm CMOS technology.
Keywords⎯ Viterbi decoder, radix-4, UWB, MB-OFDM.
I. Introduction
Ultra-wideband (UWB) systems have been receiving much attention in recent years, mainly due to their high data rate capability with low transmit power [1]-[3]. Specifically, a multiband orthogonal frequency-division multiplexing (MB- OFDM) based UWB system supporting the maximum data rate of 480 Mbps is widely considered [1]. In the MB-OFDM UWB system, a rate 1/3 convolutional code is utilized. The Viterbi algorithm is generally used to achieve near optimal decoding performance.
Recently, there have been many studies on high-speed Viterbi decoders (VDs). A systolic solution of the M-step parallel processing architecture was proposed in [4]. Because a k-fold increase in throughput requires a 2k-fold increase in hardware complexity, M is usually limited to less than 3. A bit- level pipelined structure was studied in [5]. It solves the timing
Manuscript received July 1, 2008; revised Aug. 14, 2008; accepted Aug. 22, 2008.
This work was supported by the IT R&D program of MKE/IITA [2006-S-071-02, Development of UWB Solution for High Speed Multimedia Transmission], Rep. of Korea.
Sung-Woo Choi (phone: + 82 42 860 6113, email: csw9908@etri.re.kr), Kyu-Min Kang (phone: + 82 42 860 6703, email: kmkang@etri.re.kr), and Sang-Sung Choi (email:
sschoi@etri.re.kr) are with the IT Convergence Technology Research Laboratory, ETRI, Daejeon, Rep. of Korea.
constraints by retiming the critical path; however, both power consumption and hardware complexity increase to support data rates of hundreds of Mbps. A minimized method and sliding block methods have also been studied for high-speed processing [6]. To greatly increase the data rate, a systolic design has been employed. Because the above methods have their own speed limits and/or hardware complexity, an appropriate architecture should be implemented.
II. Two-Stage Radix-4 Viterbi Decoder
In this letter, we present a power efficient two-stage radix-4 VD architecture, which can support the maximum data rate of 480 Mbps in the MB-OFDM UWB system. Figure 1 shows a block diagram of the proposed two-stage radix-4 VD. Input data is composed of 12 input signals (4 input symbol vectors of 3 signals each). The upper 6 input signals are fed into the BM0 module, while the lower 6 input signals are fed into the BM1 module.
Fig. 1. Two-stage radix-4 Viterbi decoder.
TB LIFO
Input symbols
64-state metrics
BM0
BM1
4 4
64-state radix-4
ACS0
64-state radix-4
ACS1 yn-3
3 yn-2
3
yn-1
3 yn 3
BM: branch metric unit
TB: trace-back unit ACS: add-compare-select unit LIFO: last-in first-out buffer
4-bit decoded
output Γn-2
64 λn-2
64
Γn
64 λn
64
Γn+2
64
64×4 bit decisions dn-3 dn-2 dn-1 dn
64×2 64×2 D
A Two-Stage Radix-4 Viterbi Decoder for Multiband OFDM UWB Systems
Sung-Woo Choi, Kyu-Min Kang, and Sang-Sung Choi
ETRI Journal, Volume 30, Number 6, December 2008 Sung-Woo Choi et al. 851 Fig. 2. Two-stage 64-state radix-4 trellis diagram.
0 2
Γn−
1 2
Γn−
2 2
Γn−
3 n−2
Γ
16 n−2
Γ
32 n−2
Γ
48 n−2
Γ
63 n−2
Γ
63
Γn
63 n+2
Γ
48 n+2
Γ
32 n+2
Γ
16 n+2
Γ
3 n+2
Γ
2 2
Γn+
1 n+2
Γ
0 n+2
Γ
0
Γn ACS_State0
Fig. 3. Four-way ACS unit for state 0.
2 Comparator
Comparator
Comparator
Comparator Comparator Comparator
Decision
0
Γn
16
Γn
32
Γn
48
Γn 0,0
λn
16,0
λn
32,0
λn
48,0
λn
0 n+2
Γ
0 0
1
n n
d d−
1. Two-stage 64-State Radix-4 Trellis
Figure 2 shows a two-stage 64-state radix-4 trellis diagram for the rate 1/3 convolutional code with the constraint length of 7. Each of the ACS0 and ACS1 modules consists of 64 four- way add-compare-select (ACS) units as shown in Fig. 3.
2. BM Unit
Branch metrics (BMs) for the radix-4 trellis are generated by combining branch metrics of successive iterations of the underlying radix-2 trellis. Three 4-bit soft decision input signals are used to calculate eight sub-branch metrics corresponding to the eight possible encoder outputs in the radix-2 trellis. Because the radix-4 VD utilizes six 4-bit soft decision inputs for two decoded outputs, the maximum branch metric is 90. The branch metrics are calculated using a uniform distance measure equal to the symbol itself when compared to logic-0 and equal to its one’s complement when compared to logic-1 [6].
3. ACS Unit
An example of a four-way ACS unit for state 0 is given in
Fig. 4. Two-way add-compare logic.
Bit inverter
MSB MSB
MSB XOR
Comparator
A[8:0] S[9:0]
B[8:0]
~
B[8:0]
B[9]
1
1,0 i
λn
i1
Γn
2,0 i
λn
i2
Γn B[9:0]
A[9:0]
A[9]
MSB: most significant bit
Fig. 5. Memory bank and block decoding process.
Time Write operation of survivor bits for m-th block time interval First stage trace-back operation for m-th block time interval
Decoding operation for m-th block time interval Save decoded bits in the LIFO buffer
Output decoded bits from the LIFO buffer trace-back Second stage trace-back operation for m-th block time interval WT0TB10 TB20 DE0 WT6 TB16 TB26 DE6
DE-5 WT1TB11 TB21
TB22
TB23
TB24
TB25
DE-4
DE-3
DE-2
DE-1
WT2
WT3
WT4
WT5
TB12
TB13 TB14
TB15
TB2-3
TB2-2
TB2-1
TB1-1
DE1
DE2
DE3
DE4
DE5
WT7
WT8 WT9
WT10
WT11
TB17 TB18
TB19
TB110 TB28
TB27
BI-5 BI-3 BI-1 BI1 BI3 BI5
BI-4 BI-2 BI0 BI2 BI4 BI6
BO-6 BO-4 BO-2 BO0 BO2 BO4
BO-5 BO-3 BO-1 BO1 BO3 BO5
#0
#1
#2
#3
#4
#5
#0
#1 Memory
bank
LIFO buffer
WTm TB1m
TB2m
DEm
BIm
BOm
Fig. 3. Four adders and six comparators are required for the implementation of a four-way ACS unit. The four-way ACS unit updates a new state metric and two survivor bits by using both 4 state metrics and 4 branch metrics. The state metric
2 sn+
Γ is updated recursively as
{
,}
2 min , 0,1, ,63
s i i s
n+ i n λn s
Γ = Γ + = " , (1) where i is a predecessor state of s, and λ denotes the ni s, branch metric on the transition from state i to state s at time n.
The two-way add-compare circuit of the ACS unit is described in detail in Fig. 4. By employing the modulo normalization algorithm from [7], we can avoid errors due to overflow during the updating of the state metrics and simplify the comparator circuit of the ACS unit as in Fig. 4. Note that the output of the two-way add-compare logic is 0 if A is larger than B; otherwise, the output is 1.
4. TB Unit
The proposed VD architecture employs the 3-pointer even algorithm for trace-back recursion [8], which is more hardware-efficient than the register exchange algorithm.
Figure 5 shows the memory banks, the last-in first-out (LIFO) buffers, and the block decoding method of the proposed VD
852 Sung-Woo Choi et al. ETRI Journal, Volume 30, Number 6, December 2008 Table 1. Hardware complexity, maximum operating speed, and power consumption in the ASIC implementation of several ACS architectures.
Radix-2 Pipelined radix-4 Two-stage radix-2 Radix-4 Two-stage radix-4 Chip area (μm2) 318,422 171,365 2,409,272 1,114,845 532,463 254,954 835,348 334,875 1,255,841 525,579 No. of gates1) 31,906 33,666 241,410 219,026 53,353 50,089 83,702 65,790 125,835 103,257
Operating speed (MHz) 528 264 132
Timing pass No No Yes Yes No No No Yes Yes Yes
Maximum speed (MHz) 277 364 292 329 164 202 244 302 138 167
Power (mW)2) 106.8 38.0 307.2 95.8 94.5 28.3 172.8 41.1 56.4 27.6
CMOS technology (μm) 0.18 0.13 0.18 0.13 0.18 0.13 0.18 0.13 0.18 0.13 1) Based on 2×1 NAND gate, 2) Power consumption is estimated by Synopsys’ Power Compiler.
Table 2. Hardware complexity of two-stage radix-4 VD.
VD (with memory) VD (without memory) ACS only 143,295 gates 131,415 gates 103,257 gates
architecture. To conduct the 3-pointer even algorithm, six memory banks and two LIFO buffers are required. The width of each memory bank is 256 bits and the depth is 5. The overall trace-back (TB) length is 40. Note that the 64×2 survivor bits of ACS0 are mapped to the lower half of the memory bank (WTm), while the 64×2 survivor bits of ACS1 are mapped to the upper half of the memory bank (WTm) at each iteration.
III. Implementation Results
Table 1 compares the hardware complexity, the maximum operating speed, and the power consumption in the ASIC implementation of a class of ACS architectures. The ACS architectures are implemented and tested by utilizing the TSMC 0.13-µm and 0.18-µm CMOS libraries with the operation condition of slow mode. Because the sampling frequency of the MB-OFDM UWB system is 528 MHz [1], the radix-2, radix-4 (or two-stage radix-2), and two-stage radix- 4 ACS architectures should be operated at the clock speeds of 528 MHz, 264 MHz, and 132 MHz, respectively [6]. Although the pipelined radix-4 architecture satisfies the timing constraints, it requires much greater hardware complexity and power consumption than the radix-4 or two-stage radix-4 architecture [5]. In the radix-4 architecture, only 0.13-µm technology can support the required operation speed. To make matters worse, the power consumption of the radix-4 architecture is approximately 49% higher than that of the two- stage radix-4 architecture. Table 1 indicates that the proposed two-stage radix-4 VD is the most power efficient architecture for the MB-OFDM UWB systems. We summarized the hardware complexity of the two-stage radix-4 VD implemented using
0.13-µm CMOS technology in Table 2.
IV. Conclusion
We have proposed a power efficient two-stage 64-state radix-4 VD architecture. Implementation results showed that the proposed VD with relatively low hardware complexity and power consumption can support various data rates for MB- OFDM UWB transmission. As ASIC technology evolves, the proposed VD architecture is expected to support data rates of more than 1 Gbps and accordingly is suitable for use in next generation high-speed communication systems.
References
[1] W. Abbott et al., “Multiband OFDM Physical Layer Specification, Version 1.2 (draft),” WiMedia Alliance, Feb. 2007.
[2] C.W. Kim, S.S. Choi, and S.G. Lee, “A CMOS Frequency Synthesizer Block for MB-OFDM UWB Systems,” ETRI Journal, vol. 29, no. 4, Aug. 2007, pp. 437-444.
[3] K.M. Kang and S.S. Choi, “Initial Timing Acquisition for Binary Phase-Shift Keying Direct Sequence Ultra-Wideband Transmission,” ETRI Journal, vol. 30, no. 4, Aug. 2008, pp. 495-505.
[4] G. Fettweis and H. Meyr, “High-Rate Viterbi Processor: A Systolic Array Solution,” IEEE J. Sel. Areas Comm., vol. 8, no. 8, Oct. 1990, pp. 1520-1534.
[5] K.K. Parhi, “An Improved Pipelined MSB-First Add-Compare Select Unit Structure for Viterbi Decoders,” IEEE Trans. Circuits and Systems I, vol. 51, no. 3, Mar. 2004, pp. 504-511.
[6] P.J. Black and T.H.-Y. Meng, “A 1-Gb/s, Four-State, Sliding Block Viterbi Decoder,” IEEE J. Solid-State Circuits, vol. 32, no. 6, June 1997, pp. 797-805.
[7] C.B. Shung et al., “VLSI Architectures for Metric Normalization in the Viterbi Algorithm,” Proc. ICC, vol. 4, Apr. 1990, pp. 1723-1728.
[8] G. Feygin and P.G. Gulak, “Architectural Tradeoffs for Survivor Sequence Memory Management in Viterbi Decoders,” IEEE Trans.
Comm., vol. 41, no. 3, Mar. 1993, pp. 425-429.