A Two-Stage Radix-4 Viterbi Decoder for Multiband OFDM UWB Systems

(1)

850 Sung-Woo Choi et al. ETRI Journal, Volume 30, Number 6, December 2008

ABSTRACT⎯This letter presents a power efficient 64-state Viterbi decoder (VD) employing a two-stage radix-4 add- compare-select architecture. A class of VD architectures is implemented, and their hardware complexity, maximum operating speed, and power consumption are compared.

Implementation results show that the proposed VD architecture is suitable for multiband orthogonal frequency-division multiplexing (MB-OFDM) ultra-wideband (UWB) systems, which can support the data rate of 480 Mbps even when implemented using 0.18-µm CMOS technology.

Keywords⎯ Viterbi decoder, radix-4, UWB, MB-OFDM.

I. Introduction

Ultra-wideband (UWB) systems have been receiving much attention in recent years, mainly due to their high data rate capability with low transmit power [1]-[3]. Specifically, a multiband orthogonal frequency-division multiplexing (MB- OFDM) based UWB system supporting the maximum data rate of 480 Mbps is widely considered [1]. In the MB-OFDM UWB system, a rate 1/3 convolutional code is utilized. The Viterbi algorithm is generally used to achieve near optimal decoding performance.

Recently, there have been many studies on high-speed Viterbi decoders (VDs). A systolic solution of the M-step parallel processing architecture was proposed in [4]. Because a k-fold increase in throughput requires a 2^k-fold increase in hardware complexity, M is usually limited to less than 3. A bit- level pipelined structure was studied in [5]. It solves the timing

Manuscript received July 1, 2008; revised Aug. 14, 2008; accepted Aug. 22, 2008.

This work was supported by the IT R&D program of MKE/IITA [2006-S-071-02, Development of UWB Solution for High Speed Multimedia Transmission], Rep. of Korea.

Sung-Woo Choi (phone: + 82 42 860 6113, email: csw9908@etri.re.kr), Kyu-Min Kang (phone: + 82 42 860 6703, email: kmkang@etri.re.kr), and Sang-Sung Choi (email:

sschoi@etri.re.kr) are with the IT Convergence Technology Research Laboratory, ETRI, Daejeon, Rep. of Korea.

constraints by retiming the critical path; however, both power consumption and hardware complexity increase to support data rates of hundreds of Mbps. A minimized method and sliding block methods have also been studied for high-speed processing [6]. To greatly increase the data rate, a systolic design has been employed. Because the above methods have their own speed limits and/or hardware complexity, an appropriate architecture should be implemented.

II. Two-Stage Radix-4 Viterbi Decoder

In this letter, we present a power efficient two-stage radix-4 VD architecture, which can support the maximum data rate of 480 Mbps in the MB-OFDM UWB system. Figure 1 shows a block diagram of the proposed two-stage radix-4 VD. Input data is composed of 12 input signals (4 input symbol vectors of 3 signals each). The upper 6 input signals are fed into the BM0 module, while the lower 6 input signals are fed into the BM1 module.

Fig. 1. Two-stage radix-4 Viterbi decoder.

TB LIFO

Input symbols

64-state metrics

BM0

BM1

4 4

64-state radix-4

ACS0

64-state radix-4

ACS1 y_n-3

3 yn-2

3

yn-1

3 y_n 3

BM: branch metric unit

TB: trace-back unit ACS: add-compare-select unit LIFO: last-in first-out buffer

4-bit decoded

output Γn-2

64 λn-2

64

Γn

64 λn

64

Γn+2

64

64×4 bit decisions dn-3 dn-2 dn-1 dn

64×2 64×2 D

A Two-Stage Radix-4 Viterbi Decoder for Multiband OFDM UWB Systems

Sung-Woo Choi, Kyu-Min Kang, and Sang-Sung Choi

(2)

ETRI Journal, Volume 30, Number 6, December 2008 Sung-Woo Choi et al. 851 Fig. 2. Two-stage 64-state radix-4 trellis diagram.

0 2

Γn−

1 2

Γn−

2 2

Γn−

3 n−2

Γ

16 n−2

Γ

32 n−2

Γ

48 n−2

Γ

63 n−2

Γ

63

Γn

63 n+2

Γ

48 n+2

Γ

32 n+2

Γ

16 n+2

Γ

3 n+2

Γ

2 2

Γn+

1 n+2

Γ

0 n+2

Γ

0

Γn ACS_State0

Fig. 3. Four-way ACS unit for state 0.

2 Comparator

Comparator

Comparator Comparator Comparator

Decision

0

Γn

16

Γn

32

Γn

48

Γn 0,0

λn

16,0

λn

32,0

λn

48,0

λn

0 n+2

Γ

0 0

1

n n

d d−

1. Two-stage 64-State Radix-4 Trellis

Figure 2 shows a two-stage 64-state radix-4 trellis diagram for the rate 1/3 convolutional code with the constraint length of 7. Each of the ACS0 and ACS1 modules consists of 64 four- way add-compare-select (ACS) units as shown in Fig. 3.

2. BM Unit

Branch metrics (BMs) for the radix-4 trellis are generated by combining branch metrics of successive iterations of the underlying radix-2 trellis. Three 4-bit soft decision input signals are used to calculate eight sub-branch metrics corresponding to the eight possible encoder outputs in the radix-2 trellis. Because the radix-4 VD utilizes six 4-bit soft decision inputs for two decoded outputs, the maximum branch metric is 90. The branch metrics are calculated using a uniform distance measure equal to the symbol itself when compared to logic-0 and equal to its one’s complement when compared to logic-1 [6].

3. ACS Unit

An example of a four-way ACS unit for state 0 is given in

Fig. 4. Two-way add-compare logic.

Bit inverter

MSB MSB

MSB XOR

Comparator

A[8:0] S[9:0]

B[8:0]

~

B[8:0]

B[9]

1

1,0 i

λn

i1

Γn

2,0 i

λn

i2

Γn ^B[9:0]

A[9:0]

A[9]

MSB: most significant bit

Fig. 5. Memory bank and block decoding process.

Time Write operation of survivor bits for m-th block time interval First stage trace-back operation for m-th block time interval

Decoding operation for m-th block time interval Save decoded bits in the LIFO buffer

Output decoded bits from the LIFO buffer trace-back Second stage trace-back operation for m-th block time interval WT0TB10 TB20 DE0 WT6 TB16 TB26 DE6

DE-5 WT1TB11 TB21

TB22

TB23

TB24

TB25

DE-4

DE-3

DE-2

DE-1

WT2

WT3

WT4

WT5

TB12

TB1₃ TB1₄

TB15

TB2-3

TB2-2

TB2-1

TB1-1

DE1

DE2

DE3

DE4

DE5

WT7

WT₈ WT₉

WT10

WT11

TB1₇ TB18

TB19

TB1₁₀ TB28

TB27

BI-5 BI-3 BI-1 BI1 BI3 BI5

BI-4 BI-2 BI0 BI2 BI4 BI6

BO-6 BO-4 BO-2 BO0 BO2 BO4

BO-5 BO-3 BO-1 BO1 BO3 BO5

#0

#1

#2

#3

#4

#5

#0

#1 Memory

bank

LIFO buffer

WT_m TB1m

TB2m

DEm

BIm

BO_m

Fig. 3. Four adders and six comparators are required for the implementation of a four-way ACS unit. The four-way ACS unit updates a new state metric and two survivor bits by using both 4 state metrics and 4 branch metrics. The state metric

2 sn+

Γ is updated recursively as

{

^,

}

2 min , 0,1, ,63

s i i s

n₊ i n λn s

Γ = Γ + = " , (1) where i is a predecessor state of s, and λ denotes the _n^{i s}^, branch metric on the transition from state i to state s at time n.

The two-way add-compare circuit of the ACS unit is described in detail in Fig. 4. By employing the modulo normalization algorithm from [7], we can avoid errors due to overflow during the updating of the state metrics and simplify the comparator circuit of the ACS unit as in Fig. 4. Note that the output of the two-way add-compare logic is 0 if A is larger than B; otherwise, the output is 1.

4. TB Unit

The proposed VD architecture employs the 3-pointer even algorithm for trace-back recursion [8], which is more hardware-efficient than the register exchange algorithm.

Figure 5 shows the memory banks, the last-in first-out (LIFO) buffers, and the block decoding method of the proposed VD

(3)

852 Sung-Woo Choi et al. ETRI Journal, Volume 30, Number 6, December 2008 Table 1. Hardware complexity, maximum operating speed, and power consumption in the ASIC implementation of several ACS architectures.

Radix-2 Pipelined radix-4 Two-stage radix-2 Radix-4 Two-stage radix-4 Chip area (μm²) 318,422 171,365 2,409,272 1,114,845 532,463 254,954 835,348 334,875 1,255,841 525,579 No. of gates¹⁾ 31,906 33,666 241,410 219,026 53,353 50,089 83,702 65,790 125,835 103,257

Operating speed (MHz) 528 264 132

Timing pass No No Yes Yes No No No Yes Yes Yes

Maximum speed (MHz) 277 364 292 329 164 202 244 302 138 167

Power (mW)²⁾ 106.8 38.0 307.2 95.8 94.5 28.3 172.8 41.1 56.4 27.6

CMOS technology (μm) 0.18 0.13 0.18 0.13 0.18 0.13 0.18 0.13 0.18 0.13 1) Based on 2×1 NAND gate, 2) Power consumption is estimated by Synopsys’ Power Compiler.

Table 2. Hardware complexity of two-stage radix-4 VD.

VD (with memory) VD (without memory) ACS only 143,295 gates 131,415 gates 103,257 gates

architecture. To conduct the 3-pointer even algorithm, six memory banks and two LIFO buffers are required. The width of each memory bank is 256 bits and the depth is 5. The overall trace-back (TB) length is 40. Note that the 64×2 survivor bits of ACS0 are mapped to the lower half of the memory bank (WTm), while the 64×2 survivor bits of ACS1 are mapped to the upper half of the memory bank (WTm) at each iteration.

III. Implementation Results

Table 1 compares the hardware complexity, the maximum operating speed, and the power consumption in the ASIC implementation of a class of ACS architectures. The ACS architectures are implemented and tested by utilizing the TSMC 0.13-µm and 0.18-µm CMOS libraries with the operation condition of slow mode. Because the sampling frequency of the MB-OFDM UWB system is 528 MHz [1], the radix-2, radix-4 (or two-stage radix-2), and two-stage radix- 4 ACS architectures should be operated at the clock speeds of 528 MHz, 264 MHz, and 132 MHz, respectively [6]. Although the pipelined radix-4 architecture satisfies the timing constraints, it requires much greater hardware complexity and power consumption than the radix-4 or two-stage radix-4 architecture [5]. In the radix-4 architecture, only 0.13-µm technology can support the required operation speed. To make matters worse, the power consumption of the radix-4 architecture is approximately 49% higher than that of the two- stage radix-4 architecture. Table 1 indicates that the proposed two-stage radix-4 VD is the most power efficient architecture for the MB-OFDM UWB systems. We summarized the hardware complexity of the two-stage radix-4 VD implemented using

0.13-µm CMOS technology in Table 2.

IV. Conclusion

We have proposed a power efficient two-stage 64-state radix-4 VD architecture. Implementation results showed that the proposed VD with relatively low hardware complexity and power consumption can support various data rates for MB- OFDM UWB transmission. As ASIC technology evolves, the proposed VD architecture is expected to support data rates of more than 1 Gbps and accordingly is suitable for use in next generation high-speed communication systems.

References

[1] W. Abbott et al., “Multiband OFDM Physical Layer Specification, Version 1.2 (draft),” WiMedia Alliance, Feb. 2007.

[2] C.W. Kim, S.S. Choi, and S.G. Lee, “A CMOS Frequency Synthesizer Block for MB-OFDM UWB Systems,” ETRI Journal, vol. 29, no. 4, Aug. 2007, pp. 437-444.

[3] K.M. Kang and S.S. Choi, “Initial Timing Acquisition for Binary Phase-Shift Keying Direct Sequence Ultra-Wideband Transmission,” ETRI Journal, vol. 30, no. 4, Aug. 2008, pp. 495-505.

[4] G. Fettweis and H. Meyr, “High-Rate Viterbi Processor: A Systolic Array Solution,” IEEE J. Sel. Areas Comm., vol. 8, no. 8, Oct. 1990, pp. 1520-1534.

[5] K.K. Parhi, “An Improved Pipelined MSB-First Add-Compare Select Unit Structure for Viterbi Decoders,” IEEE Trans. Circuits and Systems I, vol. 51, no. 3, Mar. 2004, pp. 504-511.

[6] P.J. Black and T.H.-Y. Meng, “A 1-Gb/s, Four-State, Sliding Block Viterbi Decoder,” IEEE J. Solid-State Circuits, vol. 32, no. 6, June 1997, pp. 797-805.

[7] C.B. Shung et al., “VLSI Architectures for Metric Normalization in the Viterbi Algorithm,” Proc. ICC, vol. 4, Apr. 1990, pp. 1723-1728.

[8] G. Feygin and P.G. Gulak, “Architectural Tradeoffs for Survivor Sequence Memory Management in Viterbi Decoders,” IEEE Trans.

Comm., vol. 41, no. 3, Mar. 1993, pp. 425-429.