High Throughput Parallel Design of 2-D $8{\times}8$ Integer Transforms for H.264/AVC

(1)

논문 2012-49SD-8-5

H.264/AVC 를 위한 높은 처리량의 2-D 8 × 8 integer

transforms 병렬 구조 설계

( High Throughput Parallel Design of 2-D 8 × 8 Integer Transforms

for H.264/AVC )

미투라니 사르마

*

, 하니 티와리

*

, 조 용 범

***

( Meeturani Sharma, Honey Tiwari, and Yong Beom Cho ) 요 약 본 논문에서 H.264표준을 위해 2차원 8 × 8 순방향/역방향 정수 DCT 변환을 빠르고 효율적으로 계산할 수 있는 알고리즘 을 제안한다. 순방향/역방향 변환은 간단한 시프트와 덧셈 동작을 사용하여 계산 복잡도를 줄였으며, DCT 연산에 메모리를 사 용하지 않으므로 해서 불필요한 자원소모를 줄였다. 제안된 파이프라인 아키텍처의 최대 동작 주파수는 1.184GHz이며, 합성결 과는 44864 게이트가 사용되어 25.27Gpixels/sec의 스루풋을 보여준다. 면적 비율에 비해 높은 스루풋으로인해, 제안된 설계는 H.264/AVC 고해상도 비디오기술의 실시간 처리에 효율적으로 사용할 수 있다. Abstract

In this paper, the implementation of high throughput two-dimensional (2-D) 8 × 8 forward and inverse integer DCT transform for H.264 is presented. The forward and inverse transforms are represented using simple shift and addition operations. Matrix decomposition and matrix operation such as the Kronecker product and direct sum are used to reduce the computation complexity. The proposed design uses integer computations and does not use transpose memory and hence, the resource consumption is also reduced. The maximum operating frequency of the proposed pipelined architecture is 1.184 GHz, which achieves 25.27 Gpixels/sec throughput rate with the hardware cost of 44864 gates. High throughput and low hardware makes the proposed design useful for real time H.264/AVC high definition processing.

Keywords: DCT; dual clock architecture; H.264/AVC; IDCT; integer transform.

Ⅰ. Introduction

H.264/Advanced Video Coding (AVC) standard uses discrete cosine transform (DCT) and its inverse (IDCT) as the core transforms[1]. The computation

*

학생회원 ** 정회원, 건국대학교 전자공학부

(Department of Electronic Engineering, Konkuk University)

※ 본 연구는 건국대학교와 ETRI SW-SoC R&DBD 융합센터 지원으로 수행되었음.

접수일자: 2012년5월31일, 수정완료일: 2012년7월25일

complexity of forward and reverse transform is reduced by using Integer DCT/IDCT structures[2～4]. In conventional indirect computation method the 2-D transform is computed by performing 1-D transform in row direction followed by 1-D transform in column direction[5～7]. The transpose memory required in indirect computation consumes significant area and leads to irregular architectures, making it difficult to realize a pipelined structure. The direct method of computing 2-D 8×8 DCT requires the decomposition of the DCT coefficient matrix[8～9]. The factorized

(2)

matrices of DCT are not regular, and hence, these fast DCT algorithms achieve only moderate speed-up. The core transform in 2-D DCT can be represented using Kronecker product and direct sum operation which reduces the implementation expenses[10～11].

Similarly, different architectures to support multi-standard video applications with the adaptive block-size transform (8×8 and 4×4) are proposed in [11] and [12]. The throughput values of these architectures are not sufficient to satisfy the real-time requirement of unified transform in the HD 2160p system.

In this paper, the dual clock pipelined architecture for achieving high throughput computation of the 2-D 8×8 integer transform is presented. The matrix decomposition proposed in [10] requires n-bit shift operations for division by 2n.The approximation errors introduced by division operation are accumulated at each stage, and the total error becomes significant. In the proposed design, it is shown that the shift operations in the intermediate stage can be shifted to a single shift operation in the last stage. The approximation error due to integer implementation of the proposed scheme is less than 0.01%. The forthcoming sections of the paper are organized as follows. In Section II, the matrix decomposition of 2-D 8×8 integer transform for H.264/AVC is presented. In Section III, the matrix decomposition is presented. In Section IV, the dual clock pipelined architecture for computing the 2-D 8×8 integer transform for H.264/AVC is introduced. Thereafter, the comparison of implementation with existing architectures followed by the conclusion is given.

Ⅱ. Matrix operations used in proposed design

The Kronecker product of A = [aij]∈Mm,n(set of

real or complex numbers, ℝ/ℂ) and B = [bij] ∈

Mp,q(ℝ/ℂ) is denoted by A⊗B and is defined to be

the block matrix [13]

⊗   



 



__{ … }_ ⋮ ⋱ ⋮   …   ∈_{  }ℝℂ  (1)

The associative and transpose property of the Kronecker product are defined as [13]

 ⊗  ⊗     ⊗   ⊗ (2) ⊗  ⊗ (3) The direct sum of matrices A and B is denoted by

A⊕ B and is defined as [13]

⊕ _ _ _

 ∈      ℝℂ  (4)

The linear transformation of any matrix ⏀∈

Mn,p(ℝ/ℂ) to Ψ ∈ Mm,q(ℝ/ℂ) given by A⏀B = Ψ

can be expressed as [13]

⊗   (5)

where, ⏀v and Ψv are the column-wise stacking

vector of ⏀ and Ψ respectively, and “.” denotes the basic matrix multiplication.

Ⅲ. Fast 2-D forward and inverse integer transform

In the H.264/AVC FRExt proposal, the 2-D forward integer transform can be indirectly computed as a 1-D horizontal (row) transform followed by a 1-D vertical (column) transform, where the corresponding 1-D 8×8 forward transform is shown by the following C matrix [4]

_{ }    



              



_                                                                        (6)

(3)

forward transform Y, is calculated as [2]

   ⊙  (7)

where “⊙” shows that each element of (CXCT) is multiplied by the scaling factor in the same position of scaling matrix Ef, and “T” denotes the matrix

transposition. Similarly, the 2-D 8 × 8 inverse integer transform is described as

  _⊙____

  (8)

where Ci is the 1-D inverse transform matrix and

is defined as Ci = CT [4]. The scaling factor Ef , in

(7) and (8) is absorbed in the quantization process, and the core forward and inverse transform (CXCT) and (CiYCiT) are used as the forward and inverse

integer transforms [10].

Using (4), the core forward and inverse transforms in (7) and (8) are expressed as

⊗  (9)

⊗  ⊗ ⊗

(10)

In [10], the core forward and inverse transforms were further decomposed as

⊗ ∙ __{ }        _{ }      _{ }    ⊗ ⊕⊕⊕    _{ }    ⊗ ⊕⊕    _{ }      _{ }    ⊕⊗⊕     _{ }   ⊗    _{ }      _{ }    ⊗ ⊗⊕⊕    _{ }    ⊗ ⊗⊕    _{ }      _{ }    _⊗__⊗_⊕     _{ }   _⊗_   _{ }   ⊗ . _{ }   ⊗⊗ ._{ }     ₍₁₁₎ ⊗∙       _{ }   ⊗⊗     _{ }  ⊗ .    _{ }      _{ }    ⊗⊗⊕     _{ }   ⊗ .    _{ }      _{ }    ⊗ ⊗⊕⊕    _{ }    ⊗ ⊗⊕ .    _{ }      _{ }    ⊕⊗⊕     _{ }   ⊗ .    _{ }      _{ }    ⊗ ⊕⊕⊕    _{ }    ⊗ ⊕⊕ . _      ₍₁₂₎

where,  ⊗ and  ⊗ , while In

is n×n identity matrix and the value of other matrices are given as follows.

  



 



                                                                  



 



                                                           



 



                                            



 



                      



 



                 

IV. Proposed Integer implementation 1. Integer division

(4)

A real value division by any nth power of 2, 2n, can be performed by n bit right shift operation. The shifting process discards n LSB bits and the resultant is equal to nearest integer after division. The rounding of values results in an approximation error. In an integer implementation, the approximation error introduced at various stages is accumulated. To reduce this approximation error, the fraction based computation must be shifted to the final stage and the number of division operations must be minimized. This will also reduce the number of shift registers required for implementation.

2. Modified Integer Transform

From (11) and (12) it can be seen that in a true integer implementation, the approximation error from stages S4,1, S4,2, S5,1, S5,2, S6,1, S6,2, S7,1, and S7,2

accumulate and are propagated to the output. This approx. error can be reduced by changing the stages to true integer butterfly structure. To achieve this matrix Q1, Q2, and Q3 can be further factorized as

__{ }               _   



 



                      _   



 



                   (13) Or, Q1 = 1/2‧ Q'1, Q2 = 1/2‧ Q'2 and Q3 = 1/4‧ Q'3, where ′            ′  



 



                     ′  



 



                   (14)

Using (13) and (14), Stages 4, 5, 6 and 7 of the forward transform in (11) can be given by

      _{ }    ⊗⊗⊕   _{ }    _ ′⊗       _{ }    ⊗⊗⊕     _{ }    ′⊗ ₍₁₅₎    _. _{ }    ⊗ ⊗⊕′ ⊕__{ }_ _    ⊗ ⊗⊕′ (16)    _    _{ }    ⊕′⊗⊕     _{ }   ′⊗₍₁₇₎    _  _{ }    ⊗ ⊕′⊕′ ⊕__{ }_ _    ⊗ ⊕ ′ ⊕ ′_ (18) Shifting the constants of each stage to the output and integrating (15) - (18) in the forward transform, (11) can be given by ⊗ ∙          _{ }           ⊗⊕′⊕′⊕         ⊗⊕′⊕′    _{ }      _{ }    ⊕′⊗⊕     _{ }   ′⊗ __{ }_ _      _{ }    _⊗_⊗__⊕′_⊕    _{ }    _⊗_⊗__⊕′_    _{ }      _{ }    ⊗⊗⊕     _{ }   ′⊗   _{ }   ⊗ . _{ }   ⊗⊗ ._{ }     ₍₁₉₎

Similarly, the inverse transform, (12) can be given by ⊗        _{ }   ⊗⊗     _{ }  ⊗ 

(5)

   _{ }      _{ }     ⊗⊗⊕     _{ }   ′⊗    _{ }      _{ }    _⊗ __⊗_⊕′__⊕    _{ }    _⊗ __⊗_⊕′__    _{ }      _{ }      ⊕′⊗⊕     _{ }   ′⊗                ⊗⊕′⊕′⊕         ⊗⊕′⊕′ _      ₍₂₀₎

The intermediate stage in (19) and (20) does not contain any matrices with fractional data, and hence, the approximation error in the intermediate stages is reduced to zero. For the integer input patterns, the butterfly structure of (19) and (20) is a true integer implementation and does not require any floating point arithmetic operations.

3. Hardware Implementation

Figure 1 shows the butterfly structure for the forward integer DCT computation. The output of each stage is passed on to next every clock cycle and hence, 64 new input values can be processed

b a a b− a b+ vec In Invec1 b a 2 a−b 2a b+ b a c d 4b c− 4a d+ 4d a− 4c b− vec In Invec6 vec V vec U vec vec U−V vec vec U+V 1 input data (a c d− −)1−c ⎡ ⎤ ⎣ ⎦ (a b c+ +) 1+a ⎡ ⎤ ⎣ ⎦ (a b d− +) 1−b ⎡ ⎤ ⎣ ⎦ (b c d− +) 1+d ⎡ ⎤ ⎣ ⎦ 8 input data vec

In n means left shift input vector by n bits

1 , Clock C 2 , Clock C 1

Stage Stage 2 Stage 3 Stage 4, 2 4,1 Stage 1 8 X X− 9 16 X−X 17 24 X−X 25 32 X−X 33 40 X−X 41 48 X−X 49 56 X−X 57 64 X−X 1 , Clock C 2 , Clock C b a c d 5,2

Stage Stage 6,2 Stage 7,2 Stage 8 1₆₄ 5,1

Stage Stage 6,1 Stage 7,1

1 8 Y−Y 9 16 Y−Y 17 24 Y−Y 25 32 Y−Y 33 40 Y−Y 41 48 Y−Y 49 56 Y−Y 57 64 Y−Y 그림 1. 빠른 2 차원 정수 DCT 변화의 블록도

Fig. 1. Symbolic block diagram of fast 2-D forward integer.

every clock. The use of dual clocks reduces the required clock cycles to half.

4. Increasing implementation throughput Multiplication with  in Stage 1 of forward

integer transform is simply data alignment and can be coupled with Stage 2. Similarly, the division by 64 can be combined with data alignment in Stage 8. The post multiplication of a vector Vvec with matrices H2,

Q2 and Q3 requires a single addition for output

generation. Hence, the output generation of Stage 1 and 2 combined requires 1 addition for each output. Similarly, the output generation from Stage 2, 3, 4 and 5 requires 1 addition for each output. For Stage 6 and 7 a maximum of 2 addition operations are required for each output. Finally, to further decrease the clocked stages, the vector re-alignment operation and division by 64 can be merged with the computations of Stage 7. For faster data buffering each of the alternate stages are given anti-phase clocks.

Consider a continuous stream of 64 input data pattern be given by Dt where t denotes the input

time instance. Let the output of each stage be given by Us,t' where s gives the stage number and t'

denotes that the output was generated due to vector

그림 2. 순방향 정수 DCT변화의 타이밍도

Fig. 2. Timing diagram of forward integer DCT transform.

(6)

input at tth time stamp. The outputs from the different pipelined stages at different time stamps are shown in Figure 2. It can be seen that the maximum latency of the forward integer transform is 3 clock cycles.

V. Comparison of architecture and implementation

Let the output obtained by the theoretical, the hardware implementation of the architecture given in [10], and the proposed architecture be given by YTh,

YOr, and YPr, respectively. The root mean difference

square between the theoretical value and the values obtained from integer hardware implementation can be defined as _



 _



   







 _; _



 _



   



  





Using 10,000 sets of 64 input integer values, it was found that for the hardware implementation of the architecture given in [10], if the output of each stage is approximated towards the smaller integer, then the root mean difference square between the theoretical value and the values from hardware is 0.99. In case of the proposed implementation, the approximation error is reduced to 0.2456. If only the final stage is implemented using the fixed point architecture and 6 bits are used for exponential part, then the error is reduced to 0.98 for the hardware implementation of the architecture given in [10], while that of the proposed architecture is reduced to 0.0001.

An integer implementation of the proposed architecture was carried out using 0.18µm CMOS technology. The synthesis results show that both clocks can be triggered at 1.184 GHz in the dual clock architecture. Table 1 shows the comparison of

I II III IV V VI [14] 0.18 125 21737 8 500M 46K [15] 0.18 125 18500 4 1000M 27K [16] 0.18 200 19147 4 800M 41.8K [17] 0.18 77 55728 32 2464M 44.2K [18] 0.18 50 39800 8 400M 10.1K [12] 0.18 200 63618 32 3200M 100.6K Pro. 0.18 1184 44864 64 25267M 535.03K I. Technology (μm)

II. Operating Frequency (MHz) III. Area (gate count)

IV. Data Processing Rate V. Throughput (pixels/sec)

VI. Throughput/ Area (pixels/sec/gate) Pro.: Proposed design

표 1. 제안된 8×8 순방향 정수변화와 기존 아키텍처

성능비교

Table 1. Comparison of proposed design of 8×8 forward integer transform with existing architecture.

the proposed architecture with the existing architectures.

Some of these architectures support multi-transforms [19]. Although, the proposed design does not support multi-transform, the throughput and the throughput/area ratio of the proposed design is more than 5 times that of other architectures.

Ⅵ. Conclusion

In this paper, a fast and cost-effective algorithm and the implementation of two-dimensional (2-D) 8×8 forward and inverse integer discrete cosine transform (DCT) for H.264 are presented. The Kronecker product and direct sum operations were used for proposing the forward and inverse integer transform using simple addition operations. The matrix decomposition is structured to reduce the approximation error in integer hardware implementation. The dual clock pipelined architecture was proposed based on the integer matrix decomposition of forward and inverse transform. The synthesis result shows that the maximum operating frequency of the proposed architecture is 1.184 GHz,

(7)

achieving a throughput of 25.27 Gpixels/sec, requiring 44864 gates. The proposed architecture achieves throughput of more than 5 times that of the other existing architectures. Due to the high throughput to area ratio, the proposed design can effectively be integrated into the real-time processing of H.264/AVC high definition video.

References

[1] T. Wiegand and G. Sullivan, Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T rec. H.264/ISO/IEC 14496-10 AVC, presented at Joint Video Team (JVC) of ISO/IEC MPEG and ITU-T VCEG), 2003.

[2] I.E.G. Richardson, H.264 and MPEG-4 Video Compression – Video Coding for Next-generation Multimedia, John Wiley & Sons Ltd., 2003.

[3] H.S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low complexity transform and quantization in H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol.13, no.7, pp.598 –603, July 2003.

[4] S. Gordon, D. Marple, and T. Wiegand, “Simplified use of 8×8 transforms—Updated proposal and results,” in Proc. JVT-K028, 11th Meeting, Munich, Germany, Mar. 2004.

[5] A. Madisetti and A. N. Willson, “A 100 MHz 2-D 8×8 DCT/IDCT processor for HDTV applications,” IEEE Trans. Circuits Syst. Video Technol., vol. 5, pp. 158–165, Apr. 1995.

[6] S. Uramoto et al., “A 100-MHz 2-D discrete cosine transform core processor,” IEEE J. Solid-State Circuits, vol. 27, pp. 492–498, Apr. 1992.

[7] Guo-An Su and Chih-Peng Fan, “Low-Cost Hardware-Sharing Architecture of Fast 1-D Inverse Transforms for H.264/AVC and AVS Applications,” IEEE Trans. on Circuits and Systems-II: EXPRESS BRIEFS, vol. 55, NO. 12, pp. 1249-1253, December, 2008.

[8] Y. P. Lee, T. H. Chen, L. G. Chen, M. J. Chen, and C. W. Ku, “A cost effective architecture for 8×8 two-dimensional DCT/IDCT using direct method,” IEEE Trans. Circuits Syst. Video Technol., vol. 7, pp. 459–467, June 1997.

[9] Y.-M. Huang and J.-L.Wu, “A refined fast 2-D discrete cosine transform algorithm,” IEEE Trans. Signal Processing, vol. 47, pp. 904–907, Mar. 1999.

[10] C. P. Fan, “Fast 2-dimensional 8×8 integer transform algorithm design for H.264/AVC fidelity range extensions,” IEICE Trans. Inf. Syst., vol. E89-D, pp. 3006–3011, Dec. 2006. [11] C. P. Fan and Y. L. Lin, “Implementations of

low-cost hardware sharing architectures for fast 8 ×8 and 4×4 integer transforms in H.264/AVC,” IEICE Trans. Fundamentals, vol. E90-A, no. 2, Feb. 2007.

[12] Woong Hwangbo and Chong-Min Kyung, “A Multi-Transform Architecture for H.264/AVC High-Profile Coders”, IEEE Trans. on Multimedia, Vol.12, No.3, Apr. 2010.

[13] R.A. Horn and C.R. Johnson, Topics in Matrix Analysis, pp.239–267, Cambridge Univ. Press, New York, 1991.

[14] C. P. Fan, “Cost-effective hardware sharing architectures of fast 8×8 and 4×4 integer transforms for H.264/AVC,” in Proc. IEEE Asia Pacific Conf. Circuits and Systems, Dec. 2006, pp. 776–779.

[15] Y. C. Chao, H. H. Tsai, Y. H. Lin, J. F. Yang, and B. D. Liu, “A novel design for computation of all transforms in H.264/AVC decoders,” in Proc. IEEE Int. Conf. Multimedia and Expo, Jul. 2007, pp. 1914–1917.

[16] Y. Li, Y. He, and S. Mei, “A highly parallel joint VLSI architecture for transforms in H.264/AVC,” J. Signal Process. Syst., vol. 50, pp. 19–32, Oct. 2007.

[17] G. Pastuszak, “Transforms and quantization in the high-throughput H.264/AVC encoder based on advanced mode selection,” in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, Apr. 2008, pp. 203–208.

[18] C. Y. Huang, L. F. Chen, and Y. K. Lai, “A high-speed 2-D transform architecture with unique kernel for multi-standard video applications,” in Proc. IEEE Int. Symp. Circuits and Systems, May 2008, pp. 21–24.

[19] Meeturani Sharma, Honey Durga Tiwari, Yong Beom Cho, “High throughput parallel design of 2-D 8 × 8 integer transforms for H.264/AVC”, SoC Conference, Seoul, April, 2012.

(8)

저 자 소 개 미투라니 사르마(학생회원)

2003년 Nagpur University, Nagpur, India 전자공학과 학사 졸업. 2012년 건국대학교 전자공학부 석사 졸업.

<주관심분야: System-on-chip (SoC) design, embedded software, multi-core processor systems, and error control coding.>

하니 티와리(학생회원) 2003년 Nagpur University,

Nagpur, India 전자공학과 학사 졸업.

2006년 VNIT, Nagpur, India, 전자공학부 석사 졸업. 2012년 건국대학교 전자공학부

박사 졸업.

<주관심분야 : Error control coding, Image processing, SoC design, embedded software> 조 용 범(평생회원) 1981년 경북대학교 전자공학과 학사 졸업. 1988년 Univ. of S. Carolina, United States 전자공학과 석사 졸업. 1992년 Case Western Reserve University, OH,

United States 전자공학과 박사 졸업. 현재 건국대학교 전자공학과 교수

<주관심분야 : Embedded system design, ASIC design, multi-core processor systems, and error control coding.>