Full Search Equivalent Motion Estimation Algorithm for General-Purpose Multi-Core Architectures

(1)

Full Search Equivalent Motion Estimation Algorithm for General-Purpose Multi-Core Architectures

Chun-Su Park

^†

†

Dep. of Info. and Telecom. Eng. Sangmyung University

ABSTRACT

Motion estimation is a key technique of modern video processing that significantly improves the coding efficiency significantly by exploiting the temporal redundancy between successive frames. Thread-level parallelism is a promising method to accelerate the motion estimation process for multithreading general-purpose processors. In this paper, we propose a parallel motion estimation algorithm which parallelizes the motion search process of the current H.264/AVC encoder. The proposed algorithm is implemented using the OpenMP application programming interface (API) and can be easily integrated into the current encoder. The experimental results show that the proposed parallel algorithm can reduce the processing time of the motion estimation up to 65.08% without any penalty in the rate-distortion (RD) performance.

Key Words : Motion estimation, H.264/AVC, OpenMP, parallel processing English

1. Introduction

In recent years, multi-core processors are being widely promoted by the industry and have rapidly grown after single-core processors reached physical bottlenecks (power and thermal problems) [1, 2]. The multi-core central processing unit (CPU) is a single CPU package that consists of two or more independent processing cores, and multiple program threads are executed in parallel on distinct CPU cores such that the overall processing time can be decreased effectively. This multi-core processor architecture entails new challenges such as parallel algorithm designing, concurrent programming, and debugging to software developers because the performance of the multi-core processor is strongly dependent on software algorithms and implementation methods [3].

However, most of the current software including video codecs is still being developed on the basis of the single-core processing architecture, instead of the multi-core processing architecture [4]. That is, no data parallelism or function parallelism is considered

in most software development processes.

In modern video processing, block-based motion estimation is a key technique to reduce the temporal redundancy existing between adjacent frames [5, 6].

For each block in the current frame, the motion estimation process is to find the best prediction, called the reference block, which minimizes the matching difference between the two blocks within the predetermined search window. The motion vector (MV) is used for specifying the spatial displacement between the current and reference blocks. The motion estimation process requires intensive computation and memory communication for the block-matching

†E-mail : [email protected]

Fig. 1. Example of a general-purpose multi-core processor.

(2)

task. Therefore, it is well-known that motion estimation is the most time-consuming process of video encoders.

The simplest motion estimation algorithm is the full search (FS) algorithm [7]. This algorithm provides the optimal MV by exhaustively searching all possible positions within the search window. However, the FS algorithm is not a feasible solution for practical applications because of its heavy computational load.

Various fast search algorithms such as partial distortion elimination (PDE) and successive elimination algorithm (SEA) have been proposed to reduce the computational complexity of the motion estimation process without sacrificing the coding efficiency [8-10]. These algorithms reduce the amount of computation drastically by examining only a limited number of pixels in the search windows. In the current implementation, the H.264/AVC encoder adopts an FS-equivalent spiral search method to accelerate the motion estimation process [11]. The present study focuses on the spiral search and proposes a fast motion estimation algorithm based on thread-level parallelization for the H.264/

AVC encoder.

The rest of this paper is organized as follows.

Section 2 introduces the spiral search and analyzes the current implementation of the H.264/AVC encoder. Based on the analysis, a parallel motion estimation algorithm is proposed. The experimental results of the proposed algorithm are presented in Section 3. Finally, conclusions are drawn in Section 4.

2. Proposed parallel motion estimation algorithm

The H.264/AVC encoder uses the Lagrangian optimization technique to find the optimal MV for the inter-coded block [12]. It yields high performance in providing the optimal bit allocation to the MV. Let s and c(m) be, respectively, the original block and its reconstruction obtained by using the MV m. For the inter-coded block, rate-constrained motion estimation is performed to find the optimal MV by minimizing the following Lagrangian functional

(1)

where m

p

is the prediction for the MV and λ is the Lagrange multiplier [13]. The distortion D(s, c(m)) is the sum of absolute differences (SAD) or the sum of squared differences (SSD) between the original block s and its reconstruction c(m). The rate R(m − m

p

) specifies the number of bits required for encoding the difference between the original MV and the predicted one.

In general, the MV distribution is highly center- biased. Cheung et al. showed that most of the MVs concentrate on the search center, i.e., approximately 81.80% of the MVs are located in the central 5 × 5 square area [14]. In order to exploit the characteristics, the spiral search adopted in the H.264/AVC encoder starts the search process at the center of the search window and then moves the search position pixel by pixel in a spiral structure around the search center.

Note that, in the current implementation, the MV prediction m

p

is used as an initial search center. The spiral search produces the best performance for the video signal that has the MV distribution gathered at the zero position (0,0). This is because the probability of finding the minimum matching error at the beginning of the search process will increase. It is well-known that the spiral search achieves better performance than the FS algorithm that uses raster scan order.

The spiral search establishes an inequality relation to accelerate the search process while preserving the rate-distortion (RD) optimized solution for the MV, given the Lagrange multiplier λ. Let Ψ be a search window of size (2w + 1) · (2w + 1) and , , be an MV indicating a possible pixel position in a spiral order. Then, Ψ is composed of pixel positions within the search window as follows

(2)

where W = (2w + 1) · (2w + 1). In the motion estimation process, after examining the sub-window , the encoder can obtain the current minimum RD cost as follows

J m s,m (

_p

) = D s,c m ( ( ) ) λ R m−m + ⋅ (

_p

) (3)

m

_k

∈ Ψ 0 k < ≤ ( 2w 1 + ) 2w 1 ⋅ ( + )

Ψ = { m

₁

, m

₂

, … m ,

_W

}

Ψ

_k

= { m

₁

, … m ,

_k

}

J

_k^min

J

_k^min

min J m { (

_i

s,m

_p

) }

mi Ψk∈

=

(3)

where .

The Lagrangian cost function consists of two parts:

distortion and rate. From (1), the inequality relation between the cost and rate can be simply derived by eliminating the distortion D(s,c(m)) as follows

(4)

This yields the result that, after examining the sub- window , the spiral search needs to evaluate the cost function only at the pixel positions that satisfy the following criterion

(5)

where . When this inequality is satisfied at a search position, the cost function is evaluated at that particular position. And if its RD cost is less than the current minimum RD cost, the current RD cost is updated.

In the spiral search, there are many potential opportunities to exploit parallelism. The search window can be divided into several regions using different scan patterns. For example, the video encoder can simply use run lengths which are applied consecutively to Ψ until the search window is completed (see Fig. 2(a)).

However, as indicated in (5), the amount of complexity reduction depends on how fast the encoder finds the pixel position with a low RD cost.

Therefore, the parallel search process can be accelerated further by assigning the search window to available threads such that each thread can find pixels with low RD costs at the beginning of the search process.

It is well-known that, since natural images are dominated by low-frequency components, similar pixel-matching errors tend to appear together in clusters. This characteristic is demonstrated in [15].

Accordingly, it can be found out that the RD costs of the adjacent pixels are strongly correlated to each other. Therefore, if the search window is partitioned consecutively and each thread evaluates a partitioned region independently, a single thread may take charge of the entire region with low RD costs. This results in that the other threads should evaluate many candidate positions because their current minimum RD costs are relatively high. Based on this observation, this paper proposes a parallel motion estimation strategy to parallelize the spiral search on multithreaded hardware architecture. Let N be the number of threads which the encoder can utilize in the motion search process. Then, the motion search window is first partitioned into N scattered regions L

n

’s as follows

(6)

where and % means a modulus operator.

The proposed method assigns ( W/N) pixel posi- tions to each thread and a thread evaluates the assigned pixel positions independently. Fig. 2(b) shows an example of the proposed partitioning method using (6).

In the proposed implementation, a master thread creates a team of threads at the beginning of the motion search process. A thread in the team is assigned its search region based on (6) and each thread searches for the best pixel position inside its search region. Note that, according to (5), a thread examines the pixels within its search region in the spiral order. In this case, the processing times of threads are different from each other because the numbers of pixels to be examined are not the same among threads. Therefore, after completing the parallel motion estimation, all threads in the team synchronize at the end of the searching process. Only the master thread continues execution. Let be the minimum RD cost of the region L

n

. The master thread finds the global minimum RD cost among

’s as follows 0 i k < ≤

R m−m (

_p

) J λ ---

≤

Ψ

_k

R m (

_j

−m

_p

) J

_k^min

--- λ

≤ k j W < ≤

L

_n

= { m

_k

k%N n = } 0 n N ≤ <

J

_L^min_n

J

^min

J

_L^min_n

Fig. 2. Search space partitioning methods based on (a) run

lengths and (b) scattered regions.

(4)

. (7)

Finally, the optimal search position is coded using the MV corresponding to . Fig. 3 shows the overall process of the proposed parallel motion estimation algorithm where N is equal to 4.

3. Experimental results

The performance of the proposed parallel motion estimation algorithm was measured using a 3.40 GHz quad-core processor with 8GB RAM. The proposed algorithm was implemented using the OpenMP application programming interface (API) which allows a compiler to exploit thread-level parallelism and optimize the performance of the application by adding a few pragmas [16]. Therefore, the proposed multi-thread source code is very similar to the single- thread code and can be easily integrated into the current encoder.

In the experiments, we used the JM18.3 reference software and evaluated the performance of the proposed method by using several video sequences with CIF@30fps format [17]. We measured the processing time of the motion estimation process of the encoder. In all sequences, only the first frame was encoded in intra mode. The MV search range was set

to 64 pixel × 64 pixel and the number of reference frames was set to 1.

Table 1 shows the motion estimation times of the CIF sequences with variation of the number of threads ( N) for the given configuration. In the simulation, N ranges from 1 to 4 and, in all simulations, the bitrate and PSNR of the proposed multithreaded encoder are exactly the same as those of the original encoder. We first show the influence of N on the motion estimation time. Table 1 shows that the average motion estimation time (AMET) is reduced significantly as N increases. In our simula- tions, the reductions of AMET are 33.24%, 49.95%, and 58.53% for cases N = 2, N = 3, and N = 4, respectively. The results indicate that, as N increases, the performance of the proposed method is improved but the amount of the performance improvement becomes smaller. The proposed method can reduce the AMET by up to 63.86% for the FOOTBALL sequence with N = 4. Table 1 also shows that the amount of the AMET reduction becomes larger as the bitrate increases. For example, when the quantization parameter (QP) is set to 28, the AMET reduction with N = 4 is 56.71% for all CIF sequences. And, it increases to 66.95% when the QP is set to 16. This result indicates that the proposed method tends to show better performance at a high bitrate.

4. Conclusion

In this paper, a parallel motion estimation algorithm was proposed for multithreading general- purpose processors. The proposed algorithm divides the motion search window into several scattered regions, and each thread searches for the best position inside its search region. After completing the parallel motion estimation, the master thread finds the best search position among the local optimal positions obtained by all threads. Simulation results show that the proposed method affords significant time saving as compared to the reference software. Since the proposed algorithm was implemented by using the OpenMP API, it can be easily integrated into the reference software.

J

^min

= min J {

_L^min₀

, …, J

_L^min₍_N−1₎

}

J

^min

Fig. 3. The proposed parallel motion estimation process

where ME( L

n

) indicates the motion estimation

using the pixels of L

n

.

(5)

Acknowledgements

This work was supported by Research Funds of Sangmyung University in 2012.

References

1. J. Parkhurst, J. Darringer, B. Grundmann, From single core to multicore: preparing for a new exponential, Proc. IEEE/ACM International Conference on Com- puter-Aided Design, Nov. 2006, pp. 67-72

2. C. M. Huang, C. W. Lin, W. P. Tsai, A Multi-Core Based Parallel Streaming Mechanism for Concurrent Video-on-Demand Applications, IEEE Communica- tions Letters, vol. 13, no. 4, Apr. 2009, pp. 286-288.

3. J. H. Hyun, Fast Mode Decision Algorithm Based on Thread-level Parallelization and Thread Slipstream- ing in H.264 Video Coding, Proc. IEEE International Conference on Multimedia and Expo (ICME), Jul.

2010, pp.655-660.

4. T. P. Chen, D. Budnikov, C. J. Hughes, Y. K. Chen, Computer vision on multi-core processors: articulated body tracking, Proc. IEEE International Conference on Multimedia and Expo (ICME) , July 2007, pp.

1862-1865.

5. M. Z. Coban, R. M. Mersereau, A Fast Exhaustive Search Algorithm for Rate-Constrained Motion Esti- mation, IEEE Transactions on Image Processing, vol.

7, no. 5, May 1998, pp. 769-773.

6. W. Ouyang, F. Tombari, S. Mattoccia, L. D. Stefano, W. K. Cham, Performance Evaluation of Full Search Equivalent Pattern Matching Algorithms, IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 34, no. 1, Jan. 2012, pp. 127-143.

7. A. V. Paramkusam, V. S. K. Reddy, An Optimal Fast Full Search Motion Estimation Algorithm In Video Coding, Proc. International Conference on Signal Processing, Communication, Computing and Net- working Technologies (ICSCCN) , Jul. 2011, pp. 598- 603.

8. X. Q. Gao, C. J. Duanmu, C. R. Zou, A Multilevel Successive Elimination Algorithm for Block Match- Table 1. Performance of the Proposed Parallel Motion Estimation Algorithm for CIF Sequences

Sequences QP

JM (N = 1) N = 2 N = 3 N = 4

Bitrate

(kbits/s) PSNR (dB) Time (s)

Time Time Time

(s) (%) (s) (%) (s) (%)

FOREMAN

16 3686.08 46.15 2592.55 1641.98 63.34 1207.94 46.59 1006.04 38.81 22 1192.63 41.07 2021.14 1346.42 66.62 1122.81 55.55 850.42 42.08 28 405.96 36.93 1574.63 1061.46 67.41 791.08 50.24 680.62 43.22

FOOTBALL

16 5925.31 46.98 3673.82 2362.64 64.31 1581.89 43.06 1327.58 36.14 22 3252.49 42.20 3159.84 2032.09 64.31 1438.30 45.52 1197.06 37.88 28 1684.91 37.69 2588.74 1723.25 66.57 1192.05 46.05 1020.83 39.43

CITY

16 4113.04 46.02 3232.44 2211.74 68.42 1662.60 51.43 1355.45 41.93 22 1450.00 40.28 2765.22 1904.58 68.88 1552.86 56.16 1260.02 45.57 28 441.57 35.06 2262.91 1580.10 69.83 1281.65 56.64 1075.84 47.54

CREW

16 4283.25 46.70 3122.60 2014.35 64.51 1369.83 43.87 1143.71 36.63 22 1798.10 42.18 2557.07 1647.18 64.42 1164.23 45.53 1003.98 39.26 28 724.91 37.84 1997.86 1296.16 64.88 931.11 46.61 810.96 40.59

HARBOUR

16 6954.67 46.45 3627.99 2197.41 60.57 1639.50 45.19 1331.42 36.70 22 3906.14 40.95 3341.05 2092.10 62.62 1582.67 47.37 1283.32 38.41 28 1749.02 35.42 2771.78 1842.77 66.48 1416.15 51.09 1174.96 42.39

SOCCER

16 3993.53 46.04 3123.69 2248.26 71.97 1691.83 54.16 1381.96 44.24

22 1621.81 40.64 2532.59 1909.57 75.40 1524.06 60.18 1244.67 49.15

28 669.64 35.78 2181.08 1552.08 71.16 1213.86 55.65 1015.27 46.55

(6)

ing Motion Estimation, IEEE Transactions on Image Processing, vol. 9, no. 3, May 1998, pp. 501-504.

9. J. Y. Lu, K. S. Wu, J. C. Lin, Fast Full Search in Motion Estimation by Hierarchical Use of Minkowski's Inequality, Pattern Recognition, vol. 31, no. 7, 1998, pp. 945-952.

10. C. D. Bei, R. M. Gray, An Improvement of the Min- imum Distortion Encoding Algorithm for Vector Quantization, IEEE Trans. Comm., vol. 33, no. 10, Oct. 1985, pp. 1132-1133.

11. K. P. Lim, G. Sullivan, T. Wiegand, Text Description of Joint Model Reference Encoding Methods and Decoding Concealment Methods, Joint Video Team, Doc. JVT-N046, Jan. 2005.

12. C. S. Park, S. W. Jung, K. S. Choi, S. J. Ko, A fast encoding algorithm avoiding repetition of motion estimation in scalable video coding, Electronics Let- ters, vol. 46, issue 4, Feb. 2010, pp. 280-282.

13. G. Sullivan, T. Wiegand, Rate-distortion optimization for video compression,” IEEE Signal Processing

Magazine, vol. 15, 1998, pp. 74-90.

14. C. H. Cheung, L. M. Po, A novel cross-diamond search algorithm for fast block motion estimation, IEEE Trans. Circuits Syst. Video Technol., vol. 12, Dec. 2002, pp. 1168-1177.

15. K. C. Hui, W. C. Siu, Y. L. Chan, New Adaptive Par- tial Distortion Search Using Clustered Pixel Match- ing Error Characteristic, IEEE Transactions on Image Processing, vol. 14, no. 5, May 2005, pp. 597-607.

16. M. Sato, OpenMP: parallel programming API for shared memory multiprocessors and on-chip multi- processors, International Symposium on System Syn- thesis, Oct. 2002, pp. 109-111.