Professional Documents
Culture Documents
Eduarda Monteiro, Bruno Vizzotto, Cláudio Diniz, Bruno Zatt, Sergio Bampi
Informatics Institute - PPGC - PGMICRO
Federal University of Rio Grande do Sul (UFRGS)
Porto Alegre, Brazil
{ermonteiro, bbvizzotto, cmdiniz, bzatt, bampi}@inf.ufrgs.br
Abstract— This work presents a parallel GPU-based solution best matching candidate is selected, a motion vector (MV)
for the Motion Estimation (ME) process in a video encoding pointing to that position is calculated. The MV indicates to
system. We propose a way to partition the steps of Full Search the video decoder the location of the most similar block, in
block matching algorithm in the CUDA architecture. A the reference frame, that must be used to ‘predict’ the current
comparison among the performance achieved by this solution block. In this way, only the motion vector and the residues,
with a theoretical model and two other implementations i.e. a pixel-by-pixel difference between the current block and
(sequential and parallel using OpenMP library) is made as the best matching block, are transmitted to the decoder.
well. We obtained a O(n²/log²n) speed-up which fits the The block matching algorithm and the matching
proposed theoretical model considering different search areas.
similarity criteria are important encoder issues not
It represents up to 600x gain compared to the serial
implementation, and 66x compared to the parallel OpenMP
standardized by H.264/AVC. The block matching task
implementation. requires intensive computation and memory communication,
representing 80% of the total computational complexity of
Keywords- Motion Estimation; H.264/AVC; GPU; CUDA. current video encoders [3]. Some block matching algorithms
for ME have a great potential of parallelism. Full Search
I. INTRODUCTION (FS) [2], the optimal solution, performs the search for the
In the past decade, the demand for high quality digital best match exhaustively inside a search area by calculating
video applications has brought the attention of industry and the similarity for every candidate block position (with a
academy to drive the development of advanced video coding pixel-by-pixel offset). In total, 15625 candidate blocks have
techniques and standards. It resulted in the publication of to be calculated considering a search area of 128x128 pixels
H.264/AVC [1], the state-of-the-art video coding standard, and a block size of 4x4 in the full search algorithm.
providing higher coding efficiency and requiring increased However, the SAD calculation for one block has no data
computational complexity compared with previous standards dependencies with other blocks enabling simultaneous
(MPEG-2, MPEG-4, H.263). Boosted by the codecs and parallel processing. In other words, ME using FS algorithm
devices evolution, digital video coding has become present has promising potential for efficient implementation on
in a wide range of applications including TV broadcasting, massively parallel architectures.
video conferencing, video surveillance, and portable Recently, the academic and industrial parallel processing
multimedia devices, to list a few. communities have turned their attention to Graphic
Among all innovative tools featured by latest video Processing Units (GPU). GPUs have been originally
coding standards the Motion Estimation (ME) is the most developed for graphic processing and 3D rendering but, due
important in order to obtain expressive coding gains. In to the great potential of parallelization, GPUs capabilities
H.264/AVC the ME provides even more coding efficiency and flexibility were extended targeting general purpose
due to the insertion of bi-prediction and variable block-size processing. These devices are referred as GPGPU – General
capabilities [1]. The new tools represent significant Purpose GPU. The FERMI architecture [3], first proposed by
complexity increase in comparison to predecessor standards NVIDIA in 2009, is the most popular and prominent GPGPU
posing a big challenge to real-time video encoding solution available nowadays. It enables high computing
realization at high definition. performance increase over previous architectures. Its main
The Motion Estimation explores the temporal innovations are the inclusion of two levels of cache (L1 and
redundancy of a video by searching in reference frames L2), the addition of cores and special units for data handling,
(previously encoded frames) the most similar region in the and double-precision floating point units. CUDA architecture
current frame. It is performed in a block-by-block basis, for (Compute Unified Device Architecture) [4] was proposed by
each block of a video frame, using a block matching NVIDIA in 2007 [3] with the objective to exploit the high
algorithm to find the best ‘match’ in a reference frame. The degree of inherent parallelism to their graphical devices. The
best ‘match’ is defined by using a similarity criterion, e.g. great computational power offered by this type of technology
Sum of Absolute Differences (SAD), the most commonly has made of this architecture a great prominence in diverse
used in commercial and research implementations. Once the areas, in special in the scientific community.
129
frame. Thus, as a product of this process, only the motion The SAD definition is presented d in (1), where w is the
vector and the residual data are transmitteed instead of the width and h is the height of both h candidate and current
original information regarding the entire fram
mes. block. The candidate block is chossen when it presents the
lowest SAD value, i.e. the lower distortion in relation to the
current block. The position (x,y) off the best block candidate
block is represented by the motion vector.
v
C. Theoretical Model
A theoretical model was used to o analyze previously the
Full Search algorithm behavior. The PRAM (Processor
Random-Access Machine) modeel was considered to
calculate the parallel complexity. However,
H only PRAM is
not enough to model the paralllel complexity of GPU
architecture, since this paradigm considers shared memory
Figure 2. Optimum Motion Estimation Algoriithm Diagram. which neglects practical issues suchh as communication time.
Then, we improved PRAM model by b adding the concept of
A. Full Search Algorithm communication between CPU and GPU. G
Considering that any block can be chossen when the ME This analysis has the main goal to
o establish a comparative
is being performed, all the blocks that constitute the search basis for experimental results. Thee complexity calculation
area can be classified as candidate blocks. TThis way, a search was based on the following variaables: (i) size of current
algorithm determine how the candidate bloock search moves block: 4x4; (ii) size of search areaa: n x n pixels, where n
inside the search area, in order to find the beest matching. refers the search area width and heiight (square search area);
Among the different approaches of searcch algorithms for (iii) frame resolution: M x N, where M and N are the frame
Motion Estimation, the Full Search is coonsidered in this width and height in pixels, resp pectively; (iv) similarity
work. The Full Search algorithm aims tto find the ‘best criteria: SAD.
match’ between the block of the currennt frame and all 1) Sequential Complexity: In order to calculate the
possible positions inside the search area sett in the reference sequential complexity of the ME M FS algorithm we
frame. Thus, this algorithm is compuutationally more considered the total number off subtractions, absolute
expensive compared to other existing allgorithms in the calculations, SAD operations, SAD D comparisons for a 4x4
literature. Despite this reason, this algorithhm is considerate
block. It was also taken into accoun nt the numbers of blocks
optimal because it is capable of generating iimproved motion
vectors (best matches), resulting in the best video quality in the search area and the frame resolution. In conclusion,
and best encoding efficiency among all oother fast motion the algorithm presents a complexity y of O(n2).
estimation algorithms [2].
ܶ௦ ൌ ሾͳሺ݊ െ ͵ሻଶ ͳሺ݊ െ ͵ሻଶ ͳͷሺ݊ െ ͵ሻଶ
B. Similarity Criteria ሺ݊ െ ͵ሻଶ െ ͳሿ ൈ ሾ ܯൈ ܰሿ (2)
To find the best matching, a metric is reqquired to evaluate
the differences between the current and canndidate blocks. In 2) Parallel Complexity (PRA AM): To calculate the
this context, a similarity criterion mustt be used. This parallel complexity of the ME FS algorithm we first used
criterion is known also as distortion criterioon. The distortion the PRAM model. It was considered a granularity of log (n)
(or difference) is inversely proportional tto the degree of x log (n) for each sub-problem insiide the n x n search area.
similarity between the blocks. xity is O(log2n) based on:
Thus, the obtained parallel complex
Different similarity criteria are often used in video
coding: (i) Mean Square Error (MSE); (ii) Sum of Absolute • Number of processors P(n n): Considering that each
Transformed Differences (SATD); (iii) S Sum of Absolute thread is responsible by a region
r log (n) x log (n).
Differences (SAD). Considering its simplicity, the SAD is
the most used similarity criterion for ME E and it is also ଶ మ
adopted in this work. SAD calculates the diistortion between ܲሺ݊ሻ ൌ ቀ ቁ ൌ ܱ ቀ ቁ (3)
୪୭ మ
the current block and each candidate blocck in the search
area, by adding the pixel-by-pixel absolute ddifferences:
• Execution Time Tp(n): Thee execution considers the
௪ିଵ ିଵ
number of subtractions an
nd absolutes calculations,
ܵܦܣሺ௫ǡ௬ሻ ൌ ȁݐ݊݁ݎݎݑܥሺ݅ǡ ݆ሻ െ ܽ݀݅݀݊ܽܥ
ܽ݁ݐሺ݅ǡ ݆ሻȁ (1) additions and SAD compaarisons between regions
ୀ ୀ of size of log²n:
130
ܶሺሻୀమା భఱ మାଶൈቀ
ቁ
(4) shown in the Fig. 3. The ME underr CUDA proposed in this
భల
paper is composed by two steps: (i)) SAD values calculation
for all candidate blocks inside the search area; (ii)
• Total Cost C(n): is the cost to impplement a parallel comparison of SAD values of all can ndidate blocks to find the
program (P(n) * Tp(n)): best matches (lowest SAD). Finally, the ME results were
stored in a text file with all motion vectors generated for the
మ current frame and transferred to CPU U memory.
ܥሺ݊ሻ ൌ ൈ ݈݃ଶ ݊ ൌ ݊ଶ (5)
మ A library called Thrust [13 3] was used for data
manipulation between CPU and GPU. This library was
3) Communication: The hardwarre architecture developed for CUDA architecture in order to facilitate the
considered in this work is based on GPU aas a co-processor creation of parallel applications. Using this library, a CPU-
attached to the CPU and, consequently,, the CPU-GPU GPU transfer requires only a simplee attribution (overloading
communication time is of key importance foor the application the attribution signal).
performance. In addition to complexity obtained by the The execution of this applicatio on in GPU is carried on
PRAM model, we included a concept off communication accordingly the processing hierarch hy of CUDA architecture.
based on latency (l) and throughput (d). This way, we In this hierarchy there are three important concepts: (i)
thread: the basic unit of the processsing; (ii) block: a set of
obtained a more realistic model whhich takes into
threads; (iii) grid: composed of maany blocks. This way, we
consideration two data transfers (input daata from CPU to defined a programming model for th his algorithm as shown in
GPU and the calculation output from GPU U to CPU and the the Fig. 4.
execution of one function (f(n)):
మ
ͳ݂ሺ݊ሻ ʹ ቀ݈ ቁ ൌ ʹ ቀ݈ ቁ (6)
ௗ ௗ
మ
(7)
మ
131
The blocks size in this application is variable according to ranging from 12 to 128, where n*n are the number of pixels
the size of the search area; (iv) Grid: The blocks is inside a search area. The data obtained for the CUDA
organized in a grid. In this work, the grid size is related to version are illustrated in the graphic shown in Fig. 5. Thus, it
the video resolution obtained as input parameter. can be observed that the execution time (one frame) is
In order to establish a comparative basis for our CUDA- constant independent of video sequence since the same
based algorithm, we also implemented a sequential and a number of candidates is tested in all cases.
parallel OpenMP-based [16] versions of the ME FS 10000
algorithm. Our OpenMP version considered a maximum of
four execution threads in order to fit to a 4-core processor.
100
12 16 20 24 32 36 48 64 80 128
Search Area (pixel) [n*n]
Figure 5. Execution time on FS algorithm in CUDA for three CIF Video
Sequences.
132
The OpenMP version results (in green, Fig. 6) has a video. Considering 48x48 search area, CUDA version
higher performance than the sequential version due the uses surpasses a three-order-of-magnitude gain in execution time
of four parallel threads. However, the speed up is compared with sequential version, which is an impressive
theoretically limited to a factor of 4x. Finally, the GPU result. Also, the CUDA version results remain independent
CUDA implementation (in purple, Fig. 6) shows the best to the increase in search area.
results of execution time among all the platforms. In GPU
version the results are kept constant, independent of the size 10000000
and the search area considered, since the high level ME_FS_Serial ME_FS_OpenMP ME_FS_CUDA
parallelism available in this architecture. For 128x128 pixel 10000000
results especially for higher search areas, which increase the 8000
Total
resulting video quality and coding efficiency. Communication
6000
C. Results for HD1080p Resolution
4000
In this section, we present the results for FullHD
resolution (1080p). Fig. 8 shows the results generated for 2000
sequential and OpenMP versions also for high resolution Figure 9. Total execution time vs. communication time for CPU to GPU
data transfer considering a FullHD 1080p resolution.
133
D. Speed-up Results due our efficient mapping of parts of FS algorithm to the
Fig. 10 presents the speed-up achieved by Motion threads in CUDA architecture. Also, the device we have used
Estimation in CUDA version in relation to the sequential in our experiment (GTX 480) is faster and has more
computation cores than those used in related works. It is
version, compared to the following speed-up values: i)
shown that our FS algorithm mapping is efficiently scalable
OpenMP implementation; ii) the speed-up expected which
and benefits from the increase of computation cores in GPU
was calculated by theoretical model presented in Section III. architectures.
800
VI. CONCLUSIONS
700 This paper presented a detailed analysis and
CUDA implementation of the motion estimation module in GPU
600
architecture considering the Full Search algorithm. We have
Expected
500 shown that the use of GPU is a good alternative for
acceleration of motion estimation process in video coding.
Speed-up
OpenMP
400 First, a sequential and parallel theoretical model considering
300
communication cost between CPU and GPU were developed
in order to analyze the execution time and speed-up potential
200 before the implementation. Then, a parallel version of the FS
block matching algorithm was developed and well mapped
100
in NVIDIA CUDA architecture using SAD similarity
0
criterion. OpenMP and sequential versions were also
implemented for comparison reasons. A variety of tests were
12 16 20 24 32 36 48 64 80 128 performed for different sizes of the search area (12x12 to
Search Area (pixel) [n*n]
128x128) and video resolutions (CIF, HD720p and
Figure 10. Speed-up – Cuda vs. Expected by Proposed Theoretical Model HD1080p). The obtained results were compared to the other
vs. OpenMP.
versions of the algorithm and with the theoretical model. The
The speed-up obtained by CUDA version presented a obtained speed-up in CUDA version represents up to 600x
gain compared to the sequential version, and 66x compared
quadratic growth in relation to the size of the search area,
to the OpenMP version. Compared with the related work
while for OpenMP version the growth is kept constant. The
proposed in [8] (which provides the best results in the
obtained speed-up represents up to 600x gain compared to literature) we achieved 27% speed-up increase for 16x16
the sequential version, and 66x compared to the OpenMP search area. This gain was achieved through an optimized
version. Furthermore, the Fig. 10 shows that the CUDA mapping of FS algorithm parts to threads in CUDA
speed-up results were also consistent with the theoretical architecture. Future works could focus on deeper exploration
model O(n²/log²n) considering the increase in search area. of data sharing between CPU to GPU, GPU memory
In addition, this work is also compared to related works organization, and also the implementation of sub-optimal
[7]-[9]. All the works implemented ME considering Full (fast) block matching algorithms for targeting real-time
Search algorithm with integer pixel accuracy for NVIDIA video encoding.
boards using Stefan video sequence (CIF resolution). The
speed-up results presented in Tab. I include 16x16 pixels ACKNOWLEDGMENT
and 32x32 pixels search area. The authors thank the partial support provided by CNPq
and CAPES Brazilian funding agencies.
TABLE I. SPEED-UP COMPARISON WITH RELATED WORK
134
[6] VideoLAN, http://www.videolan.org/developers/x264.html, [10] M. Garland, S-L. Grand, J. Nickolls, J. Anderson, J. Hardwick, S.
Available: Jun. 2011. Morton, E. Phillips, Y. Zhang, V. Volkov. “Parallel computing
[7] W-N. Chen, H-M. Hang. “H.264/AVC motion estimation experiences with CUDA,” IEEE Micro, vol. 28, no. 4, pp. 13–27,
implementation on Compute Unified Device Architecture (CUDA).” Jul.–Aug. 2008.
In IEEE International Conference on Multimedia and Expo (ICME) p. [11] B. Pieters, D. V. Rijsselbergen, W. D. Neve, R. V. Walle.
697-700, 2008. “Performance evaluation of H.264/AVC decoding and visualization
[8] Y-C. Lin, P-L Li, C-H. Chang, C-L. Wu, Y-M. Tsao, S-Y. Chien. using the GPU”. In , Vol. 6696; Tescher, A. G., Ed.; SPIE: 2007.
"Multi-pass algorithm of motion estimation in video encoding for [12] I. Richardson. “H.264/AVC and MPEG-4 Video Compression –
generic GPU", In IEEE International Symposium on Circuits and Video Coding for Next-Generation Multimedia”. Chichester: John
Systems (ISCAS), 2006. Wiley and Sons. 2003.
[9] C-Y. Lee, Y-C. Lin, C-L. Wu, C-H. Chang,Y-M. Tsao, S-Y. Chien. [13] Thrust. “Thrust – Code at the speed of light”,
“Multi-Pass and Frame Parallel Algorithms of Motion Estimation in http://code.google.com/p/thrust/wiki/QuickStartGuide. Available:
H.264/AVC for Generic GPU.” In IEEE International Conference on Jun. 2011
Multimedia and Expo (ICME), p. 1603-1606. 2007. [14] The OpenMP API specification for parallel programming. Available
at:http://openmp.org/wp/
135