Applying CUDA Architecture To Accelerate Full Search Block Matching Algorithm For High Performance Motion Estimation in Video Encoding

2011 23rd International Symposium on Computer Architecture and High Performance Computing
Applying CUDA Architecture to Accelerate Full Search Block Matching Algorithm

for High Performance Motion Estimation in Video Encoding
Eduarda Monteiro, Bruno Vizzotto, Cláudio Diniz, Bruno Zatt, Sergio Bampi
Informatics Institute - PPGC - PGMICRO
Federal University of Rio Grande do Sul (UFRGS)
Porto Alegre, Brazil
{ermonteiro, bbvizzotto, cmdiniz, bzatt, bampi}@inf.ufrgs.br
Abstract— This work presents a parallel GPU-based solution best matching candidate is selected, a motion vector (MV)
for the Motion Estimation (ME) process in a video encoding pointing to that position is calculated. The MV indicates to
system. We propose a way to partition the steps of Full Search the video decoder the location of the most similar block, in
block matching algorithm in the CUDA architecture. A the reference frame, that must be used to ‘predict’ the current
comparison among the performance achieved by this solution block. In this way, only the motion vector and the residues,
with a theoretical model and two other implementations i.e. a pixel-by-pixel difference between the current block and
(sequential and parallel using OpenMP library) is made as the best matching block, are transmitted to the decoder.
well. We obtained a O(n²/log²n) speed-up which fits the The block matching algorithm and the matching
proposed theoretical model considering different search areas.
similarity criteria are important encoder issues not
It represents up to 600x gain compared to the serial
implementation, and 66x compared to the parallel OpenMP
standardized by H.264/AVC. The block matching task
implementation. requires intensive computation and memory communication,
representing 80% of the total computational complexity of
Keywords- Motion Estimation; H.264/AVC; GPU; CUDA. current video encoders [3]. Some block matching algorithms
for ME have a great potential of parallelism. Full Search
I. INTRODUCTION (FS) [2], the optimal solution, performs the search for the
In the past decade, the demand for high quality digital best match exhaustively inside a search area by calculating
video applications has brought the attention of industry and the similarity for every candidate block position (with a
academy to drive the development of advanced video coding pixel-by-pixel offset). In total, 15625 candidate blocks have
techniques and standards. It resulted in the publication of to be calculated considering a search area of 128x128 pixels
H.264/AVC [1], the state-of-the-art video coding standard, and a block size of 4x4 in the full search algorithm.
providing higher coding efficiency and requiring increased However, the SAD calculation for one block has no data
computational complexity compared with previous standards dependencies with other blocks enabling simultaneous
(MPEG-2, MPEG-4, H.263). Boosted by the codecs and parallel processing. In other words, ME using FS algorithm
devices evolution, digital video coding has become present has promising potential for efficient implementation on
in a wide range of applications including TV broadcasting, massively parallel architectures.
video conferencing, video surveillance, and portable Recently, the academic and industrial parallel processing
multimedia devices, to list a few. communities have turned their attention to Graphic
Among all innovative tools featured by latest video Processing Units (GPU). GPUs have been originally
coding standards the Motion Estimation (ME) is the most developed for graphic processing and 3D rendering but, due
important in order to obtain expressive coding gains. In to the great potential of parallelization, GPUs capabilities
H.264/AVC the ME provides even more coding efficiency and flexibility were extended targeting general purpose
due to the insertion of bi-prediction and variable block-size processing. These devices are referred as GPGPU – General
capabilities [1]. The new tools represent significant Purpose GPU. The FERMI architecture [3], first proposed by
complexity increase in comparison to predecessor standards NVIDIA in 2009, is the most popular and prominent GPGPU
posing a big challenge to real-time video encoding solution available nowadays. It enables high computing
realization at high definition. performance increase over previous architectures. Its main
The Motion Estimation explores the temporal innovations are the inclusion of two levels of cache (L1 and
redundancy of a video by searching in reference frames L2), the addition of cores and special units for data handling,
(previously encoded frames) the most similar region in the and double-precision floating point units. CUDA architecture
current frame. It is performed in a block-by-block basis, for (Compute Unified Device Architecture) [4] was proposed by
each block of a video frame, using a block matching NVIDIA in 2007 [3] with the objective to exploit the high
algorithm to find the best ‘match’ in a reference frame. The degree of inherent parallelism to their graphical devices. The
best ‘match’ is defined by using a similarity criterion, e.g. great computational power offered by this type of technology
Sum of Absolute Differences (SAD), the most commonly has made of this architecture a great prominence in diverse
used in commercial and research implementations. Once the areas, in special in the scientific community.
1550-6533/11 $26.00 © 2011 IEEE 128

DOI 10.1109/SBAC-PAD.2011.19
By exploring the inherent parallelism potential of ME FS device card considered in this paper is NVIDIA GeForce
algorithm and the large parallel processing power of recent 6800GT.
GPUs, this work presents a parallel GPU-based solution for Lee et al. [9] presents three alternatives of ME in GPU
the FS block matching algorithm implemented on CUDA. based in Full Search algorithm: Integer Accuracy, Fractional
We propose an efficient mapping for the FS algorithm to the Accuracy and Integer Accuracy considering three reference
CUDA programming model. Further, the performance of our frames in parallel. The best performance results were
solution running motion estimation for real video sequences achieved with integer accuracy considering three reference
is compared to the state-of-the-art and to in-house developed frames. The solution with fractional accuracy enables further
serial and OpenMP [14] implementations. It was also refinement of the matching algorithm (increasing quality) but
compared to the theoretical complexity calculated in terms of it is performance-wise inefficient on GPUs due to data
computation and communication. dependencies. In this work the considered device was a
This paper is organized as follows. Section II presents the NIVIDIA GeForce 7800GT.
state-of-the–art related work. Motion Estimation basic
concepts are presented in Section III. In Section IV is
described our Motion Estimation Full Search block matching III. MOTION ESTIMATION
algorithm developed in CUDA. Section V shows the results, Fig. 1 shows the block diagram related to a generic video
analysis, and comparisons with state-of-the-art. Section VI encoder. The blocks illustrated in Fig. 1 have the common
concludes the work. goal of reducing the existing types of redundancy in digital
videos: temporal, spatial, and entropy. In summary, the main
II. RELATED WORK blocks are: (i) inter-frame prediction: aims to reduce the
The standard H.264/AVC is considered state-of-the–art temporal redundancy between frames of a video, i.e. it
in video coding by having better results than the existing focuses on the correlation between temporally neighboring
standards (MPEG-2). Different video encoding software frames; (ii) intra-frame prediction: reduces spatial
solutions for H.264/AVC have been developed since its redundancy in the current frame (current frame being
standardization, e.g. the JM reference software [5] and the encoded), i.e., it focuses on the correlation between the
x264 free software library [6]. However, these software have pixels distributed within the same frame; (iii) transforms and
none or only proprietary libraries for GPU acceleration (the quantization: responsible for reducing irrelevant information
case of x264). for the human visual system in order to achieve higher
Some works aiming to accelerate H.264/AVC video coding gain; (iv) entropy coding: focuses on the reducing of
encoding using GPU can be found in the literature. The entropic redundancy, i.e., it is related to the representation of
works described in [7]-[9] focuses specifically on the coded symbols, associating for most frequent symbols
implementation of ME algorithms into GPU cards, which are smaller codes.
directly related to the scope presented in our work.
Chen et al. [7] presented a Motion Estimation algorithm
using GPU and CUDA architecture considering Full Search
algorithm. This work is divided the ME in different steps to
achieve high parallelism through low data transference
between memory CPU and GPU. These steps are: (i) SAD
values calculation for fixed size block, 4x4 pixels; (ii) SAD
values calculation for variable block sizes; (iii) SAD
comparisons for integer accuracy; (iv) SAD comparisons for
fractional accuracy; (v) refinement of ME for factionary
accuracy. This paper considers variable block sizes getting a
high number of candidate blocks to be evaluated, so the Figure 1. H.264/AVC Encoder System Diagram.
performance of this work was not efficient. Moreover, the
ME with fractional accuracy needed a refinement which also The ME process (part of inter-frame prediction, see Fig.
impacted on results presented. The device card considered in 1) is detailed in Fig. 2. Basically, the ME identifies motion
this paper is NVIDIA GeForce 8800GTX with CUDA between neighboring frames in a scene and provides a map
architecture. of displacement using motion vectors. Firstly, one (or more)
Lin et al. [8] focuses on efficient parallelization of the reference frames are selected within the group of pictures
Motion Estimation. This paper does not consider CUDA (GOP). Next, for each block inside the current frame a search
architecture since it was not yet proposed by the time of that for the most similar block is performed. The search is
work. Thus, it uses the texture memory as main source of bounded by a region called the search area (filled area in Fig.
data management. This proposed algorithm is based in a 2). The search area is the region in the reference frame
technique called multi-pass encoding considering Full typically centered in the same relative position of the current
Search algorithm. The main drawback of this approach is the block. For each block it is calculated a motion vector
performance limitation imposed by multiple iteration steps (represented as a tuple of x and y coordinates) pointing to the
for SAD calculation and SAD values comparisons. The position of the block with highest similarity in the reference
129
frame. Thus, as a product of this process, only the motion The SAD definition is presented d in (1), where w is the
vector and the residual data are transmitteed instead of the width and h is the height of both h candidate and current
original information regarding the entire fram
mes. block. The candidate block is chossen when it presents the
lowest SAD value, i.e. the lower distortion in relation to the
current block. The position (x,y) off the best block candidate
block is represented by the motion vector.
v
C. Theoretical Model
A theoretical model was used to o analyze previously the
Full Search algorithm behavior. The PRAM (Processor
Random-Access Machine) modeel was considered to
calculate the parallel complexity. However,
H only PRAM is
not enough to model the paralllel complexity of GPU
architecture, since this paradigm considers shared memory
Figure 2. Optimum Motion Estimation Algoriithm Diagram. which neglects practical issues suchh as communication time.
Then, we improved PRAM model by b adding the concept of
A. Full Search Algorithm communication between CPU and GPU. G
Considering that any block can be chossen when the ME This analysis has the main goal to
o establish a comparative
is being performed, all the blocks that constitute the search basis for experimental results. Thee complexity calculation
area can be classified as candidate blocks. TThis way, a search was based on the following variaables: (i) size of current
algorithm determine how the candidate bloock search moves block: 4x4; (ii) size of search areaa: n x n pixels, where n
inside the search area, in order to find the beest matching. refers the search area width and heiight (square search area);
Among the different approaches of searcch algorithms for (iii) frame resolution: M x N, where M and N are the frame
Motion Estimation, the Full Search is coonsidered in this width and height in pixels, resp pectively; (iv) similarity
work. The Full Search algorithm aims tto find the ‘best criteria: SAD.
match’ between the block of the currennt frame and all 1) Sequential Complexity: In order to calculate the
possible positions inside the search area sett in the reference sequential complexity of the ME M FS algorithm we
frame. Thus, this algorithm is compuutationally more considered the total number off subtractions, absolute
expensive compared to other existing allgorithms in the calculations, SAD operations, SAD D comparisons for a 4x4
literature. Despite this reason, this algorithhm is considerate
block. It was also taken into accoun nt the numbers of blocks
optimal because it is capable of generating iimproved motion
vectors (best matches), resulting in the best video quality in the search area and the frame resolution. In conclusion,
and best encoding efficiency among all oother fast motion the algorithm presents a complexity y of O(n2).
estimation algorithms [2].
ܶ௦௘௤ ൌ ሾͳ͸ሺ݊ െ ͵ሻଶ ൅ ͳ͸ሺ݊ െ ͵ሻଶ ൅ ͳͷሺ݊ െ ͵ሻଶ ൅
B. Similarity Criteria ሺ݊ െ ͵ሻଶ െ ͳሿ ൈ ሾ‫ ܯ‬ൈ ܰሿ (2)
To find the best matching, a metric is reqquired to evaluate
the differences between the current and canndidate blocks. In 2) Parallel Complexity (PRA AM): To calculate the
this context, a similarity criterion mustt be used. This parallel complexity of the ME FS algorithm we first used
criterion is known also as distortion criterioon. The distortion the PRAM model. It was considered a granularity of log (n)
(or difference) is inversely proportional tto the degree of x log (n) for each sub-problem insiide the n x n search area.
similarity between the blocks. xity is O(log2n) based on:
Thus, the obtained parallel complex
Different similarity criteria are often used in video
coding: (i) Mean Square Error (MSE); (ii) Sum of Absolute • Number of processors P(n n): Considering that each
Transformed Differences (SATD); (iii) S Sum of Absolute thread is responsible by a region
r log (n) x log (n).
Differences (SAD). Considering its simplicity, the SAD is
the most used similarity criterion for ME E and it is also ௡ ଶ ௡మ
adopted in this work. SAD calculates the diistortion between ܲሺ݊ሻ ൌ ቀ ቁ ൌ ܱ ቀ ቁ (3)
୪୭୥ ௡ ௟௢௚మ ௡
the current block and each candidate blocck in the search
area, by adding the pixel-by-pixel absolute ddifferences:
• Execution Time Tp(n): Thee execution considers the
௪ିଵ ௛ିଵ
number of subtractions an
nd absolutes calculations,
ܵ‫ܦܣ‬ሺ௫ǡ௬ሻ ൌ ෍ ෍ȁ‫ݐ݊݁ݎݎݑܥ‬ሺ݅ǡ ݆ሻ െ ‫ܽ݀݅݀݊ܽܥ‬
ܽ‫݁ݐ‬ሺ݅ǡ ݆ሻȁ (1) additions and SAD compaarisons between regions
௜ୀ଴ ௝ୀ଴ of size of log²n:
130
ܶ௣ሺ௡ሻୀ௟௢௚మ௡ା భఱ ௟௢௚మ௡ାଶൈ௟௢௚ቀ ೙
ቁ
(4) shown in the Fig. 3. The ME underr CUDA proposed in this
భల ೗೚೒
೒೙
paper is composed by two steps: (i)) SAD values calculation
for all candidate blocks inside the search area; (ii)
• Total Cost C(n): is the cost to impplement a parallel comparison of SAD values of all can ndidate blocks to find the
program (P(n) * Tp(n)): best matches (lowest SAD). Finally, the ME results were
stored in a text file with all motion vectors generated for the
௡మ current frame and transferred to CPU U memory.
‫ܥ‬ሺ݊ሻ ൌ ൈ ݈‫݃݋‬ଶ ݊ ൌ ݊ଶ (5)
௟௢௚మ ௡ A library called Thrust [13 3] was used for data
manipulation between CPU and GPU. This library was
3) Communication: The hardwarre architecture developed for CUDA architecture in order to facilitate the
considered in this work is based on GPU aas a co-processor creation of parallel applications. Using this library, a CPU-
attached to the CPU and, consequently,, the CPU-GPU GPU transfer requires only a simplee attribution (overloading
communication time is of key importance foor the application the attribution signal).
performance. In addition to complexity obtained by the The execution of this applicatio on in GPU is carried on
PRAM model, we included a concept off communication accordingly the processing hierarch hy of CUDA architecture.
based on latency (l) and throughput (d). This way, we In this hierarchy there are three important concepts: (i)
thread: the basic unit of the processsing; (ii) block: a set of
obtained a more realistic model whhich takes into
threads; (iii) grid: composed of maany blocks. This way, we
consideration two data transfers (input daata from CPU to defined a programming model for th his algorithm as shown in
GPU and the calculation output from GPU U to CPU and the the Fig. 4.
execution of one function (f(n)):
௡ ௟௢௚ మ ௡ ௡
ͳ݂ሺ݊ሻ ൅ ʹ ቀ݈ ൅ ቁ ൌ ൅ ʹ ቀ݈ ൅ ቁ (6)
ௗ ௣ ௗ
It can be noted that the commuunication time is

negligible when compared with the computaation time. Then,
the expected speed-up calculated is shown in (7).
௣௡మ
(7)
௟௢௚మ ௡
IV. MOTION ESTIMATION IMPLEMENTA

ATION FOR CUDA
ARCHITECTURE
Noticing the computational complexity ty introduced by
ME and trying to take advantage of the siggnificant potential
of parallelism inherent to the Full Searcch algorithm, we
propose here a highly parallelizable soolution for this
algorithm on graphic cards (GPU). Amoong the different
architectural options for general purpose GPPU programming,
the CUDA architecture [4] was used. CUD DA, proposed in
2007 by NVIDIA, enables a programmiing interface for
graphics processors. This architecture, bbased on SIMD Figure 3. Proposed Algo
orithm Flow.
(Simple Instruction, Multiple Data) approachh, is adequate for
block matching algorithms as it provides m massive data-level Based on the concepts presenteed in the Fig. 4, the ME
parallelism. parallelization in GPU was perfformed considering the
In this work, the algorithm Full Search w was implemented following entities: (i) Kernel: The procedure
p to be executed
in C++ language using CUDA functions. S SAD was used as in parallel way on the GPU. This kernel
k is started by CPU.
the similarity criterion to choose the best m
match between the This implementation is based on only
o one kernel, which is
candidate blocks. Only integer accuracy ME E was considered. responsible for execution of ME under CUDA, through
In our experiments were used videos of diffferent resolutions,
SAD values calculation and SA AD values comparison,
such as CIF (352x288), HD720p (1280x720) and HD1080p
searching the lowest SAD value (M ME under CUDA, Fig. 3);
(1920x1080), obtained as input parameters.
The hardware platform used in this workk is composed by (ii) Thread: each thread is responsible by computation of
a CPU and a GPU. The algorithm proposedd in this paper is one 4x4 video block. We used thee 4x4 video block as the
presented in Fig. 3. Initially, the CPU diviides the video in basis for block-by-block comparison n to achieve finer motion
frames, where the reference frames and cuurrent frames are granularity as the macroblock (16 6x16 pixels); (iii) Block:
selected. These frames are transferred to G GPU memory, as The threads that execution the kernel is organized in blocks.
131
The blocks size in this application is variable according to ranging from 12 to 128, where n*n are the number of pixels
the size of the search area; (iv) Grid: The blocks is inside a search area. The data obtained for the CUDA
organized in a grid. In this work, the grid size is related to version are illustrated in the graphic shown in Fig. 5. Thus, it
the video resolution obtained as input parameter. can be observed that the execution time (one frame) is
In order to establish a comparative basis for our CUDA- constant independent of video sequence since the same
based algorithm, we also implemented a sequential and a number of candidates is tested in all cases.
parallel OpenMP-based [16] versions of the ME FS 10000
algorithm. Our OpenMP version considered a maximum of
four execution threads in order to fit to a 4-core processor.
Foreman Flower Bus
Log10 Time (ms)

1000
100
12 16 20 24 32 36 48 64 80 128
Search Area (pixel) [n*n]
Figure 5. Execution time on FS algorithm in CUDA for three CIF Video
Sequences.
Considering the equivalent behavior presented by the

video sequences in the Fig. 5, we simplified our analysis by
obtaining other results on CIF resolution using just the
Flower sequence. Fig. 6 shows a plot comparing the
Figure 4. CUDA Programming Model – Algorithm Allocation. execution times of implementations in CUDA (purple),
OpenMP (green), sequential (red) and the sequential
V. EXPERIMENTAL SETUP AND RESULTS complexity calculated from theoretical model presented in
In this section we present the obtained results with the Section III (blue).
execution of our ME FS algorithm in CUDA. The tests were We can observe in Fig. 6 that the execution time
performed using three video resolutions: CIF, HD720p e considering the sequential implementation fits with the
HD1080p. QCIF resolution is not used because it is too small theoretical complexity results previously calculated (Section
and has minor practical application in the market (even cell III). Also, this result shows a typical O(n2) behavior
phones already use higher resolutions). The results obtained (assuming always a square search area).
in CUDA were compared with: (i) a serial version; (ii) a
parallel version in OpenMP; (iii) a theoretical complexity
calculated presented in Section III. As the exxecution time for 700000
Serial Expected
FS algorithm is proportional with the size of search area,
most the results are shown for a range of 12x12 pixels to 600000 ME_FS_Serial
128x128 pixels search area, considering a 4x4 pixels block ME_FS_OpenMP
500000
size as the basic unit for block matching. Also, some results
are compared with related works [7]-[9]. ME_FS_CUDA
400000
Time (ms)
In the development and execution of CUDA application,

the video card NVIDIA GTX480 @1.4GHz, composed of 300000
480 functional units, was considered. This video card is
connected via PCI-Express interface through a Core 2 Quad 200000
Q9550 @2.82GHz CPU. The obtained results for CUDA
version considered this platform. For sequential and 100000
OpenMP execution results, only the Core 2 Quad Q9550
@2.82GHz processor was used. 0
A. Results for CIF Resolution 12 16 20 24 32 36 48 64 80 128

The tests performed with CIF resollution (352x288)
considered three different video sequences: Foreman, Figure 6. Sequential Complexity Expected (in blue) versus obtained
Flower and Bus. The experimental data fitting considers n results for CIF videos.
132
The OpenMP version results (in green, Fig. 6) has a video. Considering 48x48 search area, CUDA version
higher performance than the sequential version due the uses surpasses a three-order-of-magnitude gain in execution time
of four parallel threads. However, the speed up is compared with sequential version, which is an impressive
theoretically limited to a factor of 4x. Finally, the GPU result. Also, the CUDA version results remain independent
CUDA implementation (in purple, Fig. 6) shows the best to the increase in search area.
results of execution time among all the platforms. In GPU
version the results are kept constant, independent of the size 10000000
and the search area considered, since the high level ME_FS_Serial ME_FS_OpenMP ME_FS_CUDA
parallelism available in this architecture. For 128x128 pixel 10000000
search area, the CUDA implementation achieved more than
Log10 Time (ms)

600x time reduction when compared to the sequential 1000000
solution, but it still does not provide real-time performance.

100000
B. Results for HD720p Resolution
In this section, a more detailed comparison among 10000
Sequential, OpenMP and CUDA implementations are
presented. The extraction of these results considered 1000
HD720p resolution Mobcal video sequence. Fig. 7 shows the
results considering execution time of the implementation on 12 16 20 24 32 36 48
these three platforms: Sequential (blue), OpenMP (green) Search Area (pixel) [n*n]
and CUDA (red). Figure 8. HD1800p Results – CUDA vs. OpenMP vs. Sequential.
10000000
The communication time between CPU and GPU
ME_FS_Serial ME_FS_CUDA ME_FS_OpenMP memories was also analyzed. Considering that a high
1000000
definition video has more data to be manipulated, a 1080p
video sequence was used for this experiment. The graphic
Log10 Time (ms)
shown in Fig. 9 compares the communication time (CPU to

100000 GPU, and back to CPU) and the total execution time of
CUDA version. The total execution time was measured
using a timer function from cutil.h C/C++ library. The timer
10000
is initialized before the GPU kernel call and it is finalized
after the execution completion. The communication time
1000
uses a timer function from ctime.h C/C++ library. When
data is transferred between CPU to GPU the counter is
12 16 20 24 32 36 48 64
initialized. The same is performed for GPU to CPU data
Search Area (pixel) [n*n] transfer. It can be noted in Fig. 9 that the communication
Figure 7. HD720p Results – CUDA vs. OpenMP vs. Sequential. time for manipulation of data between CPU and GPU (and
vice-versa) is negligible when compared with the total
In Fig. 7 it is possible to notice a relevant gain in our execution time, according to the calculation presented in the
CUDA implementation when compared with other versions theoretical model (see Section III).
(sequential and OpenMP). Considering a 64x64 search area,
16000
this gain is of one-order-of-magnitude over the OpenMP
version and two-orders-of-magnitude over the sequential 14000
version. Moreover, it also can be noted that the obtained
12000
results in CUDA are independent of the increase of search
area, different of other versions. These are very promising 10000
Time (ms)
results especially for higher search areas, which increase the 8000
Total
resulting video quality and coding efficiency. Communication
6000
C. Results for HD1080p Resolution
4000
In this section, we present the results for FullHD
resolution (1080p). Fig. 8 shows the results generated for 2000
BlueSky video sequence. 0

Considering the results illustrated in Fig. 8, it can be 12 16 20 24 32 36 48
seen that CUDA implementation results remain better than Search Area (pixel) [n*n]
sequential and OpenMP versions also for high resolution Figure 9. Total execution time vs. communication time for CPU to GPU
data transfer considering a FullHD 1080p resolution.
133
D. Speed-up Results due our efficient mapping of parts of FS algorithm to the
Fig. 10 presents the speed-up achieved by Motion threads in CUDA architecture. Also, the device we have used
Estimation in CUDA version in relation to the sequential in our experiment (GTX 480) is faster and has more
computation cores than those used in related works. It is
version, compared to the following speed-up values: i)
shown that our FS algorithm mapping is efficiently scalable
OpenMP implementation; ii) the speed-up expected which
and benefits from the increase of computation cores in GPU
was calculated by theoretical model presented in Section III. architectures.
800
VI. CONCLUSIONS
700 This paper presented a detailed analysis and
CUDA implementation of the motion estimation module in GPU
600
architecture considering the Full Search algorithm. We have
Expected
500 shown that the use of GPU is a good alternative for
acceleration of motion estimation process in video coding.
Speed-up
OpenMP
400 First, a sequential and parallel theoretical model considering
300
communication cost between CPU and GPU were developed
in order to analyze the execution time and speed-up potential
200 before the implementation. Then, a parallel version of the FS
block matching algorithm was developed and well mapped
100
in NVIDIA CUDA architecture using SAD similarity
0
criterion. OpenMP and sequential versions were also
implemented for comparison reasons. A variety of tests were
12 16 20 24 32 36 48 64 80 128 performed for different sizes of the search area (12x12 to
128x128) and video resolutions (CIF, HD720p and
Figure 10. Speed-up – Cuda vs. Expected by Proposed Theoretical Model HD1080p). The obtained results were compared to the other
vs. OpenMP.
versions of the algorithm and with the theoretical model. The
The speed-up obtained by CUDA version presented a obtained speed-up in CUDA version represents up to 600x
gain compared to the sequential version, and 66x compared
quadratic growth in relation to the size of the search area,
to the OpenMP version. Compared with the related work
while for OpenMP version the growth is kept constant. The
proposed in [8] (which provides the best results in the
obtained speed-up represents up to 600x gain compared to literature) we achieved 27% speed-up increase for 16x16
the sequential version, and 66x compared to the OpenMP search area. This gain was achieved through an optimized
version. Furthermore, the Fig. 10 shows that the CUDA mapping of FS algorithm parts to threads in CUDA
speed-up results were also consistent with the theoretical architecture. Future works could focus on deeper exploration
model O(n²/log²n) considering the increase in search area. of data sharing between CPU to GPU, GPU memory
In addition, this work is also compared to related works organization, and also the implementation of sub-optimal
[7]-[9]. All the works implemented ME considering Full (fast) block matching algorithms for targeting real-time
Search algorithm with integer pixel accuracy for NVIDIA video encoding.
boards using Stefan video sequence (CIF resolution). The
speed-up results presented in Tab. I include 16x16 pixels ACKNOWLEDGMENT
and 32x32 pixels search area. The authors thank the partial support provided by CNPq
and CAPES Brazilian funding agencies.
TABLE I. SPEED-UP COMPARISON WITH RELATED WORK
Search Area Size REFERENCES

16x16 32x32
[1] ITU-T Recommendation H.264 (03/10): advanced video coding for
[9] 1.79 2.18 generic audiovisual services, 2010.
[2] Y-W. Huang, C-Y. Chen, C-H. Tsai, C-F. Shen, L-G. Chen. “Survey
[8] 12.08 26.76
on Block Matching Motion Estimation Algorithms and Architectures
[7] n.a. 10.38 with New Results”. The Journal of VLSI Signal Processing. v. 42,
n.3, p. 297-320, 2006.
Our
15.46 57.85 [3] Nvidia (2011) “NVIDIA Corporation”, http://www.nvidia.com,
Work
Available: Jun. 2011
[4] Cuda. “NVIDIA CUDA Programming Guide”,
The results show that our algorithm achieves the highest http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs
speed-up considering two different search areas. The work in /NVIDIA_CUDA_ProgrammingGuide.pdf , 2011. Available: Jun.
[8] achieved the highest speed-up among the works available 2011
in the literature. Our CUDA implementation achieved a 27% [5] K. Suhring, JM H.264/AVC Reference Software version 14.2.
and 16% speed-up increase compared with [8] for 16x16 and http://iphome.hhi.de/suehring/tml/download/ Available: Jun. 2011
32x32 search areas, respectively. The results were achieved
134
[6] VideoLAN, http://www.videolan.org/developers/x264.html, [10] M. Garland, S-L. Grand, J. Nickolls, J. Anderson, J. Hardwick, S.
Available: Jun. 2011. Morton, E. Phillips, Y. Zhang, V. Volkov. “Parallel computing
[7] W-N. Chen, H-M. Hang. “H.264/AVC motion estimation experiences with CUDA,” IEEE Micro, vol. 28, no. 4, pp. 13–27,
implementation on Compute Unified Device Architecture (CUDA).” Jul.–Aug. 2008.
In IEEE International Conference on Multimedia and Expo (ICME) p. [11] B. Pieters, D. V. Rijsselbergen, W. D. Neve, R. V. Walle.
697-700, 2008. “Performance evaluation of H.264/AVC decoding and visualization
[8] Y-C. Lin, P-L Li, C-H. Chang, C-L. Wu, Y-M. Tsao, S-Y. Chien. using the GPU”. In , Vol. 6696; Tescher, A. G., Ed.; SPIE: 2007.
"Multi-pass algorithm of motion estimation in video encoding for [12] I. Richardson. “H.264/AVC and MPEG-4 Video Compression –
generic GPU", In IEEE International Symposium on Circuits and Video Coding for Next-Generation Multimedia”. Chichester: John
Systems (ISCAS), 2006. Wiley and Sons. 2003.
[9] C-Y. Lee, Y-C. Lin, C-L. Wu, C-H. Chang,Y-M. Tsao, S-Y. Chien. [13] Thrust. “Thrust – Code at the speed of light”,
“Multi-Pass and Frame Parallel Algorithms of Motion Estimation in http://code.google.com/p/thrust/wiki/QuickStartGuide. Available:
H.264/AVC for Generic GPU.” In IEEE International Conference on Jun. 2011
Multimedia and Expo (ICME), p. 1603-1606. 2007. [14] The OpenMP API specification for parallel programming. Available
at:http://openmp.org/wp/
135

Applying CUDA Architecture To Accelerate Full Search Block Matching Algorithm For High Performance Motion Estimation in Video Encoding

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applying CUDA Architecture To Accelerate Full Search Block Matching Algorithm For High Performance Motion Estimation in Video Encoding

Uploaded by

Copyright:

Available Formats

2011 23rd International Symposium on Computer Architecture and High Performance Computing

Applying CUDA Architecture to Accelerate Full Search Block Matching Algorithm

1550-6533/11 $26.00 © 2011 IEEE 128

It can be noted that the commuunication time is

IV. MOTION ESTIMATION IMPLEMENTA

Foreman Flower Bus

Log10 Time (ms)

Considering the equivalent behavior presented by the

In the development and execution of CUDA application,

A. Results for CIF Resolution 12 16 20 24 32 36 48 64 80 128

search area, the CUDA implementation achieved more than

Log10 Time (ms)

solution, but it still does not provide real-time performance.

shown in Fig. 9 compares the communication time (CPU to

BlueSky video sequence. 0

Search Area Size REFERENCES

You might also like