Professional Documents
Culture Documents
David Oro1,3, Carles Fernndez3, Javier Rodrguez Saeta3 Xavier Martorell1, Javier Hernando2
1 Computer Architecture Department 2 Signal Theory & Communications Department
{doro, xavim}@ac.upc.edu
3 Herta Security
javier@tsc.upc.edu
Outline
1. 2. 3. 4. 5. Introduction Parallel Integral Image + Results Parallel Haar Filtering + Results Face Detection in Videos + Results Conclusions and Future Work
Introduction
Face detection algorithms must be designed from scratch with parallelism in mind (i.e. multithreading) Parallel stream-based architectures such as GPUs exploit DLP by devoting more transistors to data processing (INT/FP ALUs)
CPU Optimized for Loops and ILP 2-4-8 Highly Complex OoO Cores 4-8 INT/FP Functional Units Huge Caches Needed for Hiding Latency Intel Core i7 (Sandy Bridge) 4 Cores 1 Billion Transistors 256KB L2 Cache 8MB L3 Cache GPU Optimized for Stream Computations 256-512-1024 Simple Cores 512-1024 INT/FP Functional Units Memory Coalescing Hides Latency NVIDIA GTX 590 1024 Cores 3 Billion Transistors 768 KB L2 Cache
Introduction (II)
The classic V&J algorithm1 relies on the following ideas for speeding up object detection in a CPU:
Integral images for saving memory accesses and computations Boosted cascade of classifiers for early discarding image regions Simple Features [Haar Filters ]
Stage 0
A B
No
Yes Stage 1 No
<
Threshold 0
C D
<
Threshold 1 Yes
(, ) =
=0 =0
(, )
Stage N No
<
Threshold N Yes
, = () () () + ()
1 Rapid Object Detection Using a Boosted Cascade of Classifiers, P. Viola, M. Jones, CVPR2001
H.264 decoding
Serial vs. Concurrent kernel execution GPU Resources Kernel A Kernel B Kernel C
Image scaling
1
Bilinear filtering
1
Time
Integral image
1
N
CUDA kernels
Haar filtering
1
Face bounding
1
OpenGL Texture
Occupancy 0 GPU Occupancy 0 GPU0 0 0 0 00 01 0 01 0 0 + 0 0 00 + Row 00 (+ 00 Row 1 00 +10 01 00 + ) 02 + 00 11 10(11001 + 01 00 + + 20 ) Row 0000 + 00 = 2 Row 3 2 10 02 10 Row 0 Row 2 + 11 (10 + + 11 + + 12 ) 10 + 11 10 + 10)+ 11 = = = 10 + 01 10 00 (00 + (00 + 11 02+ + 02 ) + ) ( ( +10 11 02 + + N 00 N-1 0 0 (00 + + 02) 02 ) + 10 + 10 +Row 0 ( 11 ++ +12 ) + 00 + + (00 10 + 10 ) 11 + + 12 ) (10 + + 12 ) + 02 (10 (20 (10 + + 12 ) + + 20 + + 22 ) ( 22 ) Time
. . .
Row N
Time
Multiscan
Scan
= 0 , 1 , , 1
= (0, 0 , 0 + 1 , 0 + 1 + 2 , , 0 + 1 + 2 + + 2 +
SM1
SM2
SMN
Grid Scan Warp Scan Warp Scan Block Scan Warp Scan Warp Scan Warp Scan Block Scan Warp Scan
The parallel GPU algorithm is >2.5X faster than the optimized CPU O(nm) algorithm
Simultaneous Multiprocessor
Max. Occupancy
Fixed-sized sliding window
Thread Warps
Thread Thread xy
x''y''
Window Size
GPU Occupancy
n
Threadx,y
Block(i+1,j)
Block(i+1,j+1)
22.23 ms
Future work:
Techniques for reducing branch divergence (e.g. popcount) OpenCL / NVIDIA Denver [ARM OoO + GPU] / Intel MIC Architecture
Questions?
Data never return to the CPU memory address space H.264 HW video decoding through the NVCUVID API Deadline:
40 ms @ 25 fps
OpenGL screen output The input video frame is allocated and transferred to GPU DRAM memory using an OpenGL Pixel Buffer Object Then it is mapped to CUDA for GPGPU computations
Image scaling
1
Bilinear filtering
1
Integral images are computed in parallel for 32 different scales Image scaling and filtering is implemented in HW and transparently computed when texels are fetched from the texture cache/memory
Time
Integral image
1
Haar filtering
1
Texel
(x,y) FP coordinates
N
Face bounding
1
(857.8, 654.2)
OpenGL Texture
<
Stage 1
Threshold 0
<
Stage N
Threshold 1
<
Threshold N
Fn
Thread1
Thread2
Thread3
ThreadN