You are on page 1of 25

Real-time GPU-based Face Detection in HD Video Sequences

David Oro1,3, Carles Fernndez3, Javier Rodrguez Saeta3 Xavier Martorell1, Javier Hernando2
1 Computer Architecture Department 2 Signal Theory & Communications Department

Universitat Politcnica de Catalunya

Universitat Politcnica de Catalunya

{doro, xavim}@ac.upc.edu
3 Herta Security

javier@tsc.upc.edu

{david.oro, carles.fernandez, javier.rodriguez}@hertasecurity.com

Outline
1. 2. 3. 4. 5. Introduction Parallel Integral Image + Results Parallel Haar Filtering + Results Face Detection in Videos + Results Conclusions and Future Work

Introduction
Face detection algorithms must be designed from scratch with parallelism in mind (i.e. multithreading) Parallel stream-based architectures such as GPUs exploit DLP by devoting more transistors to data processing (INT/FP ALUs)
CPU Optimized for Loops and ILP 2-4-8 Highly Complex OoO Cores 4-8 INT/FP Functional Units Huge Caches Needed for Hiding Latency Intel Core i7 (Sandy Bridge) 4 Cores 1 Billion Transistors 256KB L2 Cache 8MB L3 Cache GPU Optimized for Stream Computations 256-512-1024 Simple Cores 512-1024 INT/FP Functional Units Memory Coalescing Hides Latency NVIDIA GTX 590 1024 Cores 3 Billion Transistors 768 KB L2 Cache

Introduction (II)
The classic V&J algorithm1 relies on the following ideas for speeding up object detection in a CPU:
Integral images for saving memory accesses and computations Boosted cascade of classifiers for early discarding image regions Simple Features [Haar Filters ]
Stage 0

A B

No

Yes Stage 1 No

<

Threshold 0

C D

<

Threshold 1 Yes

(, ) =
=0 =0

(, )

Stage N No

<

Threshold N Yes

, = () () () + ()

1 Rapid Object Detection Using a Boosted Cascade of Classifiers, P. Viola, M. Jones, CVPR2001

Parallel Face Detection Pipeline


Input video
GPU fixed-function logic

H.264 decoding

Serial vs. Concurrent kernel execution GPU Resources Kernel A Kernel B Kernel C

Image scaling
1

Bilinear filtering
1

Time

Integral image
1

N
CUDA kernels

Haar filtering
1

The proposed pipeline exploits both coarse-grain and fine-grain parallelism

Face bounding
1

OpenGL Texture

Parallel Integral Image


The Integral image is generated using simple matrix operations: Parallel Prefix Sum (Scan) + Parallel Matrix Transposition

Occupancy 0 GPU Occupancy 0 GPU0 0 0 0 00 01 0 01 0 0 + 0 0 00 + Row 00 (+ 00 Row 1 00 +10 01 00 + ) 02 + 00 11 10(11001 + 01 00 + + 20 ) Row 0000 + 00 = 2 Row 3 2 10 02 10 Row 0 Row 2 + 11 (10 + + 11 + + 12 ) 10 + 11 10 + 10)+ 11 = = = 10 + 01 10 00 (00 + (00 + 11 02+ + 02 ) + ) ( ( +10 11 02 + + N 00 N-1 0 0 (00 + + 02) 02 ) + 10 + 10 +Row 0 ( 11 ++ +12 ) + 00 + + (00 10 + 10 ) 11 + + 12 ) (10 + + 12 ) + 02 (10 (20 (10 + + 12 ) + + 20 + + 22 ) ( 22 ) Time

. . .
Row N

Time

Multiscan

Scan

= 0 , 1 , , 1

= (0, 0 , 0 + 1 , 0 + 1 + 2 , , 0 + 1 + 2 + + 2 +

Parallel Integral Image (II)


The parallel prefix sum operation (scan) is computed in the GPU using a multi-level divide and conquer approach
GPU DRAM
= *0 , 1 , , 1 + = *0, 0 , 0 + 1 , , 0 + + 2 +
Cache Cache Cache

SM1

SM2

SMN

Grid Scan Warp Scan Warp Scan Block Scan Warp Scan Warp Scan Warp Scan Block Scan Warp Scan

Parallel Integral Image (III)


Experimental framework:
Intel Core i5 760 @ 2.8 GHz [4x256 KB L2 / 8 MB L3] NVIDIA GTX 470 @ 1.26 GHz [448 CUDA Cores / 1280 MB DRAM] Linux Kernel 2.6.35 [x86-64] / CUDA Toolkit 4.0

CPU baseline algorithm for integral image computation:


for (j=1; j<height; j++) { output[j*width] = 0; sum = 0; for (i=0; i<width-1; i++) { sum += input[i + (j-1)*width]; output[i + j*width + 1] = sum + output[i + (j-1)*width + 1]; } }

O(nm) time complexity for a nxm matrix

Parallel Integral Image: Results


Multiscan operation

Parallel Integral Image: Results (II)


Multiscan operation TGPU < TCPU

Kernel launch overhead

Parallel Integral Image: Results (III)


GPU Parallel Matrix Transposition

Best performance if we bring 16x16 chunks of data to the shared memory

Parallel Integral Image: Results (IV)


Integral Image

The parallel GPU algorithm is >2.5X faster than the optimized CPU O(nm) algorithm

Effective use of on-die caches


The input image is cached in the texture cache Then it is split into pieces and stored in the on-die shared memory of each streaming multiprocessor Haar filters are stored in the uniform cache

Simultaneous Multiprocessor

Parallel Haar Filtering


Filter scaling vs. Image scaling
Thread xy

Thread x'y' HW Factors

Max. Occupancy
Fixed-sized sliding window

Thread Warps
Thread Thread xy
x''y''

Registers Used Shared Memory

Variable-sized sliding window

GPU Occupancy f (regsSM , smemSM , warpsSM )

Window Size

GPU Occupancy

Parallel Haar Filtering (II)


Each CUDA nxm thread block brings to the shared memory its adjancent nxm memory chunks (in parallel for each SM) Filter evaluations are performed in parallel for each pixel of the input integral image (fine-grain parallelism) The kernel function is launched and executed in parallel for the 32 image scales (coarse-grain parallelism)
Block(i,j) Block(i,j+1)

n
Threadx,y

Block(i+1,j)

Block(i+1,j+1)

Parallel Haar Filtering: Results

100% GPU Occupancy for windows of 28x28 pixels

Face Detection: Results


Benchmarks:
1920x1080p H.264 videos [Bitrate 3600 Kbps]
Shakira Waka Waka http://www.youtube.com/watch?v=pRpeEdMmmQ0 Andreea Balan Trippin http://www.youtube.com/watch?v=aRzn8NBOspg

NVIDIA GTX 470 Card


CUDA Toolkit 4.0 / Linux 2.6.35 [x86-64] OpenGL 3.x / GLEW / GLUT Video rendering embedded in a 3D triangle surface

Face Detection: Results (II)

22.23 ms

Face Detection: Results (III)

Face Detection: Results (IV)

Conclusions And Future Work


Contributions:
Parallel GPU integral image computation based on row multiscan Face detection implementation with fixed-size sliding window and image scaling based on texture fetches Face detection implementation that works in real time (35 fps) with 1080p video streams without discarding image pixels/regions

Future work:
Techniques for reducing branch divergence (e.g. popcount) OpenCL / NVIDIA Denver [ARM OoO + GPU] / Intel MIC Architecture

Thank you for your attention!

Questions?

Face Detection: Putting It All Together


H.264 Video Frame OpenGL Pixel Buffer Object OpenGL Texture Binding

CUDA / PBO Map Parallel Integral Image


Parallel Haar Filter Evaluation Parallel Face Bounding CUDA / PBO Unmap Display

Data never return to the CPU memory address space H.264 HW video decoding through the NVCUVID API Deadline:
40 ms @ 25 fps

OpenGL screen output The input video frame is allocated and transferred to GPU DRAM memory using an OpenGL Pixel Buffer Object Then it is mapped to CUDA for GPGPU computations

Parallel Haar Filtering


Input video
H.264 decoding

Image scaling
1

Bilinear filtering
1

Integral images are computed in parallel for 32 different scales Image scaling and filtering is implemented in HW and transparently computed when texels are fetched from the texture cache/memory

Time

Integral image
1

Haar filtering
1

Texel
(x,y) FP coordinates
N

Face bounding
1

(857.8, 654.2)

Texture Cache / Sampler Texture Memory (DRAM)

OpenGL Texture

Parallel Haar Filtering (II)


Selected cascade:
Stage 0

<
Stage 1

Threshold 0

Trained with 24x24 faces 2913 filters 25 stages Size: 106 KB

<
Stage N

Threshold 1

Stored (compressed) in constant memory address space


Cached in the on-die uniform cache HW optimized for broadcasting values to multiple threads
F1 F2 F3 F4

<

Threshold N

Fn

Thread1

Thread2

Thread3

ThreadN

You might also like