Face Recognition

Real-time GPU-based Face Detection in HD Video Sequences
David Oro1,3, Carles Fernndez3, Javier Rodrguez Saeta3 Xavier Martorell1, Javier Hernando2
1 Computer Architecture Department 2 Signal Theory & Communications Department
Universitat Politcnica de Catalunya
Universitat Politcnica de Catalunya
{doro, xavim}@ac.upc.edu
3 Herta Security
javier@tsc.upc.edu
{david.oro, carles.fernandez, javier.rodriguez}@hertasecurity.com
Outline
1. 2. 3. 4. 5. Introduction Parallel Integral Image + Results Parallel Haar Filtering + Results Face Detection in Videos + Results Conclusions and Future Work
Introduction
Face detection algorithms must be designed from scratch with parallelism in mind (i.e. multithreading) Parallel stream-based architectures such as GPUs exploit DLP by devoting more transistors to data processing (INT/FP ALUs)
CPU Optimized for Loops and ILP 2-4-8 Highly Complex OoO Cores 4-8 INT/FP Functional Units Huge Caches Needed for Hiding Latency Intel Core i7 (Sandy Bridge) 4 Cores 1 Billion Transistors 256KB L2 Cache 8MB L3 Cache GPU Optimized for Stream Computations 256-512-1024 Simple Cores 512-1024 INT/FP Functional Units Memory Coalescing Hides Latency NVIDIA GTX 590 1024 Cores 3 Billion Transistors 768 KB L2 Cache
Introduction (II)
The classic V&J algorithm1 relies on the following ideas for speeding up object detection in a CPU:
Integral images for saving memory accesses and computations Boosted cascade of classifiers for early discarding image regions Simple Features [Haar Filters ]
Stage 0
A B
No
Yes Stage 1 No
<
Threshold 0
C D

<
Threshold 1 Yes
(, ) =
=0 =0
(, )
Stage N No
<
Threshold N Yes
, = () () () + ()

1 Rapid Object Detection Using a Boosted Cascade of Classifiers, P. Viola, M. Jones, CVPR2001
Parallel Face Detection Pipeline

Input video
GPU fixed-function logic
H.264 decoding
Serial vs. Concurrent kernel execution GPU Resources Kernel A Kernel B Kernel C
Image scaling
1
Bilinear filtering
1
Time
Integral image
1
N
CUDA kernels
Haar filtering
1
The proposed pipeline exploits both coarse-grain and fine-grain parallelism
Face bounding
1
OpenGL Texture
Parallel Integral Image

The Integral image is generated using simple matrix operations: Parallel Prefix Sum (Scan) + Parallel Matrix Transposition
Occupancy 0 GPU Occupancy 0 GPU0 0 0 0 00 01 0 01 0 0 + 0 0 00 + Row 00 (+ 00 Row 1 00 +10 01 00 + ) 02 + 00 11 10(11001 + 01 00 + + 20 ) Row 0000 + 00 = 2 Row 3 2 10 02 10 Row 0 Row 2 + 11 (10 + + 11 + + 12 ) 10 + 11 10 + 10)+ 11 = = = 10 + 01 10 00 (00 + (00 + 11 02+ + 02 ) + ) ( ( +10 11 02 + + N 00 N-1 0 0 (00 + + 02) 02 ) + 10 + 10 +Row 0 ( 11 ++ +12 ) + 00 + + (00 10 + 10 ) 11 + + 12 ) (10 + + 12 ) + 02 (10 (20 (10 + + 12 ) + + 20 + + 22 ) ( 22 ) Time
. . .
Row N
Time
Multiscan
Scan
= 0 , 1 , , 1
= (0, 0 , 0 + 1 , 0 + 1 + 2 , , 0 + 1 + 2 + + 2 +
Parallel Integral Image (II)

The parallel prefix sum operation (scan) is computed in the GPU using a multi-level divide and conquer approach
GPU DRAM
= *0 , 1 , , 1 + = *0, 0 , 0 + 1 , , 0 + + 2 +
Cache Cache Cache
SM1
SM2
SMN
Grid Scan Warp Scan Warp Scan Block Scan Warp Scan Warp Scan Warp Scan Block Scan Warp Scan
Parallel Integral Image (III)

Experimental framework:
Intel Core i5 760 @ 2.8 GHz [4x256 KB L2 / 8 MB L3] NVIDIA GTX 470 @ 1.26 GHz [448 CUDA Cores / 1280 MB DRAM] Linux Kernel 2.6.35 [x86-64] / CUDA Toolkit 4.0
CPU baseline algorithm for integral image computation:

for (j=1; j<height; j++) { output[j*width] = 0; sum = 0; for (i=0; i<width-1; i++) { sum += input[i + (j-1)*width]; output[i + j*width + 1] = sum + output[i + (j-1)*width + 1]; } }
O(nm) time complexity for a nxm matrix
Parallel Integral Image: Results

Multiscan operation
Parallel Integral Image: Results (II)

Multiscan operation TGPU < TCPU
Kernel launch overhead
Parallel Integral Image: Results (III)

GPU Parallel Matrix Transposition
Best performance if we bring 16x16 chunks of data to the shared memory
Parallel Integral Image: Results (IV)

Integral Image
The parallel GPU algorithm is >2.5X faster than the optimized CPU O(nm) algorithm
Effective use of on-die caches

The input image is cached in the texture cache Then it is split into pieces and stored in the on-die shared memory of each streaming multiprocessor Haar filters are stored in the uniform cache
Simultaneous Multiprocessor
Parallel Haar Filtering

Filter scaling vs. Image scaling
Thread xy
Thread x'y' HW Factors
Max. Occupancy
Fixed-sized sliding window
Thread Warps
Thread Thread xy
x''y''
Registers Used Shared Memory
Variable-sized sliding window
GPU Occupancy f (regsSM , smemSM , warpsSM )
Window Size
GPU Occupancy
Parallel Haar Filtering (II)

Each CUDA nxm thread block brings to the shared memory its adjancent nxm memory chunks (in parallel for each SM) Filter evaluations are performed in parallel for each pixel of the input integral image (fine-grain parallelism) The kernel function is launched and executed in parallel for the 32 image scales (coarse-grain parallelism)
Block(i,j) Block(i,j+1)
n
Threadx,y
Block(i+1,j)
Block(i+1,j+1)
Parallel Haar Filtering: Results
100% GPU Occupancy for windows of 28x28 pixels
Face Detection: Results

Benchmarks:
1920x1080p H.264 videos [Bitrate 3600 Kbps]
Shakira Waka Waka http://www.youtube.com/watch?v=pRpeEdMmmQ0 Andreea Balan Trippin http://www.youtube.com/watch?v=aRzn8NBOspg
NVIDIA GTX 470 Card

CUDA Toolkit 4.0 / Linux 2.6.35 [x86-64] OpenGL 3.x / GLEW / GLUT Video rendering embedded in a 3D triangle surface
Face Detection: Results (II)
22.23 ms
Face Detection: Results (III)
Face Detection: Results (IV)
Conclusions And Future Work

Contributions:
Parallel GPU integral image computation based on row multiscan Face detection implementation with fixed-size sliding window and image scaling based on texture fetches Face detection implementation that works in real time (35 fps) with 1080p video streams without discarding image pixels/regions
Future work:
Techniques for reducing branch divergence (e.g. popcount) OpenCL / NVIDIA Denver [ARM OoO + GPU] / Intel MIC Architecture
Thank you for your attention!
Questions?
Face Detection: Putting It All Together

H.264 Video Frame OpenGL Pixel Buffer Object OpenGL Texture Binding
CUDA / PBO Map Parallel Integral Image

Parallel Haar Filter Evaluation Parallel Face Bounding CUDA / PBO Unmap Display
Data never return to the CPU memory address space H.264 HW video decoding through the NVCUVID API Deadline:
40 ms @ 25 fps
OpenGL screen output The input video frame is allocated and transferred to GPU DRAM memory using an OpenGL Pixel Buffer Object Then it is mapped to CUDA for GPGPU computations
Parallel Haar Filtering

Input video
H.264 decoding
Image scaling
1
Bilinear filtering
1
Integral images are computed in parallel for 32 different scales Image scaling and filtering is implemented in HW and transparently computed when texels are fetched from the texture cache/memory
Time
Integral image
1
Haar filtering
1
Texel
(x,y) FP coordinates
N
Face bounding
1
(857.8, 654.2)
Texture Cache / Sampler Texture Memory (DRAM)
OpenGL Texture
Parallel Haar Filtering (II)

Selected cascade:
Stage 0
<
Stage 1
Threshold 0
Trained with 24x24 faces 2913 filters 25 stages Size: 106 KB
<
Stage N
Threshold 1
Stored (compressed) in constant memory address space

Cached in the on-die uniform cache HW optimized for broadcasting values to multiple threads
F1 F2 F3 F4
<
Threshold N
Fn
Thread1
Thread2
Thread3
ThreadN

Face Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Face Recognition

Uploaded by

Copyright:

Available Formats

Real-time GPU-based Face Detection in HD Video Sequences

Universitat Politcnica de Catalunya

Universitat Politcnica de Catalunya

{david.oro, carles.fernandez, javier.rodriguez}@hertasecurity.com

Parallel Face Detection Pipeline

The proposed pipeline exploits both coarse-grain and fine-grain parallelism

Parallel Integral Image

Parallel Integral Image (II)

Parallel Integral Image (III)

CPU baseline algorithm for integral image computation:

O(nm) time complexity for a nxm matrix

Parallel Integral Image: Results

Parallel Integral Image: Results (II)

Kernel launch overhead

Parallel Integral Image: Results (III)

Best performance if we bring 16x16 chunks of data to the shared memory

Parallel Integral Image: Results (IV)

Effective use of on-die caches

Parallel Haar Filtering

Thread x'y' HW Factors

Registers Used Shared Memory

Variable-sized sliding window

GPU Occupancy f (regsSM , smemSM , warpsSM )

Parallel Haar Filtering (II)

Parallel Haar Filtering: Results

100% GPU Occupancy for windows of 28x28 pixels

Face Detection: Results

NVIDIA GTX 470 Card

Face Detection: Results (II)

Face Detection: Results (III)

Face Detection: Results (IV)

Conclusions And Future Work

Thank you for your attention!

Face Detection: Putting It All Together

CUDA / PBO Map Parallel Integral Image

Parallel Haar Filtering

Texture Cache / Sampler Texture Memory (DRAM)

Parallel Haar Filtering (II)

Trained with 24x24 faces 2913 filters 25 stages Size: 106 KB

Stored (compressed) in constant memory address space

You might also like