Professional Documents
Culture Documents
Department of Electrical Engineering, Hopeman 413 University of Rochester, Rochester, New York 14627 Ph: (716) 275-3774 FAX: (716) 473-0486 E-mail: tekalp@ee.rochester.edu The fundamentals of digital video representation, ltering and compression, including popular algorithms for 2-D and 3-D motion estimation, object tracking, frame rate conversion, deinterlacing, image enhancement, and the emerging international standards for image and video compression, with such applications as digital TV, web-based multimedia, videoconferencing, videophone and mobile image communications. Also included are more advanced image compression techniques such as entropy coding, subband coding and object-based coding.
PART 1: REPRESENTATION
Lecture 1 Introduction to Analog and Digital Video Lecture 2 Time-Varying Image Formation Models Lecture 3 Spatio-Temporal Sampling Lecture 4 Sampling Structure Conversion Lecture 5 Optical Flow Methods Lecture 6 Block-Based Methods Lecture 7 Pel Recursive Methods Lecture 8 Bayesian Methods Lecture 9 Parametric Modeling and Motion Segmentation Lecture 10 2-D Motion Tracking Lecture 11 3-D Motion and Structure Estimation Lecture 12 Stereo Video
PART 3: FILTERING
Lecture 13 Motion-Compensated Filtering Lecture 14 Standards Conversion Lecture 15 Noise Filtering Lecture 16 Restoration Lecture 17 Superresolution 1
Lecture 18 Fundamentals and Lossless Coding Lecture 19 DPCM and Transform Coding Lecture 20 Still Image Compression Standards Lecture 21 Subband/Wavelet Coding and Vector Quantization
Lecture 22 Interframe Compression Methods Lecture 23 Frame-Based Video Compression Standards Lecture 24 Object-Based Coding and MPEG-4 Lecture 25 Digital Video Communication
Textbook:
Digital Video Processing, by A. Murat Tekalp, Prentice-Hall, 1995.
Supplementary Reading:
mentals of analog and digital video systems, including HDTV, CATV, terrestial and satellite video broadcast technologies.) Video Dialtone Technology, by Minoli, McGraw Hill, 1995. (covers digital video over ADSL, HFC, FTTC and ATM technologies, including interactive TV and video-on-demand.)
Video Engineering, by Inglis and Luther, Second Ed., McGraw Hill, 1996. (covers funda-
Grading:
Homeworks 25% Midterm Project 25% Written report due Mar. 6 Final Project 50% To be presented May 6-8 Written report due May 12
Prerequisites:
EE 446 and EE 447 or EE 241 and permission of the instructor.
' &
1. Analog Video 2. Digital Video 3. Digital Video Standards 4. Digital Video Applications Digital TV PC Multimedia Real-time Communications
c 1995-97 This material is the property of A. M. Tekalp. It is intended for use only as a teaching aid when teaching a regular semester or quarter based course at an academic institution using the textbook "Digital Video Processing" (ISBN 0-13-190075-7) by A. M. Tekalp. Any other use of this material is strictly prohibited.
' &
ANALOG VIDEO
One or more analog signals that contain time-varying 2-D intensity (monochrome or color) pattern and the timing information to align the pictures. Component Analog Video (CAV) - RGB - YCrCb (YIQ or YUV) Composite Video - NTSC (National Television Standards Committee) - PAL (Phase Alternating Line) - SECAM (SEquential Color And Memory) S-Video (Y/C video) - NTSC - PAL - SECAM 2
' &
Frame rate and icker: Each complete picture is called a frame (temporal sampling). Minimum frame rate required for icker-free viewing is 50 Hz. Progressive scan: Each frame is made up of lines (vertical sampling).
A C B
C B A E
Interlaced scan, where each frame is split into two elds, provides a tradeo between temporal and vertical resolution.
' &
NTSC (USA,Japan,Can.,Mex.) PAL (Great Britain) PAL (Germany,Austria,Italy) PAL (China) SECAM (France,Russia)
' &
Synchronization
Horizontal synch pulse 5
Scanning at the display device must be synchronized with that at the source.
Synch Black 100 75
White
12.5
Horizontal retrace 10 t, s
Blanking pulses are inserted during the retrace intervals to blank out retrace lines on the receiving CRT. Sync pulses are added on top of the blanking pulses to synchronize the receiver's horizontal and vertical sweep circuits. The timing of the sync pulses are di erent for interlaced and non-interlaced video.
' &
Example: NTSC signal = 53.5 / 63.5 = 0.84 Video BW = 4.2 MHz Line Rate = (FR) (NL) = 29.97 525 = 15,734 2 4 :2 106 0:84 = 448 pixels HR = 15 734
' &
F v /L 1
' &
Electronic (CCD) video cameras - ITU-R standards 625/25 or 525/30 - recorded on video tape Motion picture cameras - 24 frames/s - recorded on motion picture lm Synthetic content - computer animation, graphics, etc. - formed by sequential ordering of a set of still-frame images
' &
DIGITAL REVOLUTION
Digital data communications (e.g., computer networks, e-mail) and Digital audio (e.g., CD players, digital telephony)
What is next?
Digital video - as a form of computer data Products such as: digital TV/HDTV, videophone, multimedia PCs, will be in the marketplace soon.
1] \Digital video," IEEE Spectrum Magazine, pp. 24-30, Mar. 1992.
' &
Let's look at the raw data rates for digital audio and video: CD quality digital audio High de nition video
640 pels x 360 lines chroma x 60 frames/s x 8 bits/pel/channel approximately 663.5 Mbps (from the GA-HDTV proposal)
10
' &
Number of active pels/line Lum (Y) Chroma (U,V) Number of active lines/pic Lum (Y) Chroma (U,V) Interlacing Temporal rate Aspect ratio Raw data rate (Mbps)
' &
The boom in the FAX market followed binary image compression standards. 12
' &
DVI (Digital Video Interactive), Indeo Quicktime CD-I (Compact Disc Interactive) PhotoCD
A committee under the Society of Motion Picture and Television Engineers (SMPTE) is working to develop a universal header/descriptor that would make any digital video stream recognizable by any device. There are also digital recording standards, e.g., D1 (component video), D2 (composite video), etc. 13
' &
Consumer/Commercial All Digital HDTV @ 20 Mbits/s over 6 Mhz taboo channels Digital TV @ 4-6 Mbits/s Multi-media, desktop video @ 1.5 Mbits/s CD-ROM or harddisk storage Videoconferencing @ 384 kbits/s using p x 64 kbits/s ISDN channels Videophone and Mobile Image Communications @ 16 kbits/s using the copper network (POTS) Other
Surveillance Imaging (military or law enforcement) Intelligent Vehicle Highway Systems and Harbor Tra c Control Medical Imaging (cine imaging) Education and Scienti c Research
14
' &
Digital TV
Choices for ATV broadcast channels: - terrestial broadcast - direct satellite broadcast - optical ber cable broadcast Terrestial broadcast channels are 6 MHz in US and 8 MHz in Europe. A 6 MHz channel can support about 20-30 Mbps data rate using sophisticated modulation techniques (e.g., QAM or VSB).
663.5 : 20 = 34 : 1 compression. { A single 6-MHz TV channel can support 4 or 5 standard resolution digital TV programs (at 4-6 Mbits/s each).
15
' &
PC Multimedia
Early technologies
CD-based interactive full-screen, full-motion video { Digital Video Interactive (DVI) Technology Hardware to handle full motion video in PCs at about 1.5 Mbit/s. VideoCD and Digital Video Disk (DVD) Networked Multimedia / Video-on-Demand
1] \Special report: Interactive multimedia," IEEE Spectrum, pp. 22-39, Mar. 1993. 2] J. van der Meer, \The full motion system for CD-I," IEEE Trans. Cons. Electronics, vol. 38, no. 4, pp. 910-920, Nov. 1992. 3] J. Sutherland and L. Litteral, \Residential video services," IEEE Comm. Mag., pp. 37-41, July 1992.
16
' &
Real-Time Communications
Digital Audio: The audio signal is sampled at 8 kHz and quantized with Videoconferencing/videophone over ISDN: up to 2 Mbits/s using
H.261 or H.263 compression. H.263+ compression.
8-12 bits/sample. Most telephony networks is capable to carry a load of 14 kbps to 56 kbps. Bit rate reduction is achieved by coarser quantization.
Videophone over existing phone lines: 8 - 32 kbits/s using H.263 or Video communications over future broadband ATM/access networks: { Constant Bit Rate (CBR) channel - switched network { Variable Bit Rate (VBR) channel - quality of service contract { Available Bit Rate (ABR) channel - no guarantees, just like internet
17
' &
Packet Video
The video bitstream is divided into elementary blocks ( xed or variable size) each containing a header and payload (data bits), e.g., MPEG-2 packets. Packet video allows - interleaving video, audio, and data packets, and multiple programs in a single bitstream - better error protection and resilience, and low delay Network infrastructures { Telephone networks { CableTV networks { Internet (network of networks) Modes of transmission { Point-to-point transmission { Multi-casting and Broadcasting 18
' &
Access Networks
Fiber-to-Home Hybrid-Fiber-Coax (Cable Modem) Fiber-to-Curb (ADSL to home) Some Access Network Bit-Rate Regimes Conventional Telephone Modem ISDN (Integrated Services Digital Network) T-1 ADSL (Asymmetric Digital Subscriber Line) Cable Modem Ethernet (packet-based LAN) Fiber B-ISDN/ATM 19 28.8 kbps 64 - 144 kbps (px64) 1.5 Mbps 1.5-6 Mbps downstream 30 Mbps downstream 10 Mbps 55-200 Mbps
' &
20
' &
21
' &
22
' &
(i) Motion Analysis 2-D motion/optical- ow estimation and segmentation 3-D motion, structure estimation and segmentation Object tracking, occlusion, deformations (ii) Filtering and Standards Conversion Deblurring, noise ltering, edge sharpening Frame rate conversion and deinterlacing Resolution enhancement (iii) Compression JPEG, H.261/H.263, MPEG 1-2 Subband/wavelet and model-based coding 23
' &
24
' &
238
' &
TOKEN TRACKING
2-D Trajectory Model: Describe temporal evolution of selected feature points, e.g.,
x1 (k + 1) = x1 (k) cos (k) ; x2 (k) sin (k) + t1 (k) x2 (k + 1) = x1 (k) sin (k) + x2 (k) cos (k) + t2 (k)
with a 2-D rotation by the angle (k) and translation by t1(k) and t2(k).
Observation Model: Determine a number of feature correspondences over multiple frames, e.g., by block matching. Batch or Recursive Estimation: Find the best motion parameters consistent with the model and observations. Batch estimators, e.g., the nonlinear least squares estimator, process the entire data record at once after all data is collected. Recursive estimators, e.g., Kalman lters, process each observation as it becomes available to update the motion parameters.
239
' &
Each line segment is represented by a 4-D feature vector p = p1 p2]T consisting of the two end points, p1 and p2 . The 2-D trajectory of the endpoints modeled by 1 a(k ; 1)( t)2 x(k) = x(k ; 1) + v(k ; 1) t + 2 v(k) = a(k ; 1) t a(k) = a(k ; 1) where x(k), v(k), and a(k) denote the position, velocity, and acceleration of the pixel at time k, respectively (constant acceleration model). To perform tracking by a Kalman lter, we de ne the 12-dimensional state of the line segment as h iT z(k) = p(k) p _ (k) p(k) _ (k) and p(k) denote the velocity and the acceleration of the where p coordinates, respectively. 240
' &
Example: (cont'd)
The state propagation equation z(k) = (k k ; 1)z(k ; 1) + w(k) k = 1 : : : N where 2 3 1 2 I 4 I4 t 2 I4 ( t) 6 7 (k k ; 1) = 6 I4 t 7 4 04 I4 5 and I4 and 04 are 4 4 identity and zero matrices, respectively, w(k) is a zero-mean, white process with the covariance matrix Q(k). The observation equation y(k) = p(k) + v(k) k = 1 : : : N It is assumed that the noisy observations can be estimated from pairs of frames using some token-matching algorithm. 241
04 04
I4
' &
BOUNDARY TRACKING
Polygon tracking (by tracking corners) Splines and active contours -Propagate joint points by their motion vectors -De ne various energy functions to snap the propagated snake to the contour in the next frame.
242
' &
OBJECT TRACKING
Object-Based Editing ! Synthetic Trans guration Object-Based Coding ! MPEG-4 Content-Based Retrieval ! Digital Libraries 3-D Object Modeling ! Virtual Reality
243
' &
Triangle-Based A ne MC
Standard translational block matching cannot handle rotation and zooming. Neighboring relationships in the reference frame are preserved in the target frame. (Mesh elements do not overlap each other.)
Texture mapping
Frame k-1
Frame k
244
' &
245
' &
246
' &
0000 1111 0000 1111 11111 00000 0000 1111 00000 11111 0000 1111 0000000 1111111 00000 11111 0000000 1111111 00000 11111 0000000 1111111 00000 11111 000000 111111 0000000 1111111 000000 111111 0000000 1111111 high temporal 000000 111111 0000000 1111111 00 11 activity 000000 111111 00000 0011111 11 000000 111111 00000 11111 00 11 00000000 11111111 000000 11111 111111 00000 00000000 11111111 00000 11111 00000000 11111111 00000000 11111111 low temporal 00000 11111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 activity 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 000000 111111 00000000 11111111 000000 111111 00000000 11111111 000000 111111 00000000 11111111 000000 111111 00000000 11111111 000000 111111 00000000 11111111
Marked Pixels Unmarked Pixels
247
' &
Node-Point Selection
Estimate 2-D forward dense motion nd and polygonize the BTBC region. Label all pixels within the BTBC polygon \marked," and include its corners in the list of node points. Compute the average DFD over the unmarked region. Compute a cost function C (x y ) over the unmarked region. Select the unmarked pixel with the highest C (x y) which is not closer to any of the existing node points by a prespeci ed distance as the next node point. Grow a region about this node point until the sum of the absolute DFD reaches a threshold. Label all points within this region as \marked." Continue until the maximum number of node points is reached, or all pixels are \marked." 248
' &
Sampling from dense motion eld Logarithmic hexagonal search (Hierarchical) Closed-form connectivity-preserving solutions
249
' &
250
' &
Select a polygon enclosing the region of interest Overlay a 2-D mesh (e.g., a uniform triangular mesh)
251
' &
...
..
Reference Frame
Previous Frame
Current Frame
Assumption: Mild deformations De ne a cost polygon about each boundary node Estimate the motion vector using deformable block matching 252
' &
A1 A2 A2
a
A2
Previous Polygon Current Polygon
Propagate each node using the a ne mapping of the corresponding patch Use hexagonal matching to re ne the location of each node
253
' &
254
' &
scale factor c intensity o set Each node point is assigned a pair of parameters and c Values of and c at any x are bilinearly interpolated
255
' &
Mesh fitting
Synthesized video
256
' &
257
' &
11111111111 00000000000 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111
Frame k
111111111111 000000000000 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111 000000000000 111111111111
Frame k+1
BTBC
No node points within the BTBC region Mesh propagation with node point motion vectors Model failure detection (ideally, MF region = UB region) Mesh re nement within the MF region 258
' &
00000000000 11111111111 11111111111 00000000000 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 BTBC 00000000000 11111111111 UB 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 00000000000 11111111111 11111 00000 0000 1111 000 111 00000000000 11111111111 00000000000 11111111111 00000 11111 0000 1111 111 000 00000000000 11111111111 00000000000 11111 11111111111 00000 0000 1111 000 111 00000000000 11111111111 00000 11111 0000 1111 000 111 00000 11111 0000 1111 000 111
Frame k Frame k+1
Use mesh elements from one object at a time only More than one motion vector for some nodes on the boundary BTBC regions should map onto a curve segment in the next frame. 259
' &
Each object is tracked independently. Uncovered areas are either assigned to one of the existing objects, or to a new object. Object mosaicing.
260
' &
LECTURE 2
c 1995-98 This material is the property of A. M. Tekalp. It is intended for use only as a teaching aid when teaching a regular semester or quarter based course at an academic institution using the textbook "Digital Video Processing" (ISBN 0-13-190075-7) by A. M. Tekalp. Any other use of this material is strictly prohibited.
25
' &
shot 1
shot N
A video source is a collection of shots. A shot is a video clip recorded by an uninterrupted motion of a single camera. Shot boundaries can be clean (as in a camera break) or blurred into a few frames as in special e ects such as dissolves, wipes, fade-ins, and fade-outs.
26
' &
Image Formation
Spatio-Temporal Sampling
The variation in the intensity of the images from frame to frame is due to 3-D camera motion, e.g., zoom and pan, etc. 3-D object motion, e.g., local translation and rotation, photometric e ects of 3-D motion change in the scene illumination We neglect deformable body motion at this time. 27
' &
time t k
time t k+1
Three-D displacement of a point on a rigid object - in the Cartesian coordinates, ( 1 an a ne transformation - in the homogeneous coordinates, ( a linear transformation
X X2 X3
), ),
'
where
3-D rotation, translation and scaling (zooming) of a rigid body can be represented by an a ne transformation
X = SRX + T
0 0 0 0
2 6 =6 4
X1 X2 X3
3 7 7 5 and
2 6 X=6 4
X1 X2 X3 t
3 7 7 5
t
&
denote the coordinates of a point at time instants k+1 and k , respectively, 2 3 2 3 0 0 1 1 6 7 6 7 6 7 6 T = 4 2 5 and S = 4 0 2 0 7 5 0 0 3 3
T T T S S S tk
and
tk+1
29
'
Rotation: Eulerian angles in Cartesian coordinates: An arbitrary rotation in the 3-D space can be represented by the Eulerian angles , and of rotation about the X1 , X2 and X3 axes, respectively.
X 2
(0,1,0)
= 90 X
(1,0,0)
= 90 1
(0,0,1)
= 90
&
30
'
and
The matrices that describe clockwise rotations about individual axes are given by 2 3 2 3 1 0 0 cos 0 sin 7 6 7 6 7 6 R =6 R = 1 0 7 4 0 cos ; sin 5 4 0 5 0 sin cos ; sin 0 cos
2 cos 6 =6 4 sin 0
; sin
cos 0
3 0 7 07 5 1
&
An Example: Consider rotation around the X1 axis by 90 degrees 2 3 2 32 3 2 3 X1 1 0 0 0 0 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 4 X2 5 = 4 0 cos 2 ; sin 2 5 4 1 5 = 4 0 7 5 X3 0 sin 2 cos 2 0 1
0 0 0
Recall that matrix multiplication is not commutative thus, in composite rotations, the order of specifying the rotations is important. 31
'
and
Assuming in nitesmall rotation from frame to frame, i.e., = , etc., and approximating cos 1 and sin , etc., these matrices simplify as 2 3 2 3 1 0 0 1 0 7 6 7 6 7 6 R =6 R = 40 1 ; 5 4 0 1 0 7 5 0 1 ; 0 1
2 1 6 =6 4 0
1 0
3 0 7 07 5 1 3 7 7 5
&
32
'
Rotation about an arbitrary axis in Cartesian coordinates: A 3-D rotation can be represented by an angle about an axis, described by the directional cosines 1 , 2 and 3 , through the origin.
n n n
(n , n , n ) 1 2 3
&
33
'
Then,
$
3 7 5
2 R = 6 4
+ (1 ; ) (1 ; ) + n1 n3 (1 ; cos ) ; n2 sin
n2 1 n1 n2 n2 1 cos cos n3 sin
, R reduces to 1
;3
n n1
;2
n
n3
;1
n
n2
3 7 7 5
&
= = =
n1 n2 n3
, , .
34
'
lim
$
T1 t T2 t T3 t
Start with the 3-D displacement model for rotation and translation only, 2 3 2 32 3 2 3 X1 7 6 1 ; X1 7 6 T1 7 6 7 6 6 1 ; 7 4 X2 7 5=6 4 56 4 X2 7 5+6 4 T2 7 5 X3 ; 1 X3 T3
t!
2 X0 1 6 0 X2 06 4 0 X3
X1 t ;X2 t ;X3 t
3 7 7 5 = lim
t!
2 0 6 06 4 t
;
t
0
t
32 3 t 7 6 X1 7 ; t 7 56 4 X2 7 5 + lim 0 X3
t!
2 6 06 4
3 7 7 5
&
32 3 2 3 _1 X ; 3 2 7 6 X1 7 6 V1 7 _2 X 0 ; 1 7 56 4 X2 7 5+6 4 V2 7 5 _3 X ; 2 1 0 X3 V3 where i and Vi denote the angular and translational velocities respectively, for i = 1 2 3.
2 6 6 4
3 2 0 7 6 7 5=6 4 3
35
' &
HOMOGENEOUS COORDINATES
De ne the vectors X and X0 in the homogeneous coordinates as 2 3 2 0 3 kX1 kX1 6 7 6 0 7 6 7 6 kX 0 kX 2 2 7 6 7 6 Xh = 6 and Xh = 6 0 7 7 4 kX3 5 4 kX 7 5
Then, the a ne transformation in the Cartesian coordinates X0 = AX + T can be expressed as a linear transformation in the homogeneous coordinates ~ h X0h = AX where 2 3 a11 a12 a13 T1 6 7 6 a a a T 21 22 23 2 7 ~ =6 7 A 6 4 a31 a32 a33 T3 7 5 0 0 0 1
36
'
where where
Translation:
h
2 1 0 0 T1 6 6 0 1 0 T2 ~ =6 T 6 6 4 0 0 1 T3 0 0 0 1
3 7 7 7 7 7 5
Scaling (Zooming):
~X X0 = S
h
&
2 3 S1 0 0 0 6 7 6 7 0 S2 0 0 6 7 ~=6 S 7 6 4 0 0 S3 0 7 5 0 0 0 1
37
'
where
Rotation:
~ X0 = RX
h
2 6 6 ~ =6 R 6 6 4
3 0 7 07 7 7 07 5 1
&
rij
38
'
where
X1 X2 X3 t
)!(
x1 x2 t
X1 X2 X3 x1 x2
&
- Projective Camera ! Perspective (Central) Projection - A ne Camera ! Weak-Perspective and Orthographic Projection
39
' &
Projective Camera
There are three coordinate systems - camera, image, and world. 1. Camera Coordinate System: Perspective Projection
Y c Xc y c (x ,y ) 0 0 xc
Z c
The center of projection coincides with the origin of the camera coordinates.
Xc Zc
and
yc f
Yc Zc
40
'
where
Perspective projection is nonlinear in the Cartesian coordinates however, it can be expressed as a linear operation in the homogeneous coordinates.
2 3 2 3 2 xc 7 6 Xc 7 6 1 0 0 0 6 6 4 yc 7 5= 6 4 Yc 7 5= 6 40 1 0 0 f Zc 0 0 1 0
= f=Zc
3 2 Xc 3 6 7 7 6 Y c 7 7 6 7 56 4 Zc 7 5 1
&
41
'
= c =
xi
xi yi
; ;
x0 y0
y y i
c x c
x ,y 0 0
&
where C is called the camera calibration matrix, and the principle point ( is where the optic axis intersects the image plane. 42
x0 y0
'
Y c
Z c
Zw
2 3 Xc " 6 7 6 7 Yc 7 6 = 6 7 4 Zc 5 1
Yw
R 0T
2 3 # 6 Xw 7 t 6 Yw 7 6 7 6 1 4 Zw 7 5 1
R, t
Xw
&
3 7 7 7 7 5
General Pin-Hole Camera Equation " # " # " # xi (R1 Xw + tx )=(R3 Xw + tz ) x =f + 0 yi (R2 Xw + ty )=(R3 Xw + tz ) y0 43
' &
or
image plane
(X , X , X ) 1 2 3
f lens center (x , x ) 1 2
X x
The camera coordinate system is aligned with the world coordinate system.
44
'
Let then
Weak-Perspective Projection
Zi
= R3
X+
Dz
, then the perspective projection is given by " # " # c x = f (R1 X + Dx)=Zci + ox (R2 X + Dy )=Zi oy
=
"
Zi
;
T
Zave
is such that
Zave
= R3
X
"
ave
<< Zave
&
f x = Zave c
R1 X + f c Zave RT 2
# " # Dx o + x Dy oy
45
' &
T
A ne Camera
An uncalibrated weak-perspective projection 2 3 2 3 2 X1 3 x1 7 6 T11 T12 T13 T14 7 6 7 6 6 X2 7 6 7 6 7 6 7 4 x2 5 = 4 T21 T22 T23 T24 5 6 X3 7 4 5 x3 0 0 0 T34 X4 In Cartesian coordinates, where M is a 2 3 matrix with elements t = 14 34 24 34]
=T T =T
x = MX + t
Mij
Tij =T34
and
46
' &
Orthographic Projection
X X
Let the image plane be parallel to the 1 ; 2 plane of the world coordinate system. Then, in Cartesian coordinates x1 = X1 and x2 = X2 or in vector-matrix notation 2 3 " # " # X1 7 x1 1 0 0 6 6 = X2 7 4 5 x2 0 1 0 X3
X 1
x1 X 3 x2 X 2
All rays from the 3-D object (scene) to image plane are parallel to each other.
47
' &
$
)
)=
where L = ( 1 2 3) is the unit vector in the mean illuminant direction and N is the unit surface normal of the scene at position ( 1 2 3 ( 1 2 )) given by
L L L X X X X X
N( ) L
t
N = (; ;
p p q
1) ( 2 + 2 + 1)1=2
= p q X3 x1 x2
@X3 3 in which = @X @x1 and = @x2 are the partial derivatives of depth with respect to the image coordinates 1 and 2 respectively.
x x
48
' &
Photometric model.
Note that the illuminant direction can also be expressed in terms of tilt and slant angles as
where , the tilt angle of the illuminant, is the angle between L and the 1 ; plane, and , the slant angle, is the angle between L and the positive 3 axis. 49
'
where
Assuming that the mean illuminant direction L remains constant, we can express the change in intensity due to photometric e ects of the motion as c( 1 2 ) = L N N at the point ( Approximate ddt
dt dt : X1 X2 X3 d
N
t
) as
dt
&
and
) ; N( 1 2 3 ) (; ; 1) ; (; ; 1) = ( 2 + 2 + 1)1=2 ( 2 + 2 + 1)1=2 =
N(
p0
X1 X2 X3 p
0
q0
= =
+ =; 1; 50
+ =; 1+
p p
' &
1. Spatio-Temporal Sampling 2-D Sampling Structures for Analog Video 3-D Sampling Structures for Digital Video Analog-to-Digital Conversion 2. Spectral Characterization of Sampled Video 2-D Sampling on a Rectangular Grid 2-D/3-D Sampling on a Lattice 3. Reconstruction of Continuous Video from Samples Digital-to-Analog Conversion
c 1995-97 This material is the property of A. M. Tekalp. It is intended for use only as a teaching aid when teaching a regular semester or quarter based course at an academic institution using the textbook "Digital Video Processing" (ISBN 0-13-190075-7) by A. M. Tekalp. Any other use of this material is strictly prohibited.
52
' &
Spatio-Temporal Sampling
R G source B RGB to YUV Y U V NTSC encoder composite signal NTSC decoder Y U V YUV to RGB R G B display
Consider the image plane intensity distribution ( three continuous variables. Then,
x t
sc x1 x2 t
) as a function of
(usually 2 and ) by means of the scanning process, and { for digital processing, storage and transmission in all three dimensions.
53
' &
t /2
2 t t /2
(Each dot indicates a continuous line of video perpendicular to the plane of the page.)
54
' &
Progressive Sampling
(Each dot indicates a pixel location, the numbers indicate the time of sampling.)
55
'
Field-Quincunx Sampling
1 2 1 2 1 1 1 2 1 1 2 1 2 1 1 2 1 2 1 1 2 1 2 1 2 x 0 x1 /2 1 V= 0 2 x2 x2 0 0 t /2
Line-Quincunx Sampling
1 2 1 2 2 1 1 2 1 2 1 1 2 1 2 1 1 2 1 2 1 1 2 1 x x /2 0 1 1 V = 0 2 x 0 2 0 0 t 0 c= x 2 t /2
&
1] E. Dubois, \The sampling and reconstruction of time-varying imagery with application in video systems," Proc. IEEE, vol. 73, no. 4, pp. 502-522, Apr. 1985.
56
' &
Analog-to-Digital Conversion
Minimum sampling frequency is 4.2 2 = 8.4 MHz (Nyquist rate) Sampling rate should be an integral multiple of the line rate, so that samples in successive lines are aligned. For sampling the composite signal, the sampling frequency must be an integral multiple of the subcarrier frequency. This simpli es decoding (composite to RGB) of the sampled signal. For sampling component signals, there should be a single rate for 525/30 and 625/50 systems i.e., the sampling rate should be an integral multiple of both 29.97 525 = 15,734 and 25 625 =15,625.
57
' &
58
' &
U U U
V V V
4:4:4
4:2:2
4:2:0
59
' &
= =
n1 n2
1 2
x y
)= (
sc n1
n2
60
' &
Sc F1 F2
Sc F1 F
s c x1 x
dx1 dx2
dF1 dF2
S f1 f2
S f1 f2
)= )=
1 X
n
1 X
n
1=
;1
Z
2=
;1
(
s n1 n2
) exp f; 2 (
j j f1 n1
f1 n1
f2 n2
)g
s n1 n2
; ;
1 2
1 2
1 2 1 2
S f 1 f2
) exp f 2 ( 61
f2 n2
)g
d f1 d f2
'
( (
(
) exp f 2 (
j
F1 n1
F2 n2
)g
dF1 dF2
De ne
f1
s n1 n
1Z1 ( 2) = ;1 ;1
Z
Sc
F1
and
f2
F2
2
f2
, ) exp f 2 (
j f1 f2 f1 n1
f1
f2 n2
)g 1
1
d f1 d f2
)=
XXZ Z
k
SQ k
1
SQ k
&
where
SQ k1 k2
) is de ned as 1+ 1+ ;2 1 1 2
k < f
1 k2 )
Sc
f1
f2
) exp f 2 (
j
f1 n1
f2 n2
)g
d f1 d f2
k1
and 62
;1 2+
k2 < f2
1+ 2
k2
'
A change of variables
f1
0=
(
1 2
f1
SQ k1 k
0 = 2; 2 and 2 1 1 1 1 2 ) down to (; 2 2 ] (; 2 2 ],
k1 f f k
) =
exp f 2 (
j
1 ;2
f 1 ;
2
1 2
1 XX (
2
Sc
f1
;
j
k1
f2
f1 n1
f2 n2
)g exp f; 2 (
; 2 )g 2
k
k1 n1
k2 n2
)g
d f1 d f2
k1 k2 n1 n2 f1 f
)=
; ;
1 2
1 2
1 2
1 2
S f 1 f2
) exp f 2 (
j
f2 n2
)g
d f1 d f2
&
to conclude that
S f1 f2
)=
1
1 2
XX
Sc
k
for ; 1 2
f1
;
1 2
k1
f2
k2
< f1 f2
63
' &
F 2 S (F ,F ) c 1 2 F 1
B (a) x
F 2 2
S (F ,F ) p 1 2
1/ x x 2 x x 1 1
2 1/ x F 1
(b)
(c)
64
' &
An arbitrary periodic sampling geometry can be de ned by the vectors v1 = ( 11 21) and v2 = ( 12 22) , such that
v v
T
x1 x2
= =
v11 n1 v21 n
+ 1+
v12 n2 v22 n2
v 2 v 1
65
'
where and
In vector-matrix form,
x = Vn x=(
x1 x2
) , n=(
T
n1 n2
V = v1jv2]
(n) = (Vn)
sc
&
^ = EV, where E is 1) The sampling matrix V for a given grid is not unique. V an integer matrix with detE = 1 is also a sampling matrix for that grid. 2) The quantity jdetVj is unique and denotes the reciprocal of the sampling density. 66
' &
sc
where F = (
F1 F2
) .
T
Sc
(f ) =
1 X
n=;1
Z
s
(n) exp
;2
j j
T
f n
T
(n) =
where f = ( 1 2 ) . The integrations and summations in these relations are double integrations and summations.
f f
T
1 2 1 2
S
(f ) exp 2 f n
67
' &
s
where f = j Vj F using the Jacobian. Expressing the integration over the f plane as a sum of integrations over the 1 1 ], we have 1 1] (; 2 squares (; 2 2 2
d det d
;1
det
Sc
(n ) =
; k
1 2
where exp
;2
j
1 2
1
T
det
Vj
Sc
(V
;1
;2
j
kn f
T
'
(n) =
we conclude that
S
(f ) exp 2 f n f
j
T
(f ) = j V j k
det
X
Sc
(V
;1
(f ; k))
or equivalently
Sp
X
Sc
(F ; Uk)
U V=
&
and I is the identity matrix. The periodicity matrix can be expressed as U = u1ju2], where u1 and u2 are the periodicity vectors. Note that the above formulation is also valid for rectangular sampling with the matrices V and U diagonal. 69
' &
F 2 S (F ,F ) c 1 2 F 1
B (a)
F 2 2
v 2 v1 x 1
u2 u 1 F 1
(b)
(c)
70
'
then
Let v1 v2 v3 be linearly independent vectors in the 3-D Euclidean space R3 . A lattice in R3 is the set of all linear combinations of v1 v2 v3 with integer coe cients = f 1 v1 + 2v2 + v3
n n k
n1
n2
2 Zg
V = v1jv2jv3]
]
T
= fV
s k sc
n1 n2 k
j
T
n1 n2 k
) 2 Z3 g )2
Z
&
Observe that ( ) = jdet(V)j denotes the reciprocal of the sampling density, and V is not unique.
d
(n ) = (V
n1 n2 k
] )
n1 n2 k
71
' &
Reciprocal lattice
T
uv =
T i j
ij
i j
=1 2 3
or equivalently
U V=I
T
72
' &
Unit Cell (Voronoi cell) The set of points that are closer to the origin than to any other sample point.
x2
73
'
s
Let (n ) = (V
S
] )
T
n1 n 2 k
(f ) =
n (n ) exp :; 2 f 4 (n )2Z3
X
k
8 <
)2
, then
k
39 = 5
f 2 R3
and
s
(n ) =
k
T
;1 2
1 2
n (f ) exp : 2 f 4
j
T
8 <
39 = 5 df
(n ) 2
k
&
where f = V F is the normalized frequency. The Fourier transform of a signal sampled on a lattice is periodic with the replications centered at the sites of the reciprocal lattice . Note that 1 1 1 1 1] (; 1 ] 2 P , where P f 2 (; 2 2 2 2 ] (; 2 2 ] implies that F = 1 2 denotes the unit cell of the reciprocal lattice .
F F Ft
T
74
' &
(R )
M
(F) =
R3
Z
sc
x (x ) exp :; 2 F 4
t j
T
8 <
39 = 5 dx dt
F 2 R3
(x ) 2 R3
t
(x ) =
t
R3
Sc
(F) exp : 2 F
j
8 <
2
T
x
t
39 = 5 dF
The Fourier transform of the sampled signal is equal to an in nite sum of copies of the analog spectrum shifted according to the reciprocal lattice X 1 (F) = ( ) (F + Uk) k2Z3 where U V=I
Sp d Sc
T
75
' &
pro
$
3
(a)
(b)
=V
pro
;1 T
6 =6 4 0
0
1
0 7 0 7 5 and
1
t
int
=V
int
;1T
6 =6 4 0
0 7 0 7 5
2
t
76
' &
Sublattices Let and ; be lattices. is a sublattice of ; if every point in is also a point of ;. Then, ( ) is an integer multiple of (;). The quotient ( ) (;) is called the index of in ;, and is denoted by ( : ;). If is a sublattice of ;, then ; is a sublattice of .
d d d =d
c + = fc + 4 x
t
3 2 5 4
x
t
3 5
and c 2 ;g
77
' &
The most general form of the sampling structure that we will study is the union of certain cosets of a sublattice in a lattice ; = where c1 Note that
:::
P P
=1
(c + )
i j
62
for = 6 .
i j
v 2 2 c v1 x 1 x 1
78
' &
The function
(k) =
P X
is constant over cosets of ; in , and may be zero for some of these cosets, so the corresponding shifted versions of the analog spectrum are not present.
F 2
=1
exp 2 k U c
j
T T
F 1
Reciprocal lattice
79
'
Band-limited reconstruction of the analog video requires ideal low pass ltering ( forj 1 j 2 1 1 andj 2 j 2 1 2 1 2 ( 1 1 2 2) ( )=
0 otherwise.
F 2
1/2 x2 F 1 1/2 x1
&
(
Reconstruction lter. (
F1
)=
1 2 1 ;1 2 1
1 2 2 ;1 2 2
2S
F2
) exp f 2 (
j
F1 x1
F2 x2
)g
dF1 dF2
80
'
( (
) =
F2
)
)
j F1 x1
$
)g
exp f; 2 (
j
1 2 2 ;1 2 2
F1
XX
n
s n1 n2
1 n1
F2
2 n2
)gg exp f 2 (
F2 x2
)g
dF1 dF2
) =
XX
n
s n1 n2
1 2 1 ;1 2 1
exp f 2 (
j
1 2 2 ;1 2 2
exp f; 2 (
j F1 x1
F1
1 n1
F2
2 n2
F2 x2
)g
dF1 dF2
&
h x1 x2
)=
sin
; 1 1) 1 ( 1 ; 1 1)
1 (x1
x n n
sin
; 2 2) 2 ( 2 ; 2 2)
2 (x2
x n n
' &
where
Exact reconstruction of a continuous signal from its samples on a lattice is possible via ideal low-pass ltering over a unit cell P of provided that the original continuous image spectrum was con ned to this unit cell. The ideal low pass ltering can be expressed as
Sr
(F) = :
8 <
det
Vj (V F) for F 2 P
S
T
otherwise.
2 3
(x ) =
t
X
(
n )2Z3
k
x n (n ) (4 5 ; V 4 5)
k h t k
(x) = j Vj
det
Here (x) is the ideal interpolation function for the particular lattice geometry.
h
x exp : 2 F 4
j
T
8 <
39 = 5 dF
82
' &
83
' &
Applications
Frame-Rate Conversion Deinterlacing (interlaced ! progressive) Interlacing NTSC-to-PAL transcoding or vice versa Data Compression (U, V subsampling) 84
' &
Fundamentals of Decimation/Interpolation
u(n) w(n) Downsample M:1
L M.
s (n)
Upsample 1:L
y (n)
85
' &
Interpolation
s n
u n
for = 0 otherwise.
n
s(n)
Upsampling by L = 3.
86
' &
( )=
1 X
n=
;1
u n e
( ) ;j 2
fn
1 X
n=
;1
s n e
( ) ;j 2
fLn
= ( )
S fL
S(f)
Upsampling by L = 3.
87
' &
...
... f
-1
-1/2
0 1/2L 1/2
... f
(b)
-1
Interpolation by L = 3.
The impulse response of the ideal interpolation lter is a sinc function. Because of its zero-crossings it will not alter the existing signal samples, while assigning values for the zero samples in the upsampled signal. 88
' &
Linear interpolation
h(n)
1 2/3 1/3 2/3
h(n-k) n u(k)
1/3
k n
89
' &
h(n-k) u(k) k n
90
'
Then,
Decimation
s n w n
1 X
w n
k=
;1
kM
( )= (
w Mn
)
... (a) n
s(n)
(b) n
&
y(n)
(c) n
Decimation by M = 2.
91
' &
1 X
n=;1
M ;1 k=0
S (f
k ;M )
w (M n)e;j 2 fn
f ) = W(M
Decimation by M = 2.
-1/2
1/2
92
' &
Decimation Filters
Decimation filter S(f) ... ... f -1 W(f) ... ... f -1 Y(f) ... ... f -1 -1/2 0 1/2 1 -1/2 0 1/2 1 -1/2 0 1/2 1
Box lters are generally used instead of ideal lowpass lters for simplicity. 93
' &
A single lowpass lter with cuto frequency 1 1 c = minf 2M 2L g is su cient. When , the requirement to preserve the values of the existing samples must be incorporated into the lter design.
f L > M
94
' &
Practical Method
o x o x o 3:4 conversion
525
625
525:625 conversion
95
' &
fx + y j x 2 fx j x 2
1 and y
2 2g
Intersection of lattices
The intersection 1 2 is the largest lattice which is a sublattice of both 1 and 2 , while the sum 1 + 2 is the smallest lattice which contains both 1 and 2 as sublattices. 96
2=
1 and x
2 2g
'
and
Upconvert U
Downconvert D
De ne
) (x ) 2 1 (x ) 62 1 x ) 2 1 + 2
t t t
&
p(
x ) = D p (x ) = p (x ) (x ) 2
t w t w t t
if the T input is shifted by q, the output should also be shifted by q. We need q 2 1 2. Thus, we assume that T ;1 V2 is a matrix of integers. 1 2 is a lattice, i.e., V1
Condition for the shift invariance of the lter:
97
'
w u
The Filter
x )=
t s
X
(q )2 1 +
2
p(
q )
2 3 2 3 x q (4 5 ; 4 5)
t
(x ) 2 1 + 2
t
p(
x )=
t
(q )2
p(
q )
2 3 2 3 x q (4 5 ; 4 5)
t
(x ) 2 1 + 2
t
p(
x )=
t
X
(q )2
1
p(
q )
2 3 2 3 x q (4 5 ; 4 5)
t
(x ) 2 2
t
&
One period of the lter frequency response is given by the unit cell of ( 1 + 2 ) . In order to avoid aliasing, the passband of the lowpass lter is restricted to the smaller of the Voronoi cells of 1 and 2 . 98
' &
to
2 x 1 x x 1 2 0 2 x 2
2 x 1 1 2 x x 0 2 4 x 2
V= x 2
V= x 2
2 2 + 1 2 x
2 1
2 2 x
2 x 1 x 0 1 0 x
2 x 1 2 x 0
x 0 4 x 2
V=
V= 2
T 1
99
' &
* 1
F 2 1/ x
* 2
F 2
1/ x
F 1
F 1
U=
1/ x1 - 1/2 x1 0 1/2 x 2
One period of the lter frequency response is given by the unit cell of ( 1 + 2 ) . In order to avoid aliasing, the passband of the lowpass lter is restricted to the Voronoi cell of 2 . 100
' &
Example: Deinterlacing
t t
(a)
(b)
The sampling matrices for the input and output grids are
2 x1 6 Vin = 6 40
0
det
0 2 0
0
x2 x2 t
3 7 7 5
and
2 x1 6 Vout = 6 40
0
0 0
x2
0 0
t
3 7 7 5
101
' &
102
' &
1. Projected Motion vs. Optical Flow 2. Occlusion and Aperture Problems 3. Optical Flow Equation 4. Two-D Motion Field Models, Nonparametric vs. Parametric 5. Lucas-Kanade Method 6. Smoothness Constraint, Horn-Schunck Method 7. Adaptive Methods
103
' &
2-D Motion Estimation Correspondence estimation Optical ow estimation - Motion compensated image ltering. - Motion compensated image compression. 3-D Motion and Structure Estimation Based on point correspondences Optical ow-based or direct methods From stereo video - Virtual Reality, Synthetic-Natural Hybrid Imaging - Passive Navigation: A camera moves with respect to a xed environment. Determine the 3-D structure of the environment and the motion parameters of the camera. 104
' &
Two-D Motion
X 2 Center of projection O p P x 2 x X 3 1 Image plane X 1
There is 3-D motion between the objects in the scene and the camera.
P t p t p t O Center of projection Image plane P t
' &
The 2-D displacement eld is a vector eld consisting of the 1 and 2 components of the frame-to-frame \projected" displacement vectors at each pixel.
time t +l t time t time t - l t d d d 1 1 d , x ) P = (x 1 2 2
2 , x ) P = (x 1 2
) P = (x 1,x 2
T d= [ d d ] 1 2
The 2-D velocity eld is a vector eld consisting of the of the instantaneous velocity vectors at each pixel. 106
x1
and
x2
components
' &
107
' &
There must be su cient gray level variations within the moving objects.
rad/s
108
' &
Determination of the apparent velocity v( 1 2 ) of pixels from a pair of time-sequential 2-D images. The ow vectors may vary by the coordinates (space-varying ow) due to 3-D rotation, zoom, etc.
x x t
Finding the apparent displacement vectors d( 1 2 ) between a pair of frames and = + . Dense or feature correspondence estimation. (May also appear in the context of stereo disparity estimation.)
x x t ` t t t
0
Correspondence Problem
t
Given two frames that are globally shifted with respect to each other, estimate the shift. There is one displacement vector for a pair of frames.
109
' &
Theoretically, we can determine only motion that is orthogonal to the spatial image gradient, called the normal ow, at any pixel (the aperture problem).
110
' &
Occlusion refers to covering/uncovering of a surface due to motion of an object. e.g. 1, when an object translates,
Frame k k+1
Background to be covered
(no region in the next frame matches this region)
Uncovered background
(no motion vector points into this region)
e.g. 2, when an object rotates about an axis parallel to the imaging plane. 111
' &
Basic Idea: We can only observe and determine displacement that is orthogonal to the edges (in the direction of the intensity gradient).
112
' &
If the intensity c(
) =0
where 1 and 2 varies by according to the motion trajectory. Using the chain rule of di erentiation
x x t @ sc @ x1
(x ) (x) + 1
t v
@sc
@x2
(x ) (x) + 2
t v
@sc
@t
(x ) = 0
t
This is known as the optical ow equation or the optical ow constraint. It can alternatively be expressed as
h r c(x
s s t : @ sc
) v(x) i +
t @ x2
@ sc
@t
(x ) = 0
t
where r c (x ) =
(x )
t
@sc
@ x1
(x ) ]T and h 113
' &
Normal Flow
Is the OFE su cient to uniquely specify the motion eld ? The OFE yields one scalar equation in two unknowns at each pixel.
v 2 Loci of v satisfying the optical flow equation sc (x1 ,x2 ,t)
The OFE determines, at each pixel, the component of the ow vector that is in c (x t) the direction of the spatial image intensity gradient, s sc (x t) ,
r jjr jj
jj because the component that is orthogonal to the spatial image gradient disappears under the dot product.
v t)
? (x
; jjr
@sc (x t) @t sc (x t)
114
' &
Motion Models
Because of the ill-posed nature of the problem, motion estimation algorithms use additional assumptions (models) about the structure of the 2-D motion eld. Non-parametric models: Some sort of smoothness or uniformity constraint on the 2-D motion eld. Quasi-parametric models: In 3-D rigid motion six egomotion parameters constrain the local ow vector to lie along a speci c line, while the local depth value is required to determine its exact value. Parametric models: 3-D rigid motion of the image of a planar surface under orthographic projection can be described by a 6-parameter a ne model, while under perspective projection it can be described by an 8-parameter nonlinear model. There exist more complicated models for quadratic surfaces. 115
' &
Methods Based on the OFE: Constant intensity along the motion trajectory yields an equation in terms of spatio-temporal intensity gradients. Used in conjunction with appropriate spatio-temporal smoothness constraints. Phase-Correlation Method: The linear term of the Fourier phase di erence between the consecutive frames determines the motion estimates. Block Matching Method: Matching xed size blocks between two frames based on a distance criterion. Extension to feature matching (e.g., edges, corners). Pel-Recursive Methods: Gradient-based minimization of the displaced frame di erence. Implicit use of smoothness constraint. Extension to Wiener-type motion estimation. Bayesian Methods: Probabilistic smoothness constraint in the form of Gibbs random elds.
116
' &
117
' &
r c(x
s dt
) =0
v v
t t
; ;
@ 2 sc (x t) @t@x1 @ 2 sc (x t) @t@x2
3 5
118
' &
v v
Lucas-Kanade Method
v(x t) = v(t) =
v1 t
$
3 7 7 5
( ) 2( )]T
v t
for x 2 B
x2B
@x1
@s
@x2
@s
@t
Minimization of with respect to 1 ( ) and 2 ( ) yields 2 3 2 X @sc (x t) @sc (x t) X @sc (x t) @sc (x t) 3 12; X @sc (x t) @sc (x t) @x1 @x1 @x1 @x2 @x1 @t 6 7 6 ^ ( ) 1 x x 6 4 5=6 X @sc (x t) @sc (x t) x X @sc (x t) @sc (x t) 7 X @sc (x t) @sc (x t) 4 5 4 ^2( ) ; @x2 @t @x1 @x2 @x2 @x2
E v t v t
;
t t
2B
2B
2B
x2B
x2B
x2B
119
' &
Horn-Schunck Method
Minimize a weighted sum of the error in the OFE and a measure of departure from smoothness in the motionZ eld 2 2 2 (v)) x min = ( of (v) + s v(x)
E E c E d
to estimate the velocity vector at each pixel, where denotes the image support, and Eof (v(x)) = h r (x ) v(x) i + (x ) and Es2(v(x)) = jjr 1(x)jj2 + jjr 2(x)jj2
g t @g t @t v v
= (
c
@v1 @ x1
)2 + (
@ v1 @ x2
)2 + (
c
@v2 @x1
)2 + (
@v2 @x2
)2
The parameter 2 (chosen heuristically) is a weight that controls the strength of the smoothness constraint. Larger values of 2 increase the strength of the constraint, whereas smaller values relax the constraint. 120
' &
The minimization of the functional , using the calculus of variations, and approximation of the Laplacian of the velocity components by linear highpass lters yields the following iterations:
E
(n+1) v1 (n+1) v2
(x ) =
t
(n) v1 (n) v2
(x ) ;
t
(x ) =
t
(x ) ;
t t
@sc (n) (x ) + @sc (x ) + @x @t 2 2 @s @s 2 + ( c )2 + ( c )2 @x1 @x2 (n ) @sc (n) (x ) + @sc ( x ) + 1 @x2 2 @t 2 + ( @sc )2 + ( @sc )2 @x1 @x2
t v t t v t t
where all partials are evaluated at the point (x ). The initial estimates of the (0) (0) velocities 1 (x ) and 2 (x ) can be obtained by the block matching technique. In the digital implementation of the algorithm, the derivatives are numerically estimated.
v t v
121
'
@sc @x1 @sc @x2 @sc @t
Forward di erence Backward di erence Average di erence Local average of the average di erences Horn and Schunck proposed averaging four nite di erences
= 1 4 f sc (x1 + 1 x2 t) ; sc (x1 x2 t) + sc (x1 + 1 x2 + 1 t) ; sc (x1 x2 + 1 t) + sc (x1 + 1 x2 t + 1) ; sc (x1 x2 t + 1) + sc (x1 + 1 x2 + 1 t + 1) ; sc (x1 x2 + 1 t + 1) g = 1 4 f sc (x1 x2 + 1 t) ; sc (x1 x2 t) + sc (x1 + 1 x2 + 1 t) ; sc (x1 + 1 x2 t) + sc (x1 x2 + 1 t + 1) ; sc (x1 x2 t + 1) + sc (x1 + 1 x2 + 1 t + 1) ; sc (x1 + 1 x2 t + 1) g = 1 4 f sc (x1 x2 t + 1) ; sc (x1 x2 t) + sc (x1 + 1 x2 t + 1) ; sc (x1 + 1 x2 t) + sc (x1 x2 + 1 t + 1) ; sc (x1 x2 + 1 t) + sc (x1 + 1 x2 + 1 t + 1) ; sc (x1 + 1 x2 + 1 t) g
&
122
'
N
Approximate c( 1 polynomials in 1 ,
^(
)=
N ;1 X i=0
ai
x1 x2 t
where is the number of the basis polynomials, i are the coe cients of the linear superposition, and i ( 1 2 ) are the basis polynomials. Set = 9, with the following basis functions,
N a x x t
x1 x 2 t
)=1
x1 x2 t x1 x2 x1 x2 x1 t x2 t
&
Then,
sc x1 x2 t
^(
) =
a0
a x
1+ 1 1+ 2 2+ 3 + 4 2 1+ 2 5 2+ 6 1 2+ 7 1 + 8 2
a x a x a t a x a x x a x t a x t:
123
'
e
The coe cients i , = 0 8, are estimated by using the least squares method which minimizes the error function N 1 X X X X 2 2 = ( c( 1 2 ) ; i i ( 1 2 )) jx1 =n1 x y =n2 x t=n3 t
;
n1 n2 n3
i=0
with respect to these coe cients. The summation is over a local neighborhood of the pixel. A typical case involves 50 pixels, 5x5 spatial windows in two consecutive frames. Once the coe cients i are estimated, image gradients can be found by simple di erentiation, c( 1 2 ) = +2 + + j =
a @s x x t @ sc x1 x2 t
( (
@x1
a1
a4 x1
a6 x2
a7 t
x1 =x2 =t=0
1 2
a1
&
@x2
@ sc x1 x2 t @t
) = ) =
a2
+2 +
a5 x2
a6 x1
+
1
a8 t
jx =x =t=0 =
a3
a2
a3
a7 x1
a8 x2
jx =x =t=0 =
2
Estimating the coe cients of the rst three basis polynomials is su cient to estimate the gradients. 124
' &
Adaptive Methods
Horn-Schunck algorithm imposes the optical ow and smoothness constraints globally on the entire image (or over the motion estimation window).
Frame k k+1
Background to be covered
(no region in the next frame matches this region)
Uncovered background
(no motion vector points into this region)
Smoothness constraint does hold in the direction perpendicular to an occlusion boundary. Several researchers proposed to impose the smoothness constraint along the boundaries but not perpendicular to the occlusion boundaries. These methods require the detection of moving object (occlusion) boundaries. 125
' &
126
' &
s(n1 n2 k + 1) = s(n1 + d1 n2 + d2 k)
1) To overcome the aperture problem, there must be su cient gray level variation within the block. 2) This model is used in many practical applications including World standards for video compression such as H.261 and MPEG Motion-compensated ltering in standards conversion, etc...
127
' &
and
The correlation between the frames k and k + 1 is given by c +1 (n1 n2) = s(n1 n2 k + 1) s(;n1 ;n2 k) Taking the Fourier transform of both sides C +1 (f1 f2) = S +1 (f1 f2 )S (f1 f2)
k k k k
Normalizing C
(f1 f2 ) by its magnitude ~ +1 (f1 f2) = S +1 (f1 f2 )S (f1 f2 ) C jS +1(f1 f2)S (f1 f2)j Given the motion model S +1 (f1 f2) = S (f1 f2 )e; 2 ( 1 1 + 2 2)
k k
+1
k k
k i
f d
f d
~ C
k k
+1
(f1 f2 ) = e; 2
j
(f1 d1 +f2 d2 )
c ~
k k
+1
'
i
Implementation Issues
Range of Displacement Estimates/Block Size: Since the DFT is periodic by the block size (N1 N2 ),
&
The range of estimates is ;N =2 + 1 N =2] for N even. For example, to estimate displacements within a range -31,32], the block size should be at least 64 64. Boundary E ects: To obtain a perfect impulse with the DFT, the shift must be cyclic. Since things disapperaing at one end generally do not reappear at the other end, the impulses degenerate into peaks.
i i i
8 < ^= d d : d ;N
i i
if jd j N =2 N even or jd j (N otherwise:
i i i i
; 1)=2
N odd
i
129
' &
130
' &
The displacement at the center of an N1 N2 block in frame k is determined by searching for the location of the best matching block of the same size in the frame k + 1. The search is limited to within a search window.
k+1 Frame k
Block matching algorithms di er in - Matching criteria (maximum cross-correlation, minimum error) - Search strategy - Determination of block size (hierarchical, adaptive) 131
' &
Matching Criteria
1
X
(n1 n2 )2B
s(n1 + d1 n2 + d2 k + 1) ; s(n1 n2 k) ]2
X
(n1 n2 )2B
j s(n1 + d1 n2 + d2 k + 1) ; s(n1 n2 k) j
T T
^1 d ^2] = (d1 d2 ) which minimizes the MSE or . The displacement estimate is d MAD criterion. 132
' &
Search Procedures
Usually the search area is limited to
;M1
d1 M1 and
;M2
d2 M2
Full Search: calls for the evaluation of the matching criterion at 2M1 + 1 2M2 + 1 distinct points for each block. Three-Step Search Cross-Search
133
' &
Illustration for M1 = M2 = 7. The number of steps depends on the maximum displacement vector allowed and the accuracy of estimation e.g., a range of 32 pixels with 0.5 pixel accuracy would require 6-steps ( 16, 8, 4, 2, 1, 0.5 pixels). 134
' &
Cross-Search
1 1 0 1
2 1 2
3 2 3 5 3 5 4 5 5 4
The distance between the search points is reduced if the best match is at the center of the cross or at the boundary of the search window.
135
' &
Minimizing the MSE or MAD criteria can be viewed as imposing the optical ow constraint on the entire block. It is assumed that all pixels belonging to a block have a single translation vector, which is a special case of the local smoothness constraint (same as in Lucas-Kanade method). Block size selection: There are con icting requirements on the size of the blocks. - The block size should be su ciently large. It is possible that a match may be established between blocks containing similar gray-level patterns which are unrelated in the motion sense. - The block size should be su ciently small. If the motion vector varies within a block, block matching cannot provide accurate estimates. 136
' &
Level 1
137
' &
138
' &
139
' &
Hierarchical BM - An Example
The center of the search area in the second level (denoted by \0") denotes the estimate from the rst level.
2 1 1 2 2 1 0 2 1 2 1 1 1 2 3 3 3 2 3 2 3 3 3 3 1 2 1 2 2 0 1 2 2 1 2 2 2 1 1
The estimates in the 1st and 2nd levels are 7 1] and 3 1] , respectively, resulting in an estimate of 10 2] . 140
' &
frame k
frame k+1
cannot handle rotation or zooming. Accuracy is essential in motion-compensated ltering. discontinuity at block boundaries. Blocking artifacts in motion-compensated compression. 141
' &
Spatial Transformations
Consider block-based image warping by A ne motion model (6-parameter). Perspective or bilinear motion model (8-parameter).
Affine
Affine
Perspective Bilinear
Bilinear
142
' &
{ Search method (Seferidis and Ghanbari) { Algebraic method (Extension of Lucas-Kanade method)
2-D mesh modeling (motion continuity across block boundaries)
{ Hexagonal search (Nakaya et al.) { Constrained linear estimation (Altunbasak and Tekalp)
143
' &
Frame k-1
Frame k
Search for all combinations of the coordinates of the corners to minimize the SAD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
reference frame current frame
144
' &
Algebraic Method
Extension of the Lucas-Kanade method to parametric motion models. A ne Motion Model:
(x1 x2 ) 2 B
x2B
Di erentiate E with respect to a1 : : : a6 and set the results equal to zero to obtain six linear equations in six unknowns 145
'
2P P 6 P 6 6 P 6 4P P
$
1
2 6 6 4
1 2 x2 Ix 1 Ix1 Ix2 x1 Ix1 Ix2 x2 Ix1 Ix2 2 Ix 1 2 x1 Ix
P P P P P P
a ^1 a ^2 a ^3 a ^4 a ^5 a ^6
3 7 7 5
1 2 x2 I 1 x1 2 x1 x2 Ix 1
2 x1 Ix
P P P P P P
2 x2 Ix
P P P P P P
2 Ix 2 2 x1 Ix
x2 Ix1 Ix2
2 x2 Ix
&
P P P P P P 2 ;P P ;P 6 6 ; P 6 6 4 ;;P P
2 2
2 x1 Ix
x1 x2 Ix1 Ix2
x2 1 Ix1 Ix2
x1 Ix1 Ix2
2 2 x2 I 1 x2 2 x1 x2 Ix 2
Ix1 It
3 7 7 7 7 5
P P P P P P
x2 2 Ix1 Ix2
x1 x2 Ix1 Ix2
x2 Ix1 Ix2
2 x2 Ix 2 x2 2 Ix
2 x1 x2 Ix 2
3; 7 7 7 7 5
146
' &
A ne motion with triangular patches. Hexagonal matching (search). Constrained Linear Estimation: All constraints are linear.
Frame k-1
Frame k
147
' &
Hexagonal Matching
There are six lines intersecting at each node in the case of a uniform triangular mesh. The boundaries of these six triangles de ne a hexagon. Perturb each node point to yield the smallest SAD within its hexagon. 148
' &
149
'
Let
)=
:
d1
(x
:
d2
(x
)]T
t t t
denote the displacement eld at x = 1 2]T between frames and + The DFD function between these two frames is de ned as
x x d fd
^ ) = c(x + d ^ (x (x d
: s
) +
t
) ; c (x )
s t
&
where c ( ) denotes the spatio-temporal image intensity distribution. ^ take noninteger values, interpolation is required to If the components of d compute at each pixel location. ^ is equal to the true displacement vector and there is no interpolation If d errors, attains the value of zero at that pixel location.
s d fd d fd
150
' &
- Expanding the
s x
small,
t
c( 1 + 1 (x)
d
- Neglecting
h:o:t:
, and setting
t d
^ ) = 0, we obtain (x d
t d t @s
@s
@ x1
c (x ) = 0
t @t
!0
t :
151
'
. . . .
Comments
d v t d v t
In the case of constant velocity motion, where and 2 (x) = 2 (x) . 1 (x) = 1 (x) the optical ow equation is satis ed when the displaced frame di erence function attains the value of zero. In practice, neither the dfd nor the error in the OFE is exactly zero, because there is observation noise, scene illumination may vary by time, there are occlusion regions, and there are interpolation errors.
d fd
&
Therefore, one aims to minimize the absolute value or the square of the or the LHS of the OFE to obtain an estimate of the frame-to-frame motion eld. 152
' &
PEL-RECURSIVE ALGORITHMS
di+1 (x) = di (x) + ui (x)
Pel-recursive algorithms are of the general form where di (x) is the estimated motion vector at the pel location (x) in the th step, ui (x) is the update term in the th step, and di+1(x) is the new estimate.
i i
The update term ui (x) is estimated, at each pel x, to minimize a positive-de nite function of the with respect to d. The iterations may be executed at a single pel (pixel) position or at consecutive pel positions or a combination of both.
E d fd
The motion estimate at the previous pel is taken as the initial estimate at the next pel, hence pel-recursive. 153
' &
rd
(x d) = 0
@d2
(x d) = 0 1 (x d) = 0
Since an analytical solution to these equations cannot be found in general, we resort to iterative methods. 154
'
The gradient vector points AWAY from the minimum. That is, in one dimension, its sign will be positive on an \uphill" slope. Thus, to get closer to the minimum, we can update our current vector as
d(k+1) (x) = d(k) (x)
; rd
(x d)jd k (x)
( )
too large d
&
(k)
min
If is too small, the iteration will take too long to converge, if it is too large the algorithm will become unstable and start oscillating about the minimum. 155
'
Newton-Raphson Method
We can estimate a good value for using the well-known Newton-Raphson method for root nding d(k+1) (x) = d(k) (x) ; H;1 rd (x d)jd k (x) where H is the Hessian matrix 2 (x d) H =
E
( )
ij
@ E
@d @d
i j
&
In one dimension, we would like to nd a root of ( ). Expanding ( ) in a Taylor series about the point (k) ( (k+1) ) = ( (k) ) + ( (k+1) ; (k) ) ( (k) ) Since we want (k+1) to be a zero of , we set ( (k) ) + ( (k+1) ; (k) ) ( (k) ) = 0 Thus, (k) ) ( (k+1) (k) = ; ( (k ) )
E
0
00
E d
00
00
156
' &
157
' &
Netravali-Robbins Algorithm
The Netravali-Robbins algorithm nds an estimate of the displacement vector at each pixel to minimize
E
(x d) =
d fd
(x d)]2
rd
d fd
(x di) = rx c(x ; di
s
)
t
d fd
(x di ) rx c(x ; di
s
158
' &
Walker and Rao suggested the following step size 1 = jjr (x ; d i ; )jj2 x c
s t t
d fd :
Ca ario and Rocca have added a bias term 2 to avoid division by zero in the areas of constant intensity = jjr (x ; di 1 ; )jj2 + 2 x c
s t t :
159
' &
$
.
as
M (dM ) =
x2M
d fd
(x dM )]2
rd
X
x2M
d fd
(x di )]2
M
+1 where di denotes the new displacement estimate over the entire support M
160
' &
Linear minimum mean square error (LMMSE) estimation of the update term ui based on a neighborhood of a pel. (Extension of the multiple pel version of Netravali-Robbins algorithm.) Linearization of the at the pels of the support (x(1) di ) = ;rT c(x(1) ; di ; )ui + (x(1) di) (x(2) di ) = ;rT c(x(2) ; di ; )ui + (x(2) di) . . . . = . .
d fd d fd d fd s s t t t t v v d fd
(x( ) di) =
N
;rT
c(x( ) ; di
N
Expressing this set of equations as z = uM + v the LMMSE estimate of the update term is given by ^ M = T R; u v 1 + R; u 1];1 T R; v 1z ^ M denotes the update term for the entire support where u 161
'
and
The solution requires the knowledge of the covariance matrices of both the update Ru and the linearization error Rv . 2 I and R = 2 I, Assuming that Ru = u v v
2 v T ^M = u + 2 ];1 T z u
2 i +1 v i T dM = dM + + 2 ];1 T z u
&
Note that the assumptions that are used to arrive at the simpli ed estimator are not in general true, e.g., the linearization error is not uncorrelated with the update term, and the updates and the linearization errors at each pixel are not uncorrelated with each other. However, experimental results indicate better performance than other pel-recursive estimators. 162
' &
163
' &
1. Introduction to Markov Random Fields and Gibbs Distribution 2. Optimization Methods Simulated Annealing (SA) - Metropolis algorithm and Gibbs sampler Iterated conditional modes (ICM) Mean eld Annealing (MFA) 3. MAP Motion Estimation Basic Formulation Discontinuity Models Estimation Algorithms 164
' &
165
' &
De nitions
Let a random eld z = fz (x) x 2 g be speci ed over a lattice , and ! 2 denote a realization of the random eld z. The random eld z(x) can be continuous or discrete-valued, that is ! (x) 2 R or ! (x) 2 ; = f0 1 : : : L ; 1g, for all x 2 , respectively. A neighborhood system on . The set Nx denotes the neighborhood of the site x, and has the properties: (i) x 62 Nx , and (ii) xj 2 Nxi $ xi 2 Nxj , where xi and xj denote arbitrary sites in the lattice. (In words, x does not belong to its own set of neighbors, and if xj is a neighbor of xi , then xi is a neighbor of xj , and vice versa.) The neighborhood system over is then de ned as N = fNx x 2 g 166
' &
(a)
(b)
A clique C is de ned as C such that all pairs of sites in C are neighbors. Further, C denotes the set of all cliques.
167
'
and
&
(In words, the rst condition implies all realizations have non-zero pdf, while the second states that the conditional pdf at a particular site depends only on its neighborhood.) Di culties with MRF models: i) the joint pdf p(z) cannot be easily related to local properties, and ii) it is hard to determine when a set of functions p(z(xi) j z(xj ) xj 2 Nxi ) xi 2 , are valid conditional pdfs Geman and Geman]. 168
'
where
A GRF with a neighborhood system N and the associated set of cliques C is characterized by the joint pdf discrete-valued X ;U (z=!)=T 1 (z ; !) p (z = ! ) = Q e
Q=
X
!
e;U (z=!)=T
continuous-valued
&
where
Q=
and U (z), the Gibbs potential (Gibbs energy) is de ned by X U (z) = VC (z(x) j x 2 C ):
C 2C
e;U (z)=T dz
169
'
Example: Spatial smoothness constraint using GRF Let us use a 4-point neighborhood system and the 2-pixel cliques. Over a 4 4 lattice, there are a total of 24 such cliques. Let the 2-pixel clique potential be de ned as
&
V= -24 (b)
V = 24 (c)
' &
Hammersley-Cli ord (H-C) Theorem: Let N be a neighborhood system. Then z (x) is an MRF with respect to N if and only if p(z) is a Gibbsian with
171
' &
where
e.g., used in the Gibbs sampler method for optimization The local conditional pdf is de ned as,
8 6
Cjxi 2C VC (z(x)jx2C )
Q xi =
X
z(xi )2;
1 ;T
Cjxi 2C VC (z(x)jx2C )
172
' &
OPTIMIZATION METHODS
Many estimation/segmentation problems require the minimization of an energy function E (d). We state the problem as ^ = mindE (d) E where d is some N -dimensional parameter vector. The value of d that results in the minimal E is denoted by ^ = arg (mind E (d)) d This minimization is exceedingly di cult for image processing applications due to both the dimensions of the vectors involved and the occurence of local minima because E (d) is usually nonconvex.
173
' &
Gradient descent su ers from a serious problem: its solution is strongly dependent on the starting point. If start in a \valley", it will be stuck at the bottom of that valley. We have no way of getting out of that local minimum to reach the \global" minimum. Here we look at several optimization methods that are capable of nding the global optimum. A. Simulated annealing (stochastic relaxation)
B. Iterative conditional mode (ICM) (by Besag) C. Mean eld annealing (MFA) (by Bilbro et al.)
174
' &
Simulated Annealing
Simulated annealing, sometimes refered to as stochastic relaxation, belongs to the class of Monte Carlo methods. It enables us to nd the global optimum of a nonconvex cost function of many variables. Here we describe two implementations, - the original formulation of Metropolis and - the Gibbs sampler proposed by Geman and Geman. The computational load of simulated annealing is usually signi cant especially when the number of elements in the unknown vector d and the number of values in the set ; are large.
175
' &
We start at an arbitrary initial vector d. At each iteration cycle, all components of d are perturbed one by one by assigning each another value in the set ; randomly. Note that the order in which the components are perturbed is not important, as long as all components are perturbed in each iteration cycle. The change in the total energy, E , due to the perturbation is computed after each perturbation to determine whether this perturbation is accepted. A perturbation is accepted with probability P given by
where T is the temperature parameter that controls the probability of our accepting positive changes in the energy. We always accept perturbations that lower the energy. The rationale behind accepting perturbations that increase the energy is to prevent the solution from settling in a local minimum. 176
' &
If T is relatively big, the probability of accepting a positive energy change is higher than when T is small, given the same E . In the next iteration cycle, the temperature is lowered, and the components are revisited. The process continues until the temperature has been lowered to near zero. A temperature \schedule", expressing temperature as a function of the iteration number, is therefore an important component in the stochastic relaxation process. Geman and Geman proposed the following schedule
T = ln(k + 1)
where is a constant and k is the iteration cycle. This schedule is viewed as over conservative but guarantees a global minimum solution. Schedules that lower the temperature at a faster rate have been shown to work.
177
'
The Algorithm 1. Choose an initial value for d = d(0) . Set i = 0 and j = 1. 2. Perturb the j th component of d(i) to generate the vector d(i+1) . 3. Compute E = E (d(i+1) ) ; E (d(i) ). 4. Compute P from
&
5. If P < 1, then draw a random number that is uniformly distributed between 0 and 1. If the number drawn is less than P accept the perturbation. 6. Set j = j + 1. If j N , go to 2. (N is the number of components of d). 7. Set i = i + 1 and j = 1. Reduce T according to a temperature schedule. If T > Tmin , go to 2. Otherwise terminate. 178
'
where
In Gibbs sampling, instead of making random perturbations and then deciding whether to accept or reject this perturbation, the new value is \drawn from" the distribution of P (d) and is always accepted. First compute the conditional probability of the component d(xi ) to take each of the values in the set ; given the present values of its neighbors using
P (d(xi ) =
1 ; 1 ;T d(xj ) xj = xi ) = Qxi e
Cjxi 2C VC (d(x)jx2C )
&
Q xi =
X
2;
1 ;T
Cjxi 2C VC (d(x)jx2C )
Then, the new value of the component d(xi ) is drawn from this conditional probability distribution. 179
'
To clarify the meaning of \drawn from", suppose that the sample space ; = f0 1 2 and 3g, and it was found that
= = = =
&
A uniform random number, R, between 0 and 1 is generated. If 0 R 0:2 then d(xi) = 0, if 0:2 < R 0:3 then d(xi ) = 1, if 0:3 < R 0:7 then d(xi ) = 2, and if 0:7 < R 1 then d(xi ) = 3. Properties of perturbations through Gibbs sampling: (i) for any initial estimate, updating using the Gibbs sampler yields an asymptotically Gibbsian distribution. This result can be used to simulate a Gibbs random eld with speci ed parameters. (ii) for a speci ed temperature schedule, the maximum of the Gibbs distribution will be reached. Although this property is signi cant for MAP estimation, the speci ed temperature schedule may be too slow for use in practice. 180
' &
181
'
where
It can be shown that ICM converges to the solution that maximizes the local conditional probabilities
P (d(xi ) =
1 ; 1 ;T d(xj ) xj = xi ) = Qxi e
Cjxi 2C VC (d(x)jx2C )
Q xi =
X
2;
1 ;T
Cjxi 2C VC (d(x)jx2C )
&
at each site. Thus, ICM is usually implemented as in Gibbs sampling but by choosing the value at each site that gives the maximum local conditional probability. ICM provides a much faster convergence than SA. Also, when the initial solution is a resonable estimate from other means rather than completely random, ICM reaches an acceptable solution in relatively few iterations. ICM produces good results for several applications that include image restoration see Besag] and image segmentation see Pappas]. 182
' &
'
Let
1
sk = fsk (x)g x 2 , denote the kth frame of video, d(x) = d (x) d (x)]T denote the displacement vector at site x, and d = fd (x)g and d = fd (x)g for x 2 , denote the lexicographic ordering of
1 2 1 2 2
&
the x1 and x2 components of the displacement eld from frame k ; 1 to k, respectively i.e., sk (x) = sk;1 (x ; d(x)): Then, the problem of motion estimation can be formulated as: given sk and sk;1 , nd an estimate of d1 and d2. The maximum a posteriori probability (MAP) estimates of d1 and d2 are given by: ^1 d ^ 2 ) = arg maxd1 d2 p(d1 d2jsk sk;1) (d 184
'
or
^1 d ^ 2 ) = arg maxd1 d2 p(sk jd1 d2 sk;1 )p(d1 d2jsk;1) (d ^1 d ^ 2) = arg maxd1 d2 p(sk;1 jd1 d2 sk )p(d1 d2jsk ) (d
&
The term p(sk jd1 d2 sk;1 ) is the conditional pdf, or the \consistency (likelihood) measure", that measures how well the estimates of d1 d2 explain the observations sk given sk;1 . The term p(d1 d2 jsk;1 ) is the a priori probability density that is modeled by a GRF, by specifying the clique potential functions according to the desired local ^1 d ^ 2). properties of (d 185
' &
Discontinuity Models
Let us introduce two auxilary elds, the occlusion eld o, and the line eld l to model the occlusion/uncovered areas, and the optical ow boundaries respectively, in order to improve the motion estimation results. The occlusion eld o = o(x) x 2 ,
186
' &
The line eld l(xi xj ) models the horizontal and vertical discontinuities in the motion eld (optical ow) between the sites xi and xj as
The line process, l conceptually occupies the dual lattice which has sites for lines between every pair of pixel sites. The state of each line site can be either ON (l = 1) or OFF (l = 0), expressing the presence and absence of a discontinuity, respectively. Nonnegative potentials are assigned to each rotation invariant line clique con guration to penalize excessive use of the \ON" state.
187
'
a)
b)
V = 0.0
V = 2.7
V = 0.9
&
V = 1.8
V = 1.8 c)
V = 2.7
' &
Example: Prior probabalities with and without the line eld The prior potentials slightly penalize straight lines (V = 0:9), penalize corners (V = 1:8) and \T" junctions (V = 1:8), and heavily penalize end of a line (V = 2:7) and \crosses" (V = 2:7).
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 1
The likelihood potential function puts no penalty on dissimilar pixel pairs if the line site in between is ON, and puts di erent amounts of penalty on di erent line con gurations, re ecting our a priori expectation of their occurence.
189
' &
With the introduction of the auxiliary elds, the MAP estimate of fd1 d2 o lg is given by: ^ fd ^ fd
1
Next, we discuss the likelihood (consistency) and the a priori probability models.
190
' &
x2
191
'
where
&
192
'
where
The prior model incorporates the location of the optical ow boundaries and the occlusion/uncovered areas while dictating that the ow vectors vary smoothly within each optical ow boundary. The a priori model can be expressed as
p(d1 d2 o ljsk ) = exp ;U (d1 d2 o ljsk )] U (d1 d2 o ljsk ) = dU (d1 d2jl) + s U (ojl) + l U (ljo) X X X = d Vc(d1 d2jl) + s Vc(ojl) + l Vc(ljsk )
&
c2Cd
c2Co
c2Cl
Here Cd , Co and Cl denote the sets of all cliques for the displacement, occlusion and line elds, respectively, Vc (:) represent the corresponding clique function, and d , o and l are positive constants. 193
' &
ESTIMATION ALGORITHMS
The minimization of the overall potential is an exceedingly di cult problem, there are several hundreds of thousands of unknowns for a reasonable size image, and the criterion function is nonconvex. For example, for a 256 256 image, there are 65,536 motion vectors (131,072 components), 65,536 occlusion labels, and 131,072 line eld labels for a total of 327,680 unknowns. An additional complication is that the motion vector components are continuous-valued, and the occlusion and line eld labels are discrete-valued.
194
'
Three-step iteration of Dubois and Konrad: ^ and ^ 1. Given the best estimates of the auxilary eld o l, update the motion eld dk by minimizing ^ ^ min U ( g d d o g ) + U ( d d l gk ) g k k ; d d d1 d2
1 2 1 1 2
^ 1, d ^ 2 and ^ 2. Given the best estimates of d l, update o by minimizing ^1 d ^ 2 o gk;1 ) + oUo (o ^ min U ( g d l gk ) g k o An exhaustive search or the ICM method can be employed to solve this step. ^ 1, d ^ 2 and o ^, update l by minimizing 3. Finally, given the best estimates of d ^1 d ^ 2 l gk;1 ) + o Uo (^ min U ( d o l gk ) + l Ul (l gk ) d d l Once all three elds are updated, the process is repeated until a suitable criterion of convergence is satis ed. This procedure has been reported to give good results. 195
&
' &
196
' &
197
' &
Application of standard image segmentation methods directly to optical ow segmentation (i.e., using the velocity vector as feature) may not be useful, since 3-D motion usually generates spatially varying optical ow elds. e.g., within a purely rotating object, there is no ow at the center of rotation and the magnitude of the ow vectors increase as the distance of the points from the center of rotation increase.
Thus, optical ow segmentation needs to be based on some parametric description of the motion eld. 198
' &
Ball
Train
199
'
Assume that the object surface is composed of planar patches. aX1 + bX2 + cX3 = 1 The 3-D rigid motion of the object is modeled as 2 3 2 3 X1 X1 6 7 6 7 6 7 6 = R 4 X2 5 4 X2 7 5+T X3 X3
0 0 0
Then,
&
2 3 2 32 3 X1 a1 a2 a3 X1 6 7 6 7 6 7 6 7 6 7 6 4 X2 5 = 4 a4 a5 a6 5 4 X2 7 5
0 0
where
X3
0
a7 a8 a9
X3
A = R+T a b c ]
200
' &
Scene Segmentation
x1 = a1x1 + a2x2 + a3 x2 = a4x1 + a5x2 + a6
0 0
Orthographic projection of the object coordinates into the image plane yields
Perspective projection of the object coordinates into the image plane yields 1 + a2 x2 + a3 x1 = aa1x 7 x1 + a8 x2 + 1 1 + a5 x2 + a6 x2 = aa4x 7 x1 + a8 x2 + 1
0 0
Assuming the scene is represented by a 3-D mesh (wireframe) model with planar patches, di erent parametric models are needed for { Di erent moving objects, which have di erent set of 3-D rigid motion parameters. { Di erent planar patches, which have di erent normal vectors. 201
' &
Thresholding
Consider a bi-modal histogram h(s) of an image, s(x1 x2 ), composed of a light object on a dark background.
h(s)
s s min T s max
To extract the object from the background select a threshold T that separates these two dominant modes (peaks) 8 < 1 if s(x1 x2 ) > T z (x 1 x 2 ) = : 0 otherwise. indicates the object and background pixels. 202
'
Multilevel Thresholding If the histogram has M signi cant modes (peaks), where M > 2, then we need M ; 1 thresholds to separate the image into M segments. Of course, reliable determination of the thresholds becomes more di cult as the number of modes increases. Global/Local/Dynamic Thresholding In general, the threshold T is a function of
&
' &
Suppose we wish to segment an image into K regions based on the gray-values of the pixels. Let x = (x1 x2 ) denote the coordinates of a pixel, and s(x) denote its grey level.
K = 2, M=1 s 1 2
The K-means method of clustering minimizes the performance index 3 2 K X X 6 J= 4 jjs(x) ; (i+1) jj27 5
k=1 x2
(i) k
204
'
i) for all k = 1 2 : : : K , k 6= j , where ( k denotes the set of samples whose cluster i) center is ( k . i+1) 3. Compute the new cluster centers ( , k = 1 2 : : : K as the sample mean of k i) all samples in ( k X 1 (i+1) s(x) k = 1 2 : : : K =N k k
x2
(i)
if jjs(x) ;
(i)
&
x2
(i) k
i) where Nk is the number of samples in ( k . i+1) i) 4. If ( = ( k k for all k = 1 2 : : : K , the algorithm has converged, and the procedure is terminated. Otherwise, go to step 2.
205
' &
MAP Segmentation
Clustering with Spatial Smoothness Constraints Let z(x) denote the segmentation label at the pixel x, i.e., 1 z (x) K , and s(x) denote the grey level of the pixel. De ne z and s to denote the lexicographic ordering of the segmentation label
eld and the grey level eld, respectively. The maximum a posteriori probability (MAP) estimate of the segmentation label eld maximizes the a posteriori probability of the segmentation labels given the pixel gray levels where p(s j z) is the conditional probability density of the image grey levels given the pixel labels and p(z) is the prior density of the segmentation labels. 206
'
&
The prior pdf of the segmentation labels is modeled by an GRF ( ) X X 1 p(z) = Q exp ; VC (z) (z ; !) ! C where Q is the partition function (normalizing constant) and the summation is over all cliques C. We consider only one and two point cliques. The single pixel clique potentials are de ned as VC (z(x)) = i if z(x) = i and x 2 C all i They re ect our a priori knowledge of the probabilities of di erent region types. The smaller i the higher the likelihood of region i. The two-point clique potentials are de ned as 8 <; if z (x1 ) = z(x2) and x1 x2 2 C VC (z(x1) z(x2)) = : if z(x1 ) 6= z(x2) and x1 x2 2 C where is a positive parameter so that two neighboring pixels are more likely to belong to the same class than to di erent classes. The larger the value of , the stronger the smoothness constraint.
2
207
' &
The conditional density for region k is modeled as a white Gaussian process, with mean k and variance 2. Thus, the a posteriori density has the form ( ) X 1 X p(zjs) / exp ; 2 2 s(x) ; z(x) 2 ; VC (z)
x C
Maximization of this a posteriori density function with respect to z can be performed by simulated annealing. Observe that if we turn o the spatial smoothness constraints, the result is identical to the K-means algorithm.
208
' &
The MAP method can be made adaptive by letting the cluster means vary with the pixel location x. Then, ( ) X p(sjz) / exp ; (s(x) ; z(x) (x))2=2 2
x
2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 2 1 2 1 2 2 2 2 2 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 Segmentation labels, K=2 Local window
slowly
The quantities k (x) are estimated at each site x for all k = 1 : : : K , as the sample mean of those pixels with label k within a local window about the pixel x. 209
' &
To reduce the computational load down to a reasonable level, 1) the space-varying mean estimates will be computed on a sparse grid, and then interpolated. 2) the optimization will be performed via the ICM method. The algorithm starts with a window size equal to the image size and reduce the size of the window by 4 after each ICM optimization cycle. The ICM is equivalent to maximizing the local a posteriori pdf
Computational Issues
p(z(xi )js(x) z(xj ) all xj 2 Nx ) 8 9 < 1 = X / exp :; 2 2 (s(x) ; z(x) (x))2 ; VC (z) Cx C
i
j 2
Ref: T. N. Pappas, \An Adaptive Clustering Algorithm for Image Segmentation," IEEE Trans. on Signal Proc., vol. SP-40, pp. 901-914, April 1992.
210
' &
Multi-Channel Segmentation
p(zjy) / p(yjz)p(z)
Let y(x) = (v1(x) v2(x) s(x)). Assign a single label z (x) to each element of y(x) to maximize Assuming v1, v2, and s are conditionally independent given z,
which results in
X 1 2 p(v1 v2 sjz) = exp ; 2 2 (v1(x) ; v z(x) (x)) + 1 x 1 (v (x) ; v (x))2 + 1 (s(x) ; 2 2 z(x) 2 2 2 2 3
2
s (x))2 z(x)
The prior pdf for s is a Gibbs distribution with a 4-pixel neighborhood system and 2-pixel cliques. 211
' &
CHANGE DETECTION
FD(k k 1) (x1 x2) = s(x1 x2 k) ; s(x1 x2 k ; 1)
;
Compare two images pixel by pixel by forming a di erence image Segment the scene into moving vs. stationary parts by thresholding the di erence image 8 < 1 ifjFD(k k 1) (x1 x2 )j > T z(x1 x2) = : 0 otherwise.
;
where T is an appropriate threshold. This approach assumes that the illumination remains more or less constant from frame to frame. This method may result in isolated 1s in the segmentation mask z (x1 x2 ) due to noise in the images. 212
' &
Accumulative Di erences To eliminate sporadic \1"s in the segmentation mask, we may consider adding memory to the motion detection process by forming accumulative di erence images. Let s(x1 x2 k), s(x1 x2 k ; 1), , s(x1 x2 k ; n) be a sequence of images, and let s(x1 x2 k) be the reference frame. An accumulative di erence image is formed by comparing this reference image with every subsequent image in the sequence. A counter for each pixel location in the accumulative image is incremented every time the di erence between the reference image and the next image in the sequence at that pixel location is bigger than the threshold.
213
' &
214
' &
215
' &
A Direct Method
i) Parametric modeling of the 2-D motion eld De ne a transform with a set of parameters that maps pixels from frame k to frame k+1. Estimate the parameters of this transform in the image domain. ii) Segmentation Regions undergoing the same 3-D motion would have the same set of mapping parameters. Thus, assign ow vectors having the same mapping parameters into the same class. The process iterates between parameter estimation and segmentation until a satisfactory result is obtained.
216
'
Let
where and describe global illumination changes, and nk (x) denotes the noise. Assuming no occlusion e ects,
sk+1 (x ) = sk (x)
0
x = h(x )
0
&
where
1) The 3-D motion of the object. 2) The projection model from the 3-D space onto the camera plane. 3) The model of the object surface (planar, quadratice, etc.) 217
'
0
1) Planar surface, perspective projection: Let x and x denote image plane coordinates under the perspective projection. Assume that the surface of the moving object is planar, X3 = aX1 + bX2 + c. Then, the transformation is given by 1 + a2 x2 + a3 x1 = aa1x 7 x1 + a8 x2 + 1 4 x1 + a5 x2 + a6 x2 = a a x +a x +1
0 0
7 1
8 2
where = (a1 a8) is the vector of mapping parameters. 2) Planar surface, orthographic projection: In the case of parallel (orthographic) projection, we have the a ne transform
&
where
= (c1
'
x1 = mX1 x1 = m X1
0 0 0
x2 = mX2 x2 = m X2
0 0 0
describe the parallel projection. Substituting these into the 3-D displacement model and grouping terms with the same exponent, we arrive at the 12-parameter quadratic transform
&
2 x1 = a 1 x 2 1 + a2 x2 + a3 x1 x2 + a4 x1 + a5 x2 + a6 2 x2 = b1x2 1 + b2 x2 + b3 x1 x2 + b4 x1 + b5 x2 + b6
0 0
219
' &
Remarks: The quadratic transform is generally used in optical ow segmentation and object-oriented description, because it provides a good approximation to many real life images. It is not always possible to completely determine the 3-D motion of the object and the explicit surface structure using only the mapping parameters of the transform h(x ). But for image coding applications this does not pose a serious problem, since the main interest is the prediction of the next frame from the current frame. The mapping approach that is presented is not capable of handling occlusion e ects.
220
' &
Linear algorithms exist to nd the mapping parameters given spatio-temporal intensity gradients. The contents of the images sk (x) and sk+1(x) must be su ciently similar for estimation to be successful.
We estimate the mapping parameters to minimize the error function n o 1 2 J ( ^ ) = 2 E (~ sk+1(x ^ ) ; sk+1 (x)) where s ~k+1 (x ^ ) denotes the prediction of frame k + 1 from frame k.
221
' &
Each object is characterized by a speci c mapping vector . Thus, segmentation and motion estimation are treated as a combined problem. - In the rst step, the regions which have changed between sk (x) and sk+1 (x) are determined (change detection). - All isolated connected-regions of the resulting segmentation are de ned as objects of hierarchy level one. For each of these objects, a parameter vector of a transform h(x ) which relates the two images is estimated. - Next, those regions of each object where the vector is not valid are removed. These regions are de ned as objects of the second hierarchical level. - For the objects of level two and the remaining parts of level one, the parameter vectors are estimated. - Repeat the procedure, until the parameter vectors for each region are consistent with the region. 222
' &
Dense motion estimation (hierarchical, 3-step Lucas-Kanade) Start with randomly selected seed blocks (initial regions), estimate a ne parameters over each block. Merge regions with \similar" a ne parameters to reduce the number of classes. Update regions by classifying each pixel to one of the motion classes based on similarity of the dense and the corresponding a ne motion vectors, where a \good" match can be found. Reestimate a ne parameters over the updated regions, and iterate until convergence Classify all \unassigned pixels" based on a DFD criterion. 223
' &
224
' &
CLUSTERING METHODS
1. Estimate the optical ow eld. 2. Divide the motion eld into rectangular blocks.
3. For each block, estimate the a ne parameters by the method of linear least
squares. 4. Threshold the motion residual by Tstage to determine reliable blocks.
225
' &
assignment. 6. Find the pixels that fall into the computed cluster using the velocity checking criterion. 7. Delete all the assigned pixels from the image so that they will not be used in the next stage. 8. Eliminate small regions from the map obtained in step 7. 9. If all the pixels are assigned then stop, otherwise go to step 4.
226
' &
MAP SEGMENTATION
v2jz)p(z) p(zjv1 v2) = p(vp1(v 1 v2 ) given the optical ow data, where p(v1 v2jz) is the conditional pdf of the optical ow data given the segmentation and p(z) is the prior probability of the segmentation. 1) The segmentation eld is modeled by a spatio-temporal Markov random eld (MRF) to impose continuity (smoothness) of labels. 2) The conditional pdf models how well we can predict the measured (estimated) optical ow eld.
Ref. Murray and Buxton.
227
'
where
The Conditional Probability In the presence of noise n, the joint probability of the data given the segmentation labels is related to the noise distribution Pn (n) by
p(v1 v2jz) = Pn (n)
Assuming that the noise is white, Gaussian, with zero mean and variance 2, ) ( X 1 1 2 (x) exp ; Pn(n) = (2 2)1 =(2d( )) 2 2x
2
&
which depends on the way the optic ow data are distributed among the various scene facets. 228
' &
The prior probability of the interpretation is modeled by an MRF with respect to some local neighborhood. Thus, it is given by a Gibbs distribution which e ectively introduces local constraints on the interpretation. X 1 p(z) = Q exp f;U (z)g (z ; !) !
2
and U (!) is the sum of local potentials. Taking the logarithm of the MAP criterion, the maximization of the a posteriori probability distribution becomes minimization of the cost function 1 X 2(x) + U (z) 2 2x
2
229
'
The Algorithm:
1. Start with an initial labeling z of the optical ow vectors. Calculate the mapping parameters a = a1 a8]T for each region using least squares tting. Set the initial temperature for SA. 2. Scan the pixel sites according to a prede ned convention. At each site xi: (a) Perturb the label zi , randomly. (b) Decide whether to accept or reject this perturbation, based on the change in the cost function X 1 2 VC (z(xi ) z(xj )) C = 2 2 (x) +
xj 2Nxi
&
3. After all pixel sites are visited once, re-estimate the mapping parameters for each region in the least squares sense based on the new segmentation label con guration. 4. Exit, if a stopping criterion is satis ed. Otherwise, lower the temperature according to the schedule, and go to step (2). 230
'
where and
The spatial and temporal continuity of the segmentation labels can be enforced by means of spatial and temporal Gibbs potential functions, where X X X X X U= V2s(z(xi ) z(xj ) Lij ) + V; (L) + V2t(z(xi ) z(xk ))
xi xj 2Nxi
;
xi xk 2Nxi
8 > > < ;as if z (xi ) = z (xj ) and Lij is OFF V2s(z(xi ) z(xj ) Lij ) = > as if z(xi) 6= z(xj ) and Lij is OFF > : 0 if Lij is ON 8 < ;at if z (xi ) = z (xk ) V2t(z(xi) z(xk )) = : at otherwise
&
Here as and at are positive parameters which control the strength of the spatial and temporal continuity constraints, respectively. 231
' &
z)p(v1 v2jz gk )p(zjgk ) p(v1 v2 zjgk gk+1 ) = p(gk+1 jgk v1 vp2(g k+1 jgk )
p(gk+1 jgk v1 v2 z) is characterized by the DFD, modeled by a Gaussian distribution. p(zjgk ) is modeled as Gibbsian for connected regions.
232
'
p(v1 v2jz gk ) relates the 2-D motion estimates to the 3-D scene 1 exp f;U (v v jz)g p(v1 v2jz gk ) = p(v1 v2jz) = Q 1 2 where U (v1 v2jz) =
X X
xi xj 2Nxi
~ (x)jj2 jjv(x) ; v
&
The minimization is performed in two steps, alternating between estimation of optical ow, estimation of the model parameters and update of segmentation labels. 233
'
1. Estimate the optical ow eld (v1 v2) assuming that the segmentation eld z is given. This step involves the minimization of a modi ed cost function X 2 X ~ (x)jj2 C1 = jjv(x) ; v v v (x) + x x X X + jjv(xi) ; v(xj )jj2 (z(xi) ; z(xj )):
1 2
xi xj 2Nxi
&
which is composed of all the terms in C that contain (v1 v2). While the rst term indicates how well v explains our observations, the second and third terms impose prior constraints on the motion estimates that they should conform with the parametric ow model, and that they should vary smoothly within each region. The algorithm is initialized with an optical ow eld that is estimated using a global smoothness constraint. Given this estimate, we initialize the segmentation labels using a procedure similar to Wang and Adelson. 234
'
2. Estimate the segmentation eld z, assuming the optical ow vectors (v1 v2 ) are given. This step involves the minimization of all terms in C that contain z as well as (v1 v2 ), the projection of the 3-D motion. The modi ed cost function is given by X 2 X C2 = jjv(x) ; v (x)jj2 v v (x) + x x X X + V2(z(xi ) z(xj )):
0 0
xi xj 2Nxi
&
The rst term quanti es how well the projected motion (v1 v2), which depends on z and , compensates for the motion. The second term measures the consistency of (v1 v2 ) with (v1 v2). The third term is related to the prior probability of the present con guration of the segmentation labels. This step includes the least squares estimation of the mapping parameters a. A hierarchical implementation of this algorithm is also possible by forming successive low-pass ltered versions of gk and gk+1 .
0 0 0 0
235
' &
Flowchart
Input video 2-D dense motion estimation
(e.g., Lucas-Kanade)
Go to next frame
Updates are based on the MAP criterion using Gibbsian priors. 236
' &
Perform pixel-based motion segmentation (dotted line) to determine the number of motion classes, and the parametric model for each class. Perform color segmentation to de ne regions bounded by edges (solid lines). Assign each color region into one of the motion classes based either on the motion criterion, DFD criterion, or a combination of them. 237