You are on page 1of 14

Camera Models and Affine Multiple Views

Geometry
Subhashis Banerjee
Dept. Computer Science and Engineering
IIT Delhi
email: suban@cse.iitd.ac.in
May 29, 2001

Camera Models

A Camera transforms a 3D scene point X = (X, Y, Z)T into an image point x =


(x, y)T .

1.1

The Projective Camera

The most general mapping from P 3 to P 2 is

x1

T11 T12 T13 T14

T22 T23 T24

T31 T32 T33 T34

x2 = T21

x3

X1
X2
X3
X4

where (x1 , x2 , x3 )T and (X1 , X2 , X3 , X4 )T are homogeneous coordinates related to x


and X by
(x, y)
=
(x1 /x3 , x1 /x3 )
(X, Y, Z) = (X1 /X4 , X2 /X4 , X3 /X4 )
The transformation matrix T = [Tij ] has 11 degrees of freedom since only the ratios
of elements Tij are important.
(see Zisserman and Mundy).

1.2

The Perspective Camera

A special case of the projective camera is the perspective (or central) projection,
reducing to the familiar pin-hole camera when the leftmost 3 3 sub-matrix of T
is a rotation matrix with its third row scaled by the inverse focal length 1/f . The
simplest form is:

1 0 0 0
0 0
Tp =
0 1

0 0 1/f 0
which gives the familiar equations
"

x
y

f
=
Z

"

X
Y

Each point is scaled by its individual depth, and all projection rays converge to the
optic center.

1.3

The Affine Camera

The affine camera is a special case of the projective camera and is obtained by constraining the matrix T such that T31 = T32 = T33 = 0, thereby reducing the degrees
of freedom from 11 to 8:

x1

T11 T12 T13 T14

T22 T23 T24

0
0
0 T34

x2 = T21

x3

X1
X2
X3
X4

In terms of image and scene coordinates, the mapping takes the form
x = MX + t
where M is a general 2 3 matrix with elements Mij = Tij /T34 while t is a general
2-vector representing the image center.
The affine camera preserves parallelism.

1.4

The Weak-Perspective Camera

The affine camera becomes a weak-perspective camera when the rows of M form a
uniformly scaled rotation matrix. The simplest form is

Twp
yielding,
Mwp

f
=
Zave

"

1 0 0
0

0
= 0 1 0

0 0 0 Zave /f
1 0 0
0 1 0

"

and

x
y

f
=
Zave

"

X
Y

This is simply the perspective equation with individual point depths Zi replaced by
an average constant depth Zave
The weak-perspective model is valid when the average variation of the depth of the
object (Z) along the line of sight is small compared to the Zave and the field of view
is small. We see this as follows.
Expanding the perspective projection equation using a Taylor series, we obtain
f
x=
Zave + Z

"

X
Y

Z
f
Z
1
+
=
Zave
Zave
Zave

!"

...

X
Y

When |Z| << Zave only the zero-order term remains giving the weak-perspective
projection. The error in image position is then xerr = xp xwp :
xerr

f
=
Zave

Z
Zave + Z

"

X
Y

showing that a small focal length (f ), small field of view (X/Zave and (Y /Zave ) and
small depth variation (Z) contribute to the validity of the model.

1.5

The orthographic camera

The affine camera reduces to the case of orthographic (parallel) projection when M
represents the first two rows of a rotation matrix. The simplest form is

Torth
yielding,

"

Morth =

1 0 0 0

= 0 1 0 0
0 0 0 1

1 0 0
0 1 0

"

and

x
y

"

X
Y

Affine Multiple Views Geometry


Epipolar Geometry
Structure determination
Affine structure
Euclidean structure

2.1

Affine Epipolar Geometry

When the perspective effects are small, the problem of locating perspective epipolar
lines becomes ill-conditioned. In such cases it is convenient to assume the parallel
projection model of the affine camera which explicitly models the ambiguities.
The affine epipolar constraint can be described in terms of the affine fundamental
matrix F as p0T Qp = 0, i.e.,

x0i yi0

0 0 a
xi

1
0 0 b yi = 0
1
c d e
i

Z
Scene point

ave

Average depth
plane

Image plane

X
Optic Center

Xp

X
wp

X
orth

Figure 1: 1D image formation with image plane at Z = f . Xp , Xwp and Xorth are the
perspective, weak-perspective and orthographic projections respectively.
where p0 = (x0 , y 0 , 1)T and p = (x, y, 1) are homogeneous 3-vectors representing corresponding image points in two views.
(See Shapiro, Zisserman and Brady).
To derive the above, we write M as (B | b) where B is a general (non-singular)
2 2 matrix and b is a 2 vector. The projection equation then gives
"

xi = B

Xi
Yi

+ Zi b + t

Similarly, for M0 A, we have


"

x0i

=B

Xi
Yi

+ Zi b0 + M 0 D + t 0

Eliminating scene coordinates (Xi , Yi ) gives


x0i = xi + Zi d +
where = B0 B1 , d = b0 B0 B1 b and = t0 t + M0 D.

X1

X1

X2

X2

X3

X3

u3

x1
x2

u1

x1
x2

u3

x3

u2

u2

x3

xe

xe

u1

(a)

(b)

Figure 2: Affine and Perspective Epipolar Geometries


and d are functions only of camera parameters {M, M0 } and the motion transformation A, while explains the motion of the reference point (centroid) and depend
on the translation of the object D and the camera origins t and t0 .
This equation shows that x0i associated with xi lies on a line (epipolar) on the
second image with offset xi + and direction d. The unknown depth Zi determines
how far along this line does x0i lie. Inverting the equation we obtain
xi = 1 x0i Zi 1 d 1
The translation invariant versions of these equations are
x0i =
xi + Zi d
1
xi = x0i Zi 1 d
We can eliminate Zi from the above equations and obtain a single equation in terms
of image measurables:
(x0i xi ).d = 0
where, d = (dx , dy ) and its perpendicular d = (dy , dx ). This equation can be
written as
ax0i + byi0 + cxi + dyi + e = 0
where (a, b)T = d, (c, d)T = T d and e = T d . This gives us

x0i yi0

0 0 a
xi

0
0
b
1

yi = 0
1
c d e
i

2.1.1

Computation of Affine epipolar Geometry

Given correspondences in two views the affine fundamental matrix can be computed
using orthogonal regression by minimizing
X
1 n1
(ri n + e)2
2
| n | i=0

Here ri = (x0i , yi0 , xi , yi )T and n = (a, b, c, d)T . The minimization finds a hyper-plane
that globally minimizes the sum of the squared perpendicular distances between ri
and the hyper-plane.
Defining
vi = ri
r
and
W=

n1
X

vi vi T

i=0

it can be shown that the solution satisfies the eigenvector equation


Wn = i n, | n |2 = 1

2.2

Affine Structure

Consider a set of n 3D world points Xi (i = 0, . . . , n 1) in affine (non-rigid)


motion described by
X0i = AXi + D
where X0i is the new 3D position of the ith point, A is an arbitrary 3 x 3 matrix
and D is a 3-vector representing translation.
Removing the effects of translation
by registering the points with respect to a reference point X0 to obtain
X = X X0

and

X0 = X0 X00 = AX

Affine projections
If the affine camera models for the two views are given by the parameters {M, t}
and {M0 , t0 } respectively, then
x = MX

and

x0 = M0 X0 = M0 AX

Basis and Affine structure


Now, consider four non-coplanar scene points X0 , . . . , X3 with X0 as the origin.
We define three axis vectors Ej = Xj X0 for j = 1, . . . , 3. {E1 , E2 , E3 } form
a basis for the 3D affine space and any of the n vectors can be represented in
this basis as
Xi X0 = i E1 + i E2 + i E3

for

i = 1, . . . , n

where (i , i , i ) are the 3D affine coordinates of Xi . We call the 3D affine


coordinates the affine structure of the point Xi . It can be shown that the affine
structure remains invariant to affine motion with respect to the transformed
basis, that is,
Xi = i E1 + i E2 + i E3
(1)
X0i = i E01 + i E02 + i E03
where E0j = AEj .
Computation of Affine structure
From the above we obtain
xi = xi x0 = i e1 + i e2 + i e3
x0i = x0i x00 = i e01 + i e02 + i e03

(2)

where, ei = MEi and e0i = M0 E0i .


Thus, to compute the affine structure, we require two images with at least four
points in correspondence, i.e.,
{x0 , x1 , x2 , x3 }

and

{x00 , x01 , x02 , x03 }

These correspondences establish the bases {e1 , e2 , e3 } and {e01 , e02 , e03 } provided
no two axes, in either images, are collinear. Each additional point gives four
equations in 3 unknowns
"

xi
x0i

"

e1 e2 e3
e01 e02 e03

i
i

and the affine structure can be computed. The redundancy in the system enables
us to verify whether the affine projection model is valid.

2.2.1

Tomasi and Kanade factorization

In case n point correspondences (n 4) over k views (k 2) are available, we use


the factorization procedure of Tomasi and Kanade to obtain the bases and structure.
Their formulation can be written as an extension of the above equation as

x1 x2 . . . xn1
x01 x02 . . . x0n1
x001 x002 . . . x00n1
..
.

e1 e2 e3

2 . . . n1
e01 e02 e03
1
1 2 . . . n1

e001 e002 e003


.
.
.

..
1
2
n1
.

where the left measurement matrix W represents the n point correspondences in k


(2k x 3) and S
(3
views and has dimensions 2k x (n1). The matrices on the right, M
gives
x (n 1)), are called motion and structure matrices respectively. The matrix S
t

the invariant affine structure of the n points in motion, and the i h row of M, M(i),
along with the corresponding image center x0 (i), gives the projection parameters for

the it h view {M(i),


x0 (i)}.
Clearly, in the absence of noise, W must have a rank at-most 3. Tomasi and
Kanade perform a singular value decomposition of W and use the 3 largest eigenvalues
and S.
If the SVD returns a rank greater than 3, then the affine
to construct M
projection model is invalid and we use this as a check. The rank 2 case signifies either
a planar object (which is not possible for facial images!) or degenerate motion. In
such a case, the 3D affine structure cannot be determined and the views are related
by 2D affine transformations. The 2D affine structure can then be recovered in only
two axes using the same formalism.
2.2.2

Image transfer and linear combination of views

Once the affine structure has been computed, it can be used to generate a new view
of the object (transfer) by simply selecting a new spanning set {e001 , e002 , e003 }. No
camera calibration is needed. Note that this is same as choosing a new projection
matrix M00 .
x00i = x000 + i e001 + i e002 + i e00k
If the affine structure is not of interest (graphics), it is possible to bypass the affine
coordinates and express the new image coordinates x00 directly in terms of the first
two sets of image coordinates x and x0 . One can write the projection equations
in the first two views as
x = GX
x0 = G0 X

where G and G0 are 2 3 matrices with rows {G1 , G2 } and {G01 , G02 } respectively.
The new view can be similarly written as
x00 = G00 X
where G00 has rows {G001 , G002 }.
Now, any three rows of {G1 , G2 , G01 , G02 } define a linearly independent spanning
set for A3 , say {G1 , G2 , G01 }. So, there exists scalars such that
"
00

G =

a1 a2
b1 b2

"

G+

a3 0
b3 0

G0

Then, x00 = G00 X gives


"

x00 =

a1 a2
b1 b2

"

x +

a3 0
b3 0

"

x0 =

a1 a2 a3
b1 b2 b3

y
x0

Thus, if images of an object are obtained using affine cameras, then a novel view can
be expressed as a linear combination of views (this is useful for object recognition).
2.2.3

Change of basis

Given the current spanning set {e1 , e2 , e3 } and {e01 , e02 , e03 } in the two images, we
have that

"
#
"
# i
xi
e1 e2 e3

=
i
0
0
0
0
xi
e1 e2 e3
i
Suppose that we now wish to express the same set of points using alternative spanning
sets {h1 , h2 , h3 } and {h01 , h02 , h03 }, the new affine coordinates must obey
"

2.2.4

xi
x0i

"

h1 h2 h3
h01 h02 h03

i

i
i

Koenderink and Van Doorn

Instead of choosing E3 = X3 X0 , KVD choose Ek = k (i.e. the direction of viewing


in the first frame). Since ek = MEk = 0, the projection of Ek in the first image is
degenerate reducing it to a single point. Thus, only two basis vectors are chosen in
the first image
xi = i e1 + i e2

~
Q

~
P

reference
plane

p
p
q
~
p
q

V
1
V
2

~
q

Figure 3: Affine Structure from Motion


In the second image, the third axis vector is no longer degenerate, given by e0k =
ME0k = MAEk . e0k is actually an epipolar line. If we use e01 and e02 to predict the
position where each point would appear in image 2, as if they lay on plane {E1 , E2 },
we get
x
0i = x00 + i e01 + i e02
the disparity between the predicted position and the observed position
x0i = x00 + i e01 + i e02 + i e0k
is solely due to the i component
0i = i e0k
x0i x

2.3
2.3.1

Rigid reconstruction
Assumptions

Rigid transformation (isometry)

Affine projection
Metric constructions
2.3.2

Procedure

Image Plane

Fronto-parallel
plane

Figure 4: Euclidean reconstruction

1. Translation in fronto-parallel plane merely produces a shift in projections. This


can be factored out by putting two projections of O in to coincidence.
2. Rotation can be decomposed into i) a rotation in the image plane (cyclorotation) and rotation about an axis in the fronto-parallel plane. Projection
of the third affine frame vector is the projection of a plane perpendicular to the
axis of rotation in the fronto-parallel plane. One can reconstruct the projection
in the first view (only affine construction) and factor out the relative rotation
in the two images. This yields the cyclo-rotation.
3. Since the axis of rotation is known in both views, one can find the overall scale
difference due to translation in depth. Points on the axis of rotation do not

rotate. Consider the projection of all image points on to this axis. If they differ
in the two views, they must differ by only a constant scale factor. Otherwise,
the rigidity assumption is falsified.
4. Now the two views differ only by a rotation about an axis in the fronto-parallel
plane. Define a Euclidean frame (e1 , e2 , e3 ), such that e1,2,3 are unit vectors
with e1 along the axis of rotation and e3 along the line of sight.
Let G1 e1 + G2 e2 denote the depth gradient of a plane in the object. That is,
the depth of a point e1 + e2 in the image with respect to the fronto-parallel
plane is G1 + G2 . Note that
G1 = tan cos
G2 = tan sin
where is the slant and is the tilt of the plane.
Consider any triangle OXY in the plane. Let the coordinates of X and Y be
(X1 , X2 ) and (Y1 , Y2 ) respectively. Then the third coordinates must be
X3 = G1 X1 + G2 X2
Y3 = G1 Y1 + G2 Y2
For a given turn the rotation can be represented by

1
0
0

0 cos sin
0 sin cos
Of the three transformed coordinates, the first one is trivially unchanged and
the third one is not observable. The second coordinate is observable, and the
equations are:
X21 = X20 cos sin (X10 G1 + X20 G2 )
Y21 = Y20 cos sin (Y10 G1 + Y20 G2 )
here the upper indices label the views and the lower indices label the components.
Because the turn is unknown, we eliminate it from these equations to obtain a
single equation in (G1 , G2 ). This equation represents a one-parameter solution
for the two view case. The parameter is the unknown turn . The equation is
quadratic in (G1 , G2 ) with the linear term absent; and represents a hyperbola
in the (G1 , G2 ) space (please derive it).

5. Repeating the steps above between the second and a third view, we obtain a
pair of two view solutions. Each two view solution represents a one-parameter
family of solutions. The one-parameter families for the 0-1 transition and the
1-2 transition are represented by the hyperbolic loci in the gradient space. The
pair of hyperbola has either two or four intersections. The case of no intersection
occurs only in the non-rigid case. If the motion is rigid, then there has to be
one solution and hence a pair of them. The intersections represent either one or
two pairs of solutions that are related through a reflection in the fronto-parallel
plane.