You are on page 1of 31

Int J Comput Vis (2011) 92: 131

DOI 10.1007/s11263-010-0390-2
A Database and Evaluation Methodology for Optical Flow
Simon Baker Daniel Scharstein J.P. Lewis
Stefan Roth Michael J. Black Richard Szeliski
Received: 18 December 2009 / Accepted: 20 September 2010 / Published online: 30 November 2010
The Author(s) 2010. This article is published with open access at Springerlink.com
Abstract The quantitative evaluation of optical ow algo-
rithms by Barron et al. (1994) led to signicant advances
in performance. The challenges for optical ow algorithms
today go beyond the datasets and evaluation methods pro-
posed in that paper. Instead, they center on problems as-
sociated with complex natural scenes, including nonrigid
motion, real sensor noise, and motion discontinuities. We
propose a new set of benchmarks and evaluation methods
for the next generation of optical ow algorithms. To that
end, we contribute four types of data to test different as-
pects of optical ow algorithms: (1) sequences with non-
rigid motion where the ground-truth ow is determined by
A preliminary version of this paper appeared in the IEEE International
Conference on Computer Vision (Baker et al. 2007).
S. Baker R. Szeliski
Microsoft Research, Redmond, WA, USA
S. Baker
e-mail: sbaker@microsoft.com
R. Szeliski
e-mail: szeliski@microsoft.com
D. Scharstein ()
Middlebury College, Middlebury, VT, USA
e-mail: schar@middlebury.edu
J.P. Lewis
Weta Digital, Wellington, New Zealand
e-mail: zilla@computer.org
S. Roth
TU Darmstadt, Darmstadt, Germany
e-mail: sroth@cs.tu-darmstadt.de
M.J. Black
Brown University, Providence, RI, USA
e-mail: black@cs.brown.edu
tracking hidden uorescent texture, (2) realistic synthetic
sequences, (3) high frame-rate video used to study inter-
polation error, and (4) modied stereo sequences of static
scenes. In addition to the average angular error used by Bar-
ron et al., we compute the absolute ow endpoint error, mea-
sures for frame interpolation error, improved statistics, and
results at motion discontinuities and in textureless regions.
In October 2007, we published the performance of several
well-known methods on a preliminary version of our data
to establish the current state of the art. We also made the
data freely available on the web at http://vision.middlebury.
edu/ow/. Subsequently a number of researchers have up-
loaded their results to our website and published papers us-
ing the data. A signicant improvement in performance has
already been achieved. In this paper we analyze the results
obtained to date and draw a large number of conclusions
from them.
Keywords Optical ow Survey Algorithms Database
Benchmarks Evaluation Metrics
1 Introduction
As a subeld of computer vision matures, datasets for
quantitatively evaluating algorithms are essential to ensure
continued progress. Many areas of computer vision, such
as stereo (Scharstein and Szeliski 2002), face recognition
(Philips et al. 2005; Sim et al. 2003; Gross et al. 2008;
Georghiades et al. 2001), and object recognition (Fei-Fei
et al. 2006; Everingham et al. 2009), have challenging
datasets to track the progress made by leading algorithms
and to stimulate new ideas. Optical ow was actually one
of the rst areas to have such a benchmark, introduced by
Barron et al. (1994). The eld beneted greatly from this
2 Int J Comput Vis (2011) 92: 131
study, which led to rapid and measurable progress. To con-
tinue the rapid progress, new and more challenging datasets
are needed to push the limits of current technology, reveal
where current algorithms fail, and evaluate the next gener-
ation of optical ow algorithms. Such an evaluation dataset
for optical owshould ideally consist of complex real scenes
with all the artifacts of real sensors (noise, motion blur, etc.).
It should also contain substantial motion discontinuities and
nonrigid motion. Of course, the image data should be paired
with dense, subpixel-accurate, ground-truth ow elds.
The presence of nonrigid or independent motion makes
collecting a ground-truth dataset for optical ow far harder
than for stereo, say, where structured light (Scharstein and
Szeliski 2002) or range scanning (Seitz et al. 2006) can
be used to obtain ground truth. Our solution is to collect
four different datasets, each satisfying a different subset of
the desirable properties above. The combination of these
datasets provides a basis for a thorough evaluation of current
optical ow algorithms. Moreover, the relative performance
of algorithms on the different datatypes may stimulate fur-
ther research. In particular, we collected the following four
types of data:
Real Imagery of Nonrigidly Moving Scenes: Dense
ground-truth ow is obtained using hidden uorescent
texture painted on the scene. We slowly move the scene,
at each point capturing separate test images (in visible
light) and ground-truth images with trackable texture (in
UV light). Note that a related technique is being used
commercially for motion capture (Mova LLC 2004) and
Tappen et al. (2006) recently used certain wavelengths
to hide ground truth in intrinsic images. Another form of
hidden markers was also used in Ramnath et al. (2008) to
provide a sparse ground-truth alignment (or ow) of face
images. Finally, Liu et al. recently proposed a method to
obtain ground-truth using human annotation (Liu et al.
2008).
Realistic Synthetic Imagery: We address the limitations of
simple synthetic sequences such as Yosemite (Barron et al.
1994) by rendering more complex scenes with larger mo-
tion ranges, more realistic texture, independent motion,
and with more complex occlusions.
Imagery for Frame Interpolation: Intermediate frames are
withheld and used as ground truth. In a wide class of ap-
plications such as video re-timing, novel-view generation,
and motion-compensated compression, what is important
is not how well the ow matches the ground-truth motion,
but how well intermediate frames can be predicted using
the ow (Szeliski 1999).
Real Stereo Imagery of Rigid Scenes: Dense ground truth
is captured using structured light (Scharstein and Szeliski
2003). The data is then adapted to be more appropriate
for optical ow by cropping to make the disparity range
roughly symmetric.
We collected enough data to be able to split our collec-
tion into a training set (12 datasets) and a nal evalua-
tion set (12 datasets). The training set includes the ground
truth and is meant to be used for debugging, parameter
estimation, and possibly even learning (Sun et al. 2008;
Li and Huttenlocher 2008). The ground truth for the nal
evaluation set is not publicly available (with the exception
of the Yosemite sequence, which is included in the test set to
allow some comparison with algorithms published prior to
the release of our data).
We also extend the set of performance measures and the
evaluation methodology of Barron et al. (1994) to focus at-
tention on current algorithmic problems:
Error Metrics: We report both average angular error (Bar-
ron et al. 1994) and ow endpoint error (pixel distance)
(Otte and Nagel 1994). For image interpolation, we com-
pute the residual RMS error between the interpolated im-
age and the ground-truth image. We also report a gradient-
normalized RMS error (Szeliski 1999).
Statistics: In addition to computing averages and standard
deviations as in Barron et al. (1994), we also compute
robustness measures (Scharstein and Szeliski 2002) and
percentile-based accuracy measures (Seitz et al. 2006).
Region Masks: Following Scharstein and Szeliski (2002),
we compute the error measures and their statistics over
certain masked regions of research interest. In particular,
we compute the statistics near motion discontinuities and
in textureless regions.
Note that we require ow algorithms to estimate a dense
ow eld. An alternate approach might be to allow algo-
rithms to provide a condence map, or even to return a
sparse or incomplete ow eld. Scoring such outputs is
problematic, however. Instead, we expect algorithms to gen-
erate a ow estimate everywhere (for instance, using inter-
nal condence measures to ll in areas with uncertain ow
estimates due to lack of texture).
In October 2007 we published the performance of sev-
eral well-known algorithms on a preliminary version of our
data to establish the current state of the art (Baker et al.
2007). We also made the data freely available on the web
at http://vision.middlebury.edu/ow/. Subsequently a large
number of researchers have uploaded their results to our
website and published papers using the data. A signicant
improvement in performance has already been achieved. In
this paper we present both results obtained by classic al-
gorithms, as well as results obtained since publication of
our preliminary data. In addition to summarizing the over-
all conclusions of the currently uploaded results, we also
examine how the results vary: (1) across the metrics, sta-
tistics, and region masks, (2) across the various datatypes
and datasets, (3) from ow estimation to interpolation, and
(4) depending on the components of the algorithms.
Int J Comput Vis (2011) 92: 131 3
The remainder of this paper is organized as follows. We
begin in Sect. 2 with a survey of existing optical ow al-
gorithms, benchmark databases, and evaluations. In Sect. 3
we describe the design and collection of our database, and
briey discuss the pros and cons of each dataset. In Sect. 4
we describe the evaluation metrics. In Sect. 5 we present the
experimental results and discuss the major conclusions that
can be drawn from them.
2 Related Work and Taxonomy of Optical Flow
Algorithms
Optical ow estimation is an extensive eld. A fully com-
prehensive survey is beyond the scope of this paper. In this
related work section, our goals are: (1) to present a taxon-
omy of the main components in the majority of existing
optical ow algorithms, and (2) to focus primarily on re-
cent work and place the contributions of this work in the
context of our taxonomy. Note that our taxonomy is similar
to those of Stiller and Konrad (1999) for optical ow and
Scharstein and Szeliski (2002) for stereo. For more exten-
sive coverage of older work, the reader is referred to previ-
ous surveys such as those by Aggarwal and Nandhakumar
(1988), Barron et al. (1994), Otte and Nagel (1994), Mitiche
and Bouthemy (1996), and Stiller and Konrad (1999).
We rst dene what we mean by optical ow. Following
Horns (1986) taxonomy, the motion eld is the 2D projec-
tion of the 3D motion of surfaces in the world, whereas the
optical ow is the apparent motion of the brightness pat-
terns in the image. These two motions are not always the
same and, in practice, the goal of 2D motion estimation is
application dependent. In frame interpolation, it is prefer-
able to estimate apparent motion so that, for example, spec-
ular highlights move in a realistic way. On the other hand, in
applications where the motion is used to interpret or recon-
struct the 3D world, the motion eld is what is desired.
In this paper, we consider both motion eld estimation
and apparent motion estimation, referring to them collec-
tively as optical ow. The ground truth for most of our
datasets is the true motion eld, and hence this is how we
dene and evaluate optical ow accuracy. For our interpola-
tion datasets, the ground truth consists of images captured at
an intermediate time instant. For this data, our denition of
optical ow is really the apparent motion.
We do, however, restrict attention to optical ow algo-
rithms that estimate a separate 2D motion vector for each
pixel in one frame of a sequence or video containing two or
more frames. We exclude transparency which requires mul-
tiple motions per pixel. We also exclude more global rep-
resentations of the motion such as parametric motion esti-
mates (Bergen et al. 1992).
Most existing optical ow algorithms pose the problem
as the optimization of a global energy function that is the
weighted sum of two terms:
E
Global
=E
Data
+E
Prior
. (1)
The rst term E
Data
is the Data Term, which measures how
consistent the optical ow is with the input images. We con-
sider the choice of the data term in Sect. 2.1. The second
term E
Prior
is the Prior Term, which favors certain ow
elds over others (for example E
Prior
often favors smoothly
varying owelds). We consider the choice of the prior term
in Sect. 2.2. The optical ow is then computed by optimiz-
ing the global energy E
Global
. We consider the choice of the
optimization algorithm in Sects. 2.3 and 2.4. In Sect. 2.5
we consider a number of miscellaneous issues. Finally, in
Sect. 2.6 we survey previous databases and evaluations.
2.1 Data Term
2.1.1 Brightness Constancy
The basis of the data termused by most algorithms is Bright-
ness Constancy, the assumption that when a pixel ows
from one image to another, its intensity or color does not
change. This assumption combines a number of assumptions
about the reectance properties of the scene (e.g., that it is
Lambertian), the illumination in the scene (e.g., that it is
uniformVedula et al. 2005) and about the image forma-
tion process in the camera (e.g., that there is no vignetting).
If I (x, y, t ) is the intensity of a pixel (x, y) at time t and the
ow is (u(x, y, t ), v(x, y, t )), Brightness Constancy can be
written as:
I (x, y, t ) =I (x +u, y +v, t +1). (2)
Linearizing (2) by applying a rst-order Taylor expansion to
the right-hand side yields the approximation:
I (x, y, t ) =I (x, y, t ) +u
I
x
+v
I
y
+1
I
t
, (3)
which simplies to the Optical Flow Constraint equation:
u
I
x
+v
I
y
+
I
t
=0. (4)
Both Brightness Constancy and the Optical Flow Constraint
equation provide just one constraint on the two unknowns at
each pixel. This is the origin of the Aperture Problemand the
reason that optical ow is ill-posed and must be regularized
with a prior term (see Sect. 2.2).
The data term E
Data
can be based on either Brightness
Constancy in (2) or on the Optical Flow Constraint in (4).
In either case, the equation is turned into an error per pixel,
4 Int J Comput Vis (2011) 92: 131
the set of which is then aggregated over the image in some
manner (see Sect. 2.1.2). If Brightness Constancy is used, it
is generally converted to the Optical Flow Constraint dur-
ing the derivation of most continuous optimization algo-
rithms (see Sect. 2.3), which often involves the use of a Tay-
lor expansion to linearize the energies. The two constraints
are therefore essentially equivalent in practical algorithms
(Brox et al. 2004).
An alternative to the assumption of constancy is that
the signals (images) at times t and t +1 are highly correlated
(Pratt 1974; Burt et al. 1982). Various correlation constraints
can be used for computing dense ow including normalized
cross correlation and Laplacian correlation (Burt et al. 1983;
Glazer et al. 1983; Sun 1999).
2.1.2 Choice of the Penalty Function
Equations (2) and (4) both provide one error per pixel, which
leads to the question of how these errors are aggregated over
the image. A baseline approach is to use an L2 norm as in
the Horn and Schunck algorithm (Horn and Schunck 1981):
E
Data
=

x,y
_
u
I
x
+v
I
y
+
I
t
_
2
. (5)
If (5) is interpreted probabilistically, the use of the L2 norm
means that the errors in the Optical Flow Constraint are as-
sumed to be Gaussian and IID. This assumption is rarely true
in practice, particularly near occlusion boundaries where
pixels at time t may not be visible at time t +1. Black and
Anandan (1996) present an algorithm that can use an arbi-
trary robust penalty function, illustrating their approach with
the specic choice of a Lorentzian penalty function. A com-
mon choice by a number of recent algorithms (Brox et al.
2004; Wedel et al. 2008) is the L1 norm, which is sometimes
approximated with a differentiable version:
E
1
=

x,y
|E
x,y
|

x,y
_
E
x,y

2
+
2
, (6)
where E is a vector of errors E
x,y
,
1
denotes the L1
norm, and is a small positive constant. A variety of other
penalty functions have been used.
2.1.3 Photometrically Invariant Features
Instead of using the raw intensity or color values in the im-
ages, it is also possible to use features computed from those
images. In fact, some of the earliest optical ow algorithms
used ltered images to reduce the effects of shadows (Burt
et al. 1983; Anandan 1989). One recently popular choice
(for example used in Brox et al. 2004 among others) is to
augment or replace (2) with a similar term based on the gra-
dient of the image:
I (x, y, t ) =I (x +u, y +v, t +1). (7)
Empirically the gradient is often more robust to (approxi-
mately additive) illumination changes than the raw intensi-
ties. Note, however, that (7) makes the additional assump-
tion that the ow is locally translational; e.g., local scale
changes, rotations, etc., can violate (7) even when (2) holds.
It is also possible to use more complicated features than the
gradient. For example a Field-of-Experts formulation is used
in Sun et al. (2008) and SIFT features are used in Liu et al.
(2008).
2.1.4 Modeling Illumination, Blur, and Other Appearance
Changes
The motivation for using features is to increase robustness
to illumination and other appearance changes. Another ap-
proach is to estimate the change explicitly. For example,
suppose g(x, y) denotes a multiplicative scale factor and
b(x, y) an additive term that together model the illumina-
tion change between I (x, y, t ) and I (x, y, t +1). Brightness
Constancy in (2) can be generalized to:
g(x, y)I (x, y, t ) =I (x +u, y +v, t +1) +b(x, y). (8)
Note that putting g(x, y) on the left-hand side is preferable
to putting it on the right-hand side as it can make optimiza-
tion easier (Seitz and Baker 2009). Equation (8) is even more
under-constrained than (2), with four unknowns per pixel
rather than two. It can, however, be solved by putting an ap-
propriate prior on the two components of the illumination
change model g(x, y) and b(x, y) (Negahdaripour 1998;
Seitz and Baker 2009). Explicit illumination modeling can
be generalized in several ways, for example to model the
changes physically over a longer time interval (Haussecker
and Fleet 2000) or to model blur (Seitz and Baker 2009).
2.1.5 Color and Multi-Band Images
Another issue, addressed by a number of authors (Ohta
1989; Markandey and Flinchbaugh 1990; Golland and
Bruckstein 1997), is how to modify the data term for color
or multi-band images. The simplest approach is to add a data
term for each band, for example performing the summation
in (5) over the color bands, as well as the pixel coordinates
x, y. More sophisticated approaches include using the HSV
color space and treating the bands differently (e.g., by using
different weights or norms) (Zimmer et al. 2009).
2.2 Prior Term
The data term alone is ill-posed with fewer constraints than
unknowns. It is therefore necessary to add a prior to fa-
vor one possible solution over another. Generally speaking,
while most priors are smoothness priors, a wide variety of
choices are possible.
Int J Comput Vis (2011) 92: 131 5
2.2.1 First Order
Arguably the simplest prior is to favor small rst-order
derivatives (gradients) of the ow eld. If we use an L2
norm, then we might, for example, dene:
E
Prior
=

x,y
__
u
x
_
2
+
_
u
y
_
2
+
_
v
x
_
2
+
_
v
y
_
2
_
. (9)
The combination of (5) and (9) denes the energy used by
Horn and Schunck (1981). Given more than two frames
in the video, it is also possible to add temporal smooth-
ness terms
u
t
and
v
t
to (9) (Murray and Buxton 1987;
Black and Anandan 1991; Brox et al. 2004). Note, however,
that the temporal terms need to be weighted differently from
the spatial ones.
2.2.2 Choice of the Penalty Function
As for the data term in Sect. 2.1.2, under a probabilis-
tic interpretation, the use of an L2 norm assumes that the
gradients of the ow eld are Gaussian and IID. Again,
this assumption is violated in practice and so a wide va-
riety of other penalty functions have been used. The al-
gorithm by Black and Anandan (1996) also uses a rst-
order prior, but can use an arbitrary robust penalty func-
tion on the prior term rather than the L2 norm in (9).
While Black and Anandan (1996) use the same Lorentzian
penalty function for both the data and spatial term, there
is no need for them to be the same. The L1 norm is also
a popular choice of penalty function (Brox et al. 2004;
Wedel et al. 2008). When the L1 norm is used to penalize
the gradients of the ow eld, the formulation falls in the
class of Total Variation (TV) methods.
There are two common ways such robust penalty func-
tions are used. One approach is to apply the penalty func-
tion separately to each derivative and then to sum up the
results. The other approach is to rst sum up the squares
(or absolute values) of the gradients and then apply a sin-
gle robust penalty function. Some algorithms use the rst
approach (Black and Anandan 1996), while others use the
second (Bruhn et al. 2005; Brox et al. 2004; Wedel et al.
2008).
Note that some penalty (log probability) functions have
probabilistic interpretations related to the distribution of
ow derivatives (Roth and Black 2007).
2.2.3 Spatial Weighting
One popular renement for the prior term is one that weights
the penalty function with a spatially varying function. One
particular example is to vary the weight depending on the
gradient of the image:
E
Prior
=

x,y
w(I)
__
u
x
_
2
+
_
u
y
_
2
+
_
v
x
_
2
+
_
v
y
_
2
_
. (10)
Equation (10) could be used to reduce the weight of the prior
at edges (high |I|) because there is a greater likelihood
of a ow discontinuity at an intensity edge than inside a
smooth region. The weight can also be a function of an over-
segmentation of the image, rather than the gradient, for ex-
ample down-weighting the prior between different segments
(Seitz and Baker 2009).
2.2.4 Anisotropic Smoothness
In (10) the weighting function is isotropic, treating all direc-
tions equally. A variety of approaches weight the smooth-
ness prior anisotropically. For example, Nagel and Enkel-
mann (1986) and Werlberger et al. (2009) weight the direc-
tion along the image gradient less than the direction orthog-
onal to it, and Sun et al. (2008) learn a Steerable Random
Field to dene the weighting. Zimmer et al. (2009) perform
a similar anisotropic weighting, but the directions are de-
ned by the data constraint rather than the image gradient.
2.2.5 Higher-Order Priors
The rst-order priors in Sect. 2.2.1 can be replaced with pri-
ors that encourage the second-order derivatives (

2
u
x
2
,

2
u
y
2
,

2
u
xy
,

2
v
x
2
,

2
v
y
2
,

2
v
xy
) to be small (Anandan and Weiss 1985;
Trobin et al. 2008).
A related approach is to use an afne prior (Ju et al. 1996;
Ju 1998; Nir et al. 2008; Seitz and Baker 2009). One ap-
proach is to over-parameterize the ow (Nir et al. 2008). In-
stead of solving for two ow vectors (u(x, y, t ), v(x, y, t ))
at each pixel, the algorithm in Nir et al. (2008) solves for 6
afne parameters a
i
(x, y, t ), i = 1, . . . , 6 where the ow is
given by:
u(x, y, t ) = a
1
(x, y, t ) +
x x
0
x
0
a
3
(x, y, t )
+
y y
0
y
0
a
5
(x, y, t ), (11)
v(x, y, t ) = a
2
(x, y, t ) +
x x
0
x
0
a
4
(x, y, t )
+
y y
0
y
0
a
6
(x, y, t ), (12)
where (x
0
, y
0
) is the middle of the image. Equations (11)
and (12) are then substituted into any of the data terms
6 Int J Comput Vis (2011) 92: 131
above. Ju et al. formulate the prior so that neighboring afne
parameters should be similar (Ju et al. 1996). As above, a ro-
bust penalty may be used and, further, may vary depending
on the afne parameter (for example weighting a
1
and a
2
differently from a
3
a
6
).
2.2.6 Rigidity Priors
A number of authors have explored rigidity or fundamental
matrix priors which, in the absence of other evidence, favor
ows that are aligned with epipolar lines. These constraints
have both been strictly enforced (Adiv 1985; Hanna 1991;
Nir et al. 2008) and added as a soft prior (Wedel et al. 2008;
Wedel et al. 2009; Valgaerts et al. 2008).
2.3 Continuous Optimization Algorithms
The two most commonly used continuous optimization tech-
niques in optical ow are: (1) gradient descent algorithms
(Sect. 2.3.1) and (2) extremal or variational approaches
(Sect. 2.3.2). In Sect. 2.3.3 we describe a small number of
other approaches.
2.3.1 Gradient Descent Algorithms
Let f be a vector resulting from concatenating the horizon-
tal and vertical components of the ow at every pixel. The
goal is then to optimize E
Global
with respect to f. The sim-
plest gradient descent algorithm is steepest descent (Baker
and Matthews 2004), which takes steps in the direction of
the negative gradient
E
Global
f
. An important question with
steepest descent is how big the step size should be. One ap-
proach is to adjust the step size iteratively, increasing it if the
algorithm makes a step that reduces the energy and decreas-
ing it if the algorithm tries to makes a step that increases the
error. Another approach used in Black and Anandan (1996)
is to set the step size to be:
w
1
T
E
Global
f
. (13)
In this expression, T is an upper bound on the second deriv-
atives of the energy; T

2
E
Global
f
2
i
for all components f
i
in
the vector f. The parameter 0 < w < 2 is an over-relaxation
parameter. Without it, (13) tends to take too small steps be-
cause: (1) T is an upper bound, and (2) the equation does
not model the off-diagonal elements in the Hessian. It can
be shown that if E
Global
is a quadratic energy function (i.e.,
the problem is equivalent to solving a large linear system),
convergence to the global minimum can be guaranteed (al-
beit possibly slowly) for any 0 < w < 2. In general E
Global
is nonlinear and so there is no such guarantee. However,
based on the theoretical result in the linear case, a value
around w 1.95 is generally used. Also note that many non-
quadratic (e.g., robust) formulations can be solved with iter-
atively reweighted least squares (IRLS); i.e., they are posed
as a sequence of quadratic optimization problems with a
data-dependent weighting function that varies from iteration
to iteration. The weighted quadratic is iteratively solved and
the weights re-estimated.
In general, steepest descent algorithms are relatively
weak optimizers requiring a large number of iterations be-
cause they fail to model the coupling between the unknowns.
A second-order model of this coupling is contained in the
Hessian matrix

2
E
Global
f
i
f
j
. Algorithms that use the Hessian
matrix or approximations to it such as the Newton method,
Quasi-Newton methods, the Gauss-Newton method, and
the Levenberg-Marquardt algorithm (Baker and Matthews
2004) all converge far faster. These algorithms are how-
ever inapplicable to the general optical ow problem be-
cause they require estimating and inverting the Hessian,
a 2n 2n matrix where there are n pixels in the image.
These algorithms are applicable to problems with fewer pa-
rameters such as the Lucas-Kanade algorithm (Lucas and
Kanade 1981) and variants (Le Besnerais and Champagnat
2005), which solve for a single ow vector (2 unknowns) in-
dependently for each block of pixels. Another set of exam-
ples are parametric motion algorithms (Bergen et al. 1992),
which also just solve for a small number of unknowns.
2.3.2 Variational and Other Extremal Approaches
The second class of algorithms assume that the global en-
ergy function can be written in the form:
E
Global
=
_ _
E(u(x, y), v(x, y), x, y, u
x
, u
y
, v
x
, v
y
) dx dy,
(14)
where u
x
=
u
x
, u
y
=
u
y
, v
x
=
v
x
, and v
y
=
v
y
. At this
stage, u = u(x, y) and v = v(x, y) are treated as unknown
2D functions rather than the set of unknown parameters (the
ows at each pixel). The parameterization of these func-
tions occurs later. Note that (14) imposes limitations on the
functional form of the energy, i.e., that it is just a function
of the ow u, v, the spatial coordinates x, y and the gradi-
ents of the ow u
x
, u
y
, v
x
and v
y
. A wide variety of en-
ergy functions do satisfy this requirement including (Horn
and Schunck 1981; Bruhn et al. 2005; Brox et al. 2004;
Nir et al. 2008; Zimmer et al. 2009).
Equation (14) is then treated as a calculus of variations
problem leading to the Euler-Lagrange equations:
E
Global
u


x
E
Global
u
x


y
E
Global
u
y
= 0, (15)
E
Global
v


x
E
Global
v
x


y
E
Global
v
y
= 0. (16)
Int J Comput Vis (2011) 92: 131 7
Because they use the calculus of variations, such algorithms
are generally referred to as variational. In the special case
of the Horn-Schunck algorithm (Horn 1986), the Euler-
Lagrange equations are linear in the unknown functions u
and v. These equations are then parameterized with two un-
known parameters per pixel and can be solved as a sparse
linear system. A variety of options are possible, including
the Jacobi method, the Gauss-Seidel method, Successive
Over-Relaxation, and the Conjugate Gradient algorithm.
For more general energy functions, the Euler-Lagrange
equations are nonlinear and are typically solved using an
iterative method (analogous to gradient descent). For exam-
ple, the ows can be parameterized by u + du and v + dv
where u, v are treated as known (from the previous itera-
tion or the initialization) and du, dv as unknowns. These
expressions are substituted into the Euler-Lagrange equa-
tions, which are then linearized through the use of Taylor
expansions. The resulting equations are linear in du and dv
and solved using a sparse linear solver. The estimates of u
and v are then updated appropriately and the next iteration
applied.
One disadvantage of variational algorithms is that the dis-
cretization of the Euler-Lagrange equations is not always
exact with respect to the original energy (Pock et al. 2007).
Another extremal approach (Sun et al. 2008), closely related
to the variational algorithms is to use:
E
Global
f
=0 (17)
rather than the Euler-Lagrange equations. Otherwise, the ap-
proach is similar. Equation (17) can be linearized and solved
using a sparse linear system. The key difference between
this approach and the variational one is just whether the pa-
rameterization of the ow functions into a set of ows per
pixel occurs before or after the derivation of the extremal
constraint equation ((17) or the Euler-Lagrange equations).
One advantage of the early parameterization and the subse-
quent use of (17) is that it reduces the restrictions on the
functional form of E
Global
, important in learning-based ap-
proaches (Sun et al. 2008).
2.3.3 Other Continuous Algorithms
Another approach (Trobin et al. 2008; Wedel et al. 2008) is
to decouple the data and prior terms through the introduction
of two sets of ow parameters, say (u
data
, v
data
) for the data
term and (u
prior
, v
prior
) for the prior:
E
Global
= E
Data
(u
data
, v
data
) +E
Prior
(u
prior
, v
prior
)
+
_
u
data
u
prior

2
+v
data
v
prior

2
_
. (18)
The nal term in (18) encourages the two sets of ow para-
meters to be roughly the same. For a sufciently large value
of the theoretical optimal solution will be unchanged and
(u
data
, v
data
) will exactly equal (u
prior
, v
prior
). Practical op-
timization with too large a value of is problematic, how-
ever. In practice either a lower value is used or is steadily
increased. The two sets of parameters allow the optimiza-
tion to be broken into two steps. In the rst step, the sum
of the data term and the third term in (18) is optimized
over the data ows (u
data
, v
data
) assuming the prior ows
(u
prior
, v
prior
) are constant. In the second step, the sum of the
prior term and the third term in (18) is optimized over prior
ows (u
prior
, v
prior
) assuming the data ows (u
data
, v
data
) are
constant. The result is two much simpler optimizations. The
rst optimization can be performed independently at each
pixel. The second optimization is often simpler because it
does not depend directly on the nonlinear data term (Trobin
et al. 2008; Wedel et al. 2008).
Finally, in recent work, continuous convex optimization
algorithms such as Linear Programming have also been used
to compute optical ow (Seitz and Baker 2009).
2.3.4 Coarse-to-Fine and Other Heuristics
All of the above algorithms solve the problem as huge
nonlinear optimizations. Even the Horn-Schunck algorithm,
which results in linear Euler-Lagrange equations, is nonlin-
ear through the linearization of the Brightness Constancy
constraint to give the Optical Flow constraint. A variety of
approaches have been used to improve the convergence rate
and reduce the likelihood of falling into a local minimum.
One component in many algorithms is a coarse-to-ne
strategy. The most common approach is to build image
pyramids by repeated blurring and downsampling (Lucas
and Kanade 1981; Glazer et al. 1983; Burt et al. 1983;
Enkelman 1986; Anandan 1989; Black and Anandan 1996;
Battiti et al. 1991; Bruhn et al. 2005). Optical ow is rst
computed on the top level (fewest pixels) and then upsam-
pled and used to initialize the estimate at the next level.
Computation at the higher levels in the pyramid involves
far fewer unknowns and so is far faster. The initialization at
each level from the previous level also means that far fewer
iterations are required at each level. For this reason, pyra-
mid algorithms tend to be signicantly faster than a single
solution at the bottom level. The images at the higher lev-
els also contain fewer higher frequency components reduc-
ing the number of local minima in the data term. A related
approach is to use a multigrid algorithm (Bruhn et al. 2006)
where estimates of the ow are passed both up and down the
hierarchy of approximations. A limitation of many coarse-
to-ne algorithms, however, is the tendency to over-smooth
ne structure and to fail to capture small fast-moving ob-
jects.
The main purpose of coarse-to-ne strategies is to deal
with nonlinearities caused by the data term (and the subse-
quent difculty in dealing with long-range motion). At the
8 Int J Comput Vis (2011) 92: 131
coarsest pyramid level, the ow magnitude is likely to be
small making the linearization of the brightness constancy
assumption reasonable. Incremental warping of the ow be-
tween pyramid levels (Bergen et al. 1992) helps keep the
ow update at any given level small (i.e., under one pixel).
When combined with incremental warping and updating
within a level, this method is effective for optimization with
a linearized brightness constancy assumption.
Another common cause of nonlinearity is the use of a
robust penalty function (see Sects. 2.1.2 and 2.2.2). A com-
mon approach to improve robustness in this case is Grad-
uated Non-Convexity (GNC) (Blake and Zisserman 1987;
Black and Anandan 1996). During GNC, the problem is
rst converted into a convex approximation that is more eas-
ily solved. The energy function is then made incrementally
more non-convex and the solution is rened, until the origi-
nal desired energy function is reached.
2.4 Discrete Optimization Algorithms
A number of recent approaches use discrete optimization
algorithms, similar to those employed in stereo matching,
such as graph cuts (Boykov et al. 2001) and belief propa-
gation (Sun et al. 2003). Discrete optimization methods ap-
proximate the continuous space of solutions with a simpli-
ed problem. The hope is that this will enable a more thor-
ough and complete search of the state space. The trade-off
in moving from continuous to discrete optimization is one
of search efciency for delity. Note that, in contrast to dis-
crete stereo optimization methods, the 2D ow eld makes
discrete optimization of optical ow signicantly more chal-
lenging. Approximations are usually made, which can limit
the power of the discrete algorithms to avoid local minima.
The few methods proposed to date can be divided into two
main approaches described below.
2.4.1 Fusion Approaches
Algorithms such as Jung et al. (2008), Lempitsky et al.
(2008) and Trobin et al. (2008) assume that a number of
candidate ow elds have been generated by running stan-
dard algorithms such as Lucas and Kanade (1981), and Horn
and Schunck (1981), possibly multiple times with a number
of different parameters. Computing the ow is then posed as
choosing which of the set of possible candidates is best at
each pixel. Fusion Flow (Lempitsky et al. 2008) uses a se-
quence of binary graph-cut optimizations to rene the cur-
rent ow estimate by selectively replacing portions with one
of the candidate solutions. Trobin et al. (2008) perform a
similar sequence of fusion steps, at each step solving a con-
tinuous [0, 1] optimization problem and then thresholding
the results.
2.4.2 Dynamically Reparameterizing Sparse State-Spaces
Any xed 2D discretization of the continuous space of 2D
ow elds is likely to be a crude approximation to the con-
tinuous eld. A number of algorithms take the approach of
rst approximating this state space sparsely (both spatially,
and in terms of the possible ows at each pixel) and then re-
ning the state space based on the result. An early use of this
idea for ow estimation employed simulated annealing with
a state space that adapted based on the local shape of the ob-
jective function (Black and Anandan 1991). More recently,
Glocker et al. (2008) initially use a sparse sampling of possi-
ble motions on a coarse version of the problem. As the algo-
rithm runs from coarse to ne, the spatial density of motion
states (which are interpolated with a spline) and the density
of possible ows at any given control point are chosen based
on the uncertainty in the solution fromthe previous iteration.
The algorithmof Lei and Yang (2009) also sparsely allocates
states across space and for the possible ows at each spatial
location. The spatial allocation uses a hierarchy of segmen-
tations, with a single possible ow for each segment at each
level. Within any level of the segmentation hierarchy, rst a
sparse sampling of the possible ows is used, followed by
a denser sampling with a reduced range around the solution
from the previous iteration. The algorithm in Cooke (2008)
iteratively alternates between two steps. In the rst step, all
the states are allocated to the horizontal motion, which is es-
timated similarly to stereo, assuming the vertical motion is
zero. In the second step, all the states are allocated to the ver-
tical motion, treating the estimate of the horizontal motion
from the previous iteration as constant.
2.4.3 Continuous Renement
An optional step after a discrete algorithm is to use a con-
tinuous optimization to rene the results. Any of the ap-
proaches in Sect. 2.3 are possible.
2.5 Miscellaneous Issues
2.5.1 Learning
The design of a global energy function E
Global
involves a
variety of choices, each with a number of free parameters.
Rather than manually making these decision and tuning pa-
rameters, learning algorithms have been used to choose the
data and prior terms and optimize their parameters by max-
imizing performance on a set of training data (Roth and
Black 2007; Sun et al. 2008; Li and Huttenlocher 2008).
2.5.2 Region-Based Techniques
If the image can be segmented into coherently moving re-
gions, many of the methods above can be used to accu-
Int J Comput Vis (2011) 92: 131 9
rately estimate the ow within the regions. Further, if the
ow were accurately known, segmenting it into coherent re-
gions would be feasible. One of the reasons optical ow has
proven challenging to compute is that the ow and its seg-
mentation must be computed together.
Several methods rst segment the scene using non-
motion cues and then estimate the ow in these regions
(Black and Jepson 1996; Xu et al. 2008; Fuh and Mara-
gos 1989). Within each image segment, Black and Jepson
(1996) use a parametric model (e.g., afne) (Bergen et al.
1992), which simplies the problem by reducing the num-
ber of parameters to be estimated. The ow is then rened
as suggested above.
2.5.3 Layers
Motion transparency has been extensively studied and is not
considered in detail here. Most methods have focused on
the use of parametric models that estimate motion in layers
(Jepson and Black 1993; Wang and Adelson 1993). The reg-
ularization of transparent motion in the framework of global
energy minimization, however, has received little attention
with the exception of Ju et al. (1996), Weiss (1997), and
Shizawa and Mase (1991).
2.5.4 Sparse-to-Dense Approaches
The coarse-to-ne methods described above have difculty
dealing with long-range motion of small objects. In con-
trast, there exist many methods to accurately estimate sparse
feature correspondences even when the motion is large.
Such sparse matching method can be combined with the
continuous energy minimization approaches in a variety
of ways (Brox et al. 2009; Liu et al. 2008; Ren 2008;
Xu et al. 2008).
2.5.5 Visibility and Occlusion
Occlusions and visibility changes can cause major prob-
lems for optical ow algorithms. The most common so-
lution is to model such effects implicitly using a robust
penalty function on both the data term and the prior term.
Explicit occlusion estimation, for example through cross-
checking ows computed forwards and backwards in time,
is another approach that can be used to improve robust-
ness to occlusions and visibility changes (Xu et al. 2008;
Lei and Yang 2009).
2.6 Databases and Evaluations
Prior to our evaluation (Baker et al. 2007), there were three
major attempts to quantitatively evaluate optical ow algo-
rithms, each proposing sequences with ground truth. The
work of Barron et al. (1994) has been so inuential that
until recently, essentially all published methods compared
with it. The synthetic sequences used there, however, are too
simple to make meaningful comparisons between modern
algorithms. Otte and Nagel (1994) introduced ground truth
for a real scene consisting of polyhedral objects. While this
provided real imagery, the images were extremely simple.
More recently, McCane et al. (2001) provided ground truth
for real polyhedral scenes as well as simple synthetic scenes.
Most recently Liu et al. (2008) proposed a dataset of real
imagery that uses hand segmentation and computed ow es-
timates within the segmented regions to generate the ground
truth. While this has the advantage of using real imagery,
the reliance on human judgement for segmentation, and on a
particular optical ow algorithm for ground truth, may limit
its applicability.
In this paper we go beyond these studies in several impor-
tant ways. First, we provide ground-truth motion for much
more complex real and synthetic scenes. Specically, we in-
clude ground truth for scenes with nonrigid motion. Second,
we also provide ground-truth motion boundaries and extend
the evaluation methods to these areas where many ow algo-
rithms fail. Finally, we provide a web-based interface, which
facilitates the ongoing comparison of methods.
Our goal is to push the limits of current methods and,
by exposing where and how they fail, focus attention on the
hard problems. As described above, almost all ow algo-
rithms have a specic data term, prior term, and optimiza-
tion algorithm to compute the ow eld. Regardless of the
choices made, algorithms must somehow deal with all of
the phenomena that make optical ow intrinsically ambigu-
ous and difcult. These include: (1) the aperture problem
and textureless regions, which highlight the fact that opti-
cal ow is inherently ill-posed, (2) camera noise, nonrigid
motion, motion discontinuities, and occlusions, which make
choosing appropriate penalty functions for both the data and
prior terms important, (3) large motions and small objects
which, often cause practical optimization algorithms to fall
into local minima, and (4) mixed pixels, changes in illumi-
nation, non-Lambertian reectance, and motion blur, which
highlight overly simplied assumptions made by Brightness
Constancy (or simple lter constancy). Our goal is to pro-
vide ground-truth data containing all of these components
and to provide information about the location of motion
boundaries and textureless regions. In this way, we hope
to be able to evaluate which phenomena pose problems for
which algorithms.
3 Database Design
Creating a ground-truth (GT) database for optical ow is
difcult. For stereo, structured light (Scharstein and Szeliski
10 Int J Comput Vis (2011) 92: 131
Fig. 1 (a) The setup for obtaining ground-truth ow using hidden
uorescent texture includes computer-controlled lighting to switch be-
tween the UV and visible lights. It also contains motion stages for both
the camera and the scene. (bd) The setup under the visible illumi-
nation. (eg) The setup under the UV illumination. (c and f) Show the
high-resolution images taken by the digital camera. (d and g) Show a
zoomed portion of (c) and (f). The high-frequency uorescent texture
in the images taken under UV light (g) allows accurate tracking, but is
largely invisible in the low-resolution test images
2002) or range scanning (Seitz et al. 2006) can be used to ob-
tain dense, pixel-accurate ground truth. For optical ow, the
scene may be moving nonrigidly making such techniques
inapplicable in general. Ideally we would like imagery col-
lected in real-world scenarios with real cameras and substan-
tial nonrigid motion. We would also like dense, subpixel-
accurate ground truth. We are not aware of any technique
that can simultaneously satisfy all of these goals.
Rather than collecting a single type of data (with its
inherent limitations) we instead collected four different
types of data, each satisfying a different subset of desir-
able properties. Having several different types of data has
the benet that the overall evaluation is less likely to be
affected by any biases or inaccuracies in any of the data
types. It is important to keep in mind that no ground-
truth data is perfect. The term itself just means measured
on the ground and any measurement process may introduce
noise or bias. We believe that the combination of our four
datasets is sufcient to allow a thorough evaluation of cur-
rent optical ow algorithms. Moreover, the relative perfor-
mance of algorithms on the different types of data is itself
interesting and can provide insights for future algorithms
(see Sect. 5.2.4).
Wherever possible, we collected eight frames with the
ground-truth owbeing dened between the middle pair. We
collected color imagery, but also make grayscale imagery
available for comparison with legacy implementations and
existing approaches that only process grayscale. The dataset
is divided into 12 training sequences with ground truth,
which can be used for parameter estimation or learning, and
12 test sequences, where the ground truth is withheld. In
this paper we only describe the test sequences. The datasets,
instructions for evaluating results on the test set, and the per-
formance of current algorithms are all available at http://
vision.middlebury.edu/ow/. We describe each of the four
types of data below.
3.1 Dense GT Using Hidden Fluorescent Texture
We have developed a technique for capturing imagery of
nonrigid scenes with ground-truth optical ow. We build a
scene that can be moved in very small steps by a computer-
controlled motion stage. We apply a ne spatter pattern of
uorescent paint to all surfaces in the scene. The computer
repeatedly takes a pair of high-resolution images both under
ambient lighting and under UV lighting, and then moves the
scene (and possibly the camera) by a small amount.
In our current setup, shown in Fig. 1(a), we use a Canon
EOS 20D camera to take images of size 35042336, and
make sure that no scene point moves by more than 2 pixels
from one captured frame to the next. We obtain our test se-
quence by downsampling every 40th image taken under visi-
ble light by a factor of six, yielding images of size 584388.
Because we sample every 40th frame, the motion can be
quite large (up to 12 pixels between frames in our evaluation
data) even though the motion between each pair of captured
frames is small and the frames are subsequently downsam-
pled, i.e., after the downsampling, the motion between any
pair of captured frames is at most 1/3 of a pixel.
Since uorescent paint is available in a variety of col-
ors, the color of the objects in the scene can be closely
matched. In addition, it is possible to apply a ne spatter
pattern, where individual droplets are about the size of 1
2 pixels in the high-resolution images. This high-frequency
texture is therefore far less perceptible in the low-resolution
images, while the uorescent paint is very visible in the
high-resolution UV images in Fig. 1(g). Note that uores-
cent paint absorbs UV light but emits light in the visible
spectrum. Thus, the camera optics affect the hidden texture
and the scene colors in exactly the same way, and the hidden
texture remains perfectly aligned with the scene.
The ground-truth ow is computed by tracking small
windows in the original sequence of high-resolution UV
images. We use a sum-of-squared-difference (SSD) tracker
Int J Comput Vis (2011) 92: 131 11
Fig. 2 Hidden Texture Data. Army contains several independently
moving objects. Mequon contains nonrigid motion and texture-
less regions. Schefera contains thin structures, shadows, and fore-
ground/background transitions with little contrast. Wooden contains
rigidly moving objects with little texture in the presence of shadows.
In the right-most column, we include a visualization of the color-
coding of the optical ow. The ticks on the axes denote a ow unit
of one pixel; note that the ow magnitudes are fairly low in Army
(<4 pixels), but higher in the other three scenes (up to 10 pixels)
with a window size of 1515, corresponding to a window
radius of less than 1.5 pixels in the downsampled images.
We perform a local brute-force search, using each frame to
initialize the next. We also crosscheck the results by track-
ing each pixel both forwards and backwards through the
sequence and require perfect correspondence. The chances
that this check would yield false positives after tracking for
40 frames are very low. Crosschecking identies the oc-
cluded regions, whose motion we mark as unknown. Af-
ter the initial integer-based motion tracking and crosscheck-
ing, we estimate the subpixel motion of each window using
Lucas-Kanade (1981) with a precision of about 1/10 pixels
(i.e., 1/60 pixels in the downsampled images). In order to
downsample the motion eld by a factor of 6, we nd the
modes among the 36 different motion vectors in each 6 6
window using sequential clustering. We assign the average
motion of the dominant cluster as the motion estimate for
the resulting pixel in the low-resolution motion eld. The
test images taken under visible light are downsampled using
a binomial lter.
Using the combination of uorescent paint, downsam-
pling high-resolution images, and sequential tracking of
small motions, we are able to obtain dense, subpixel accu-
rate ground truth for a nonrigid scene.
We include four sequences in the evaluation set (Fig. 2).
Army contains several independently moving objects.
Mequon contains nonrigid motion and large areas with lit-
tle texture. Schefera contains thin structures, shadows,
and foreground/background transitions with little contrast.
Wooden contains rigidly moving objects with little texture
12 Int J Comput Vis (2011) 92: 131
Fig. 3 Synthetic Data. Grove contains a close up of a tree with thin
structures, very complex motion discontinuities, and a large motion
range (up to 20 pixels). Urban contains large motion discontinuities
and an even larger motion range (up to 35 pixels). Yosemite is included
in our evaluation to allow comparison with algorithms published prior
to our study
in the presence of shadows. The maximum motion in Army
is approximately 4 pixels. The maximum motion in the other
three sequences is about 10 pixels. All sequences are signif-
icantly more difcult than the Yosemite sequence due to the
larger motion ranges, the non-rigid motion, various photo-
metric effects such as shadows and specularities, and the
detailed geometric structure.
The main benet of this dataset is that it contains ground
truth on imagery captured with a real camera. Hence, it
contains real photometric effects, natural textural properties,
etc. The main limitations of this dataset are that the scenes
are laboratory scenes, not real-world scenes. There is also
no motion blur due to the stop motion method of capture.
One drawback of this data is that the ground truth it is not
available in areas where cross-checking failed, in particular,
in regions occluded in one image. Even though the ground
truth is reasonably accurate (on the order of 1/60th of a
pixel), the process is not perfect; signicant errors however,
are limited to a small fraction of the pixels. The same can be
said for any real data where the ground truth is measured,
including, for example, in the Middlebury stereo dataset
(Scharstein and Szeliski 2002). The ground-truth measuring
technique may always be prone to errors and biases. Con-
sequently, the following section describes realistic synthetic
data where the ground truth is guaranteed to be perfect.
3.2 Realistic Synthetic Imagery
Synthetic scenes generated using computer graphics are of-
ten indistinguishable from real ones. For the study of optical
ow, synthetic data offers a number of benets. In particu-
lar, it gives full control over the rendering process including
material properties of the objects, while providing precise
ground-truth motion and object boundaries.
To go beyond previous synthetic ground truth (e.g., the
Yosemite sequence), we generated two types of fairly com-
plex synthetic outdoor scenes. The rst is a set of natural
scenes (Fig. 3 top) containing signicant complex occlusion.
These scenes consist of a random number of procedurally
generated rocks and trees with randomly chosen ground tex-
ture and surface displacement. Additionally, the tree bark
has signicant 3D texture. The trees have a small amount
of independent movement to mimic motion due to wind.
The camera motions include camera rotation and 3D trans-
lation. A second set of urban scenes (Fig. 3 middle) con-
Int J Comput Vis (2011) 92: 131 13
tain buildings generated with a random shape grammar. The
buildings have randomly selected scanned textures; there are
also a few independently moving cars.
These scenes were generated using the 3Delight Render-
man-compliant renderer (DNA Research 2008) at a resolu-
tion of 640480 pixels using linear gamma. The images are
antialiased, mimicking the effect of sensors with nite area.
Frames in these synthetic sequences were generated with-
out motion blur. There are cast shadows, some of which are
non-stationary due to the independent motion of the trees
and cars. The surfaces are mostly diffuse, but the leaves on
the trees have a slight specular component, and the cars are
strongly specular. A minority of the surfaces in the urban
scenes have a small (5%) reective component, meaning
that the reection of other objects is faintly visible in these
surfaces.
The rendered scenes use the ambient occlusion approxi-
mation to global illumination (Landis 2002). This approx-
imation separates illumination into the sum of direct and
multiple-bounce components, and then assumes that the
multiple-bounce illumination is sufciently omnidirectional
that it can be approximated at each point by a product of the
incoming ambient light and a precomputed factor measuring
the proportion of rays that are not blocked by other nearby
surfaces.
The ground truth was computed using a custom shader
that projects the 3D motion of the scene corresponding to a
particular image onto the 2D image plane. Since individual
pixels can potentially represent more than one object, sim-
ply point-sampling the ow at the center of each pixel could
result in a ow vector that does not reect the dominant mo-
tion under the pixel. On the other hand, applying antialiasing
to the ow would result in an averaged ow vector at each
pixel that does reect the true motion of any object within
that pixel. Instead, we clustered the ow vectors within each
pixel and selected a ow vector from the dominant cluster:
The ow elds are initially generated at 3 resolution, re-
sulting in nine candidate ow vectors for each pixel. These
motion vectors are grouped into two clusters using k-means.
The k-means procedure is initialized with the vectors clos-
est and furthest from the pixels average ow as measured
using the ow vector end points. The ow vector closest to
the mean of the dominant cluster is then chosen to represent
the ow for that pixel. The images were also generated at
3 resolution and downsampled using a bicubic lter.
We selected three synthetic sequences to include in the
evaluation set (Fig. 3). Grove contains a close-up view of a
tree, with a substantial parallax and motion discontinuities.
Urban contains images of a city, with substantial motion
discontinuities, a large motion range, and an independently
moving object. We also include the Yosemite sequence to al-
low some comparison with algorithms published prior to the
release of our data.
3.3 Imagery for Frame Interpolation
In a wide class of applications such as video re-timing,
novel view generation, and motion-compensated compres-
sion, what is important is not how well the ow eld
matches the ground-truth motion, but how well intermediate
frames can be predicted using the ow. To allow for mea-
sures that predict performance on such tasks, we collected a
variety of data suitable for frame interpolation. The relative
performance of algorithms with respect to frame interpola-
tion and ground-truth motion estimation is interesting in its
own right.
3.3.1 Frame Interpolation Datasets
We used a PointGrey Dragony Express camera to capture
the data, acquiring 60 frames per second. We provide every
other frame to the optical ow algorithms and retain the in-
termediate images as frame-interpolation ground truth. This
temporal subsampling means that the input to the ow algo-
rithms is captured at 30 Hz while enabling generation of a
2 slow-motion sequence.
We include four such sequences in the evaluation set
(Fig. 4). The rst two (Backyard and Basketball) include
people, a common focus of many applications, but a subject
matter absent from previous evaluations. Backyard is cap-
tured outdoors with a short shutter (6 ms) and has little mo-
tion blur. Basketball is captured indoors with a longer shutter
(16 ms) and so has more motion blur. The third sequence,
Dumptruck, is an urban scene containing several indepen-
dently moving vehicles, and has substantial specularities and
saturation (2 ms shutter). The nal sequence, Evergreen, in-
cludes highly textured vegetation with complex motion dis-
continuities (6 ms shutter).
The main benet of the interpolation dataset is that the
scenes are real world scenes, captured with a real camera
and containing real sources of noise. The ground truth is
not a ow eld, however, but an intermediate image frame.
Hence, the denition of ow being used is the apparent mo-
tion, not the 2D projection of the motion eld.
3.3.2 Frame Interpolation Algorithm
Note that the evaluation of accuracy depends on the inter-
polation algorithm used to construct the intermediate frame.
By default, we generate the intermediate frames from the
ow elds uploaded to the website using our baseline inter-
polation algorithm. Researchers can also upload their own
interpolation results in case they want to use a more sophis-
ticated algorithm.
Our algorithm takes a single ow eld u
0
from image
I
0
to I
1
and constructs an interpolated frame I
t
at time
t (0, 1). We do, however, use both frames to generate the
14 Int J Comput Vis (2011) 92: 131
Fig. 4 High-Speed Data for Interpolation. We collected four se-
quences using a PointGrey Dragony Express running at 60 Hz. We
provide every other image to the algorithms and retain the intermediate
frame as interpolation ground truth. The rst two sequences (Backyard
and Basketball) include people, a common focus of many applications.
Dumptruck contains several independently moving vehicles, and has
substantial specularities and saturation. Evergreen includes highly tex-
tured vegetation with complex discontinuities
actual intensity values. In all the experiments in this pa-
per t =0.5. Our algorithm is closely related to previous al-
gorithms for depth-based frame interpolation (Shade et al.
1998; Zitnick et al. 2004):
(1) Forward-warp the ow u
0
to time t to give u
t
where:
u
t
(round(x +t u
0
(x))) =u
0
(x). (19)
In order to avoid sampling gaps, we splat the ow vec-
tors with a splatting radius of 0.5 pixels (Levoy 1988)
(i.e., each ow vector is followed to a real-valued lo-
cation in the destination image, and the ow is written
into all pixels within a distance of 0.5 of that location).
In cases where multiple ow vectors map to the same
location, we attempt to resolve the ordering indepen-
Int J Comput Vis (2011) 92: 131 15
Fig. 5 Stereo Data. We cropped the stereo dataset Teddy (Scharstein
and Szeliski 2003) to convert the asymmetric stereo disparity range
into a roughly symmetric ow eld. This dataset includes complex
geometry as well as signicant occlusions and motion discontinuities.
One reason for including this dataset is to allow comparison with state-
of-the-art stereo algorithms
dently for each pixel by checking photoconsistency; i.e.,
we retain the ow u
0
(x) with the lowest color difference
|I
0
(x) I
1
(x +u
0
(x))|.
(2) Fill any holes in u
t
using a simple outside-in strategy.
(3) Estimate occlusions masks O
0
(x) and O
1
(x), where
O
i
(x) =1 means pixel x in image I
i
is not visible in the
respective other image. To compute O
0
(x) and O
1
(x),
we rst forward-warp the ow u
0
(x) to time t =1 using
the same approach as in Step 1 to give u
1
(x). Any pixel
x in u
1
(x) that is not targeted by this splatting has no
corresponding pixel in I
0
and thus we set O
1
(x) =1 for
all such pixels. (See Herbst et al. 2009 for a bidirectional
algorithm that performs this reasoning at time t .) In or-
der to compute O
0
(x), we cross-check the ow vectors,
setting O
0
(x) =1 if
|u
0
(x) u
1
(x +u
0
(x))| > 0.5. (20)
(4) Compute the colors of the interpolated pixels, taking
occlusions into consideration. Let x
0
= x t u
t
(x) and
x
1
= x + (1 t )u
t
(x) denote the locations of the two
source pixels in the two images. If both pixels are vis-
ible, i.e., O
0
(x
0
) =0 and O
1
(x
1
) =0, blend the two im-
ages (Beier and Neely 1992):
I
t
(x) =(1 t )I
0
(x
0
) +t I
1
(x
1
). (21)
Otherwise, only sample the non-occluded image, i.e.,
set I
t
(x) =I
0
(x
0
) if O
1
(x
1
) =1 and vice versa. In order
to avoid artifacts near object boundaries, we dilate the
occlusion masks O
0
, O
1
by a small radius before this
operation. We use bilinear interpolation to sample the
images.
This algorithm, while reasonable, is only meant to serve as
starting point. One area for future research is to develop bet-
ter frame interpolation algorithms. We hope that our data-
base will be used both by researchers working on opti-
cal ow and on frame interpolation (Mahajan et al. 2009;
Herbst et al. 2009).
3.4 Modied Stereo Data for Rigid Scenes
Our nal type of data consists of modied stereo data.
Specically we include the Teddy dataset in the evalua-
tion set, the ground truth for which was obtained using
structured lighting (Scharstein and Szeliski 2003) (Fig. 5).
Stereo datasets typically have an asymmetric disparity range
[0, d
max
], which is appropriate for stereo, but not for optical
ow. We crop different subregions of the images, thereby
introducing a spatial shift, to convert this disparity range to
[d
max
/2, d
max
/2].
A key benet of the modied stereo dataset, like the hid-
den uorescent texture dataset, is that it contains ground-
truth ow elds on imagery captured with a real camera.
An additional benet is that it allows a comparison be-
tween state-of-the-art stereo algorithms and optical ow al-
gorithms (see Sect. 5.6). Shifting the disparity range does
not affect the performance of stereo algorithms as long as
they are given the newsearch range. Although optical owis
a more under-constrained problem, the relative performance
of algorithms may lead to algorithmic insights.
One concern with the modied stereo dataset is that al-
gorithms may take advantage of the knowledge that the mo-
tions are all horizontal. Indeed a number recent algorithms
have considered rigidity priors (Wedel et al. 2008, 2009).
However, these algorithms must also perform well on the
other types of data and any over-tting to the rigid data
should be visible by comparing results across the 12 im-
ages in the evaluation set. Another concern would be that
the ground truth is only accurate to 0.25 pixels. (The origi-
nal stereo data comes with pixel-accurate ground truth but
is four times higher resolutionScharstein and Szeliski
2003.) The most appropriate performance statistics for this
data, therefore, are the robustness statistics used in the
Middlebury stereo dataset (Scharstein and Szeliski 2002)
(Sect. 4.2).
16 Int J Comput Vis (2011) 92: 131
4 Evaluation Methodology
We rene and extend the evaluation methodology of Barron
et al. (1994) in terms of: (1) the performance measures used,
(2) the statistics computed, and (3) the sub-regions of the
images considered.
4.1 Performance Measures
The most commonly used measure of performance for opti-
cal ow is the angular error (AE). The AE between a ow
vector (u, v) and the ground-truth ow (u
GT
, v
GT
) is the an-
gle in 3D space between (u, v, 1.0) and (u
GT
, v
GT
, 1.0). The
AE can be computed by taking the dot product of the vec-
tors, dividing by the product of their lengths, and then taking
the inverse cosine:
AE = cos
1
_
1.0 +u u
GT
+v v
GT

1.0 +u
2
+v
2
_
1.0 +u
2
GT
+v
2
GT
_
. (22)
The popularity of this measure is based on the seminal sur-
vey by Barron et al. (1994), although the measure itself dates
to prior work by Fleet and Jepson (1990). The goal of the
AE is to provide a relative measure of performance that
avoids the divide by zero problem for zero ows. Errors
in large ows are penalized less in AE than errors in small
ows.
Although the AE is prevalent, it is unclear why errors in a
region of smooth non-zero motion should be penalized less
than errors in regions of zero motion. The AE also contains
an arbitrary scaling constant (1.0) to convert the units from
pixels to degrees. Hence, we also compute an absolute er-
ror, the error in ow endpoint (EE) used in Otte and Nagel
(1994) dened by:
EE =
_
(u u
GT
)
2
+(v v
GT
)
2
. (23)
Although the use of AE is common, the EE measure
is probably more appropriate for most applications (see
Sect. 5.2.1). We report both.
For image interpolation, we dene the interpolation error
(IE) to be the root-mean-square (RMS) difference between
the ground-truth image and the estimated interpolated image
IE =
_
1
N

(x,y)
_
I (x, y) I
GT
(x, y)
_
2
_1
2
, (24)
where N is the number of pixels. For color images, we take
the L2 norm of the vector of RGB color differences.
We also compute a second measure of interpolation per-
formance, a gradient-normalized RMS error inspired by
Szeliski (1999). The normalized interpolation error (NE) be-
tween an interpolated image I (x, y) and a ground-truth im-
age I
GT
(x, y) is given by:
NE =
_
1
N

(x,y)
(I (x, y) I
GT
(x, y))
2
I
GT
(x, y)
2
+
_1
2
. (25)
In our experiments the arbitrary scaling constant is set to be
=1.0 (graylevels per pixel squared). Again, for color im-
ages, we take the L2 norm of the vector of RGB color dif-
ferences and compute the gradient of each color band sepa-
rately.
Naturally, an interpolation algorithm is required to gener-
ate the interpolated image from the optical ow eld. In this
paper, we use the baseline algorithm outlined in Sect. 3.3.2.
4.2 Statistics
Although the full histograms are available in a technical re-
port, Barron et al. (1994) only reports averages (AV) and
standard deviations (SD). This has led most subsequent re-
searchers to only report these statistics. We also compute the
robustness statistics used in the Middlebury stereo dataset
(Scharstein and Szeliski 2002). In particular RX denotes the
percentage of pixels that have an error measure above X. For
the angle error (AE) we compute R2.5, R5.0, and R10.0 (de-
grees); for the endpoint error (EE) we compute R0.5, R1.0,
and R2.0 (pixels); for the interpolation error (IE) we com-
pute R2.5, R5.0, and R10.0 (graylevels); and for the normal-
ized interpolation error (NE) we compute R0.5, R1.0, and
R2.0 (no units). We also compute robust accuracy measures
similar to those in Seitz et al. (2006): AX denotes the accu-
racy of the error measure at the Xth percentile, after sorting
the errors from low to high. For the ow errors (AE and EE),
we compute A50, A75, and A95. For the interpolation errors
(IE and NE), we compute A90, A95, and A99.
4.3 Region Masks
It is easier to compute ow in some parts of an image than in
others. For example, computing ow around motion discon-
tinuities is hard. Computing motion in textureless regions
is also hard, although interpolating in those regions should
be easier. Computing statistics over such regions may high-
light areas where existing algorithms are failing and spur
further research in these cases. We follow the procedure in
Scharstein and Szeliski (2002) and compute the error mea-
sure statistics over three types of region masks: everywhere
(All), around motion discontinuities (Disc), and in texture-
less regions (Untext). We illustrate the masks for the Schef-
era dataset in Fig. 6.
Int J Comput Vis (2011) 92: 131 17
Fig. 6 Region masks for Schefera. Statistics are computed over the
white pixels. All includes all the pixels where the ground-truth ow
can be reliably determined. The Disc mask is computed by taking the
gradient of the ground-truth ow (or pixel differencing if the ground-
truth owis unavailable), thresholding and dilating. The Untext regions
are computed by taking the gradient of the image, thresholding and di-
lating
The All masks for ow estimation include all the pixels
where the ground-truth ow could be reliably determined.
For the new synthetic sequences, this means all of the pix-
els. For Yosemite, the sky is excluded. For the hidden uores-
cent texture data, pixels where cross-checking failed are ex-
cluded. Most of these pixels are around the boundary of ob-
jects, and around the boundary of the image where the pixel
ows outside the second image. Similarly, for the stereo se-
quences, pixels where cross-checking failed are excluded
(Scharstein and Szeliski 2003). Most of these pixels are pix-
els that are occluded in one of the images. The All masks for
the interpolation metrics include all of the pixels. Note that
in some cases (particularly the synthetic data), the All masks
include pixels that are visible in rst image but are occluded
or outside the second image. We did not remove these pixels
because we believe algorithms should be able to extrapolate
into these regions.
The Disc mask is computed by taking the gradient of
the ground-truth ow eld, thresholding the magnitude, and
then dilating the resulting mask with a 99 box. If the
ground-truth ow is not available, we use frame differenc-
ing to get an estimate of fast-moving regions instead. The
Untext regions are computed by taking the gradient of the
image, thresholding the magnitude, and dilating with a 33
box. The pixels excluded from the All masks are also ex-
cluded from both Disc and Untext masks.
5 Experimental Results
We now discuss our empirical ndings. We start in Sect. 5.1
by outlining the evolution of our online evaluation since the
publication of our preliminary paper (Baker et al. 2007). In
Sect. 5.2, we analyze the ow errors. In particular, we in-
vestigate the correlation between the various metrics, sta-
tistics, region masks, and datasets. In Sect. 5.3, we analyze
the interpolation errors and in Sect. 5.4, we compare the in-
terpolation error results with the ow error results. Finally,
in Sect. 5.5, we compare the algorithms that have reported
results using our evaluation in terms of which components
of our taxonomy in Sect. 2 they use.
5.1 Online Evaluation
Our online evaluation at http://vision.middlebury.edu/ow/
provides a snapshot of the state-of-the-art in optical ow.
Seeded with the handful of methods that we implemented as
part of our preliminary paper (Baker et al. 2007), the evalu-
ation has quickly grown. At the time of writing (December
2009), the evaluation contains results for 24 published meth-
ods and several unpublished ones. In this paper, we restrict
attention to the published algorithms. Four of these meth-
ods were contributed by us (our implementations of Horn
and Schunck 1981, Lucas-Kanade 1981, Combined Local-
GlobalBruhn et al. 2005, and Black and Anandan 1996).
Results for the 20 other methods were submitted by their au-
thors. Of these new algorithms, two were published before
2007, 11 were published in 2008, and 7 were published in
2009.
On the evaluation website, we provide tables comparing
the performance of the algorithms for each of the four er-
ror measures, i.e., endpoint error (EE), angular error (AE),
interpolation error (IE), and normalized interpolation error
(NE), on a set of 8 test sequences. For EE and AE, which
measure ow accuracy, we use the 8 sequences for which we
have ground-truth ow: Army, Mequon, Schefera, Wooden,
Grove, Urban, Yosemite, and Teddy. For IE and NE, which
measure interpolation accuracy, we use only four of the
above datasets (Mequon, Schefera, Urban, and Teddy) and
replace the other four with the high-speed datasets Back-
yard, Basketball, Dumptruck, and Evergreen. For each mea-
sure, we include a separate page for each of the eight sta-
tistics in Sect. 4.2. Figure 7 shows a screenshot of the rst
of these 32 pages, the average endpoint error (Avg. EE). For
each measure and statistic, we evaluate all methods on the
set of eight test images with three different regions masks
18 Int J Comput Vis (2011) 92: 131
Fig. 7 A screenshot of the default page at http://vision.middlebury.
edu/ow/eval/, evaluating the current set of 24 published algorithms
(as of December 2009) using the average endpoint error (Avg. EE).
This page is one of 32 possible metric/statistic combinations the user
can select. By moving the mouse pointer over an underlined perfor-
mance score, the user can interactively view the corresponding ow
and error maps. Clicking on a score toggles between the computed and
the ground-truth ows. Next to each score, the corresponding rank in
the current column is indicated with a smaller blue number. The min-
imum (best) score in each column is shown in boldface. The methods
are sorted by their average rank, which is computed over all 24 columns
(eight sequences times three region masks each). The average rank
serves as an approximate measure of performance under the selected
metric/statistic
(all, disc, and untext; see Sect. 4.3), resulting in a set of 24
scores per method. We sort each table by the average rank
across all 24 scores to provide an ordering that roughly re-
ects the overall performance on the current metric and sta-
tistic.
We want to emphasize that we do not aim to provide
an overall ranking among the submitted methods. Authors
sometimes report the rank of their method on one or more of
the 32 tables (often average angular error); however, many
of the other 31 metric/statistic combinations might be better
suited to compare the algorithms, depending on the appli-
cation of interest. Also note that the exact rank within any
of the tables only gives a rough measure of performance,
as there are various other ways that the scores across the
24 columns could be combined.
We also list the runtimes reported by authors on the Ur-
ban sequence on the evaluation website (see Table 1). We
made no attempt to normalize for the programming environ-
ment, CPU speed, number of cores, or other hardware ac-
celeration. These numbers should be treated as a very rough
guideline of the inherent computational complexity of the
algorithms.
Int J Comput Vis (2011) 92: 131 19
Table 1 Reported runtimes on the Urban sequence in seconds. We do not normalize for the programming environment, CPU speed, number of
cores, or other hardware acceleration. These numbers should be treated as a very rough guideline of the inherent computational complexity of the
algorithms
Algorithm Runtime Algorithm Runtime
Adaptive (Wedel et al. 2009) 9.2 Seg OF (Xu et al. 2008) 60
Complementary OF (Zimmer et al. 2009) 44 Learning Flow (Sun et al. 2008) 825
Aniso. Huber-L1 (Werlberger et al. 2009) 2 Filter Flow (Seitz and Baker 2009) 34,000
DPOF (Lei and Yang 2009) 261 Graph Cuts (Cooke 2008) 1,200
TV-L1-improved (Wedel et al. 2008) 2.9 Black & Anandan (Black and Anandan 1996) 328
CBF (Trobin et al. 2008) 69 SPSA-learn (Li and Huttenlocher 2008) 200
Brox et al. (Brox et al. 2004) 18 Group Flow (Ren 2008) 600
Rannacher (Rannacher 2009) 0.12 2D-CLG (Bruhn et al. 2005) 844
F-TV-L1 (Wedel et al. 2008) 8 Horn & Schunck (Horn and Schunck 1981) 49
Second-order prior (Trobin et al. 2008) 14 TI-DOFE (Cassisa et al. 2009) 260
Fusion (Lempitsky et al. 2008) 2,666 FOLKI (Le Besnerais and Champagnat 2005) 1.4
Dynamic MRF (Glocker et al. 2008) 366 Pyramid LK (Lucas and Kanade 1981) 11.9
Table 2 A comparison of the average endpoint error (Avg. EE) results for 2D-CLG (Bruhn et al. 2005) (overall the best-performing algorithm in
our preliminary study, Baker et al. 2007) and the best result uploaded to the evaluation website at the time of writing (Fig. 7)
Army Mequon Schefera Wooden Grove Urban Yosemite Teddy
Best 0.09 0.18 0.24 0.18 0.74 0.39 0.08 0.50
2D-CLG (Bruhn et al. 2005) 0.28 0.67 1.12 1.07 1.23 1.54 0.10 1.38
Finally, we report on the evaluation website for each
method the number of input frames and whether color in-
formation was utilized. At the time of writing, all of the 24
published methods discussed in this paper use only 2 frames
as input; and 10 of them use color information.
The best-performing algorithm (both in terms of average
endpoint error and average angular error) in our prelimi-
nary study (Baker et al. 2007) was 2D-CLG (Bruhn et al.
2005). In Table 2, we compare the results of 2D-CLG with
the current best result in terms of average endpoint error
(Avg. EE). The rst thing to note is that performance has
dramatically improved, with average EE values of less than
0.2 pixels on four of the datasets (Yosemite, Army, Mequon,
and Wooden). The common elements of the more difcult
sequences (Grove, Teddy, Urban, and Schefera) are the
presence of large motions and strong motion discontinuities.
The complex discontinuities and ne structures of Grove
seem to cause the most problems for current algorithms.
A visual inspection of some computed ows (Fig. 8) shows
that oversmoothing motion discontinuities is common even
for the top-performing algorithms. A possible exception is
DPOF (Lei and Yang 2009). On the other hand, the prob-
lems of complex non-rigid motion confounded with illu-
mination changes, moving shadows, and real sensor noise
(Army, Mequon, Wooden) do not appear to present as much
of a problem for current algorithms.
5.2 Analysis of the Flow Errors
We now analyze the correlation between the metrics, statis-
tics, region masks, and datasets for the ow errors. Figure 9
compares the average ranks computed over different subsets
of the 32 pages of results, each of which contains 24 re-
sults for each algorithm. Column (a) contains the average
rank computed over seven of the eight statistics (the stan-
dard deviation is omitted) and the three region masks for the
endpoint error (EE). Column (b) contains the corresponding
average rank for the angular error (AE). Columns (c) contain
the average rank for each of the seven statistics for the end-
point error (EE) computed over the three masks and the eight
datasets. Columns (d) contain the average endpoint error
(Avg. EE) for each of the three masks just computed over the
eight datasets. Columns (e) contains the Avg. EE computed
for each of the datasets, averaged over each of the three
masks. The order of the algorithms is the same as Fig. 7, i.e.,
we order by the average endpoint error (Avg. EE), the high-
lighted, leftmost column in (c). To help visualize the num-
bers, we color-code the average ranks with a color scheme
where green denotes low values, yellow intermediate, and
red large values.
We also include the Pearson product-moment coefcient
r between various subsets of pairs of columns at the bot-
tom of the gure. The Pearson measure of correlation takes
20 Int J Comput Vis (2011) 92: 131
Fig. 8 The results of some of the top-performing methods on three
of the more difcult sequences. All three sequences contain strong
motion discontinuities. Grove also contains particularly ne structures.
The general tendency is to oversmooth motion discontinuities and ne
structures. A possible exception is DPOF (Lei and Yang 2009)
on values between 1.0 and 1.0, with 1.0 indicating perfect
correlation. First, we include the correlation between each
column and column (a). As expected, the correlation of col-
umn (a) with itself is 1.0. We also include the correlation
between all pairs of the statistics, between all pairs of the
masks, and between all pairs of the datasets. The results are
shown in the 7 7, 3 3, and 8 8 (symmetric) matrices at
the bottom of the table. We color-code the correlation results
with a separate scale where 1.0 is dark green and yellow/red
denote lower values (less correlation).
5.2.1 Comparison of the Endpoint Error and the Angular
Error
Columns (a) and (b) in Fig. 9 contain average ranks
for the endpoint error (EE) and angular error (AE). The
rankings generated with these two measures are highly cor-
related (r = 0.989), with only a few ordering reversals.
At rst glance, it may seem that the two measures could
be used largely interchangeably. Studying the qualitative re-
sults contained in Fig. 10 for the Complementary OF algo-
rithm (Zimmer et al. 2009) on the Urban sequence leads to
a different conclusion. The Complementary OF algorithm
(which otherwise does very well) fails to correctly estimate
the ow of the building in the bottom left. The average AE
for this result is 4.64 degrees which ranks 6th in the table
at the time of writing. The average EE is 1.78 pixels which
ranks 20th at the time of writing. The huge discrepancy is
due to the fact that the building in the bottom left has a very
large motion, so the AE in that region is downweighted.
Based on this example, we argue that the endpoint error (EE)
should become the preferred measure of ow accuracy.
5.2.2 Comparison of the Statistics
Columns (c) in Fig. 9 contains a comparison of the var-
ious statistics, the average (Avg), the robustness mea-
Int J Comput Vis (2011) 92: 131 21
Fig. 9 A comparison of the various different metrics, statistics, region
masks, and datasets for ow errors. Each column contains the aver-
age rank computed over a different subset of the 32 pages of results,
each of which contains 24 different results for each algorithm. See the
main body of the text for a description of exactly how each column
is computed. To help visualize the numbers, we color-code the aver-
age ranks with a color scheme where green denotes low values, yellow
intermediate, and red large values. The order of the algorithms is the
same as Fig. 7, i.e., we order by the average endpoint error (Avg. EE),
the leftmost column in (c), which is highlighted in the table. At the
bottom of the table, we include correlations between various subsets
of pairs of the columns. Specically, we compute the Pearson
product-moment coefcient r. We separately color-code the correla-
tions with a scale where dark green is 1.0 and yellow/red denote lower
values
Fig. 10 Results of the Complementary OF algorithm (Zimmer et al.
2009) on the Urban sequence. The average AE is 4.64 degrees which
ranks 6th in the table at the time of writing. The average EE is 1.78 pix-
els which ranks 20th at the time of writing. The huge discrepancy is
due to the fact that the building in the bottom left has a very large mo-
tion, so the AE in that region is downweighted. Based on this example,
we argue that the endpoint error (EE) should become preferred mea-
sure of ow accuracy
sures (R0.5, R1.0, and R2.0), and the accuracy measures
(A50, A75, and A95). The rst thing to note is that again
these measures are all highly correlated with the average
over all the statistics in column (a) and with themselves.
The outliers and variation in the measures for any one
algorithm can be very informative. For example, the per-
formance of DPOF (Lei and Yang 2009) improves dramat-
ically from R0.5 to R2.0 and similarly from A50 to A95.
22 Int J Comput Vis (2011) 92: 131
This trend indicates that DPOF is good at avoiding gross
outliers but is relatively weak at obtaining high accuracy.
DPOF (Lei and Yang 2009) is a segmentation-based dis-
crete optimization algorithm, followed by a continuous re-
nement (Sect. 2.4.2). The variation of the results across
the measures indicates that the combination of segmenta-
tion and discrete optimization is benecial in terms of avoid-
ing outliers, but that perhaps the continuous renement is
not as sophisticated as recent purely continuous algorithms.
The qualitative results obtained by DPOF on the Schefera
and Grove sequences in Fig. 8 show relatively good results
around motion boundaries, supporting this conclusion.
5.2.3 Comparison of the Region Masks
Columns (d) in Fig. 9 contain a comparison of the region
masks, All, Disc, and Untext. Overall, the measures are
highly correlated by rank, particularly for the All and Un-
text masks. When comparing the actual error scores in the
individual tables (e.g., Fig. 7), however, the errors are much
higher throughout in the Disc regions than in the All regions,
while the errors in the Untext regions are typically the low-
est. As expected, the Disc regions thus capture what is still
the hardest task for optical ow algorithms: to accurately
recover motion boundaries. Methods that strongly smooth
across motion discontinuities (such as the Horn and Schunck
algorithm 1981, which uses a simple L2 prior) also show a
worse performance for Disc in the rankings (columns (d) in
Fig. 9). Textureless regions, on the other hand, seem to be
no problem for todays methods, essentially all of which op-
timize a global energy.
5.2.4 Comparison of the Datasets
Columns (e) in Fig. 9 contain a comparison across the
datasets. The rst thing to note is that the results are less
strongly correlated than across statistics or region masks.
The results on the Yosemite sequence, in particular, are either
poorly or negatively correlated with all of the others. (The
main reason is that the Yosemite ow contains few discon-
tinuities and consequently methods do well here that over-
smooth other sequences with more motion boundaries.) The
most correlated subset of results appear to be the four hidden
texture sequences Army, Mequon, Schefera, and Wooden.
These results show how performance on any one sequence
can be a poor predictor of performance on other sequences
and how a good benchmark needs to contain as diverse a set
of data as possible. Conversely, any algorithm that performs
consistently well across a diverse collection of datasets can
probably be expected to perform well on most inputs.
Studying the results in detail, a number of interesting
conclusions can be noted. Complementary OF (Zimmer
et al. 2009) does well on the hidden texture data (Army,
Mequon, Schefera, Wooden) presumably due to the use of
a relatively sophisticated data term, including the use of a
different robust penalization function for each channel in
HSV color space (the hidden texture data contains a number
of moving shadows and other illumination-related effects),
but not as well on the sequences with large motion (Urban)
and complex discontinuities (Grove). DPOF (Lei and Yang
2009), which involves segmentation and performs best on
Grove, does particular poorly on Yosemite presumably be-
cause segmenting the grayscale Yosemite sequence is dif-
cult. F-TV-L1 (Wedel et al. 2008) does well on the largely
rigid sequences (Grove, Urban, Yosemite, and Teddy), but
poorly on the non-rigid sequences (Army, Mequon, Schef-
era, and Wooden). F-TV-L1 uses a rigidity prior and so it
seems that this component is being used too aggressively.
Note, however, that a later algorithm by the same group
of researchers (AdaptiveWedel et al. 2009which also
uses a rigidity prior) appears to have addressed this prob-
lem. The ow elds for Dynamic MRF (Glocker et al. 2008)
all appear to be over-smoothed; however, quantitatively, the
performance degradation is only apparent on the sequences
with strong discontinuities (Grove, Urban, and Teddy). In
summary, the relative performance of an algorithm across
the various datatypes in our benchmark can lead to insights
into which of its components work well and which are lim-
iting performance.
5.3 Analysis of the Interpolation Errors
We now analyze the correlation between the metrics, statis-
tics, region masks, and datasets for the interpolation errors.
In Fig. 11, we include results for the interpolation errors that
are analogous to the ow error results in Fig. 9, described
in Sect. 5.2. Note that we are now comparing interpolated
frames (generated from the submitted ow elds using the
interpolation algorithm from Sect. 3.3.2) with the true in-
termediate frames. Also, recall that we use a different set of
test sequences for the interpolation evaluation: the four high-
speed datasets Backyard, Basketball, Dumptruck, and Ever-
green, in addition to Mequon, Schefera, Urban, and Teddy,
as representatives of the three other types of datasets. We
sort the algorithms by the average interpolation error per-
formance (Avg. IE), the leftmost column in Fig. 11(c). The
ordering of the algorithms in Fig. 11 is therefore different
from that in Fig. 9.
5.3.1 Comparison of the Interpolation and Normalized
Interpolation Errors
Columns (a) and (b) in Fig. 11 contain average ranks for the
interpolation error (IE) and the normalized interpolation er-
ror (NE). The rankings generated with these two measures
are highly correlated (r =0.981), with only a few ordering
Int J Comput Vis (2011) 92: 131 23
Fig. 11 A comparison of the various different metrics, statistics, re-
gion masks, and datasets for interpolation errors. These results are
analogous to those in Fig. 9, except the results here are for interpola-
tion errors rather than ow errors. See Sect. 5.2 for a description of
how this table was generated. We sort the algorithms by the average
interpolation error performance (Avg. IE), the rst column in (c). The
ordering of the algorithms is therefore different to that in Fig. 9
reversals. Most of the differences between the two measures
can be explained by the relative weight given to the discon-
tinuity and textureless regions. The rankings in columns (a)
and (b) are computed by averaging the ranking over the
three masks. The normalized interpolation error (NE) gener-
ally gives additional weight to textureless regions, and less
weight to discontinuity regions (which often also exhibit an
intensity gradient). For example, CBF (Trobin et al. 2008)
performs better on the All and Disc regions than it does on
the Untext regions, which explains why the NE rank for this
algorithm is slightly higher than the IE rank.
5.3.2 Comparison of the Statistics
Columns (c) in Fig. 11 contain a comparison of the vari-
ous statistics, the average (Avg), the robustness measures
(R2.5, R5.0, and R10.0), and the accuracy measures (A90,
A95, and A99). Overall the results are highly correlated.
The most obvious exception is R2.5, which measures the
percentage of pixels that are predicted very precisely (within
2.5 graylevels). In regions with some texture, very accu-
rate ow is needed to obtain the highest possible precision.
Algorithms such as CBF (Trobin et al. 2008) and DPOF (Lei
and Yang 2009), which are relatively robust but not so accu-
rate (compare the performance of these algorithms for R0.5
and R2.0 in Fig. 9), therefore performworse in terms of R2.5
than they do in terms of R5.0 and R10.0.
5.3.3 Comparison of the Region Masks
Columns (d) in Fig. 11 contain a comparison of the region
masks, All, Disc, and Untext. The All and Disc results are
highly correlated, whereas the Untext results are less corre-
lated with the other two masks. Studying the detailed results
on the webpage for the outliers in columns (d), there does
not appear to be any obvious trend. The rankings in the Un-
text regions just appear to be somewhat more noisy due to
the fact that for some datasets there are relatively few Untext
pixels and all algorithms have relatively lowinterpolation er-
rors in those regions. The actual error values (as opposed to
their rankings) are quite different between the three regions
masks. Like the ow accuracy errors (Sect. 5.2.3), the IE
values are highest in the Disc regions since ow errors near
object boundaries usually cause interpolation errors as well.
24 Int J Comput Vis (2011) 92: 131
5.3.4 Comparison of the Datasets
Columns (e) in Fig. 11 contain a comparison across the
datasets. The results are relatively uncorrelated, just like the
ow errors in Fig. 9. The most notable outlier for interpola-
tion is Schefera. Studying the results in detail on the web-
site, the primary cause appears to the right hand side of the
images, where the plant leaves move over the textured cloth.
This region is difcult for many ow algorithms because the
difference in motions is small and the color difference is not
great either. Only a few algorithms (e.g., DPOFLei and
Yang 2009, FusionLempitsky et al. 2008, and Dynamic
MRFGlocker et al. 2008) perform well in this region.
Getting this region correct is more important in the inter-
polation study than in the ow error study because: (1) the
background is quite highly textured, so a small ow error
leads to a large interpolation error (see the error maps on the
webpage) and (2) the difference between the foreground and
background ows is small, so oversmoothing the foreground
ow is not penalized by a huge amount in the ow errors.
The algorithms that perform well in this region do not per-
form particularly well on the other sequences, as none of the
other seven interpolation datasets contain regions with sim-
ilar causes of difculty, leading to the results being fairly
uncorrelated.
5.4 Comparison of the Flow and Interpolation Errors
In Fig. 12, we compare the ow errors with the interpola-
tion errors. In the left half of the gure, we include the av-
erage rank scores, computed over all statistics (except the
standard deviation) and all three masks. We compare ow
endpoint errors (EE), interpolation errors (IE), and normal-
ized interpolation errors (NE), and include two columns for
each, Avg and Avg4. The rst column, Avg EE, is computed
over all eight ow error datasets, and corresponds exactly to
column (a) in Fig. 9. Similarly, the third and fth columns,
Avg IE and Avg NE, are computed over all eight interpo-
lation error datasets, and correspond exactly to columns (a)
and (b) in Fig. 11. To remove any dependency on the differ-
ent datasets, we provide the Avg4 columns, which are com-
puted over the four sequences that are common to the ow
and interpolation studies: Mequon, Schefera, Urban, and
Teddy.
The right half of Fig. 12 shows the 6 6 matrix of the
column correlations. It can be seen that the correlation be-
tween the results for Avg4 EE and Avg4 IE is only 0.763.
The comparison here uses the same datasets, statistics, and
masks; the only difference is the error metric, ow end-
point error (EE) vs. interpolation error (IE). Part of the rea-
son these measures are relatively uncorrelated is that the
Fig. 12 A comparison of the ow errors, the interpolation errors, and
the normalized interpolation errors. We include two columns for the
average endpoint error. The leftmost (Avg EE) is computed over all
eight ow error datasets. The other column (Avg4 EE) is computed
over the four sequences that are common to the ow and interpola-
tion studies (Mequon, Schefera, Urban, and Teddy). We also include
two columns each for the average interpolation error and the average
normalized interpolation error. The leftmost of each pair (Avg IE and
Avg NE) are computed over all eight interpolation datasets. The other
columns (Avg4 IE and Avg NE) are computed over the four sequences
that are common to the ow and interpolation studies (Mequon, Schef-
era, Urban, and Teddy). On the right, we include the 6 6 matrix of
the correlations of the six columns on the left. As in previous gures,
we separately color-code the average rank columns and the 6 6 cor-
relation matrix
Int J Comput Vis (2011) 92: 131 25
Fig. 13 A comparison of the ow and interpolation results for DPOF
(Lei and Yang 2009) and CBF (Trobin et al. 2008) on the Teddy se-
quence to illustrate the differences between the two measures of per-
formance. DPOF obtains the best ow results with an Avg. EE of
0.5 pixels, whereas CBF is ranked 9th with an Avg. EE of 0.76 pix-
els. CBF obtains the best interpolation error results with an Avg. IE of
5.21 graylevels, whereas DPOF is ranked 6th with an Avg. IE of 5.58
graylevels
interpolation errors are themselves a little noisy internally.
As discussed above, the R2.5 and Untext mask results are
relatively uncorrelated with the results for the other mea-
sures and masks. The main reason, however, is that the in-
terpolation penalizes small ow errors in textured regions
a lot, and larger ow errors in untextured regions far less.
An illustration of this point is included in Fig. 13. We in-
clude both ow and interpolation results for DPOF (Lei and
Yang 2009) and CBF (Trobin et al. 2008) on the Teddy se-
quence. DPOF obtains the best ow results with an average
endpoint error of 0.5 pixels, whereas CBF is the 9th best
with an average endpoint error of 0.76 pixels. CBF obtains
the best interpolation error results with an average interpo-
lation error of 5.21 graylevels, whereas DPOF is 6th best
with an average interpolation error of 5.58 graylevels. Al-
though the ow errors for CBF are signicantly worse, the
main errors occur where the foreground ow is fattened
into the relatively textureless background to the left of the
birdhouse and the right of the teddy bear. The interpolation
errors in these regions are low. On the other hand, DPOF
makes ow errors on the boundary between the white cloth
and blue painting that leads to large interpolation errors.
The normalized interpolation error (NE) is meant to com-
pensate for this difference between the ow and interpo-
lation errors. Figure 12 does show that the Avg4 NE and
Avg4 EE measures are more correlated (r =0.803) than the
Avg4 IE and Avg4 EE measures (r =0.763). The increased
degree of correlation is marginal, however, due to the dif-
culty in setting a spatial smoothing radius for the gradi-
ent computation, and the need to regularize the NE measure
by adding to the denominator. Therefore, as one might
expect, the performance of a method in the interpolation
evaluation yields only limited information about the accu-
racy of the method in terms of recovering the true motion
eld.
5.5 Analysis of the Algorithms
Table 3 contains a summary of most of the algorithms for
which results have been uploaded to our online evaluation.
We omit the unpublished algorithms and a small number of
the algorithms that are harder to characterize in terms of
our taxonomy. We list the algorithms in the same order as
Figs. 7 and 9. Generally speaking, the better algorithms are
at the top, although note that this is just one way to rank the
algorithms. For each algorithm, we mark which elements
of our taxonomy in Sect. 2 it uses. In terms of the data
term, we mark whether the algorithm uses the L1 norm or
a different robust penalty function (Sect. 2.1.2). Neither col-
umn is checked for an algorithm such as Horn and Schunck
(1981), which uses the L2 norm. We note if the algorithm
uses a gradient component in the data term or any other
more sophisticated features (Sect. 2.1.3). We also note if the
algorithm uses an explicit illumination model (Sect. 2.1.4),
normalizes the data term in any way, or uses a sophisticated
color model to reduce the effects of illumination variation
(Sect. 2.1.5).
For the spatial prior term, we also mark whether the algo-
rithm uses the Total Variation (TV) norm or a different ro-
bust penalty function (Sect. 2.2.2). We note if the algorithm
spatially weights the prior (Sect. 2.2.3) or if the weighting
is anisotropic (Sect. 2.2.4). We also note if the algorithm
26 Int J Comput Vis (2011) 92: 131
Table 3 A classication of most of the algorithms for which results have been uploaded to our online evaluation in terms of which elements of
our taxonomy in Sect. 2 they use
uses a higher-order prior (Sect. 2.2.5) or a rigidity prior
(Sect. 2.2.6).
In terms of the optimization algorithm, we mark if the
algorithm uses a gradient-descent based continuous opti-
mization (Sect. 2.3.1). We also specify which algorithms are
variational or use other extremal approaches (Sect. 2.3.2).
Other approaches (Sect. 2.3.3), such as the dual variable ap-
proach and the use of Linear Programming, are grouped to-
gether. In terms of discrete optimization, we distinguish fu-
sion based algorithms (Sect. 2.4.1) from reparameterization
based algorithms (Sect. 2.4.1) and note which approaches
also use a continuous optimization phase to rene the results
(Sect. 2.4.3).
Finally, we also denote which algorithms use learning
(Sect. 2.5.1) to optimize the parameters and which algo-
rithms perform explicit visibility or occlusion reasoning
(Sect. 2.5.5). In the last column we mark whether the al-
gorithm uses color images.
Based on Table 3, we note the following:
Degree of Sophistication: The algorithms toward the top
of the table tend to use a lot more of the renements to
the data and prior terms. Spatial weighting, anisotropic
weighting, and the addition of robustness to illumination
changes through data termnormalization or the use of fea-
tures, are all common components in the top-performing
algorithms.
Int J Comput Vis (2011) 92: 131 27
Choice of Penalty Function: The L1 norm is a very pop-
ular choice, particularly for the data term. A couple of
the top-performing algorithms combine a L1 norm on the
data term with a different (more truncated) robust penalty
function on the prior term.
Rigidity: As discussed in Sect. 5.2.4, one algorithm that
uses rigidity (F-TV-L1Wedel et al. 2008) does poorly
on the non-rigid scenes, however, Adaptive (Wedel et al.
2009) (a subsequent algorithm by the same researchers)
does well on all sequences.
Continuous Optimization: The gradient descent algo-
rithms (discounting the ones that rst perform a discrete
optimization) all appear at the bottom of the table. On the
other hand, the variational approaches appear through-
out the table. Note that there is a correlation between the
use of variational methods and more sophisticated energy
functions that is not intrinsic to the variational approach.
A direct comparison of different optimization methods
with the same objective functions needs to be carried out.
The dual-variable approach is competitive with the best
algorithms, and may offer a speed advantage.
Discrete Optimization: The discrete optimization algo-
rithms do not perform particularly well. Note, however,
that the energy functions used in these methods are gen-
erally relatively simple and might be extended in the fu-
ture to incorporate some of the more sophisticated ele-
ments. It does, however, appear that rening the results
with a continuous optimization is required to obtain good
results (if accuracy is measured using average endpoint
error).
Miscellaneous: There are few algorithms that employ
learning in the table, making it difcult to draw conclu-
sions in terms of performance. This is likely to change
in the future, as learning techniques are maturing and
more labeled training data is becoming available. Simi-
larly, few algorithms incorporate explicit visibility or oc-
clusion reasoning, making it difcult to assess how im-
portant this could be. Notably, all 24 algorithms consid-
ered here utilize only 2 input frames, despite the fact
that we make 8-frame sequences available. In contrast,
on previous evaluation sets (particularly Yosemite) multi-
frame methods relying on temporal smoothing were quite
common. This raises the question of whether temporal
smoothing, at least as applied so far, is less suited for
the more challenging sequences considered here. A de-
nitive answer to this point cannot be given in this paper,
but should be subject of future work. Finally, less than
half of the algorithms utilize color information, and there
is no obvious correlation with performance. The utility of
color for image matching clearly deserves further study
as well; see Bleyer and Chambon (2010) for some re-
cent insights on this issue in the context of stereo match-
ing.
5.6 Comparison with State-of-the-Art Stereo Methods
As mentioned in Sect. 3.4, evaluating the ow algorithms
on the modied Teddy stereo dataset allows a comparison
with current stereo methods from the online Middlebury
stereo evaluation at http://vision.middlebury.edu/stereo/
(Scharstein and Szeliski 2002). To compare the state of the
art, we select the best-performing ow and stereo meth-
ods from the two evaluations and compute the median of
the lowest ve R0.5 and R1.0 endpoint error scores on the
Teddy dataset. Recall that the RX endpoint error score mea-
sures the percentage of pixels whose endpoint (or dispar-
ity) error is greater than X pixels. We compute these scores
for both All and Disc region masks. While there are slight
differences in the denition of the Disc region masks be-
tween ow and stereo evaluations, the comparison provides
a good sense of the relative accuracy of the two classes of
methods.
When comparing these scores, it becomes clear that the
current top stereo methods signicantly outperform the top
ow methods. In particular, the median of the lowest ve
R1.0 error rates in All regions is 9.9 for ow, but only 6.5
for stereo (a reduction by 34%). In the Disc regions, the er-
rors are much higher and the difference is even more pro-
nounced, with a median error of 27.6 for ow and 10.0 for
stereo (a reduction by 64%). Of course, stereo methods solve
an easier problem, since correspondences are restricted to lie
on epipolar lines, which may be one reason for the perfor-
mance difference (though, as mentioned earlier, some ow
methods employ rigidity priors that aid in the recovery of
static scenes). Another signicant difference is that many
current stereo methods employ either discrete label sets to
model disparities, or piecewise planar surface models. In
contrast, current ow methods typically perform a continu-
ous optimization. This explains why current stereo methods
are able to recover much sharper depth discontinuities than
most current ow methods, which is apparent both quantita-
tively from the Disc scores and qualitatively from examining
the recovered disparity maps and ow elds.
When comparing the R0.5 scores, which reect subpixel
accuracy, the errors are higher overall, but the difference be-
tween the top stereo and ow methods is slightly less pro-
nounced: the median of the lowest ve scores in All regions
is now 16.6 for ow, and 13.8 for stereo (a reduction by
17%); in the Disc regions the median is now 38.0 for ow,
and 22.5 for stereo (a reduction by 41%). A possible expla-
nation for the smaller performance difference when using
the R0.5 scores is that the continuous approaches used in
optical ow techniques are better able to achieve subpixel
precision.
In summary, current ow algorithms, when run on a
stereo pair, cannot quite match the performance of state-of-
the-art stereo methods, particularly near depth discontinu-
ities. Conversely, most current stereo methods use discrete
28 Int J Comput Vis (2011) 92: 131
label sets or simplied surface models that cannot be eas-
ily adapted to the problem of recovering continuous and
smoothly varying 2D motion elds. It is likely that stereo
and ow algorithms will become more similar in the future,
in particular with the advance of discrete/continuous opti-
mization techniques (Lempitsky et al. 2008; Bleyer et al.
2010).
6 Conclusion
We have presented a collection of datasets for the evalu-
ation of optical ow algorithms. These datasets are sig-
nicantly more challenging and comprehensive than pre-
vious ones. We have also extended the set of evaluation
measures and improved the evaluation methodology of
Barron et al. (1994). The data and results are available at
http://vision.middlebury.edu/ow/. Since the publication of
our preliminary paper (Baker et al. 2007), a large number
of authors have uploaded results to our online evaluation.
The best results are a huge improvement over the algo-
rithms in Baker et al. (2007) (Table 2). Our data and metrics
are diverse, offering a number of insights into the choice
of the most appropriate metrics and statistics (Sect. 5.2),
the effect of the datatype on the performance of algorithms
and the difculty of the various forms of data (Sect. 5.2.4),
the differences between ow errors and interpolation errors
(Sect. 5.3), and the importance of the various components
in an algorithm (Sect. 5.5). Of course, as newer papers con-
tinue to be published, e.g., Sun et al. (2010), which as we go
to press (June 2010) is now the leading algorithm, our un-
derstanding of which factors contribute to good performance
will continue to evolve.
Progress on our data has been so rapid that the per-
formance on some of the sequences is already very good
(Table 2). The main exceptions are Grove, Teddy, Urban,
and perhaps Schefera. As our statistical analysis shows,
however, the correlation in performance across datasets is
relatively low. This suggest that no single method is yet
able to achieve strong performance across a wide variety of
datatypes. We believe that such generality is a requirement
for robust optical ow algorithms suited for real-world ap-
plications.
Any such dataset and evaluation has a limited lifespan
and new and more challenging sequences should be col-
lected. A natural question, then, is how such data is best
collected. Of the various possible techniquessynthetic
data (Barron et al. 1994; McCane et al. 2001), some form
of hidden markers (Mova LLC 2004; Tappen et al. 2006;
Ramnath et al. 2008), human annotation (Liu et al. 2008),
interpolation data (Szeliski 1999), and modied stereo data
(Scharstein and Szeliski 2003)the authors believe that
synthetic data is probably the best approach (although gen-
erating high-quality synthetic data is not as easy as it might
seem). Large motion discontinuities and fast motion of com-
plex, ne structures appear to be more of a problem for cur-
rent optical ow algorithms than non-rigid motion, complex
illumination changes, and sensor noise. The level of dif-
culty is easier to control using synthetic data. Degradations
such as sensor noise, etc., can also easily be added. The re-
alism of synthetic sequences could also be improved further
beyond the data in our evaluation.
Future datasets should also consider more challenging
types of materials, illumination change, atmospheric effects,
and transparency. Highly specular and transparent materials
present not just a challenge for current algorithms, but also
for quantitative evaluation. Dening the ground-truth ow
and error metrics for these situations will require some care.
With any synthetic dataset, it is important to understand
how representative it is of real data. Hence, the use of mul-
tiple types of data and an analysis of the correlation across
them is critical. A diverse set of datatypes also reduces over-
tting to any one type, while offering insights into the rel-
ative performance of the algorithms in different scenarios.
On balance, however, we would recommend that any future
studies contain a higher proportion of challenging, realistic
synthetic data. Future studies should also extend the data to
longer sequences than the 8-frame sequences that we col-
lected.
Acknowledgements Many thanks to Brad Hiebert-Treuer and Alan
Lim for their help in creating the uorescent texture data sets. Michael
Black and Stefan Roth were supported by NSF grants IIS-0535075 and
IIS-0534858, and a gift from Intel Corporation. Daniel Scharstein was
supported by NSF grant IIS-0413169. Aghiles Kheffache generously
donated a software license for the 3Delight renderer for use on this
project. Michael Black and JP Lewis thank Lance Williams for early
discussions on synthetic ow databases and Doug Creel and Luca Fas-
cione for discussions of rendering issues. Thanks to Sing Bing Kang,
Simon Winder, and Larry Zitnick for providing implementations of
various algorithms. Finally, thanks to all the authors who have used
our data and uploaded results to our website.
Open Access This article is distributed under the terms of the Cre-
ative Commons Attribution Noncommercial License which permits
any noncommercial use, distribution, and reproduction in any medium,
provided the original author(s) and source are credited.
References
Adiv, G. (1985). Determining three-dimensional motion and struc-
ture from optical ow generated by several moving objects. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 7(4),
384401.
Aggarwal, J., & Nandhakumar, N. (1988). On the computation of mo-
tion from sequences of imagesa review. Proceedings of the
IEEE, 76(8), 917935.
Anandan, P. (1989). A computational framework and an algorithm for
the measurement of visual motion. International Journal of Com-
puter Vision, 2(3), 283310.
Int J Comput Vis (2011) 92: 131 29
Anandan, P., & Weiss, R. (1985). Introducing smoothness constraint in
a matching approach for the computation of displacement elds.
In Proceedings of the DARPA image understanding workshop (pp.
186196).
Baker, S., & Matthews, I. (2004). Lucas-Kanade 20 years on: a unify-
ing framework. International Journal of Computer Vision, 46(3),
221255.
Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M., & Szeliski, R.
(2007). A database and evaluation methodology for optical ow.
In Proceedings of the IEEE international conference on computer
vision.
Barron, J., Fleet, D., & Beauchemin, S. (1994). Performance of optical
ow techniques. International Journal of Computer Vision, 12(1),
4377.
Battiti, R., Amaldi, E., & Koch, C. (1991). Computing optical ow
across multiple scales: an adaptive coarse-to-ne strategy. Inter-
national Journal of Computer Vision, 6(2), 133145.
Beier, T., & Neely, S. (1992). Feature-based image metamorphosis. In
Annual conference series: Vol. 26(2). ACM computer graphics,
SIGGRAPH (pp. 3542).
Bergen, J., Anandan, P., Hanna, K., & Hingorani, R. (1992). Hierarchi-
cal model-based motion estimation. In Proceedings of the Euro-
pean conference on computer vision (pp. 237252).
Black, M., & Anandan, P. (1991). Robust dynamic motion estimation
over time. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 296302).
Black, M., & Anandan, P. (1996). The robust estimation of multiple
motions: parametric and piecewise-smooth ow elds. Computer
Vision and Image Understanding, 63(1), 75104.
Black, M., & Jepson, A. (1996). Estimating optical ow in segmented
images using variable-order parametric models with local defor-
mations. IEEE Transactions on Pattern Analysis and Machine In-
telligence, 18(10), 972986.
Blake, A., & Zisserman, A. (1987). Visual reconstruction. Cambridge:
MIT Press.
Bleyer, M., & Chambon, S. (2010). Does color really help in dense
stereo matching? In Proceedings of the international symposium
3D data processing, visualization and transmission.
Bleyer, M., Rother, C., & Kohli, P. (2010). Surface stereo with soft seg-
mentation. In Proceedings of the IEEE conference on computer
vision and pattern recognition.
Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate en-
ergy minimization via graph cuts. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 23(11), 12221239.
Brox, T., Bregler, C., & Malik, J. (2009). Large displacement optical
ow. In Proceedings of the IEEE conference on computer vision
and pattern recognition.
Brox, T., Bruhn, A., Papenberg, N., &Weickert, J. (2004). High accu-
racy optical ow estimation based on a theory for warping. In Pro-
ceedings of the European Conference on Computer Vision (Vol. 4,
pp. 2536).
Bruhn, A., Weickert, J., & Schnrr, C. (2005). Lucas/Kanade meets
Horn/Schunck: combining local and global optic ow methods.
International Journal of Computer Vision, 61(3), 211231.
Bruhn, A., Weickert, J., Kohlberger, T., & Schnrr, C. (2006).
A multigrid platform for real-time motion computation with
discontinuity-preserving variational methods. International Jour-
nal of Computer Vision, 70(3), 257277.
Burt, P., Yen, C., & Xu, X. (1982). Local correlation measures for mo-
tion analysis: a comparative study. In Proceedings of the IEEE
conference on pattern recognition and image processing (pp.
269274).
Burt, P., Yen, C., & Xu, X. (1983). Multi-resolution ow-through mo-
tion analysis. In Proceedings of the IEEE conference on computer
vision and pattern recognition (pp. 246252).
Cassisa, C., Simoens, S., & Prinet, V. (2009). Two-frame optical ow
formulation in an unwarped multiresolution scheme. In Proceed-
ings of the Iberoamerican congress on pattern recognition (pp.
790797).
Cooke, T. (2008). Two applications of graph-cuts to image process-
ing. In Proceedings of digital image computing: techniques and
applications (pp. 498504).
DNA Research (2008). 3Delight rendering software. http://www.
3delight.com/.
Enkelman, W. (1986). Investigations of multigrid algorithms for the
estimation of optical ow elds in image sequences. In Proceed-
ings of the workshop on motion: representations and analysis (pp.
8187).
Everingham, M., Van Gool, L., Williams, C., Winn, J., &Zisserman, A.
(2009). The PASCAL visual object classes challenge 2009. http://
www.pascal-network.org/challenges/VOC/voc2009/workshop/
index.html
Fei-Fei, L., Fergus, R., &Perona, P. (2006). One-shot learning of object
categories. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 28(4), 594611.
Fleet, D., & Jepson, A. (1990). Computation of component image
velocity from local phase information. International Journal of
Computer Vision, 5(1), 77104.
Fuh, C., & Maragos, P. (1989). Region-based optical ow estimation.
In Proceedings of the IEEE conference on computer vision and
pattern recognition (pp. 130135).
Georghiades, A., Belhumeur, P., & Kriegman, D. (2001). From few to
many: illumination cone models for face recognition under vari-
able lighting and pose. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 23(6), 643660.
Glazer, F., Reyonds, G., & Anandan, P. (1983). Scene matching by
hierarchical correlation. In Proceedings of the IEEE conference
on computer vision and pattern recognition (pp. 432441).
Glocker, B., Paragios, N., Komodakis, N., Tziritas, G., & Navab, N.
(2008). Optical ow estimation with uncertainties through dy-
namic MRFs. In Proceedings of the IEEE conference on computer
vision and pattern recognition.
Golland, P., & Bruckstein, A. (1997). Motion from color. Computer
Vision and Image Understanding, 68(3), 346362.
Gross, R., Matthews, I., Cohn, J., Kanade, T., & Baker, S. (2008).
Multi-PIE. In Proceedings of the international conference on au-
tomatic face and gesture recognition.
Hanna, K. (1991). Direct multi-resolution estimation of ego-motion
and structure from motion. In Proceedings of the IEEE workshop
on visual motion (pp. 156162).
Haussecker, H., & Fleet, D. (2000). Computing optical ow with phys-
ical models of brightness variation. In Proceedings of the IEEE
conference on computer vision and pattern recognition (Vol. 2,
pp. 760767).
Herbst, E., Seitz, S., & Baker, S. (2009). Occlusion reasoning for tem-
poral interpolation using optical ow. Technical report UW-CSE-
09-08-01, Department of Computer Science and Engineering Uni-
versity of Washington.
Horn, B. (1986). Robot vision. Cambridge: MIT Press.
Horn, B., & Schunck, B. (1981). Determining optical ow. Articial
Intelligence, 17, 185203.
Jepson, A., & Black, M. (1993). Mixture models for optical ow com-
putation. In Proceedings of the IEEE conference on computer vi-
sion and pattern recognition (pp. 760761).
Ju, S. (1998). Estimating image motion in layers: the skin and bones
model. PhD thesis, Department of Computer Science, University
of Toronto.
Ju, S., Black, M., & Jepson, A. (1996). Skin and bones: multi-layer,
locally afne, optical ow and regularization of transparency. In
Proceedings of the IEEE conference on computer vision and pat-
tern recognition (pp. 307314).
30 Int J Comput Vis (2011) 92: 131
Jung, H., Lee, K., & Lee, S. (2008). Toward global minimum through
combined local minima. In Proceedings of the European confer-
ence on computer vision (Vol. 4, pp. 298311).
Landis, H. (2002). Production-ready global illumination. In L. Gritz
(Ed.), RenderMan in production: SIGGRAPH 2002 course 16 (pp.
87100). New York: ACM.
Le Besnerais, G., & Champagnat, F. (2005). Dense optical ow by it-
erative local window registration. In Proceedings of the interna-
tional conference on image processing (Vol. 1, pp. 137140).
Lei, C., & Yang, Y. (2009). Optical ow estimation on coarse-to-ne
region-trees using discrete optimization. In Proceedings of the
IEEE international conference on computer vision.
Lempitsky, V., Roth, S., & Rother, C. (2008). Fusion ow: discrete-
continuous optimization for optical ow estimation. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition.
Levoy, M. (1988). Display of surfaces from volume data. IEEE Com-
puter Graphics and Applications, 8(3), 2937.
Li, Y., & Huttenlocher, D. (2008). Learning for optical ow using sto-
chastic optimization. In Proceedings of the European conference
on computer vision (Vol. 2, pp. 373391).
Liu, C., Freeman, W., Adelson, E., &Weiss, Y. (2008). Human-assisted
motion annotation. In Proceedings of the IEEE conference on
computer vision and pattern recognition.
Liu, C., Yuen, J., Torralba, A., Sivic, J., & Freeman, W. (2008). SIFT
ow: dense correspondence across difference scenes. In Proceed-
ings of the European conference on computer vision (Vol. 3, pp.
2842).
Lucas, B., & Kanade, T. (1981). An iterative image registration tech-
nique with an application in stereo vision. In Proceedings of the
international joint conference on articial intelligence (pp. 674
679).
Mahajan, D., Huang, F., Matusik, W., Ramamoorthi, R., & Bel-
humeur, P. (2009). Moving gradients: a path-based method for
plausible image interpolation. In Annual conference series. ACM
computer graphics, SIGGRAPH.
Markandey, V., & Flinchbaugh, B. (1990). Multispectral constraints
for optical ow computation. In Proceedings of the IEEE interna-
tional conference on computer vision (pp. 3841).
McCane, B., Novins, K., Crannitch, D., & Galvin, B. (2001). On
benchmarking optical ow. Computer Vision and Image Under-
standing, 84(1), 126143.
Mitiche, A., & Bouthemy, P. (1996). Computation and analysis of im-
age motion: a synopsis of current problems and methods. Interna-
tional Journal of Computer Vision, 19(1), 2955.
Mova LLC (2004). Contour reality capture. http://www.mova.com/.
Murray, D., & Buxton, B. (1987). Scene segmentation from visual
motion using global optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 9(2), 220228.
Nagel, H.-H., & Enkelmann, W. (1986). An investigation of smooth-
ness constraints for the estimation of displacement vector elds
from image sequences. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 8(5), 565593.
Negahdaripour, S. (1998). Revised denition of optical ow: integra-
tion of radiometric and geometric cues for dynamic scene analy-
sis. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 20(9), 961979.
Nir, T., Bruckstein, A., &Kimmel, R. (2008). Over-parameterized vari-
ational optical ow. International Journal of Computer Vision,
76(2), 205216.
Ohta, N. (1989). Optical ow detection by color images. In Interna-
tional conference on image processing (pp. 801805).
Otte, M., & Nagel, H.-H. (1994). Optical ow estimation: advances
and comparisons. In Proceedings of the European conference on
computer vision (pp. 5160).
Philips, P., Scruggs, W., OToole, A., Flynn, P., Bowyer, K., Schott,
C., & Sharpe, M. (2005). Overview of the face recognition grand
challenge. In Proceedings of the IEEE conference on computer
vision and pattern recognition (Vol. 1, pp. 947954).
Pock, T., Pock, M., & Bischof, H. (2007). Algorithmic differenti-
ation: application to variational problems in computer vision.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(7), 11801193.
Pratt, W. (1974). Correlation techniques of image registration. IEEE
Transactions on Aerospace and Electronic Systems, AES-10, 353
358.
Ramnath, K., Baker, S., Matthews, I., & Ramanan, D. (2008). Increas-
ing the density of active appearance models. In Proceedings of the
IEEE conference on computer vision and pattern recognition.
Rannacher, J. (2009). Realtime 3D motion estimation on graphics
hardware. Undergraduate thesis, Heidelberg University.
Ren, X. (2008). Local grouping for optical ow. In Proceedings of the
IEEE conference on computer vision and pattern recognition.
Roth, S., & Black, M. (2007). On the spatial statistics of optical ow.
International Journal of Computer Vision, 74(1), 3350.
Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of
dense two-frame stereo correspondence algorithms. International
Journal of Computer Vision, 47(13), 742.
Scharstein, D., & Szeliski, R. (2003). High-accuracy stereo depth maps
using structured light. In Proceedings of the IEEE conference on
computer vision and pattern recognition (pp. 195202).
Seitz, S., & Baker, S. (2009). Filter ow. In Proceedings of the IEEE
international conference on computer vision.
Seitz, S., Curless, B., Diebel, J., Scharstein, D., & Szeliski, R. (2006).
A comparison and evaluation of multi-view stereo reconstruction
algorithms. In Proceedings of the IEEE conference on computer
vision and pattern recognition (Vol. 1, pp. 519526).
Shade, J., Gortler, S., He, L.-W., & Szeliski, R. (1998). Layered depth
images. In Annual conference series. ACM computer graphics,
SIGGRAPH (pp. 231242).
Shizawa, M., & Mase, K. (1991). A unied computational theory for
motion transparency and motion boundaries based on eigenenergy
analysis. In Proceedings of the IEEE conference on computer vi-
sion and pattern recognition (pp. 289295).
Sim, T., Baker, S., & Bsat, M. (2003). The CMU pose, illumination,
and expression database. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 25(12), 16151618.
Stiller, C., &Konrad, J. (1999). Estimating motion in image sequences:
a tutorial on modeling and computation of 2D motion. IEEE Sig-
nal Processing Magazine, 16(4), 7091.
Sun, C. (1999). Fast optical ow using cross correlation and shortest-
path techniques. In Proceedings of digital image computing: tech-
niques and applications (pp. 143148).
Sun, J., Shum, H.-Y., & Zheng, N. (2003). Stereo matching using belief
propagation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 25(7), 787800.
Sun, D., Roth, S., Lewis, J., & Black, M. (2008). Learning optical ow.
In Proceedings of the European conference on computer vision
(Vol. 3, pp. 8397).
Sun, D., Roth, S., & Black, M. (2010). Secrets of optical ow estima-
tion and their principles. In Proceedings of the IEEE conference
on computer vision and pattern recognition.
Szeliski, R. (1999). Prediction error as a quality metric for motion and
stereo. In Proceedings of the IEEE international conference on
computer vision (pp. 781788).
Tappen, M., Adelson, E., & Freeman, W. (2006). Estimating intrinsic
component images using non-linear regression. In Proceedings of
the IEEE conference on computer vision and pattern recognition
(Vol. 2, pp. 19921999).
Trobin, W., Pock, T., Cremers, D., & Bischof, H. (2008). Continuous
energy minimization via repeated binary fusion. In Proceedings
Int J Comput Vis (2011) 92: 131 31
of the European conference on computer vision (Vol. 4, pp. 677
690).
Trobin, W., Pock, T., Cremers, D., & Bischof, H. (2008). An unbiased
second-order prior for high-accuracy motion estimation. In Pro-
ceedings of pattern recognition, DAGM (pp. 396405).
Valgaerts, L., Bruhn, A., & Weickert, J. (2008). A variational model for
the joint recovery of the fundamental matrix and the optical ow.
In Proceedings of pattern recognition, DAGM (pp. 314324).
Vedula, S., Baker, S., Rander, P., Collins, R., & Kanade, T. (2005).
Three-dimensional scene ow. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 27(3), 475480.
Wang, J., & Adelson, E. (1993). Layered representation for motion
analysis. In Proceedings of the IEEE conference on computer vi-
sion and pattern recognition (pp. 361366).
Wedel, A., Pock, T., Braun, J., Franke, U., & Cremers, D. (2008). Du-
ality TV-L1 ow with fundamental matrix prior. In Proceedings
of image and vision computing, New Zealand.
Wedel, A., Pock, T., Zach, C., Cremers, D., & Bischof, H. (2008). An
improved algorithm for TV-L1 optical ow. In Proceedings of the
Dagstuhl motion workshop.
Wedel, A., Cremers, D., Pock, T., & Bischof, H. (2009). Structure-
and motion-adaptive regularization for high accuracy optic ow.
In Proceedings of the IEEE international conference on computer
vision.
Weiss, Y. (1997). Smoothness in layers: motion segmentation using
nonparametric mixture estimation. In Proceedings of the IEEE
conference on computer vision and pattern recognition (pp. 520
526).
Werlberger, M., Trobin, W., Pock, T., Bischof, H., Wedel, A., & Cre-
mers, D. (2009). Anisotropic Huber-L1 optical ow. In Proceed-
ings of the British machine vision conference.
Xu, L., Chen, J., & Jia, J. (2008). A segmentation based variational
model for accurate optical ow estimation. In Proceedings of the
European conference on computer vision (Vol. 1, pp. 671684).
Zimmer, H., Bruhn, A., Weickert, J., Valgaerts, L., Salgado, A., Rosen-
hahn, B., & Seidel, H.-P. (2009). Complementary optic ow. In
Proceedings of seventh international workshop on energy mini-
mization methods in computer vision and pattern recognition.
Zitnick, C., Kang, S., Uyttendaele, M., Winder, S., & Szeliski, R.
(2004). High-quality video view interpolation using a layered rep-
resentation. In Annual conference series: Vol. 23(2). ACM com-
puter graphics, SIGGRAPH (pp. 600608).

You might also like