You are on page 1of 33

A Survey of Methods for Face Detection

Andrew King 992 550 627 March 3, 2003

Contents

1 Introduction 1.1 1.2 The Problem of Face Detection . . . . . . . . . . . . . . . . . Current Work . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 2 4

2 Mathematical Models and Approaches

3 Classiers

18

4 Results and Conclusions 4.1 4.2

26

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 27

Chapter 1 Introduction

1.1

The Problem of Face Detection

In this paper we focus specically on the problem of face detection in still images. Obviously the most straightforward variety of this problem is the detection of a single face at a known scale and orientation. Even this, it turns out, is a nontrivial problem. The most immediate application that comes to mind for face detection is as the rst step in an automated face recognizer [12]. Thought of in this sense, face detection can be applied to systems for things such as automated surveillance and human trac census. In and of itself, however, face detection is a fascinating problem. Ecient

face detection at framerate is an impressive goal; it is an analogue to face tracking (on which the literature, due to the subjects obvious application to human-computer interaction [6], is extensive) that requires no knowledge of previous frames. As such, it is obviously a more challenging problem (particularly as many face tracking approaches are designed for human-computer interaction schemes designed specically to this end). Furthermore, fast face detection has an apparent application to practical face tracking in the sense that it can be used to initialize tracking, e.g. when an interaction subject enters the frame or appears from an occluded position. Another reason that face detection is an important research problem is its role as a challenging case of a more general problem, i.e. object detection, for which the applications, once not restricted to faces, are manifold. Face detection is a beautiful paradigm for the general problem for several reasons. A face is naturally recognizable to a human being despite its many points of variation (e.g. skin tone, hairstyle, facial hair, glasses, etc.). Obviously a human being is able to detect a face in the context of an entire person, but we want a simple, context-free approach to detection. Another source of diculty for faces is the complex 3-dimensional shape of ones face, and the resulting dierence in the appearance of a given face under dierent lighting

conditions, even in an otherwise identical environment [12]. There may be object detection methods that work well for more easily identiable objects such as blocks, but a method that works well for faces can generally be trusted with the task of detection for a wide range of complex object structures. The generality of detecting faces in a single greyscale image is a major challenge. We have no standard method for determining illumination data, scene structure, or context of sub-images without performing extensive operations on the image before even considering faces. Hence a successful strategy for face detection will be able to dodge environmental tricks and traps, but cannot ever be expected to perform perfectly.

1.2

Current Work

There are various solutions to this problem, most of which deal with faces at arbitrary (at least, within a reasonable range) scales, though most assume an upright face (the method to be used for rotated faces is an obvious exhaustive analogue to any detection method for upright faces). Most of the methods discussed in this paper are concerned only with detecting forwardfacing faces. Of these methods, only Schneiderman and Kanades statistical

method considers prole detection [11]. However, their method considers only three face orientations, and practically speaking, each orientation is treated as a dierent object. The eects of this approach for detecting faces at various orientations is discussed in Chapter 4. Schneiderman and Kanade apply statistical likelihood tests, using feature output histograms to create their detector scheme in [11]. Rowley and Kanade use neural network-based lters in [10], obtaining good early results in what has apparently become a benchmark of sorts for face detection schemes. In another early work, Papageorgiou et al. propose a general object detection scheme which uses a wavelet representation and statistical learning techniques [8]. Osuna et al. apply Vapniks support vector machine technique to face detection in [7], and Romdhani et al. improve on that work by creating reduced training vector sets for their classier in [9]. Fleuret and Geman attempt a coarse-to-ne approach to face detection, focusing on minimizing computation in their approach [3]. In perhaps the most impressive paper, Viola and Jones use the concept of an integral image, along with a rectangular feature representation and a boosting algorithm as its learning method, to detect faces at 15 frames per second [13]. This represents an improvement in computation time of an order of magnitude over previous

implementations of face detection algorithms. In Chapter 2, we describe the various mathematical models used for these methods. In Chapter 3, we specically discuss the classier for each approach. In Chapter 4, the results of these approaches are analyzed and compared.

Chapter 2 Mathematical Models and Approaches


Every method addressed in this paper uses a learning algorithm on a training set to begin the detection process. The training stage is extensive for some methods, and relatively small for others. This common training gives us an advantage when we consider the problem for the rst time: we can assume that we have available to us data about a general face, and we can infer certain information regarding faces in general. The most intuitive solution to the problem of modeling faces is the geometric formulation which allows the detector to project a tested image onto

a learned subspace and determine whether or not it is close to that subspace. The natural thing to do with a training set, then, is to compute a manifold in Rn (from training images containing n pixels) from the most signicant components of the general face. This is a very basic scheme, and is computationally burdensome.

Sung and Poggio use an adaptation of this scheme to create a detection scheme using Gaussian clusters in Rn . The basic idea of their detection model is using a multiple-mean Gaussian mixture model for both objects (the general case, as opposed to faces) and non-objects. Obviously the space with low object probability is the non-object space, so it is more accurate to say that among the Gaussian object clusters, negatively weighted clusters are placed so as to improve the denition of the object space. In terms of the detection problem, these negatively weighted clusters will be centred at images which can be mistaken for faces, but are not. Their implementation uses six each of the face and non-face clusters. Their learning method is appropriate for their means: a large focus of the detector is on discerning between faces and face-like non-faces. They use a bootstrapping strategy for creating a non-face training set consisting of only the most meaningful non-faces (as

opposed to a general non-face training set, which would contain many images which are so obviously not faces that they hold little weight in the detector). Sung and Poggio construct their face set in a very straightforward manner (enlarging their data set with rotations and reections). The bootstrapping scheme for non-face generation begins with a small set of non-face samples. Their detector is then run, and false positives are added to their non-face set. This method can be iterated until a satisfactory data set has been reached. This makes for a very time-consuming construction, but the resulting set, using their negatively weighted cluster scheme for non-faces, is well suited to their demands. If necessary, the face data set can be bootstrapped in a similar manner. Sung and Poggio claim that their system can be made arbitrarily robust in this manner: Both false positive and false negative detection errors can be easily corrected by further training with the wrongly classied patterns [12]. This is in reference to error rates in training sets, and does not necessarily suggest that given time, the scheme can be improved arbitrarily for unseen data.

A dierent approach to separating faces and non-faces in image space is

used by Osuna et al., and followed up with work by Romdhani et al. in [7] and [9], respectively. Both are based on support vector machines, a classication method which is the result of V. Vapnik and others at AT&T Bell Labs, notably presented in [2]. The key to the model for a support vector machine is the choice of a manifold that separates the face set from the non-face set. In [7], a hyperplane is chosen, specically the hyperplane which maximizes minimum distance on either side. A support vector set is, roughly speaking, a set of vectors (images) which are close to this hyperplane, and can therefore be used on their own to reacquire the hyperplane. In [9], an attempt to improve the performance of a similar system is made via reduced set vector machines. The methods in both papers use quadratic programming heavily, and exploit properties of their models kernel functions. This is more closely related to their classiers, and will therefore be left mostly to Chapter 3. In terms of training sets, Osuna et al. exploit their model and the fact that most vectors will be ignored, or at least meaningless, in their quadratic programming formulation. Because the hardware requirements for training a support vector machine in a natural way are prohibitive, training data must be chosen in a nontrivial manner. First. a set of optimality conditions,

10

specically the Kuhn-Tucker conditions, are considered. Only those vectors which are relevant to the training, i.e. support vectors, are used. Memory requirements are quadratic in the size of this working vector set, so obviously minimizing it is key. The proposed solution is to decompose the problem into smaller sub-problems, a standard solution to such a problem when doing so is possible. Romdhani et al. [9] work further on reducing this vector set in order to improve performance. In [9], it is argued that the support vector set in detectors like the detector in [7] forms a proportion of the entire training set that stands to be reduced signicantly. There has been a decent amount of research done on improving the performance of support vector machines since their development less than 10 years ago. Rhomdhani et al. apply one such method to improve the performance of an SVM-based face detector. Basically speaking, given a vector in the models feature space (expressable, thanks to the model, as a sum over the support vector set), there is a good approximation to . Specically, there is a good approximation which is expressable as a sum over a reduced vector set which is much smaller than the support vector set. Given a reduced set, the problem remains to minimize the norm of . This can be done in terms of the models associated kernels.

11

Akin to Sung and Poggios bootstrap approach is the retraining performed in [9]. The positive results of such retraining are demonstrated in the context of a neural network-based detector in [10].

Schneiderman and Kanade propose a statistical model in [11]. To apply statistical methods to the problem, they represent visual attributes with wavelet coecients. This method suits their needs because unlike other methods, with wavelets an image can be perfectly reconstructed from its transform with a coecient set that has the same size as the image itself. Specically speaking, their method uses three lter levels, giving 10 image sub-bands. This representation allows them to jointly model image data which is localized in space, frequency, and orientation. From this information, then, they are able to construct a histogram-based face detector. This method requires that initial histograms be constructed. Schneiderman and Kanades approach to this is similar to the approach of Sung and Poggios, and is in fact, loosely speaking, a statistical analogue of the bootstrapping method described previously [11, 12]. Rather than giving every training example acquired through bootstrapping equal weight, they use an approach for faces that explicitly minimizes error in the training data. This

12

is done using AdaBoost, an algorithm for converting a weak learning method into one with high accuracy [4] (Viola and Jones detector uses a boosting algorithm which is based on AdaBoost). In their training method for faces, Schneiderman and Kanade begin with a bootstrapping basis which is evenly weighted, then give more weight to training images which are identied as false positives. Just like Sung and Poggios bootstrapping, this training can be iterated to improve robustness.

In [10], Rowley et. al. present a face detection system based on articial neural networks. This paper seems to have become an early standard in face detection, against which many researchers compare their results. Of course, this fact may be contributed to by the fact that Rowley provided several authors with test data [9, 13]. The neural network is the most novel part of the paper, as the general method for detection is fairly standard, in terms of scanning over every pixel at various scales. The neural network contains three types of hidden units: one set of units for quadrants of the 20 20 image, one set for quadrants of the quadrants, and one set for looking at overlapping horizontal strips of the image. The idea is clear: certain hidden units will help detect certain facial character-

13

istics. For example, since an oval binary mask is applied to the image in preprocessing, dark corner pixels will likely be removed in the case of a face. In this situation the quadrant hidden units are likely to sense the presence of eyes in the upper two quadrants. In order to train the neural network on a face data set, a large number of face images were used, in which feature points were labeled manually [10]. The locations of these feature points are averaged over the training set, then warped to coincide with predetermined points. Each face training image can then be aligned to the mean as the optimal solution to an overdetermined system. Iterating this method results in a suitably warped data set. This set is articially enlarged as in other methods through rescaling, rotation, reection, and translation. The result of this data set enlargement is that the neural network, as a lter, becomes invariant to these transformations within a range. Sungs bootstrapping method is used to determine a non-face data set. Rowley et al. provide interesting classication methods, which will be discussed in Chapter 3.

Papageorgiou, Oren, and Poggio, in what can be considered to be a conceptual precursor to the work of Viola and Jones, use Haar wavelets to create

14

an overcomplete representation of the face class [8]. The focus of their paper is on the development of their wavelet model. They provide a simple application of this model (for objects in general) to face and pedestrian detection. They use an extension of two-dimensional Haar wavelets called the quadruple density transform to create their redundant representative set. This initial set consists of 1734 coecients for vertical, horizontal, and diagonal wavelets at scales of 2 2 pixels and 4 4 pixels. To avoid prohibitive computational costs in training the classier, this set is reduced to a set of 37 signicant coecients through statistical analysis. Again, training is done using bootstrapping methods, as in [12, 10, 7, 9, 11]. In this case, Papageorgiou et al. train their system using a variety of penalties for misclassication [8]. Their results show a marginal improvement when the penalty for missed positives is an order of magnitude greater than the penalty for false detections. In practice, however, it seems like there would be very little dierence between the system under the various training schemes.

Following on the work of Papageorgiou et al., Viola and Jones present a much faster detector than any of their contemporaries [13]. The performance

15

can be attributed to the use of an attentional cascade, using low-featurenumber detectors based on a natural extension of Haar wavelets [5]. The cascade itself has more to do with their classier than with their model, so it will be discussed in the next chapter. Each detector in their cascade ts objects to simple rectangular masks, basically speaking. In order to avoid making many computations when moving through their cascade, Viola and Jones introduce a new image representation which they call an integral image, which is just what it sounds like. For each pixel in the original image, there is exactly one pixel in the integral image, whose value is the sum of the original image values above and to the left. The integral image can be computed quickly, and drastically improves computation costs under the rectangular feature model. As explained in [13], the integral image allows rectangular sums to be computed in four array references. This is easy to see when the model is considered; under the conventional representation of an image, the computation time needed would be proportional to the size of the rectangle. At the highest levels of the attentional cascade, where most of the comparisons are made, the rectangular features are very large. As the computation progresses down the cascade, the features can get smaller and smaller, but fewer locations are

16

tested for faces. Thus the advantage of the integral image representation is clear. The remaining diculty lies in creating and training the attentional cascade, which also lends very heavily to the detectors eciency. Training the attentional cascade is similar to the other training methods seen, obviously adapted to suit the situation. Because of the cascades nature, a very high detection rate is needed, but the false detection rate can also be very high, as the overall gures decrease exponentially with the gures for each individual cascade level, based on the depth of the cascade [13]. Each level of the cascade needs to reject examples that are closer to faces than the previous level (as each level inherits the previous levels accepted images). Viola and Jones therefore pass a large number of non-face examples to train the rst cascade level, then pass those detected by the rst level on to the next level, and so on. For face training, each level is trained on the same face set. This method is similar in spirit to the more basic bootstrapping methods adapted from Sungs method, but is geared toward the progressive nature of the attentional cascade.

17

Chapter 3 Classiers
Each model requires a classier to determine whether given data are faces or non-faces. The classier is, in general, some threshold applied to the data, usually some sort of goodness-of-t measure. The classiers for the models in Chapter 2 are discussed in this chapter. Recall that Sung and Poggio model their face likelihood with Gaussian clusters and anti-clusters in Rn (n, in their case, happens to be 283) [12]. In their implementation, six clusters and six anti-clusters are used. Obviously the number of clusters, and to a lesser extent the number of anti-clusters, will have a great eect on the receiver operating characteristic (ROC) curve. The numbers of clusters (i.e. the classier architecture) were determined

18

empirically. This detector was tested for a number of dierent architectures, and the six and six architecture provided the best results. The Gaussian clusters used are non-isotropic, that is, the axes of each cluster are not of equal length. Sung and Poggio justify this under the belief that the actual face distribution can be locally more elongated along certain vector space directions than others [12]. This seems like a reasonable generalization, but it leaves us with the problem of choosing a suitable distance function. A natural choice for a model based on these non-isotropic clusters is the normalized Mahalanobis distance. The normalized Mahalanobis distance between an image under consideration x and the centre of a Gaussian cluster is 1 Mn (x, ) = (n ln 2 + ln || + (x u)T 1 (x u)), 2 (3.1)

where is the covariance matrix of the Gaussian cluster [12]. We can see that if the model contains a single Gaussian cluster, then thresholding at a xed Mahalanobis distance from selects all vectors (images) which are within a xed probability density in the model. Of several distance metrics tested for this model, a two-value combination yielded the best results. For a given vector, these two values are obtained for each cluster. The rst, D1 is the Mahalanobis distance between the vector and the cluster centroid after 19

both have been projected to the space of the clusters 75 most signicant eigenvectors. The second, D2 , is the Euclidean distance between the vector and its projection to this 75-dimensional space, i.e. its out-of-subspace error. For each cluster, then, this vector has a two-value distance. These are given a weighted sum and checked against a threshold to determine whether the vector is a face or not. Relative results for this particular classier and its variants are discussed in Chapter 4. In terms of preprocessing work, Sung and Poggio perform the standard operations: image resizing, illumination gradient correction, and histogram equalization. Further, they mask the 19 19 pixel images, removing some border and especially corner pixels from consideration. Osuna et al. perform identical preprocessing for their support vector machine detector [12, 7]. For the support vector machine detectors, the obvious desire is to have faces on one side of the selected hyperplane and non-faces on the other side. This is the ideal classier for the model. After training, the system is very similar to that of Sung and Poggio [7]. The simplicity of the classiers criterion for support vector machines makes the run-time computation of the methods in [7] and [9] extremely simple. Pre-processing of tested images must be performed, but in general, these two methods give impressive

20

run-time computational savings (run-time complexity for these machines is proportional to the size of the support vector set [9]. As training is the crux of both support vector methods, the classier is of relatively little interest in contrast to the training algorithms themselves.

Schneiderman and Kanade, for their classier, use 17 statistical image attributes, some of which relate to only one sub-band, and some of which relate to several. Recall that the sub-bands represent dierent frequencies, orientations, and spaces. This means that sub-bands, when sampled to form an attribute, can interact in a number of dierent ways. The detector samples each of these 17 attributes over the object. Obviously some attributes will contribute to detecting a face more than others, e.g. the eyes and nose are more signicant than the chin [11]. Their classier thresholds a patterns likelihood ratio, i.e. a threshold is chosen such that faces are exactly those regions for which
x,yregion x,yregion 17 k=1 17 k=1

Pk (patternk (x, y), x, y | object)

Pk (patternk (x, y), x, y | non-object)

> ,

(3.2)

which is a very natural value to threshold. The run-time calculations needed for this scheme are extensive, so a heuristic coarse-to-ne strategy is used, rst thesholding values for level 1 wavelet coecients, then further threshold21

ing values for level 1 and 2 coecients for areas not rejected, then performing the nal classier on the remaining regions.

Rowley et al. use arbitration between multiple neural networks to eliminate many of their false positives. However it is rst important to understand the classication criteria for a single neural network as implemented in [10]. Back-propagation with momentum is used as the networks training algorithm, and the training is done iteratively. This will result in networks that are self-programmed to classify faces versus non-faces [1]. This is the beautiful part of this detector. What Rowley et al. do to reduce error after passing images through neural networks is twofold. The rst heuristic used is based on the observation that false positives have overlapping multiple detections less frequently than do true faces. The merging step of the classier demands that true faces must have a certain number of overlapping detections. These detections are projected over various image scales in the image pyramid and a weighted centroid is computed. The result is a single detection where there once were many (and, subsequently, fewer false positives). The second step of the classier involves arbitration between multiple networks. Because the networks are trained with random initial variables, there

22

is nondeterminism among networks trained in the same manner. Several methods were tested for successful arbitration: heuristics involving logical operations, and a separate neural network, itself designed to arbitrate between several networks. All of these arbitration methods work well; extensive result tables are given in [10], and will be discussed in Chapter 4. Preprocessing is explained thoroughly in [10]. Once an oval mask has been applied to the image, a linear best-t function is calculated and subtracted from the image to correct lighting conditions. Histogram equalization is then performed. This step sets contrast and compensates for camera variants. This detector, like others, implements a coarse-to-ne formulation to improve performance. The combination of the detection and error prevention methods in [10] makes for an impressive detector, as will be outlined more clearly later on. It is no wonder that this paper is regarded as a standard against which new detectors are measured.

Papageorgiou et al. use a support vector machine, as in [7, 9] for their classier, due to the fact that such machines allow small parameter count and minimized generalization error, a concern which arises in [12] in the context of model architecture (this is a global concern, assuming the absence of a

23

complete data set). In fact, Papageorgiou et al. create a reduced coecient set which is akin to the support vectors in [7], in the sense that a small subset of training data can be used to accurately represent the relationship between faces and non-faces. In this paper, however, the focus is not in the implementation (i.e. the classier), but rather on the value of the overcomplete set of wavelet coecients in representing complex object classes, specically the classes of faces and pedestrians [8].

Viola and Jones use a classier that, largely for the sake of computational eciency, is based on an attentional cascade. The individual weak classiers are based on a variant of the AdaBoost algorithm, which converts weak classiers into a strong classier via boosting. To be detected, an image must be detected by each level of a series of basic classiers, each more discriminating than the last. The computational advantage is gained in the fact that the initial levels of the cascade can use very simple features for their classiers, and therefore can reject the vast majority of locations in an image quickly. By the time the cascade levels become more meaningful, they are operating only on a small proportion of the initial image locations. In their implementation, Viola and Jones use a 10-layer cascade, each of which

24

contains 20 rectangular features. They compare this against their initial, less ecient detector, which uses 200 rectangular features. One eect of the cascade strategy is that each classier must have an extremely high detection rate, but can get away with false positive rates that would in other circumstances be thought abysmal. The reason for this is not hard to see: The false positive rate F of the entire K-layer cascade is
K

F =
i=1

fi ,

(3.3)

where fi is the false positive rate of the ith classier. Similarly, the cascades detection rate is
K

D=
i=1

di ,

(3.4)

so the unusual restraints on detection and rejection rates are obviously justied [13].

More detailed analysis and research regarding sequential testing is done by Fleuret and Geman [3]. Their model is a statistical approach, like Schneiderman and Kanades in [11], but their focus is on theoretical work regarding cascades.

25

Chapter 4 Results and Conclusions

4.1

Results

Viola and Jones easily present the best results in terms of computation time. In terms of their error rates, they provide impressive ROC curves and numerical gures that rival those of Rowley et al. [13]. Rowley et al. provide the most extensive test data of all papers addressed, and reach very impressive results through their merging and arbitration methods [10]. With their six cluster and six anti-cluster architecture, Sung and Poggio reach fairly good results, while numerically, Schneiderman and Kanade boast what seem to be the best numerical results [11]. Osuna et al. attain slightly better rates than

26

Sung et al. [7], and Romdhani et al. manage to vastly improve the speed of a support vector machine with only marginal loss in classication accuracy [9]. An interesting point to make is that while Schneiderman and Kanade only train their detector on front and prole face poses, the structure of faces aids the detection of partially averted faces: in some of the detected proles in [11], this eect is noticeable. The reason for it can be seen if only the selected image window (a pentagon) is viewed. In some cases where the face is almost front-facing, half of the face, when viewed alone, looks very much like a prole. In this case, it doesnt seem like more than three poses are necessary.

4.2

Conclusions and Future Work

Despite the broad range of general approaches, several aspects seem to be particularly eective in face (and in general, object) detection schemes. Sungs bootstrapping method for training detectors is very eective, and the reasons are clear; it is important to best dene those areas in the border between faces and non-faces when classifying a window. Also evident is the fact that

27

due to the nature of the problem, classier cascades are necessary in order to attain computationally cheap detection. Since the vast majority of pixels in a given image will represent non-face windows, it is very important that these pixels be rejected with as little computation as possible. Viola and Jones seem to have the best application of this principle, and it is aided by the fact that their wavelet representation of their classiers is exible enough to perform meaningful rejection with very little cost. In this sense, the integral image is absolutely key to a fast application of Haar wavelets. While Rowley, Baluja and Kanades neural networks approach performs well and is beautiful in the sense that neural networks give such a conceptually natural simulation of human detection, the speed of such a scheme must be improved dramatically if it is to compete with the wavelet formulation. It seems unlikely that a detection scheme that does not use an easily decomposable model will move forward in the same way that wavelet applications seem able to. One obvious avenues for future research is model combinations, since weak classiers can be used to quickly reject large portions of an image with an eective false negative rate of 0. After such rejection, a more powerful classier could be used to scrutinize remaining areas.

28

It would be interesting to see the improvement in performance upon applying the integral image formulation to other Harr wavelet-based detectors, such as Papageorgiou et al. [8] and Schneiderman and Kanade [11]. In the same vein, it would be interesting to see a more powerful cascading scheme applied to the methods of Viola and Jones [13]. The work of Viola and Jones opens the way for further practical applications to face detection. One immediate such application would be buttressing face tracking methods. Running at 15 frames per second, a detector could add robust backup to a tracker. Furthermore, the integral image is a breakthrough in wavelet classication that can easily be seen to generalize well to other object classes.

29

Bibliography
[1] C.M. Bishop, Neural networks for pattern recognition, Oxford University Press, Oxford, 1995. [2] Corinna Cortes and Vladimir Vapnik, Support-vector networks, Machine Learning 20 (1995), no. 3, 273297. [3] Francois Fleuret and Donald Geman, Coarse-to-ne face detection, International Journal of Computer Vision 41 (2001), no. 1/2, 85107. [4] Yoav Freund and Robert E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, European Conference on Computational Learning Theory, 1995, pp. 2337. [5] S. Mallat, A wavelet tour of signal processing, Academic Press, San Diego, 1998.

30

[6] Yoshio Matsumoto and Alexander Zelinsky, Real-time stereo face tracking system for visual human interfaces, Proc. Intl Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time (1999), 7782. [7] E. Osuna, R. Freund, and F. Girosi, Training support vector machines: An application to face detection, 1997. [8] C. P. Papageorgiou, M. Oren, and T. Poggio, A general framework for object detection, Proceedings of International Conference on Computer Vision (1998), 555562. [9] S. Romdhani, P. Torr, B. Scholkopf, and A. Blake, Computationally ecient face detection, Proc. Int. Conf. on Computer Vision (2001), II:695700. [10] Henry A. Rowley, Shumeet Baluja, and Takeo Kanade, Neural networkbased face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998), no. 1, 2338. [11] H. Schneiderman and T. Kanade, A statistical approach to 3d object detection applied to faces and cars, IEEE Conference on Computer Vision and Pattern Recognition - to appear (2000). 31

[12] Kah Kay Sung and Tomaso Poggio, Example-based learning for viewbased human face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998), no. 1, 3951. [13] Paul Viola and Michael Jones, Robust real-time object detection, International Journal of Computer Vision - to appear (2002).

32

You might also like