You are on page 1of 8

M.

Bichsel

Automatic Interpretation of Human Head Movements

A. P. Pentland

University of Zurich The Media Laboratory Department of Computer Science Massachusetts Institute of Technology Winterthurerstr. 190 20 Ames Street 8057 Zurich Cambridge, MA 02139 Switzerland USA Progresses were also made in Abstract Email: mbichsel@i .unizh.ch Email: sandy@media.mit.edu object tracking and This paper describes a complete face tracking system motion analysis. Most known approaches are eithat interprets human head movements in real time. ther based on a local analysis of optical ow, e.g. 9], 10], 11] or local feature tracking e.g. 12]. In huThe system combines motion analysis with reliable and man body tracking the problem is slightly simpler bee cient object recognition strategies. It classi es head movements as \yes" (nodding head), \no" (shaking cause, usually, only the objects of interest are moving head) or \nothing" (still head). The system's skill al- whereas the camera is static. The segmentation of lows contactless man-machine interaction, thus giving moving parts in a static or stabilized scene has been addressed by a number of authors, in particular for access to a number of new applications. human motion 13], 14]. As industrialization proceeds, the importance of interactions between man and machine increases rapidly. Information ow from machine to man has become comfortable and direct in recent years due to tremendous progresses in computer graphics. Information ow from man to machine, on the other hand, is still on a low level. It is restricted to moving mice, pressing buttons, and typing character sequences on a keyboard. Automatic interpretation of gestures and facial expressions could reduce this imbalance and is therefore of central interrest for current and future research. A completely passive approach, using a camera as an \electronic eye" and a computer as an interpreter, can provide the required skill and therefore must be the long term goal. The computational tasks for manmachine interaction include real time detection and tracking of human faces, individual facial features, hands, and other parts of the human body. In recent years considerable progress has been made on the problems of face detection and recognition, especially in the processing of \mug shots," i.e. head-on face pictures with controlled illumination and scale. The best results have been obtained for 2-D, viewbased techniques based on either template matching (e.g. 1], 2]), combined feature-and-template matching (e.g. 3]) or matching using \Eigenfaces," i.e. template matching using the Karhunen-Loeve transformation of a set of face pictures (e.g. 4], 5], 6]). Recently Ullman 7] and Poggio 8] have both argued that 3-D recognition can be accomplished using linear combinations of only four or ve 2-D views.
This work was supported by the consortium VISAGE and the KWF grant No. 2440.1

1 Introduction

2 System Overview
This paper addresses the problem of automatically interpreting human head movements in real time, including head detection, tracking, and motion interpretation. In order to solve this complex task a multicomponent system approach is chosen (Figure 1). The processing starts with motion detection. If some motion is detected then the moving areas are segmented and a head detection module looks for a face within the moving segments. If no head is found the system returns control to the motion detector. If a head is found, on the other hand, its parameters, such as position and orientation, are determined. The estimated parametrs serve as predictors for the subsequent frames. If the system succeeds in tracking a face over several frames then the head motion is evaluated and interpreted by a motion interpreting module. The result of this interpretation, at present, is a classi cation of the head movements as still head, nodding head or shaking head. This result can be used by any other program.

3 Motion Segmentation
Initial position estimates for head candidates are obtained from a motion segmentation module 15]. In face recognition applications, typically, the camera is xed and people are moving within a static or slowly changing environment. Therefore, detecting and segmenting moving objects in a static scene is an important task which can enormously simplify the subsequent face recognition steps. Traditional motion segmentation algorithms apply a thresholding operation in an early processing stage.

Motion Segmentation no

Application Specific Program

Motion? yes Head Detection

Man-Machine Dialog Handler no

Head? yes State Parameters (Position, Orientation)

Motion Interpreter

State Prediction

Figure 1: Flow diagram of the head motion interpretation system.

After this thresholding operation connectivity information is typically exploited by connected component labeling (e.g. 16]). Our algorithm, on the other hand, includes connectivity information directly into the probability estimates and applies a thresholding operation only at the very end. Traditional heuristics are replaced by experimentally con rmed properties of local image statistics. In particular, we exploit the fact that applying a gradient magnitude operator to textured regions or contours in a natural image results in an exponential distribution of the resulting image values 15]. As shown in 15], local background probability estimates can be well approximated by

regions in image space, extending resolution-pyramid based techniques as proposed by Burt 18]. If a head is detected then the head detector determines its location, orientation, and, optionally, some additional parameters.

Based on previously estimated head parameters a prediction module foretells the parameters (position and orientation) for the next frame, as well as an error bound. In the initialization phase the head detection module must search within the whole parameter space. In the tracking phase, on the other hand, the system must only search within a subregion of the parameter space, de ned by the error bounds. This results in a ln(P fbackgroundjGT ; at least 1 background neighbourgconsiderable reduction of the time required for head ) ?j j (jGT j ? ); if jGT j veri cation (1) Due to a and head parameter estimation. detector 0; if jGT j fast implementation of the head and state predictor the time intervals between subwhere jGT j is the local gradient magnitude after sub- sequent evaluated images are relatively small (about traction of subsequent images, is a transition thresh- 0.14 sec). As a consequence, the error bounds remain old. The slope j j of the straight line above the transi- small, thus allowing a large reduction of the search tion threshold is not required to be known because dif- space. This, in turn, helps to maintain a high system ferent slope values only correspond to di erent overall speed. scalings of the nal object-background function that 6 Motion Interpretation can be included in the nal segmentation threshold. These local estimates are iteratively combined to A motion interpreter module evaluates the history of an object-background probability estimate including the head parameters and classi es the head movements global information about the objects simply connect- in \yes", \no", or \still head". This interpretation can edness 15]. be used by any program, provided that it owes a manDue to its reduced sensitivity to noise and due to machine dialog handler (I/O interface) that can make its ability to resolve high resolution details our motion use of this additional source of information. segmentation algorithm is especially well suited for the segmentation of people with varying distance to the 7 Characterization of the Tracking System camera and, hence, with varying size. The system runs on a Silicon Graphics 4D/320 VGX 4 Head Detection workstation with a VisionLab frame grabbing board. A fast head detector module, based on template In the slower head detection mode, including momatching enhanced with a coarse-to- ne search in pa- tion segmentation, it requires 0.5 seconds per frame, rameter space 2], then veri es whether a head is whereas in the faster tracking mode it processes 7 present. This module was designed after a thorough frames per second. analysis of the structure of the face image set in image The face tracking system has been tested qualitaspace 17], where image space denotes the Nx Ny - tively, running in real-time, in dozens of cases and on dimensional space de ned by assigning a coordinate ve di erent people. In each case the head was coraxis to each pixel of a Nx by Ny image, so that any rectly located, and was reliably tracked. In an e ort image and, especially, any face image is represented as to quantify the performance of the system, we tested a single point in image space. it on a calibrated image sequence. This analysis reveiled that the set of all possible Figure 2(a),(b), and (c) shows frames 1, 50, and face images forms a small number of connected re- 100 from a head motion sequence which was calibrated gions in image space and that these regions can be using a Polhemus magnetic position sensor 19]. In this parametrized with the aid of the physical transforma- sequence the translation of the head's center-of-mass tions that can act on a human face. As a consequence is known to within approximately 1% accuracy over a a highly e cient face detection strategy emerged that workspace of approximately 1 meter3. The path of the is based on a coarse-to- ne search in the parametrized head's center-of-gravity is shown in Figure 2(d).

5 Head Tracking

(a)

(b)

(c)

(d)

Figure 2: (a-c) Three frames from a calibrated image sequence used to test the algorithm's accuracy. (d) the path of the head's center of mass.

This image sequence was used to test the system's translational tracking accuracy. The system automatically found the face in the rst frame, and then produced location estimates for each subsequent frame. The error in these estimates had a standard deviation of 4.1 pixels in the horizontal direction, and 2.0 pixels in the vertical direction. This corresponds to a 3-D error of approximately 0.4 and 0.2 inches, respectively. Much of this error may be attributable to the di erence between the head's center of mass (as estimated by the Polhemus magnetic sensor) and the center of the face (as estimated by our algorithm). Our motion analyzing module exploits the fact that a person's head coordinates show a characteristic pattern depending on whether the person is nodding, shaking his/her head, or keeping the head still. Figure 3 shows our system response to a person constantly shaking his head (Figure 3(a)) or nodding (Figure 3(b)). Our motion interpretor evaluates the head coordinates (x(f ); y(f )), where f is the frame number. A time-localized variance analysis of the signals x(f ) and y(f ), followed by a thresholding operation, allows a reliable classi cation of \yes", \no", or \still head".

8 Analysis of Head Movements

9 Practicability tests

In order to test the suitability for real applications the system is provided with a visual and acoustic feedback. It responds with a smiling face together with a short beep if the user nods, a sad face together with a long beep if the user shakes his head, and otherwise with a silent neutral face. Without any instruction uninitiated subjects were able to communicate with the system within seconds, just by watching an experienced user. This demonstrates that our test system o ers a direct and natural way of man-machine communication and makes way to a large variety of user-friendly applications. We assign the easiness, with which subjects learned to communicate with the system, to the fact that head movements are one of the most important means of communication among people. Thus, the subjects did not have to learn anything new but just applied their everyday skills. In a second test a subject was asked to communicate 50 times a yes followed by a no, i.e. 100 di erent answers. Without any error, these 100 messages were communicated within 3 minutes. This means 1.8 seconds per message, including user action, evaluation of the action by the system, and auditory feedback. Many interactions with current computer programs consist of simple yes-no answers or con rmation re-

quests. Examples are: \A copy with the same le name already exists. Do you want to overwrite it?" \Save changes to File?" \File not found" The user then must move his mouse into a speci c box and press the mouse button, type \yes" or \no", or type \y" or \n", etc. Some programs still complain if the user writes \yes" instead of \y" or vice versa. If the user has to move his mouse he must leave the keyboard so that his ten nger typing is interrupted. In these cases our system could be helpful for future, more user-friendly systems. By combining multiple yes-no answers we can even answer general multi-valued questions. Figure 4 illustrates how a user would select number 6 out of 16 possibilities by giving 4 yes-no answers. At each step the system marks the remaining options by enclosing them within a thick rectangle. With these options the system forms two groups that are marked di erently (dark or white). The user selects either group by answering yes (nodding) or no (shaking head). In fact, the user can uniquely select one out of any nite number Ns of selections by giving log2 (Ns ) yes-no answers. Thus, although this possibility would be timeconsuming, our system even allows contactless typing by giving 6 yes-no answers per character. The fact that a user can give arbitrary commands to a machine, simply by nodding or shaking head, could help patients after a stroke. In some cases these patients can neither move their limbs nor speak, but are still able to move their head. Thus, our system could help to ease the cumberful time when these patients must relearn to speak. We would like to thank our colleagues Martin J. Durst, Michel Hafner, and Rene Sennhauser for carefully looking through this manuscript and for helping to improve it with a number of corrections, suggestions and good questions. We would also like to thank Prof. P. Stucki for continuously supporting this work and providing us with a fruitful working environment.

11 Acknowledgements

References

10 Applications

1] R. Brunelli and T. Poggio, Face Recognition through Geometrical Features, Proceedings, ECCV92, Santa Margherita Ligure, 1992, pp. 792-800. 2] Martin Bichsel and Alex Pentland, \Topological Matching for Human Face Recognition", M.I.T.

position

100 80 60 40 20 0 0 30 60

x(f)

position

100 80 60 40

x(f) y(f)

y(f)
90 120 frame number f

20 0 250 280 310 340 370 frame number f

(a)

(b)

Figure 3: (a) (x,y)-head-coordinates of a person continuously shaking his head. (b) (x,y)-head-coordinates of a nodding person.
Media Laboratory Vision and Modeling Group Technical Report, No. 186, Jan. 1992. Proc. of the IEEE Workshop on Visual Motion, Nassau Inn, Princeton, New Jersey, pp. 187-193, 1991.

3] M. Bichsel, Strategies of Robust Object Recognition for the Automatic Identi cation of Human Faces, PhD. thesis ETH Zurich, No. 9467, 1991. 4] M. Turk and A. Pentland, \Face processing: models for recognition," Intelligent Robots and Computer Vision VIII, SPIE, Philadelphia, PA, 1989. 5] M. Turk, and A. Pentland, Eigenfaces for Recognition, Journal of Cognitive Neuroscience, 3(1), 1991, pp. 71-86. 6] M. Kirby and L. Sirovich, \Application of the Karhunen-Loeve Procedure for the Characterization of Human Faces", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 1, pp. 103-108, 1990. 7] S. Ullman, and R. Basri, \Recognition by Linear Combinations of Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 13, No. 10, pp. 992-1007, 1991. 8] T. Poggio and S. Edelman, \A Network that Learns to Recognize Three Dimensional Objects," Nature, Vol. 343, No. 6255, pp. 263-266, 1990. 9] Ken-ichi Kanatani, \Structure and Motion from Optical Flow under Orthographic Projection", Computer Vision, Vol. 35, pp. 181-199, 1986. 10] P. J. Burt, R. Hingorani, and R. J. Kolczynski, \Mechanisms for Isolating Component Patterns in the Sequential Analysis of Multiple Motion"

11] J. R. Bergen et al., \Computing Two Motions from Three Frames", IEEE Proc. of the Third Int. Conf. on Computer Vision, Osaka, Japan, pp. 2732, 1990. 12] C. Tomasi and T. Kanade, \Shape and Motion without Depth: A Factorization Method - Point Features in 3D Motion", Technical Report CMUCS-91-105, Carnegie Mellon University, Pittsburg, Pennsylvania, Jan. 1991. 13] A. Shio and J. Sklansky, \Segmentation of People in Motion" IEEE Proceedings, pp. 325-332, 1991. 14] Maylor K. Leung and Yee-Hong Yang, \Human Body Motion Segmentation in a Complex Scene" Pattern Recognition, Vol. 20, No. 1, pp. 55-64, 1987. 15] Martin Bichsel, \Segmenting Moving Objects in a Static Scene", Computer Science Department Zurich, Technical Report, No. 93.20, April 1993. 16] R. C. Jain, \Segmentation of Frame Sequences Obtained by a Moving Observer", IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 6, No. 5, pp. 624-629, 1984. 17] Martin Bichsel and Alex Pentland, \Human Face Recognition and the Face Picture Set's Topology", CVGIP: Image Understanding, to be published.

a)

1 5 9

2 6

3 7

4 8

b)

1 5 9

2 6

3 7

4 8

10 11 12

10 11 12

13 14 15 16

13 14 15 16

c)

1 5 9

2 6

3 7

4 8

d)

1 5 9

2 6

3 7

4 8

10 11 12

10 11 12

13 14 15 16

13 14 15 16

Figure 4: Selecting 6 out of 16 possibilities. Yes selects grey region, no selects white region within thick rectangle. a) Select dark (nod), b) Select dark (nod), c) Select bright (shake head), d) Select bright (shake head)

18] P. J. Burt, \Fast Filter Transforms for Image Processing", Computer Vision, Graphics and Image Processing, Vol. 16, pp. 20-51, 1981. 19] A. Azarbayejani, T. Starner, B. Horowitz, A. Pentland, \Interactive Graphics Without The Wires," IEEE Transactions on Pattern Analysis and Machine Intelligence special issue of Computer Graphics and Computer Vision, to appear, 1993.

You might also like