You are on page 1of 93

1

CHAPTER ONE

INTRODUCTION

1.1 Scope of Research
Drivers fatigue and its related drowsiness is a significant factor in a large number of
vehicle accidents. Recent statistics show that 1,200 deaths and 76,000 injuries can be
annually attributed to fatigue related crashes. This rather disturbing trend urgently requires
the development of early warning systems meant to detect drivers drowsiness whilst on the
wheels (Haro et al., 2000).
The development of technologies for detecting or preventing drowsiness at the wheel
has been a major challenge to the field of accident avoidance systems (Neeta, 2002). Because
of the hazard that drowsiness presents on the road, methods need to be developed for
counteracting these effects. The aim of this project is to improve on the development of
drowsiness detection systems. The focus will be placed on designing a system that will
accurately monitor the open or closed state of the drivers eyes in real-time. By monitoring the
eyes, it is believed that the symptoms of driver fatigue can be detected early enough to avoid a
car accident. Detection of fatigue involves a sequence of images of a face, and the observation
of eye movements and blink patterns.
Eye-blink detection plays an important role in human computer interface (HCI)
systems. It can also be used in drivers assistance systems. Studies show that eye blink
duration has a close relation to a subjects drowsiness (Kojima et al., 2001). The openness of
eyes, as well as the frequency of eye blinks, shows the level of the persons consciousness,
which has potential applications in monitoring drivers vigorous level for additional safety
control. Also, eye blinks can be used as a method of communication for people with severe
disabilities, in which blink patterns can be interpreted as semiotic messages. This provides an

2

alternate input modality to control a computer: communication by blink pattern. The
duration of eye closure determines whether the blink is voluntary or involuntary. Blink
patterns are used by interpreting voluntary long blinks according to the predefined semiotics
dictionary, while ignoring involuntary short blinks. (Black et al., 1997)
Eye blink detection has attracted considerable research interest from the computer
vision community. In literature, most existing techniques used two separate steps for eye
tracking and blink detection. For eye blink detection systems, there are three types of
dynamic information involved: the global motion of the eye, local motion of eye lids, and the
eye openness/closure. Once the eyes locations are estimated by the tracking algorithm, the
differences in image appearance between the open eyes and the closed eyes can be used to
find the frames in which the subjects eyes are closed, such that eye blinking can be
determined. Template matching is used to track the eyes and color features are used to
determine the openness of eyes. Detected blinks are then used together with pose and gaze
estimates to monitor the drivers alertness, differences in intensity values between the upper
eye and lower eye are used for eye openness/closure classification, such that closed-eye
frames can be detected. The use of low-level features makes the real-time implementation of
the blink detection systems feasible. However, for videos with large variations, such as the
typical videos collected from in-car cameras, the acquired images are usually noisy and with
low-resolution. In such scenarios, simple low-level features, like color and image differences,
are not sufficient, temporal information is also used by some other researchers for blinking
detection purposes (Grauman et al., 2003).



3

1.2 Justification of Work
Eye blink is a physiological activity of rapid closing and opening of eyelids, which is
an essential function of eyes that helps spread tears across and remove irritants from the
surface of the cornea and conjunctiva (Tsubota, 1998). Although blink speed can vary with
elements such as fatigue, emotional stress, behavior category, amount of sleep, eye injury,
medication, and disease, researchers report that (Karson, 1983), the spontaneous resting blink
rate of a human being is nearly from 15 to 30 eye blinks per minute. That is, a person blinks
approximately once every 2 to 4 seconds, and a blink lasts averagely 250 milliseconds.
Currently a generic camera can easily capture a face video with not less than 15 fps
(frames per second), i.e. the frame interval is not more than 70 milliseconds. Thus, it is easy
for a generic camera to capture two or more frames for each blink when a face looks into the
camera. The advantages of eye blink based approach is based on the fact that, it is non-
intrusive, and can be used generally without user collaboration and no extra hardware is
required, Eye- blink behavior is the prominently distinguishing character of a live face from a
facial photo from a generic camera.

1.3 Requirements
The system for tracking the eyes should be robust, non intrusive, inexpensive and this
is quite a challenge in the computer vision field. Nowadays eye tracking receives a great deal
of attention for applications such as Facial Expression analysis and driver awareness systems.
Today we can find very accurate eye tracker using external devices. Most modern eye-
trackers use contrast to locate the centre of the pupil and use infrared cameras to create a
corneal reflection, and the triangulation of both to determine the fixation point.
4

However, eye tracking setups vary greatly; some are head-mounted, some require the
head to be stable (for example, with a chin rest), and some automatically track the head as
well. The eye-tracker described in this thesis is characterized for being a noninvasive eye-
tracker. This is because we do not need any external devices for tracking the eyes besides the
web camera, which records the video stream. Moreover, the efficiency of the eye-tracker is
very important when working with real time communications.

1.4 Objectives of the Research
The specific objectives of the study are to
(a) develop an algorithm to identify and track the location of the drivers eye
(b) develop an algorithm for an eye-blink detection system; and
(c) design a system that implement (a) and (b); and
(d) evaluate the performance of the system in (c)

1.5 Thesis Organisation
The remaining part of this thesis is organized as follows: Chapters Two discusses the major
techniques used in image processing in the design of related systems, and survey of related
works, while the development of the algorithm is presented in Chapter Three, the test
experiment conducted and the outcome of results are the contents of Chapter Four. Chapter
Five of this write-up concludes the report and indicates some directions for further work.





5

CHAPTER TWO

LITERATURE REVIEW

2.1 Human Eye and Its Behavior
A close-up view of a typical open human eye is shown in Fig. 2.1. The most
significant feature in the eye is the iris. It has a ring structure with a large variety of colours.
The ring might not be completely visible even if the eye is in its normal state (non-closed or
partly closed). Visibility depends on the individual variations (Uzunova, 2005). Most often, it
is partly occluded above by the upper eyelid. Completely visible or occluded by both eyelids
are possible too. The iris changes its position as well from centered rolled to one side or
rolled upwards or downwards. Depending on the speed, when the iris is moving from side to
side, the motion is called smooth pursuit or saccades. A saccade is a rapid iris movement,
which happens when fixation is jumping from one point to another (Galley et al., 2004).
Inside the iris is the pupil a smaller dark circle. Its size varies depending on the light
conditions. Sclera is the white visible portion of the eyeball. At a glance with unaided eye, it
is the brightest part in the eye region, which directly surrounds the iris. Apart from these
features, the eye has two additional salient features the upper and lower (or down) eyelid.
Their Latin names are palpebra superior (the upper eyelid) and palpebra infior (the lower).
The gap between them is called rima palpebrarum.
The eyelids movements are constraints by their physical attributes. The upper eyelid
is a stretchable skin membrane that can cover over the eye. It has great freedom of motion,
ranging from wide open to close and small deformations due to eyeball motion. When the
eye is open, the eyelid is a concave arc connecting the two eye corners. As the eye becomes





6


















Fig. 2.1: The Human eye (Uzunova, 2005).








7

more and more closed the curvature of the arc become lower, has a line-like shape (when the
eye is nearly closed), follows the lower eyelid (when the eye is closed).
On the other hand the lower eyelid is close to the straight line and moves to the
smaller degree. In this thesis the eyelids contours will be referred as eyelids, if something
else is not explicitly said. The eyelids are meeting each other in the eye corners (angulus
oculi). Here the eye corners will be referred to inner corners (the ones that are closer to the
nose and outer corner). The eye corners are called left or right corners in the way they appear
on the image. The skin-colored growth close to the inner corner is third degenerated
membrane, called membrane nicitans.
The eye features - iris and eyelids, can be involved in very complex movements as
part of the completely human behaviour, expressing different meaning. Here only eye
features and local movements within the eye region are described. The eye-blink is the focus
of this thesis. It is a natural act, which represents closing of the eye followed by an opening
where the upper eyelid performs most of the movements. Similar to blinking is the eyelid
fluttering. This is quick wavering or flapping motions of the upper eyelid. Here blinking and
eyelid fluttering are not distinguished, but blinking and eye closing is not synonyms
especially in context of safety driving system. To distinguish eye closing from blinking time
have to be taken into account.
Blinking can be defined as a temporary hiding of the iris due to the touching of both
eyelids within one second, whereas closing takes longer time. According to the researchers
(Thorslund, 2003), blinking frequency is affected by different factors like mood state, task
demand, etc. In stress-free state the blink rate is 15-20 times per minute. It drops down to 3
times per minute during reading. It increases under stress, time pressure or when close
attention is required. The pattern for detecting drowsiness can be described as follows. In
8

awake state, the eyelids are far apart before they closed, they are closed for a short interval
and closing the eye (single blink) is repeated rarely. As the person gets tired, the eyelids stay
closer to each other, the time when the eye is closed increases and frequency of blinking
increases as well in other words drowsiness is characterized by long flat blinks (Galley et al.,
2004).

2.2 Image Representation and Acquisition
Any visual scene can be represented by any continuous functions (in two dimensions)
of some analogue quantity, this is typically the reflectance function of a scene: the light is
reflected at each visual level in the scene, such representations is referred to as image and the
value at any point is the image corresponds to the intensity of the reflectance function at the
point.
A continuous analogy representation cannot be conveniently interpreted by a
computer and an alternative representation; the digital image must be used, digital image also
represent the reflectance function of any scene but they do so in a sampled and quantized
form (David, 1991). Fig. 2.2 shows the block diagram depicting processing steps in computer
vision. The basic image acquisition equipment used in this study is the camera
There are two types of semiconductor photo-resistive sensor used in cameras: CCD
(charged coupled devices) and CMOS (complementary metal oxide semiconductor).In a
CCD sensor, every pixels charge is transferred through just one output node to be converted
to voltage, buffered and sent off-chip as an analogue signal. All pixel area can be devoted to
light capture. In a CMOS sensor each pixel has its own charge to voltage conversion and the


9


















Fig. 2.2: Block diagram depicting process steps in computer vision (Zuechi, 2000)





IMAGE ACQUSITION
SEGMENTATION
CODING
(FEATURE EXTRACTION)
ENHANCEMENT
(PREPROCESSING)
IMAGE ANALYSIS
DECISION MAKING
10

sensor often includes amplifiers, noise correction and digitization circuits, so that the chip
outputs (digital) bits (Sonka et al., 2008). These other function increases the design
completely and reduces the area available for the capture. The chip can be built to require
less off-chip circuitry for basic operation.
The development of semiconductor technology permits the production of matrix-like
sensors based on CMOS technology. This technology is used in mass production in the
semiconductor industry because processors and memories are manufactured using the same
technology, the photosensitive matrix-like element can be integrated to the same chip as the
processor and/or operational memory. This opens the door to 'smart cameras' in which the
images capture and basic image processing is performed on the same chip.
The major advantage of CMOS cameras (as opposed to CCD) is a higher range of
sensed intensities (about 4 orders of magnitude), high speed of read-out (about 100 ns) and
random access to individual pixels. The basic CCD element includes a schottky photodiode
and a field transistor. A photon falling on the junction of the photodiode liberates electrons
from the crystal lattice and creates holes resulting in the electric charge that accumulated in a
capacitor. The collected charge is directly proportional to the light intensity and the duration
of its falling diode.
The sensor elements are arranged into matrix- like grid of pixel a CCD chip. The
charges accumulated by the sensor element are transferred to a horizontal register one row at
a time by vertical shift register. The charges are shifted out in a bucket brigade fashion to the
form the video signal.
There are three inherent problems with CCD chips which are: The blooming effect is
the mutual influence of charge in neighboring pixels. Current CCD sensor technology is able
to suppress the problem (anti-blooming) to a great degree, It is impossible to address directly
11

individual pixels in the CCD chips because reed out through shift registers is needed,
Individual CCD sensor elements are able to accumulate approximately 30-200 thousands
electrons. The usual level of inherent noise of the CCD sensor is on the level of 20 electrons.
The usual level of inherent noise of the CCD sensor is on the level of 20 electrons.
The signal to noise ratio (SNR) in the case of CCD chip is
log

(2.1)
This implies that the logarithmic noise is approximately 80 dB at best, this causes the CCD
sensor is able to cope with four orders of magnitude of intensity in the best case. This range
drops to approximately two orders of magnitude with common uncooled CCD cameras. The
range of incoming light intensity variations is usually higher.

2.3 Image pre-processing
Pre-processing is the name used for operations on images at the lost of abstraction
both input and output are intensity images (Sonka et al., 2008). These iconic images are
usually of the same kind as the original data captured by the sensor, with an intensity image
usually represented by a matrix or matrices of image function values (brightness).
Pre-processing does not increase image information content. If information is
measured using entropy, then pre-processing typically decreases image information content.
From the information-theoretic viewpoint it can thus be concluded that the best pre-
processing is no pre-processing and without question, the best way to avoid (elaborate) pre-
processing is to concentrate on high- quality image acquisition. Nevertheless, pre-processing
is very useful in a variety of situations since it helps to suppress information that is not
relevant to the specific image processing or analysis task. Therefore, the aim of pre-
12

processing is an improvement of the image data that suppresses undesired distortions or
enhances some image features important for further processing.
A considerable redundancy of information in most images allows image pre-
processing methods to explore image data itself to learn image characteristics in a statistical
sense. These characteristics are used either to suppress unintended degradations such as noise
or to enhance the image. Neighboring pixels corresponding to a given object in real images
have essentially the same or similar brightness value, so if a distorted pixel can be picked out
from the image, it can usually be restored as an average of neighboring pixels.
Image preprocessing methods are classified in to three categories namely: Pixel Brightness
Transformations, Geometric Transformations, Local Pre-processing. These methods are
discussed in the following sections in detail.

2.3.1 Pixel brightness transformations
A brightness transformation modifies pixel brightness; the transformation depends on
the properties of a pixel itself. There are two classes of pixel brightness transformations
which are brightness corrections and gray-scale transformations. Brightness corrections
modifies the pixel brightness taking into account its original brightness without regard to
position in the image.

2.3.1.1 Position dependent brightness correction
Ideally, the sensitivity of image acquisition and digitization devices should not
depend on position in the image, but this assumption is not valid in many practical cases. The
lens attenuates light more if it passes farther from the optical axis, and the photosensitive part
13

of the sensor (vacuum-tube camera, CCD camera elements) is not of identical sensitivity.
Uneven object illumination is also a source of degradation.
If degradation is of a systematic nature, it can be suppressed by brightness correction.
A multiplicative error coefficient describes the change from the ideal identity transfer
function. Assume that is the original undegraded image (or desired or true image) and
is the image containing degradation. Then

The error coefficient can be obtained if a reference image with known
brightness is captured, the simplest being an image of constant brightness c. The degraded
result is the image

. The systematic brightness errors can be suppressed as



(2.3)

This method can be used only if the image degradation process is stable. If we wish to
suppress this kind of error in the image capturing process, we should perhaps re-calibrate the
device (find error coefficients ) from time to time.
Brightness correction method implicitly assumes linearity of the transformation,
which is not true in reality because the brightness scale is limited to some interval. The
calculation according to equation 2.3 can overflow, and the limits of the brightness scale are
used instead, this implies that the best reference image has brightness that is far enough from
both limits. If the gray-scale has 256 brightness levels, the ideal image has brightness values
of 128.



14

2.3.1.2 Gray- scale transformation
Gray-scale transformations do not depend on the position of the pixel in the image. A
transformation of the original brightness from scale

into brightness q from a new


scale

is given by

The most common gray-scale transformations are shown in Fig. 2.3(a); the piecewise
linear function a enhances the image contrast between brightness values

. The
function b is called brightness thresholding and result in a black and white image; the straight
line c denotes the negative transformation. Digital images have a very limited number of
gray-levels, so gray-scale transformations are easy to realize both in hardware and software.
Often only 256 bytes of memory (called a look-up table) are needed.
The original brightness is the index to the look-up, and the table content gives the
new brightness. The image signal usually passes through a look-up table in the image
displays, enabling simple gray-scale transformations in real time. The same principle can be
used for color displays. A color signal consists of three components: red, green and blue;
three look-up tables provide all possible color scale transformations.
Gray-scale transformations are used mainly when an image is viewed by a human
observer, and a transformed image might be more easily interpreted if the contrast is
enhanced. For instance an X-ray image can often be much clearer after transformation. A
gray-scale transformation for contrast enhancement is usually found automatically using the
histogram equalization technique. The aim is to create an image with equally distributed
brightness levels over the whole brightness scale as in Fig. 2.3(b). Histogram equalization
enhances contrast for brightness values close to histogram maxima, and decrease contrast
near minima.
15




(a): Perspective progression geometric examples

(b): Histogram equalization of images

Fig. 2.3: Perspective progression and histogram equalization of images (Sonka et al., 2008)


16

Denote the input of the histogram by H(p) and recall that the input grayscale is

.
The intention is to find a monotonic pixel brightness transformation such that the
desired output histogram is uniform over the whole output brightness scale

. The
Histogram can be traced as s discrete probability density function. The monotonic property of
the transform implies

) (2.5)
The sums in equation 2.5 can be interpreted as discrete distribution functions. Assume
that the image has rows and columns; then the equalized histogram corresponds to
the uniform probability density function whose function value is a constant:

(2.6)
The values from equation (2.6) replace the left side of equation (2.5). The equalized
histogram can be obtained precisely only for the idealized continuous probability density, in
which case equation 2.5 above becomes.

(2.7)
The desired pixel brightness transformation then be derived as

(2.8)
The integral in the equation (2.8) is called the cumulative histogram, which is approximated
by a sum in digital images, so the resulting histogram is not equalized ideally. The discrete
approximation of the continuous pixel brightness transformation from equation 2.8 is
17


(2.9)
2.3.2 Geometric transformations
Geometric transforms are common in computer graphics and are often used in image
analysis as well. They permit elimination of the geometric distortion that occurs when an
image is captured. If one attempts to match two different images of the same object, a
geometric transformation may be needed. We consider geometric transformations only in 2D
as this is sufficient for most digital images. One example is an attempt to match remotely
sensed images of the same area taken after a year, when the more recent image was probably
not taken from precisely the same position. To inspect changes over the year,it is necessary
first to execute a geometric transformation, and then subtract one image from the other.
A geometric transform is a vector function t that maps the pixel (x, y) to a new
position (x, y), an illustration of the whole region transformed on a point to point basis is
shown in Fig. 2.4 and T is defined by its two component equations.

(2.10)
The transformation equations T
x
and T
y
are either known in advance for example, in the case
of rotation, translation, scaling or can be determined from original and transformed images.
Several pixels in both images with known correspondences are used to derive the known
transformation. A geometric transform consists of two basic steps. First is the pixel
coordinate transformation, which maps the co-ordinate of the input image pixel to the point
in the output image. The output point co-ordinate should be computed as continuous values
(real numbers), as the position does not necessarily match the digital grid after the transform
The second step is to find the point in the digital raster which matches the
transformed point and determine its brightness value.
18






Fig. 2.4: Geometric transform on a plane of images (Sonka et al., 2008).







19

The brightness is usually computed as an interpolation of the brightness of several
points in the neighborhood. This idea enables the classification of geometric transforms
among other preprocessing techniques, the criterion being that only the neighborhood of a
processed pixel is needed for the calculation. Geometric transforms are on the boundary
between point and local operations.
2.3.2.1 Pixel co-ordinate transformations
Equation (2.10) shows the general case of finding the co-ordinates of a point in the
output image after a geometric transform. It is usually approximated by a polynomial
equation.

(2.11)
This transform is linear with respect to the coefficients

and so if pairs of
corresponding points

in both images are known; it is possible to determine



by solving a set of linear equations. More points than coefficients are usually used to
provide robustness; the mean square method is often used.
In the case where the geometric transform does not change rapidly depending on
position in the image, low-order approximating polynomials, m=2 or m=3 are used, needing
at least 6 or 10 pairs of corresponding points. The corresponding points should be distributed
in the image in a way that can express the geometric transformation usually they are spread
uniformly. In general, the higher the degree of the approximating polynomial, the more
sensitive to distribution of the pairs of corresponding points the geometric transform is.
Equation (2.10) is in practice approximated by a bilinear transform for which four pairs of
corresponding points are sufficient to find the transformation coefficients.


20

(2.12)
Even simpler is the affine transformation, for which three pairs of corresponding points are
sufficient to find coefficients.

(2.13)


The affine transformation includes typical geometric transformations such as rotation,
transformation, scaling, and skewing.A geometric transform applied to the whole image may
change the co-ordinate system, and a Jacobian j provides information about how the co-
ordinate system changes

(2.14)
If the transformation is singular (has no inverse), then J=0. If the area of the image is
invariant under the transformation, then J =1.
The Jacobian for the bilinear transform in equation 2.12 is

(2.15)
And for affine transformation in equation 2.13 it is

(2.16)
Some important geometric transformations are:
Rotation by the angle about the origin

os sin

sin os
21

(2.17)
Change of scale a in the x axis and b in the y axis


(2.18)
Skewing by the angle , given by

tan



It is possible to approximate complex geometric transformations (distortion) by partitioning
an image into smaller rectangular sub-images; for each sub-image, a simple geometric
transformation such as the affine, are estimated using pairs of corresponding pixels. The
geometric transformation (distortion) is then repaired separately in each sub-image.
There are some typical geometric distortions which have to be overcome in remote sensing.
Errors may be caused by distortion of the optical systems, by the non-linearity in row by row
scanning and non-constant sampling period. Wrong position or orientation, skew and line
non-linearity distortions. Panoramic distortion (Fig. 2.5b) appears in line scanners with the
mirror rotating at constant speed. Line non-linearity distortion (Fig. 2.5a) is caused by
variable distance of the object from the scanner mirror. The rotation of the earth during
image capture in a mechanical scanner generates skew distortion (Fig. 2.5c). Change of
distance from the sensor induces changeofscale distortion (Fig. 2.5e). Perspective
progression causes perspective distortion (Fig. 2.5f).

22











Fig. 2.5: Geometric distortion types in images (Sonka et al., 2008).










(a) Line non-linear distortion (b) panoramic distortion (c) Skew distortion



(d) Paranormal distortion (e) Change of scale distortion (f) perspective distortion

23

2.3.2.2 Brightness interpolation
Brightness interpolation influences image quality. The simpler the interpolation, the
greater is the loss in geometric and photometric accuracy, but the interpolation neighborhood
is often reasonably small due to computational load. The three most common interpolation
methods are the neighbor, linear, and bi-cubic.
The brightness interpolation problem is usually expressed in a dual way by determing
the brightness of the original point in the input image that corresponds to the point in the
image lying on the discrete raster. Assume that we wish to compute the brightness value of
the pixel

in the output image

and

lie on the discrete raster (integer numbers,


illustrated by solid lines in Fig. 2.5). The co-ordinates of the point (x, y) in the original image
can be obtained by inverting the planar transformation in equation (2.10):


In general, the real coordinates after inverse transformation (dashed lines in Fig. 2.5) do not
fit the input image discrete raster (solid lines), and so the brightness is not known. The only
information available about the originality continuous image f(x, y) is its samples
version

. To get the brightness can be expressed by the convolution equation.


(2.21)
The function

is called the interpolation kernel. Usually a small neighborhood is used,


outside which

is zero (Sonka et al., 2008).


Nearest-neighborhood interpolation assigns to point the brightness value of the
nearest point g in the discrete raster as shown in Fig. 2.6 (a). On the right side is the
interpolation kernel

in the 1D case. The left side of the figure shows how the new
brightness is assigned. Dashed lines show the inverse planar transformation maps the raster

24





(a): Nearest neighborhood interpolation



(b): Linear interpolation

Fig. 2.6: Interpolation types in images (Sonka et al., 2008).




25

of the output image; full lines show the raster of the input image. Nearest-neighborhood
interpolation is given by

(2.22)
The position error of the nearest-neighborhood interpolation is at most half a pixel.
This error is perceptible on objects with straight-line boundaries that may appear step-like
after the transformation.
Linear interpolation explores four points neighboring the point and assumes
that the brightness function is linear in this neighborhood. Linear interpolation is
demonstrated in the Fig. 2.6(b).



Linear interpolation can cause a small decrease in resolution, and blurring due to its average
nature. The problem of step-like boundaries with the nearest-neighborhood interpolation is
reduced.
Bipolar interpolation improves the model of the brightness function by approximating
it locally by a bi-cubic polynomial surface; 16 neighboring points are used interpolation. The
one-dimensional interpolation kernel (Mexican hat) is shown in Fig. 2.7 and is given by

(2.4)
Linear interpolation can cause a small decrease in resolution, and blurring due its
average nature. The problem of step-like boundaries with the nearest-neighborhood

26









Fig. 2.7: Bi-cubic interpolation kernel (Sonka et al., 2008).







27

interpolation is reduced. Bi-cubic interpolation is often is often used in raster displays that
enable zooming with respect to an arbitrary point. If the nearest-neighborhood method were
used, areas of the same brightness would increase. Bi-cubic interpolation preserves fine
details in the image very well.

2.3.3 Local pre-processing
This method uses a small neighborhood of a pixel in an input image to produce a new
brightness value in the output image. Such preprocessing operations are called filtration (or
filtering) if signal processing terminology is used. Local pre-processing methods can be
divided into two groups according to the goal of the processing.
They are smoothing and edge detection. First, smoothing aims to suppress noise or
other fluctuations in the image; it is equivalent to the suppression of high frequencies in the
Fourier transform domain.
Unfortunately smoothing also blurs all sharp edges that bear important information about the
image.

2.3.3.1 Image smoothing
Image smoothing is the set of local pre-processing methods whose predominant use is
the suppression of image noise; it is predominately used in the image data. Calculation of the
new value is based on averaging of brightness values in some neighborhood. Smoothing
poses the problem of blurring sharp edges in the image, and so we shall be more specific on
smoothing methods which are edges preserving.


28

Local image smoothing can effectively eliminate noise or degradation appearing as thin
stripes, but does not work if degradations are large blobs or thick stripes (Sonka et al., 2008).

2.3.3.2 Median filtering
In probability theory, the median divides the higher half of a probability distribution
from the lower. For random variables x, the median is the value for which the probability of
the outcome x<M is 0.5.The median filtering of finite real numbers can be found by ordering
the values and selecting the middle one. Lists are often constructed to be odd in length to
secure uniqueness.
Median filtering is a non-linear smoothing method that reduces the blurring of edges
in which the idea is to replace the current point in the image by the median of the brightness
in its neighborhood. The median of the brightness in the neighborhood is not affected by
individual noise spikes and so median smoothing eliminates impulse quite well. Further, as
median filtering does not blur edges much, it can be applied iteratively. Clearly, performing a
sort on pixel within a rectangular widow at every pixel position may become very expensive.
A more efficient approach is to notice that as the window moves across a row by one
column, the only change to its contents is to lose the leftmost column and replace it with a
new right column for a median window of m rows and n columns, mn-2m pixels are
unchanged and do not need re-sorting.
The main disadvantages of median filtering in a rectangular neighborhood is its
damaging of thin lines and sharp corners in the image, this can be avoided if another shape of
neighborhood is used. For instance, if horizontal/vertical line need preserving, a
neighborhood such as that in Fig. 2.8 can be used.

29









Fig. 2.8: Horizontal/vertical line preserving neighborhood for median filtering (Sonka et al.,
2008).











30

2.3.3.4 Non-linear mean filter
The non-linear mean filter is another generalization of average techniques (Pitas and
Venetsanopulos, 1986); it is defined by

(2.25)
Where is the result of the filtering, is the pixel in the input image, and is a
local neighborhood of the current pixel . The function of one variable has an inverse
function

; the are weight coefficients. If the weights are constant, the filter
is called homomorphic.

2.3.3.5 Edge detectors
Edge detectors are a collection of very important local image pre-processing methods
used to locate changes in the intensity function; edges are pixels where this function
(brightness) changes abruptly. Edges are to a certain degree invariant to changes of
illumination and viewpoint.
If only edge elements with strong magnitude (edgels) are considered, such information often
suffices for image understanding. The positive effect of such a process is that it leads to
significant reduction of image data. Nevertheless such a data reduction does not undermine
understanding the content of the image (interpretation) in many cases.
An edge is a property attached to an individual pixel and is calculated from the image
function behavior in a neighborhood of that pixel. It is a vector variable with two
components, magnitude and direction. The edge magnitude of the gradient and the edge
direction is rotated with respect to the gradient direction by . The gradient direction
gives the direction of maximum growth of the function, e.g. from black to
31

white . This is illustrated in Fig. 2.9 (a), in which closed lines are lines of
equal brightness. The orientation points east.
Edges are often used in image analysis for finding region boundaries. Provided that
the region has homogeneous brightness, its boundary is at the pixels where the image
function varies and so in the ideal case without noise consists of pixels with high edge
magnitude. It can be seen that the boundary and its parts (edges) are perpendicular to the
direction of the gradient.
. The edge profile in the gradient direction (perpendicular to the edge direction) is
typical for edges. Fig. 2.9(b) shows examples of several standard profiles
Roof edges are typical for objects corresponding to thin lines in the image. Edge detectors are
usually tuned for some type of edge profile.
The gradient magnitude and gradient direction are continuous image
functions calculated as:

(2.26)

(2.27)
Where aig is the angle (in radians) from x axis to the point (x, y). Sometimes we are
interested only interested only in edge magnitudes without regards to their orientations, a
linear differential operator called the laplacian may then be used. The laplacian has the same
properties in all directions and is therefore invariant to rotation in the image it is defined as




32






(a) Gradient direction and edge direction




(b) Typical edge profile

Fig. 2.9 Diagrams illustrating edge detection (Sonka et al., 2008).




33

Image sharpening has the objectives of making edges steeper the sharpened image is
intended to be observed by human. The sharpened output f is obtained from the input image
g as :

Where C is a positive coefficient which gives the strength of sharpening is a measure
of the image function sheerness, calculated using a gradient operator. The lapacian is often
used for this purpose.
Image sharpening can be interpreted in the frequency domain. The result of the
fourier transform is a combination of harmonic functions. The derivative of the harmonic
function sin(nx) is n cos(nx); thus the higher the frequency. Thus the higher the frequency,
the higher the magnitude of its derivative. This explains why gradient operators are used to
enhance edges.
A similar image sharpening technique is in equation (2.29), called unsharp masking
often used in painting industry applications. A signal proportional to an unsharp image
(heavily blurred by a smoothing operator) is subtracted from the original image. A digital
image is discrete in nature and so equations (2.26) and (2.27), containing derivatives, must be
approximated by differences. The first differences of the image g in the vertical direction (for
fixed i) and in the horizontal direction (for fixed j) are given by


Where n is a small integer, usually 1. The value n should be chosen small enough to provide
a good approximation to the derivative, but large enough to neglect unimportant changes in
the image function, symmetric expressions for the difference.

34


are usually not used because they neglect the impact of the pixel itself.

2.4 Segmentation
Segmentation refers to the process of partitioning a digital image into multiple
segments (sets of pixels, also known as super pixels). The goal of segmentation is to simplify
and/or change the representation of an image into something that is more meaningful and
easier to analyze. More precisely, image segmentation is the process of assigning a label to
every pixel in an image such that pixels with the same label share certain visual
characteristics. Segmentation are divided into three groups which are thresholding, edge-
based segmentation and region-based segmentation they discussed in detail.

2.4.1 Thresholding
Gray-level thresholding is the simplest segmentation process. Many objects or image
regions are characterized by constant reflectivity or light absorption of their surfaces; a
brightness constant or threshold can be determined to segment objects and background.
Thresholding is computationally inexpensive and fast, it is the oldest segmentation method
and is still widely used in simple applications; thresholding can easily be done in real time
using specialized hardware. (Sonka et al., 2008).
A complete segmentation of an image R is a finite set of region


Complete segmentation can result from thresholding in simple scenes. Thresholding
is the transformation of an input image f to an output (segmented) binary image g as follows:
35





where T is the threshold, for image element of object, and for image
elements of the background (or vice versa).
If objects do not touch each other, and if their gray levels are clearly distinct from
background gray levels thresholding is a suitable method. A global threshold is determined
from the whole image f:

On the other hand, local thresholds are position dependent


image f into sub-images

and determined in some sub-image, it can be interpolated from


thresholds determined in neighboring sub-images. Each sub-image is then processed with
respect to its local threshold.
Basic thresholding as defined by equations 2.32 has many modifications. one
possibility is to segment an image an image into regions of pixels with gray- levels from a set
D and into background otherwise (band thresholding):



This thresholding can be useful, for instance, for instance, in microscopic blood cell
segmentations where particular gray-level interval represents cytoplasm, the background is
lighter, and the cell kernel darker. This thresholding definition can serve as a border detector
as well; assuming dark objects on light borders. If the gray-level set D is chosen to contain
36

just these borders gray-levels, and if thresholding according to equation 2.36 is used. There
are many modifications that use multiple thresholds, after which the resulting image is no
longer binary, but rather an image consisting of a very limited set of gray-levels.

(2.37)

Where each D
i
is a specified subset of gray-levels
Another special choice of gray-level subset D
i
defines semi-thresholding, which is
sometimes used to make human-assisted analysis easier:




(2.38)

This process aims to mask out the image background, leaving gray-level information present
in the objects. Thresholding has been presented relying only on gray-level image properties.
Note that this is just one of many possibilities; thresholding can be applied if the
values do not represent gradient, a local texture property or the value of any image
decomposition criterion.



37

2.4.2 Edge-based segmentation
Edge-based segmentation represents a large group of methods based on information
about edges in the images; it is one of the earliest segmentation approaches and still remains
very important. Edge-based segmentation rely on the edges found in an images by edge
detecting operators, these edges mark image locations of discontinuities in gray-level color,
texture e.t.c
There are several edges based segmentation methods which differ in strategies
leading to final border construction, and also differ in the amount of prior information that
can be incorporated in to the method. The more prior information that is available to the
segmentation process, the better the segmentation results that is available to segmentation
process, the segmentation results that can be obtained. Prior information affects segmentation
algorithms; if a large amount of prior information about the desired result is available, the
boundary shape and relations with other image structures are specified very strictly and the
segmentation must satisfy all these specification. If little information about the boundary is
known, the segmentation method must take local information about the boundary is known,
the segmentation method must take more local information about the image into
consideration and combine it with specific knowledge that is general for an application area.
If little prior information is available, it cannot be used to evaluate the confidence of
segmentation results, and therefore no basis for feedback corrections of segmentation is
available (Sonka et al., 2008).
The most common problems of edge-based segmentation, caused by noise or
unsuitable information in an image, are an edge presence in locations where there is no
border, and no edge presence where a real border exists. Clearly both cases have a negative
influence on segmentation results.
38

2.4.2.1 Edge image thresholding
Almost no zero pixel are present in an edge image, but small edge values correspond
to non-significant gray-level changes resulting from quantization noise, small lighting
irregularities, e.t.c. simple thresholding of an edge image can be applied to remove this
values. This approach is based on an image of edge magnitudes processed by appropriate
threshold. Selection of appropriate global threshold is often difficult and sometimes
impossible; p-tilling can be applied to define a threshold and a more exact approach using
orthogonal basis functions is described in which, if the original basis functions is described in
which, the original data has a good contrast and is not noisy, gives good results.

2.4.2.2 Edge relaxation
Borders resulting from previous method are strongly affected by image noise, often
with important parts missing. Considering edge properties in the context of their mutual
neighbors can increase the quality of the resulting image. All the image properties, including
those of further edge existence, are iteratively evaluated with more precision until the edge
context is totally clear- based on the strength of edges in a specified local neighborhood; the
confidence of each edge is either increased or decreased.
A weak edge positioned between two strong edges an example of context; it is highly
probable that this inter-positioned weak edge should be part of a resulting boundary. If, on
the other hand, an edge (even a strong one) is positioned by itself with no supporting context,
it is probably not a part of any border.



39

2.4.3 Region-based segmentation
Region growing techniques are generally better in noisy images, where borders are
extremely difficult to detect. Homogeneity is an important property of regions and is used as
the main segmentation criterion in region growing, whose basic idea is to divide an image
into zones of maximum homogeneity. The criteria for homogeneity can be based on gray-
level, color, texture, shape model (using semantic information), e.t.c. properties chosen to
describe regions influence the form, complexity, and amount of prior information in the
specific region-growing segmentation method.
Region growing segmentation must satisfy the following condition of complete
segmentation:


Where S is the total number of regions in an image and H (R
i
) is a binary homogeneity
evaluation of the region R
i
. Resulting regions of the segmented image must be both
homogenous and maximal, where by maximal we mean the homogeneity criterion would
not be true after merging a region with any adjacent region. The homogeneity criterion uses
an average gray-level of the region, its color properties, or m-dimensional vector of average
gray values for multi-spectral images.

2.5 Image Analysis/Classification/Interpretation
For some applications the feature, as extracted from the image are all that is required.
Most of the time however one more step must be taken; classification interpretation.
The most important interpretation method is conversion of units. Rarely will dimensions in
pixels or gray level be appropriate for an industrial application. As part of the software, a
40

calibration procedure will define the conversion factors between vision system units (Nello,
2000).
Reference point and other important quantities are occasionally not visible on the
part, but must be derived from measurable features. For instance, a reference point may be
defined by the axes of the tubes on either side of the bend).Error checking, or image
verification, is a vital process. By closely examining the features found, or extracting
additional feature, test the image itself to verify that it is suited to the processing being done.
Since features are being checked, it can be considered a classification or interpretation step.
Without this, features could have incorrect values because the parts is mislocated, upside
down or missing, because a light has burned out, because the lens is dirty, e.t.c. a philosophy
of fail-safe programming should be adopted; that is any uncertainty about validity of the
image or the processing should either reject parts or shut down the process. This is
imperative in the process control, process verification, and robot guidance, where safety is at
risk. Unfortunately, error checking procedures are usually specific to a certain type of image;
general procedures are not available.
2.6 Decision Making
Decision making in conjunction with classification and interpretation, is characterized
as heuristic, decision theoretic, syntactic or edge tracking. The most commonly used decision
techniques will be discussed

2.6.1 Heuristic
In this case, the basis of the machine vision decision emulates how humans might
characterize the image as such intensity histogram, black-white/black-white transition count,
pixel count, Background /fore ground pixel maps, average intensity value, delta or
41

normalized image intensity pixel maps, X number of data point, each representing the
integration over some area in the picture row/column totals.
Often times systems are designed to handle decision making to a specific duration of
time. For example some companies have these programs in hardware and consequently, can
handle some decision making as 3000 per minute. These systems typically operate in a train
by showing technique. During training (sometimes called learning), a range of acceptable
representative is shown to the system, and the representation, which is to serve as a standard,
is established. The representation may be based on a single object or on the average of the
images from many objects or may include a family of known good samples, each creating a
representation standard to reflect the acceptable variables.
In operating mode, decision-making based on how close the representations from the
present object being examined compares to the original or standard representation(s). A
goodness-of fit criterion is established during training to reflect the range of acceptable
appearances the system should be tolerant of. If the difference between the representation
established from object under test and the standard exceeds the goodness-of-fit criteria, it is
considered a reject. Significantly, the decision may be based on a combination of criteria
(pixel counts and transition count, for example). The goodness-of-fit criteria then become
based on statistical analysis of the combination of each of the fit criteria.
Decision-making, in conjunction with these approaches, can be either deterministic or
probabilistic. Deterministic means that given some state or set of conditions, the outcome of
a function or process is fully determined with 100% probability of the same outcome.
Probabilistic means that a particular outcome has some probability of occurrence (100%),
given some initial set.

42

2.6.2 Syntactic Analysis
The ability to make decisions based on pieces of an object is usually based upon
syntactic Analysis, unlike the decision theoretic approach. In this case, the object is
represented as a string, a tree, or graph of pattern primitive relationships. Decision making is
based on a parsing procedure. Another way to view this is a local features analysis (LFA) - a
collection of local features with specified spatial relationships between various combinations.
Again these primitives can be derived from a binary or grayscale images thresholded or edge
processed.
For example, three types of primitive include curve angle and line that together can
be used to describe a region. Image analysis involves decomposing the objects into its
primitives, and relationships of primitive results in recognition. The primitive decision
making can be performed using decision theoretic or statistical techniques.

2.6.3 Edge tracking
In addition to geometric feature extraction of boundary images, image analysis can be
conducted by edge tracking: when the edge is detected, it is stored as a link of edge points.
Alternatively, line encoding and connectivity analysis can be conducted. That is the location
of edge points detected is stored and line fitting is performed (Zuechi, 2000).
Decision making is then based on comparism of line segments directly or based on
probability theory. Line segment descriptions of objects are called structural description. This
process of comparing them to models is called structural pattern recognition.



43

2.7 Related Works
A thorough survey of work related to eye tracking techniques and eye-blink detection system
will discussed in detail.

2.7.1 Eye tracking technique
Tian et al. (2000a) presents a dual-state eye model, there are two templates - for
closed and open eye. The template for the open eye consists of a circle and two parabolic
arcs. The circle, described by three parameters x
0
, y
0
, r ((x
0
, y
0
) the centre and r the
radius), represents the iris. The arcs represent the eyelids. They are described by 3 points
one for each eye corner and one on the apex of the eyelids. The template for a closed eye is a
straight line between the eye corners. If the iris is detected, the eye is open and modeled with
a template for an open eye, otherwise it is closed. They assume that the eye features are
given on the first frame. The inner corners are tracked by minimizing the squared difference
between the intensity values regions close to the corners in two subsequent frames. The outer
corners are detected as first, lying on the line between the two inner corners and second stay
apart from them in width w (certain value obtained on the first frame). After both eye corners
are fixed, to complete the eye tracking, the eyelids have to be localized. It is done by tracking
central points on both eyelids by minimizing the squared difference between the intensity
values. They tested their method over 500 image sequences where the full-size face takes
220x300 pixels and each eye region - 60x30. The method works robustly and accurately
across race and expressions variety and make-up presence.
Tian et al. (2000b) have developed a system for recognizing three action units
completely closed, narrow-open and completely open eye by use of Gabor wavelets in nearly
frontal image sequences. The feature points are three inner corner, outer corner and middle
44

point between the first two. The eye corners are tracked in the whole sequence. The most
important of them for tracking is the inner corner; the positions of the others are relatively
found by the inner corner position. The initial positions of the feature points are given. Then
it is tracked by minimizing the function over a certain displacement. The function depends of
the intensity values (grayscale values). The outer corners are detecting by using the size of
the eyes, obtained on the first frame. The middle point is the point in the middle between the
inner and outer corner. For each of these three feature points a set of multi-scale and multi-
orientation Gabor coefficients are calculated. Three spatial frequencies with wavenumber and
six orientations from 0 to differing in /6 are


used. These 18 coefficients are charged into a neural network to determine the state of the
eye. Unfortunately, only the success of detection the eye state (not of the eye corners
tracking) is reported. Since this is the main aim of the paper. The recognition rate when three
action units are recognized is 89% and when only two (the narrow eye equal to closed) it
increases to 93%.
Sirohey et al. (2002) present a flow-based method for tracking. Their method for
detection is based on finding a combination of edge segments, which represents upper eyelid
the best. First the head motion is detected from the edge segments associated with the
silhouette of the head. Based on this information the head is stabilized. The head motion
vectors are subtracted from the iris and eyelid motion and only their independent motions are
left. The eyelids are tracked as follows: The edge pixels of the eyelid that have flow vectors
associated with them are followed according to the direction and magnitude of the flow
vector. If edge pixels are found in close neighborhood to the pointed pixels, they are labeled
as possible belonging to the eyelid. The candidates are fitted into third-order polynomial.
45

With this method the iris is found at each frame correctly and the eyelids - in 90% of the
frames (two sequences of 120 frames of single person, with and without glasses). The paper
does not mention how the lower eyelid is modeled, extracted and tracked. Blinking is
detected as the height of the apex of the upper eyelid from iris center.
Black et al. (1996) explore a template-based approach combined with optical flow
(for example affine), in which they represent rigid and deformable facial motions using
piecewise parametric models of image motion. The facial features face, eye regions,
eyebrows and mouth are given. The face is broken down into parts and the motion of each of
them is modeled independently by planner motion models. The affine model is sufficient to
model the eye motion.


where u, v horizontal and vertical components of the flow at image point p(x, y). The
coordinates are defined with respect to some image point (typically center of the region).
The difference between the image and the change of the parameters obtained on the
previous is minimized by simple gradient descent scheme. The eye state transition can be
described by tree parameters vertical translation, divergence (isotropic expansion) and
rapid deformation (squashing and stretching), which interpretation is given in Table 2.1.
The curves of all three are plotted against time and are observed for local maximums
and minimums. The changes in function have to appeared nearly on the same time. A eye
blink is detected when translation max & divergence min & deformation max.. The reported
accuracy of 88% for artificial sequences and 73% for TV movies is measured for all facial
expressions. Unfortunately the achieved processing time is 2 min/frame, which is not
applicable for real-time applications.
46

*Table 2.1: Parameters describing the movement in the eye region



*source: Black et al. (1996)















47

Cohn et al. (2004) and Moriayama et al. (2004) present different aspects if the same
system, where a carefully detailed generative eye model (see Fig. 2.10) is used. A template is
built with the usage of two types of parameters structure and motion. The structure
parameters describe the appearance of the eye region, capturing all its racial, individual and
age variations. This includes size and color of the iris, sclera, dark regions near left and right.
corners, the eyelids, width and boldness of the double-fold eyelid, width of the bulge below
the eye, width of the illumination reflection on the bulge and furrow below the bugle. Motion
parameters describe the changes during time. Traditionally the movement of the iris is
described by 2D position of its center
Closing and opening the eye is shown with the height of the eyelids. The skew of the
upper eyelid is also motion parameter to catch changing of the upper eyelid when the eyeball
is moving. Unfortunately, the structure parameters are not automatically implemented. The
model is individualized by manually adjusting the structural parameters. From the
initialization they derived structural parameters, which remain fixed for the entire sequence
later. Further the features are tracked with iterative minimization of mean square error
between the input image and the template, obtained by current motion parameters.
According to Cohn et al. (2000) the model for tracking the eye features and blink
detection is a part of a system for automatic recognizing of the embarrassing smiles. They
tested a hypothesis that there is correlation between head movement, eye gaze and lip
displacement during embarrassing smiles. That is why probably the accuracy of the tracking
method is not measured and is not reported. The second one reports for failure in only 2
image sequences from 576, which happens due to the head tracker. The database includes a
variety of subjects from different ethnic groups, ages and gender. The in-plane and limited
out-of-plane motion is included.
48










Fig. 2.10: Detail template used ( Moriayama et al. (2004) and Cohn et al. (2004))











49

An active contour technique is applied by Paradas (2000) to track the eyelids. The model for
the eye consists of two curves one, for the lower eyelid, with one minimum and one, for the
upper eyelid, with one maximum. Tracking of the eyelids is done with active contour
technique where the motion is embedded in the energy minimization process of the snakes.
A closed snake, which tracks the eyelids, is built by selection a small percentage of
the pixels along the contours obtained during initialization or tracked on the previous frame.
Among these points are the eye corners. Motion compensation errors are computed for each
snaxel (x
0
, y
0
) within given range of allowed displacement (dx, dy). Those pixels (x
0
+dx,
y
0
+dy), which produce the smaller computational error, are selected as possible candidates of
the snaxel (x
0
, y
0
) at the current frame. A two-steps dynamic programming algorithm is run
for these candidates.
The paper does not report anything about the running time of the algorithm. The author only
mentions that it is stable against blinking, head translation and rotation, up to the extent,
where the eyes are visible.

2.7.2 Blink detection systems
Very briefly, I would like to mention the ways, in which the authors of the revised
papers detect blinking.
In Tian et al. (2000) blinking is detected if the iris is not visible. This is not the most
appropriate way. If it is assumed that the method of iris detection never fails, and thus gives
false alarms, misclassification might occur because of eye or iris occlusion because of the
head rotation.
Sirohey et al. (2002) detects blinking occurrence as the height of the apex of the
upper eyelid from iris center, which might be a consequence of not tracking the lower eyelid.
50

The extension of Tians approach (2000) is a paper by Cohn et al. (2002). It focuses
on blink detection, and not on locating the eye features. The eye region is defined on the first
frame manually picking 4 points the eye corners the centre point of the upper eyelid and a
point straight under it. It stays the same within the whole image sequence, because the face
region is stabilized. Eye region is divided into two portions the upper and the lower by the
line connecting the eye corners. Blink detection relays on the fact that the intensity
distributions of the upper and the lower part change when the eye is opening and closing. The
upper part consists of sclera, pupil, eyelash, iris and skin. For all of them only the first and
the last (sclera, skin) contribute for increasing of the average intensity values. When the
upper eyelid is closing, the eyelash is moved in the lower region and the pupil and iris are
replaced by brighter skin, which leads to increasing the overall intensity of the average
intensity of the upper portion and simultaneously decreasing the average intensity of the
lower. The average grey scale intensities of the both portions are plotted against time. The
eye is closed when the curve of the upper has a maximum. The blink is detecting also by
counting the number of crossings and the number of peaks in order to distinguish between
blinking and eyelid flatter. If the blinking is undergoing between two neighbor crossings
there is only one peak, otherwise the peaks are more than one.
Correlation with a template of the persons eye is used in the paper by Grauman
(2001) as classifying the state of the eye. The difference image during first several blinks is
used to detect the eye regions. The candidates are discarded based on the anthropomorphic
measures, as distances between the blobs, their width and height should keep a certain ration
and others. The remaining pairs candidates are classified on the Mahalanobis distance
between their parameter vector and a mean vector of blink-pair property vector. The
bounding box of the detected eye region determines the template. Further, it is decided for
51

eye blinking by calculating the correlation between this template and the image on the
current frame. As the eye closes, it begins to look less like the template eye and otherwise
when it reopens - more and more similar. The correlation score ranging between 0.85 and 1
classifies the eye as open, the range between 0.55 and 0.8 as closed eye and less than 0.4
the tracker is lost. Again, the technique is appropriate only for blink detection, not on precise
eye feature extraction and tracking. The reported overall detection accuracy is 95.6% on
average 28 frames per second. This result might change for a longer image sequences as
using for detection driver drowsiness, because a template for a single person expires in time,
when the person gets tired.
Ramadan et al. (2002) used the active deformable model technique to track the iris.
A statistical pressure snake, where the internal forces are eliminated, tracks the pupil. The
snake expands and closes the pupil. When the upper eyelid occludes the pupil, i.e. the blink is
undergoing, the snake collapses. The duration of snake collapse is measurement for blink.
After reopening the eye the snake can expand itself again if the iris position is not changes
during blinking, otherwise it has to be initialized (position) manually. Although they reported
very high accuracy of the tracking method, the system suffers by several disadvantages. The
main problem is manual initialization and re-initialization. Further the way of measuring
blinking also does not seem to be very reliable. The snake might collapse in case of saccades,
which will be misunderstanding with blinking. The third one is the position of the camera. It
is attached to the head, which restricts to the head movements and also makes the equipment
to applicable for drivers.
Danisman et al. (2010) presented an automatic drowsy driver monitoring and accident
prevention system that is based on monitoring the changes in the eye blink duration. His
proposed method detects visual changes in eye locations using the proposed horizontal
52

symmetry feature of the eyes. This new method detects eye blinks via a standard webcam in
real-time at 110 fps for a 320240 resolution. Experimental results in the JZU eye-blink
database showed that the proposed system detects eye blinks with 94% accuracy with a 1%
false positive rate.




















53

CHAPTER THREE
ALGORITHM DEVLOPMENT
3.1 System Flowchart
The flowchart diagram in Fig. 3.1 describes the processes involved in drowsiness
detection of the system. The image is acquired with the aid of a digital camera, which
converts the acquired image to grayscale. The searching for the location of the eye is
initialized by analyzing the involuntary blinks of the user of the system; this is achieved by
motion analysis technique. An online template of the eye is created which is used to update
the position of the eye every thirty seconds caused by slight movement of the drivers head.
Anytime the output of the tracker is lost, the system re-initializes itself by automatically
repeating the process. Once the tracking is successful the system proceeds to extract the
visual cue of the drivers eye by detecting the number of blinks produced. This acquired
information is used to take a decision of when to trigger the alarm. In other words drowsiness
detection keeps track of the number of blinks produced by the user. When the number of
blinks gets to a critical point, which translates into detecting short period of micro-sleep, the
alarm is triggered, if the alarm is not triggered within five minutes the system is designed to
reset itself automatically.
3.2 Software Development
The algorithm was developed in C language in the visual studio environment
interlinked with the OpenCV library which is mainly used for image processing and in the
area computer vision. The algorithm is broken down into five processes which are namely:
eye-detection, template creation, eye tracking, blink detection and drowsiness detection.



54






















Fig. 3.1: Flowchart diagram describing of the eye-blink detection system


Image Acquisition
Eye detection
Eye tracking
Success
Blink detection
Activate alarm
STOP
Drowsiness
detection
NO
YES
YES
NO
55

A number of significant contributions and advancement has been made to the works of
Grauman et al., (2001) in other to improve on the accuracy and reliability of the system
which will be discussed in detail.

3.3 Eye-Detection
The system will try to locate the position of the eye by analyzing the blinking of the
user in this stage, this is achieved by creating a difference image from the current frame and
previous gray-scaled frame of the driver, the gray-scaled image undergoes binarization.
Binarization is the conversation of gray-scaled image to a binary image which is often used
to show regions of significant movement in the scene. A binary image is an image in which
each pixel assumes the two discrete values in this case 0 and 1; 0 representing black and 1
representing white after thresholding.
The next phase in this stage is to eliminate noise which is often caused by naturally
occurring jitter caused by lighting conditions and camera resolution. We employ some
functions in the OpenCV library, this provides a fast, convenient interface for doing
morphological transformations on image this is called dilation and erosion. They remove
noise and produce fewer and larger connected components. The resultant 3x3 star-shaped
convolution kernel is passed over the binary image in an opening morphological operation.
Listing 1 in Appendix A shows the algorithm for the opening morphological operation.
Candidates eye-blobs are extracted by recursive labeling of the connected
components of the produced binary image. We then determine whether the connected
component is an eye-pair or not i.e. the system is able to consider if each eye pair is a
possible match for users eye. The algorithm for connected component labeling is shown in
Listing 2 in Appendix A.
56

A number of experimentally-derived heuristics is applied based on the width, height,
vertical distance and horizontal distance to pinpoint the exact pair that most likely represents
the drivers eye (Chau et al., 2005). The system proceeds if the number of connected
components is two or otherwise the process re-initializes itself. This is achieved by some sets
of defined rules such as the width of the components must be about the same, the height of
the connected components must be about the same, and the vertical distance must be small;
this is scrutinized by some set of filters. If these pairs of component pass through the set of
filters, then there is a good indication that the drivers eye has been successfully located.
The name given to this technique is known as motion analysis.
Connected component labeling is applied next to obtain the number of connected
components in the difference image. Fig. 3.2 shows the thresholded difference image prior to
erosion.

3.4 Template Creation
After the connected components have successfully passed through the filter, the larger
of the two components will be chosen, for template creation, due to the fact that size of the
template to be created is directly proportional to the chosen components. The larger the
component chosen the more the brightness information it contains. This will result in more
accurate tracking and, hence, the system obtains the boundary of the selected component,
which will be used to extract a portion of the current frame as the eye template. Since we
need an open eye template, it will be a mistake to create a template the moment the eye is
located. This is because blinking involves closing and opening of the eye and, thus, once the


57










Fig. 3.2: Transition during the eye detection using the motion analysis technique










58

eye is located, we set some delay before creating the template. Following this, therefore, we
need an open eye template. Since the user's eyes are still closed at the heuristics filtering
above, there is a need to wait a moment for the user to open his eyes. Listing 4 in Appendix
B shows the algorithm used in creating an online template of the eye.

3.5 Eye-Tracking
Eye detection is not sufficient to give an highly accurate blink information desired,
since there is possibility of head movement from time to time. A fast tracking procedure is
needed to maintain the exact knowledge about the eyes appearance. So having the eye
template and live video feed from camera makes it possible for the system to locate the user's
eye in the subsequent frames using template matching. The searching is limited in a small
search window since searching the whole image will use extensive amount of CPU resources.
The system utilizes the square difference matching method, which matches the
squared difference so that a perfect match will be zero and bad matches will be large (Gray,
2008). The equation is given by:

(3.1)
Where



are the brightness of the pixels at in the template and source image
respectively, and

is the average value of the pixels in the template raster and

is the average
value of the pixels in the current search window of the image. At any time the squared difference
exceeds a predefined threshold the tracker is believed to be lost, for this event it is critical
that the tracker declares itself lost and re-initialize using by going back to eye detection by

59

motion analysis technique. Fig. 3.3 shows the sample of the tracked object. Listing 5 in
Appendix A shows the algorithm for locating the eye in subsequent frames, the location of
the best matches is available in minloc. It is used to draw rectangle in the displayed frame to
label the object being tracked as shown in Fig. 3.3

3.5 Blink Detection
A human being must periodically blink to keep his eyes moist. Blinking is
involuntary and fast. Most people do not notice when they blink. However, detecting a
blinking pattern in an image sequence is an easy and reliable means to detect the presence of
a face. Blinking provides a space-time signal which is easily detected and unique to faces.
The fact that both frame of tracked objects in which is the eye. Listing 5 in Appendix A
shows the algorithm for locating the eye.
The algorithms developed in previous works (Grauman et al., (2001) and Chau et al.,
(2005)) eye blinks were detected by the observation of correlation scores such that the
detection of blinking and the analysis of blink duration are solely based on observation of the
correlation scores. This is generated by the tracking the previous step by using the online
template of the users eye. In other words as the users eye closes during the process of a
blink, its similarity to the open eye template decreases. As the users eye is in the normal
open state, very high correlation scores of about 0.85 to 1.0 are reported. As the user blinks,
the scores fall to values of about 0.5 to 0.55.
In theory, when the user blinks, the similarity to the open eye template decreases.
While it is true in most cases, weve found that it is only reliable if the user does not make
any significant head movements. If the user moves his head, the correlation score also

60










Fig.3.3: Sample frames of the tracked object










61

decreases even if the user doesn't blink. In this system we use motion analysis to detect eye
blinks, just like the very first stage above. Only this time the detection is limited in a small
search window, the same window that is used to locating the user's eye. Listing shows the
algorithms for blink detection using motion analysis.
From Listing 6 in Appendix A, cvFindContours will return the connected components in
comp, and the number of connected components nc. To determine whether a motion is eye
blink or not, we apply several rules for the connected component: There is only one (1)
connected component; the component is located at the centroid of user's eye.
Note that we require only one (1) connected component, while normally user blink
will yielding two (2) connected components. That's because we perform the motion analysis
in a small search window, where the window fits only for one (1) eye.

3.6 Drowsiness Detection
Driver drowsiness is one specific human error that has been well studied. Studies
have shown that immediately prior to the accidents, the driver's eye change in blinking
behavior (Thorslund, 2003). The basic parameter used to detect drowsiness is the frequency
of blinks, the system detects micro-sleep symptoms in order to diagnose driver fatigue. As
the driver fatigue increases, the blinks of the driver tend to last longer and drowsiness
gradually sets in.
Mathematically:






Where f is the frequency of blinks
62

T is the blink duration
In other words as the frequency of blinks is inversely proportional to blinking duration,
When a person is highly alert the duration of blinking in a circle is relatively high but as
drowsiness set in the blinking duration drop relatively, invariably this means that when
blinking duration is high the frequency of blinks is low and vice-versa.
The system determines the blink rate by counting the number of consecutive frames in which
the eye remain closed. The system is design to trigger a warning signal via an alarm once the
early stages of drowsiness is detected.

3.7 Hardware consideration
The system is primarily developed and tested on windows vista PC with AMD Turion
3GHz processor and 2 GB RAM. Video was captured with a color CMOS image sensor type
Averon webcam, which captures at 30 frames per second, also processes video as grayscale
images at 320 x240 pixels, resolution of 1.3 mega pixel and signal to noise Ratio of 48dB
which enhance the accuracy of the system.








63

CHAPTER FOUR
RESULTS AND DISCUSSION
In order to ascertain the reliability of the system performance evaluation was carried
out. Furthermore, compatibility test was carried out on the system. And it was discovered
that it is compatible with operating systems like windows XP and windows vista and
performed satisfactorily well.

4.1 Blink Detection Accuracy
The blink detection accuracy was conducted, using 10 different test subjects, since a
more standard measure of the overall accuracy of the system is across a broad range of users.
In order to measure the detection accuracy, video sequence were captured of each of the test
subjects sitting at 60m away from the camera. They are asked to blink naturally but
frequently and exhibit mild head movements.
A total of 500 true blinks of 10 test subjects were analyzed, in which each test subject
produced 50 blinks. During this evaluation session, the system encountered two types of
errors which are the missed blink error and false positive blink error. Missed blinks occur as
a result of the system not being able to detect the subjects blink when there was actually a
blink. False positive blinks occur when the system detects a blink when there was none
produced by the test subject.
Twelve (12) blinks were missed out of 500, resulting in an initial accuracy of 97.6%.
Furthermore 15 false positive blinks were encountered making the overall accuracy of the
system to be 94.6%. The accuracy of the system and errors encountered in the system are
calculated below:

64









Table 4.1 shows the summary of result, from the foregoing, the capture rate of the camera,
which is 30 frame/seconds, was used to produce a blink accuracy of 94.6% with a 3% false
positive error. This result is comparable to the work of Danisman et al. (2010), which
employed a camera with a capture rate of 110 frame/seconds in order to obtain an accuracy
of 94.8% and a 1% false positive error.

65

Table 4.1: Summary of results




















Total number of blinks analyzed 500
Total missed blinks 12
Total false positive blinks 15
% percentage initial accuracy of the system 97.6%
% Overall accuracy of the system 94.6%
66

4.2 Eye tracking accuracy
This experiment was conducted by placing the test subject at varying distance from
the camera. A time constraint of 30 seconds was placed on the system to effect the automatic
initialization of the tracker, which consist of two small bonding boxes, which tends to appear
on the image. If the tracker does not appear within 30 second, the tracker is believed to be
lost. This was conducted with distances of 30cm, 60cm, 90cm, 150cm and 180cm.
A log reading of the number of times the tracker appears in thirty seconds and expressed in
percentage which is given as:





Fig. 4.1 shows a plot of the percentage tracking accuracy against distances from the plot, at a
distance of 30cm the accuracy is 72%, and at a distance of 150cm the accuracy drops to 10%.
The tracking accuracy of the system enables us to ascertain the sensitivity of the
system at varying distances. Since there are possibilities of slight movement of the drivers
head from time to time, this can result in varying distance of the driver from the camera fixed
location. In conclusion, as the distance from the camera increases the tracking accuracy of
the system decreases progressively,





67





Fig. 4.1: Percentage eye tracking accuracy at varying distances







20 40 60 80 100 120 140 160 180 200
10
20
30
40
50
60
70
80
Distance(cm)
E
y
e

t
r
a
c
k
i
n
g

A
c
c
u
r
a
c
y
(
%
)


slight movement of the head
68

CHAPTER FIVE
CONCLUSION AND RECOMMENDATIONS
5.1 Conclusion
This work presents a non-invasive method of monitoring the drowsiness of car drivers
by continuously monitoring the psycho-physiological status of the driver, since the driver
does not drop off into sleep instantly. Instead there is a period of progressive drop in the
alertness level due to fatigue associated with psycho-physiological signs, the developed
system is designed to detect this and issue a warning early enough to avoid accident.
The application uses a standard monitor mounted webcam to track the users eye
dynamics, the developed algorithm uses the motion analysis to detect the eye which is
automatically initialized by the voluntary blink of the driver, and furthermore subsequent
location of the eye due to movement of the head is detected using the squared difference
method. The frequency of blinks is the primary indicator of drowsiness used in this work.
When this parameter gets to a critical level, the system is programmed to provide an
immediate warning signal, which is detected with high certainty by presenting a verbal
secondary task via recorded voice to alert the driver.
The achieved results demonstrated that blink detection is a suitable technique for
initializing the location; from this the process the eye can be successfully tracked in the
succeeding frames of an image sequence. The system has worked in real-time and is robust
with respect to variations in scaling and lighting conditions, different orientations of the head
and presence of distracting objects on the face (such as glasses e.t.c).
With the high degree of accuracy achieved in this system, a reduction in the rate of
accidents caused by drowsiness on the highway would be an achievable target.


69

5.2 Recommendations
In other to enhance the tracking accuracy of the system, it is recommended that for us
to automatic zooming on the eyes should be done once it has been tracked. The will go a long
way to improve the sensitivity of the system when tracking at varying distances. By doing
this, the tracking accuracy of the system will be fairly constant and predictable.
It is also recommended that adaptive binarization should be used to eliminate the
need for the noise removal function. This is expected to cut down on the computations
needed to find the eye, and hence, enhance the adaptability of the system to changes in
ambient light.













70

REFRENCES
Bhaskar, T.N., Keat, F.T., Ranganath, S., and Venkatesh, Y.V.(2003): Blink Detection and
Eye Tracking for Eye Localization. Proceedings of the Conference on Convergent
Technologies for Asia-Pacific Region (TENCON 2003), pp. 821824, Bangalore,
India, October 15-17, 2003.
Black, M. J., Yacoob, Y. (1997): Recognizing Facial Expressions in Image Sequences Using
Local Parameterized Models of Image Motion, International Journal of Computer
Vision. Vol. 25(1), October 1997, pp. 23 48; Available online at
http://citeseer.ist.psu.edu/black97recognizing.html
Cohn, J. F., Ian Reed, L., Moriyama, Xiao, J., Schmidt, K., Zara Ambadar (2004):
Multimodal Coordination of Facial Action, Head Rotation, and Eye Motion During
Spontaneous Smiles. In Proceedings of the Sixth IEEE International Conference on
Automatic Face and Gesture. (FG'04), Seoul, Korea, 2004, pp. 129 - 138; available
online at http://www-2.cs.cmu.edu/~face/Papers/fg2004.pdf
Chau, M. and Betke, V. (2005): Real Time Eye Tracking and Blink Detection with USB.
Boston University Computer Science Technical Report No.2005 -12, May 2005.
Danisman, T., Bilasco, I.M., Djeraba, and C., Ihaddadene, N. (2010): Drowsy Driver
Detection System Using Blinks Patterns, In Machine and Web Intelligence (ICMWI),
2010 International Conference, 5
th
Oct 2010, pp. 230-233.
Donghheng, li. (2006): Low-Cost Eye-tracking for Human Computer Interaction M.sc.
Thesis, Iowa State University, Ames Iowa. Human Computer Interaction, pp. 5-10.
Galley. N. and Schleicher, R. (2004). Subjective and Optomotoric Indicators of Driver
Drowsiness, 3rd International Conference on Traffic and Transport Psychology,
Nottingham, UK, 2004, pp. 1-7
71

Grauman, K., Betke, M., Lombardi, J., Gips, J. and Bradski, G. (2003): Communication via
Eye-blinks and Eyebrow Raises: Video Based Human-Computer Interfaces,
Universal Access in the Information Society, vol. 2, no. 4, pp. 359373, 2003.
Grauman, K., Betke, M., Gips, J.and Bradski, G. R. (2001): Communication via Eye Blinks
Detection and Duration Analysis in Real Time, Computer Vision and Pattern
Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society
Conference. vol. 6 345-358.
Haro, A., Flickner, M. and Essa, I. (2000): Detecting and Tracking Eyes by Using their
Physiological Properties, Dynamics, and Appearance, In Proc. IEEE Conf. Computer
Vision and Pattern Recognition, Hilton Head Island, SC, Jun.13-15,2000, vol.1,
pp.163-168. Moriyama,
Hjelmas, E. and B.K., (2001): A survey. Computer Vision and Image Understanding,
83(3):236274.
Karson, C. (1983): Spontaneous Eye-blink rates and Dopaminergic Systems. Brain,
vol.106, pp. 643-653, 1983
Kanade, T., J. F. Cohn, J. Xiao, Z. Ambadar, J. Gao, H. and Imamura. (2002): Automatic
Recognition of Eye Blinking in Spontaneously Occurring Behavior, Proceedings of
the 16th International Conference on Pattern Recognition (ICPR '2002), Vol. 4,
August, 2002, pp. 78 81; www-2.cs.cmu.edu/~face/Papers/icpr2002.pdf.
Kojima, N., Kozuka, K., Nakano, T. and Yamamoto, S. (2001): Detection of Consciousness
Degradation and Concentration of a Driver for Friendly Information Service, In
Proceedings of the IEEE International Vehicle Electronics Conference, pp. 3136,
Tottori, Japan, September 2001.
72

Moriyama, T., Xiao, J., Cohn, J.F. and Kanade, T. (2004): Meticulously Detailed Eye Model
and Its Application to Analysis of Facial Image. Proceedings of the IEEE
Conference on Systems, Man, and Cybernetics, 2004, pp. 629 634; available online
at http://www.ri.cmu.edu/pubs/pub_4811.html
Pardas, M. (2000): Extraction and Tracking of the Eyelids, International Conference on
Acoustics, Speech and Signal Processing ICASSP vol. 4: pp. 2357-2360, Istambul,
Turkey, June 2000.
Pitas, I. and Venatsanopulos, A. N. (1986): Non-linear Order Statistic Filters for Image
Filtering and Edge Detection Signal Processing pp. 573-584
Ramadan, S, W., Abd-Almageed and Smith, C. E. (2002): Eye Tracking Using Active
Deformable Models, Proc. of the 3
rd
Indian Conference on Computer Vision,
Graphics and Image processing pp. 234-255 available online at
http://www.eece.unm.edu/faculty/chsmith/Papers/EyeTrackingCameraReady.pdf
Sirohey, S., Rosenfeld, A. and Duric, Z. (2002): A Method of Detecting and Tracking Irises
and Eyelids in Video, Pattern Recognition, vol. 35, pp. 13891401.
Sonka, M., Hlavac M. and Boyle,R. (2008): Image Processing, Analysis, and Machine
Vision. 3
rd
Edition, International Student Edition pp 113-249.
Tian, Y., Kanade, T. and Cohn, J. (2000a): Dual-state Parametric Eye Tracking,
Proceedings of the 4th IEEE International Conference on Automatic Face and
Gesture Recognition (FG'00), March, 2000, pp. 110 115; Available on line at
http://citeseer.ist.psu.edu/tian99dualstate.html
Tian, Y., Kanade, T. and Cohn, J.F. (2000b): Eye-state Action Unit Detection by Gabor
Wavelets, In Proceedings of International Conference on Multi-modal Interfaces
(ICMI 2000), October, 2000, pp 56 66; http://citeseer.ist.psu.edu/637068.html.
73

Thorslund, B. (2003): Electrooculogram Analysis and Development of a System for
Defining Stages of Drowsiness. Linkping University, Linkping, 2003.
Tsubota, K. (1998): Tear Dynamics and Dry Eye Progress in Retinal and Eye Research,
Vol.17, No.4, pp 565-596, 1998
Uzunova, V.I. (2005): An Eyelids and Eye Corner Detection and Tracking method of Rapid
Iris Tracking. M.sc. Thesis Otto-von-Guericke University of Magdeburg, ISG pp. 5-
28
Weirwille, W.W. (1994): Overview of Research on Driver Drowsiness Definition and
Driver Drowsiness Detection, 14th International Technical Conference on Enhanced
Safety of Vehicles, pp. 23-26.
Zuechi, H. (2000): Understanding and Applying machine vision 2
nd
edition. Marcel
Dekker, Inc. New York. Bessel, pp. 165-214.











74

APPENDIX A
ALGORITHMS
Listing 1: Algorithm for Opening Morphological operation
IplConvKernel* kernel;
kernel = cvCreateStructuringElementEx(3, 3, 1, 1, CV_SHAPE_CROSS, NULL);
cvMorphologyEx(diff, diff, NULL, kernel, CV_MOP_OPEN, 1);
Listing 2: Algorithm for connected component labeling
CvSeq* comp;
int nc = cvFindContours(
diff, /* the difference image */
storage, /* created with cvCreateMemStorage() */
&comp, /* output: connected components */
size of(CvContour),
CV_RETR_CCOMP,
CV_CHAIN_APPROX_SIMPLE,
cvPoint(0,0)
);
Listing 3: Algorithm for motion analysis to detect eye blinks
CvSub (gray, prev, diff, NULL);
cvThreshold(diff, diff, 5, 255, CV_THRESH_BINARY);
Listing 4: Algorithm for online template creation
cvWaitKey(250);
cvSetImageROI(gray, rect_eye);
75

cvCopy(gray, tpl, NULL);
cvResetImageROI(gray);
Listing 5: Algorithm for locating the eye in subsequent frames
/* get the centroid of eye */
point = cvPoint(
rect_eye.x + rect_eye.width / 2,
rect_eye.y + rect_eye.height / 2
);
/* setup search window */
window = cvRect(
point.x - WIN_WIDTH / 2,
point.y - WIN_HEIGHT / 2,
WIN_WIDTH,
WIN_HEIGHT
);
/* locate the eye with template matching */
cvSetImageROI(gray, window);
cvMatchTemplate(gray, tpl, res, CV_TM_SQDIFF_NORMED);
cvMinMaxLoc(res, &minval, &maxval, &minloc, &maxloc, 0);
cvResetImageROI(gray);
Listing 6: Algorithm for blink detection with motion analysis
/* motion analysis
cvSetImageROI has been applied to the images below */
cvSub(gray, prev, diff, NULL);
76

cvThreshold(diff, diff, 5, 255, CV_THRESH_BINARY);
cvMorphologyEx(diff, diff, NULL, kernel, CV_MOP_OPEN, 1);
/* detect eye blink */
nc = cvFindContours(diff, storage, &comp, sizeof(CvContour),
CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE, cvPoint(0,0));



















77

APPENDIX B
PROGRAM LISTING



#include "stdafx.h"


using namespace std;

#define MAX_LOADSTRING 100
#define FRAME_WIDTH 240
#define FRAME_HEIGHT 180
#define TPL_WIDTH 16
#define TPL_HEIGHT 12
#define WIN_WIDTH TPL_WIDTH * 2
#define WIN_HEIGHT TPL_HEIGHT * 2
#define TM_THRESHOLD 0.4
#define STAGE_INIT 1
#define STAGE_TRACKING 2
#define POINT_TL(r) cvPoint(r.x, r.y)
#define POINT_BR(r) cvPoint(r.x + r.width, r.y + r.height)
#define POINTS(r) POINT_TL(r), POINT_BR(r)
#define DRAW_RECTS(f, d, rw, ro)

do {

cvRectangle(f, POINTS(rw), CV_RGB(255, 0, 0), 1, 8, 0); \
cvRectangle(f, POINTS(ro), CV_RGB(0, 255, 0), 1, 8, 0); \
cvRectangle(d, POINTS(rw), cvScalarAll(255), 1, 8, 0); \
cvRectangle(d, POINTS(ro), cvScalarAll(255), 1, 8, 0); \
} while(0)

#define DRAW_TEXT(f, t, d, use_bg) \
if (d)

{

CvSize _size;

cvGetTextSize(t, &font, &_size, NULL);

if (use_bg)

{

cvRectangle(f, cvPoint(0, f->height), \
cvPoint(_size.width + 5, \
f->height - _size.height * 2), \
CV_RGB(255, 0, 0), CV_FILLED, 8, 0); \
}

\
cvPutText(f, t, cvPoint(2, f->height - _size.height / 2), \
&font, CV_RGB(255,255,0));

\
78

\
d--;

\
}

CvCapture* capture;
IplImage* frame, * gray, * prev, * diff, * tpl;
CvMemStorage* storage;
IplConvKernel* kernel;
CvFont font;
char* wnd_debug = "diff";
HINSTANCE hInst;
TCHAR szTitle[MAX_LOADSTRING];
TCHAR szWindowClass[MAX_LOADSTRING];
char* wnd_name = "video feed";
CHAR lpszz[16];
CHAR lpszz2[16];
HANDLE hThread;
DWORD dwThreadId;
int bc;
int defcount;
long slp;

INT_PTR CALLBACK About(HWND, UINT, WPARAM, LPARAM);
int get_connected_components(IplImage* img, IplImage* prev, CvRect
window, CvSeq** comp);
int is_eye_pair(CvSeq* comp, int num, CvRect* eye);
int locate_eye(IplImage* img, IplImage* tpl, CvRect* window, CvRect*
eye);
int is_blink(CvSeq* comp, int num, CvRect window, CvRect eye);
void delay_frames(int nframes);
void init();
void exit_nicely(char* msg);
int talk(wchar_t *st);

DWORD WINAPI Detector( LPVOID lpParam ) {
// we ronly interested in counts within d defined interval, if d sum >= ||
< thresh, reset count.
while(true){

Sleep(slp );
if(bc>=defcount){

//for(int i=0;i<3;i++)
//{
talk(L"Attention! Please the driver is sleeping");
bc=0;
//PlaySound("buzzer.wav", NULL, SND_ASYNC);
//}

} else{bc=0;}
}

return 0;
}


79

int APIENTRY _tWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance,LPTSTR
lpCmdLine, int nCmdShow)
{
UNREFERENCED_PARAMETER(hPrevInstance);
UNREFERENCED_PARAMETER(lpCmdLine);

hInst=hInstance;


int ret = DialogBox(hInst, MAKEINTRESOURCE(IDD_ABOUTBOX),
NULL, About);

if(ret == 2){
return 0;
}
char lszThreadParam;
hThread =
CreateThread(NULL,0,Detector,&lszThreadParam,0,&dwThreadId);
if(hThread == NULL)
{
DWORD dwError = GetLastError();
cerr<<"Error in Creating thread"<<dwError<<endl ;
return 1;
}

//MessageBox(NULL, "No characters entered.", "Error", MB_OK);
CvSeq* comp = 0;
CvRect window, eye;
int key, nc, found;
int text_delay, stage = STAGE_INIT;

init();

while (key != 'q')
{
frame = cvQueryFrame(capture);
if (!frame)
exit_nicely("cannot query frame!");
frame->origin = 0;

if (stage == STAGE_INIT)
window = cvRect(0, 0, frame->width, frame->height);

cvCvtColor(frame, gray, CV_BGR2GRAY);

nc = get_connected_components(gray, prev, window, &comp);

if (stage == STAGE_INIT && is_eye_pair(comp, nc, &eye))
{
delay_frames(5);

cvSetImageROI(gray, eye);
cvCopy(gray, tpl, NULL);
cvResetImageROI(gray);

stage = STAGE_TRACKING;
text_delay = 10;
}
80


if (stage == STAGE_TRACKING)
{
found = locate_eye(gray, tpl, &window, &eye);

if (!found || key == 'r'){
stage = STAGE_INIT;
bc=0;
}
if (is_blink(comp, nc, window, eye)){
text_delay = 10;
++bc;
}
DRAW_RECTS(frame, diff, window, eye);
DRAW_TEXT(frame, "blink!", text_delay, 1);
}

cvShowImage(wnd_name, frame);
cvShowImage(wnd_debug, diff);
prev = (IplImage*)cvClone(gray);
key = cvWaitKey(15);
}

exit_nicely(NULL);
}



INT_PTR CALLBACK About(HWND hDlg, UINT message, WPARAM wParam, LPARAM
lParam)
{
UNREFERENCED_PARAMETER(lParam);

switch (message)
{
case WM_INITDIALOG:
return (INT_PTR)TRUE;

case WM_COMMAND:
if (LOWORD(wParam) == IDOK)
{

WORD cch = (WORD) SendDlgItemMessage(hDlg,IDC_EDIT1, EM_LINELENGTH,
(WPARAM) 0,(LPARAM) 0);
WORD cch2 = (WORD) SendDlgItemMessage(hDlg,IDC_EDIT2, EM_LINELENGTH,
(WPARAM) 0,(LPARAM) 0);
if (cch == 0 || cch2==0)
{
MessageBox(hDlg, "Please complete all fields.", "Error",
MB_OK);
return (INT_PTR)FALSE;
} else{

*((LPWORD)lpszz) = cch;

// Get the characters.
SendDlgItemMessage(hDlg, IDC_EDIT1, EM_GETLINE,
(WPARAM) 0,(LPARAM) lpszz);
81


// Null-terminate the string.
lpszz[cch] = 0;


*((LPWORD)lpszz2) = cch2;

// Get the characters.
SendDlgItemMessage(hDlg, IDC_EDIT2, EM_GETLINE,
(WPARAM) 0,(LPARAM) lpszz2);

// Null-terminate the string.
lpszz2[cch2] = 0;

defcount=atoi(lpszz2);
slp=atol(lpszz);
EndDialog(hDlg, LOWORD(wParam));
return (INT_PTR)TRUE;
}

}
if(LOWORD(wParam) == IDCANCEL){
EndDialog(hDlg, LOWORD(wParam));
return (INT_PTR)TRUE;
}



break;
}
return (INT_PTR)FALSE;
}


void exit_nicely(char* msg)
{
cvDestroyAllWindows();

if (capture)
cvReleaseCapture(&capture);
if (gray)
cvReleaseImage(&gray);
if (prev)
cvReleaseImage(&prev);
if (diff)
cvReleaseImage(&diff);
if (tpl)
cvReleaseImage(&tpl);
if (storage)
cvReleaseMemStorage(&storage);

if (msg != NULL)
{
fprintf(stderr, msg);
fprintf(stderr, "\n");
exit(1);
}

82

exit(0);
}


void
init()
{


capture = cvCaptureFromCAM(-1);
if (!capture)
exit_nicely("Cannot initialize camera!");

cvSetCaptureProperty(capture, CV_CAP_PROP_FRAME_WIDTH,
FRAME_WIDTH);
cvSetCaptureProperty(capture, CV_CAP_PROP_FRAME_HEIGHT,
FRAME_HEIGHT);

frame = cvQueryFrame(capture);
if (!frame)
exit_nicely("cannot query frame!");

cvInitFont(&font, CV_FONT_HERSHEY_SIMPLEX, 0.4, 0.4, 0, 1, 8);
cvNamedWindow(wnd_name, 1);
HWND hWnd= (HWND)cvGetWindowHandle(wnd_name);
if(hWnd == NULL) {

return ;
}

storage = cvCreateMemStorage(0);
if (!storage)
exit_nicely("cannot allocate memory storage!");

kernel = cvCreateStructuringElementEx(3, 3, 1, 1,
CV_SHAPE_CROSS,NULL);
gray = cvCreateImage(cvGetSize(frame), 8, 1);
prev = cvCreateImage(cvGetSize(frame), 8, 1);
diff = cvCreateImage(cvGetSize(frame), 8, 1);
tpl = cvCreateImage(cvSize(TPL_WIDTH, TPL_HEIGHT), 8, 1);

if (!kernel || !gray || !prev || !diff || !tpl)
exit_nicely("system error.");

gray->origin = frame->origin;
prev->origin = frame->origin;
diff->origin = frame->origin;

cvNamedWindow(wnd_debug, 1);
}


void delay_frames(int nframes)
{
int i;

for (i = 0; i < nframes; i++)
{
83

frame = cvQueryFrame(capture);
if (!frame)
exit_nicely("cannot query frame");
cvShowImage(wnd_name, frame);
if (diff)
cvShowImage(wnd_debug, diff);
cvWaitKey(30);
}
}



/**
* This is the wrapper function for cvFindContours
*
* @param IplImage* img the current grayscaled frame
* @param IplImage* prev previously saved frame
* @param CvRect window search within this window
* @param CvSeq** comp output parameter, will contain the connected
components
* @return int the number of connected components
*/
int get_connected_components(IplImage* img, IplImage* prev, CvRect window,
CvSeq** comp)
{
IplImage* _diff;

cvZero(diff);

/* apply search window to images */
cvSetImageROI(img, window);
cvSetImageROI(prev, window);
cvSetImageROI(diff, window);

/* motion analysis */
cvSub(img, prev, diff, NULL);
cvThreshold(diff, diff, 5, 255, CV_THRESH_BINARY);
cvMorphologyEx(diff, diff, NULL, kernel, CV_MOP_OPEN, 1);

/* reset search window */
cvResetImageROI(img);
cvResetImageROI(prev);
cvResetImageROI(diff);

_diff = (IplImage*)cvClone(diff);

/* get connected components */
int nc = cvFindContours(_diff, storage, comp, sizeof(CvContour),
CV_RETR_CCOMP,
CV_CHAIN_APPROX_SIMPLE, cvPoint(0,0));

cvClearMemStorage(storage);
cvReleaseImage(&_diff);

return nc;
}

/**
84

* Experimentally-derived heuristics to determine whether
* the connected components are eye pair or not.
*
* @param CvSeq* comp the connected components
* @param int num the number of connected components
* @param CvRect* eye output parameter, will contain the location of
the
* first component
* @return int '1' if eye pair, '0' otherwise
*/
int is_eye_pair(CvSeq* comp, int num, CvRect* eye)
{
if (comp == 0 || num != 2)
return 0;

CvRect r1 = cvBoundingRect(comp, 1);
comp = comp->h_next;

if (comp == 0)
return 0;

CvRect r2 = cvBoundingRect(comp, 1);

/* the width of the components are about the same */
if (abs(r1.width - r2.width) >= 5)
return 0;

/* the height f the components are about the same */
if (abs(r1.height - r2.height) >= 5)
return 0;

/* vertical distance is small */
if (abs(r1.y - r2.y) >= 5)
return 0;

/* reasonable horizontal distance, based on the components' width */
int dist_ratio = abs(r1.x - r2.x) / r1.width;
if (dist_ratio < 2 || dist_ratio > 5)
return 0;

/* get the centroid of the 1st component */
CvPoint point = cvPoint(
r1.x + (r1.width / 2),
r1.y + (r1.height / 2)
);

/* return eye boundaries */
*eye = cvRect(
point.x - (TPL_WIDTH / 2),
point.y - (TPL_HEIGHT / 2),
TPL_WIDTH,
TPL_HEIGHT
);

return 1;
}

/**
85

* Locate the user's eye with template matching
*
* @param IplImage* img the source image
* @param IplImage* tpl the eye template
* @param CvRect* window search within this window,
* will be updated with the recent search
window
* @param CvRect* eye output parameter, will contain the current
* location of user's eye
* @return int '1' if found, '0' otherwise
*/
int locate_eye(IplImage* img, IplImage* tpl, CvRect* window, CvRect* eye)
{
IplImage* tm;
CvRect win;
CvPoint minloc, maxloc, point;
double minval, maxval;
int w, h;

/* get the centroid of eye */
point = cvPoint(
(*eye).x + (*eye).width / 2,
(*eye).y + (*eye).height / 2
);

/* setup search window
replace the predefined WIN_WIDTH and WIN_HEIGHT above
for your convenient */
win = cvRect(
point.x - WIN_WIDTH / 2,
point.y - WIN_HEIGHT / 2,
WIN_WIDTH,
WIN_HEIGHT
);

/* make sure that the search window is still within the frame */
if (win.x < 0)
win.x = 0;
if (win.y < 0)
win.y = 0;
if (win.x + win.width > img->width)
win.x = img->width - win.width;
if (win.y + win.height > img->height)
win.y = img->height - win.height;

/* create new image for template matching result where:
width = W - w + 1, and
height = H - h + 1 */
w = win.width - tpl->width + 1;
h = win.height - tpl->height + 1;
tm = cvCreateImage(cvSize(w, h), IPL_DEPTH_32F, 1);

/* apply the search window */
cvSetImageROI(img, win);

/* template matching */
cvMatchTemplate(img, tpl, tm, CV_TM_SQDIFF_NORMED);
cvMinMaxLoc(tm, &minval, &maxval, &minloc, &maxloc, 0);
86


/* release things */
cvResetImageROI(img);
cvReleaseImage(&tm);

/* only good matches */
if (minval > TM_THRESHOLD)
return 0;

/* return the search window */
*window = win;

/* return eye location */
*eye = cvRect(
win.x + minloc.x,
win.y + minloc.y,
TPL_WIDTH,
TPL_HEIGHT
);

return 1;
}

int is_blink(CvSeq* comp, int num, CvRect window, CvRect eye)
{
if (comp == 0 || num != 1)
return 0;

CvRect r1 = cvBoundingRect(comp, 1);

/* component is within the search window */
if (r1.x < window.x)
return 0;
if (r1.y < window.y)
return 0;
if (r1.x + r1.width > window.x + window.width)
return 0;
if (r1.y + r1.height > window.y + window.height)
return 0;

/* get the centroid of eye */
CvPoint pt = cvPoint(
eye.x + eye.width / 2,
eye.y + eye.height / 2
);

/* component is located at the eye's centroid */
if (pt.x <= r1.x || pt.x >= r1.x + r1.width)
return 0;
if (pt.y <= r1.y || pt.y >= r1.y + r1.height)
return 0;

return 1;
}

int talk(wchar_t *st){
ISpVoice * pVoice = NULL;

87

if (FAILED(::CoInitialize(NULL)))
return FALSE;

HRESULT hr = CoCreateInstance(CLSID_SpVoice, NULL, CLSCTX_ALL,
IID_ISpVoice, (void **)&pVoice);
if( SUCCEEDED( hr ) )
{
hr = pVoice->Speak(st, 0, NULL);

pVoice->Release();
pVoice = NULL;
}


::CoUninitialize();
return TRUE;

}















88

APPENDIX C
RESULTS
(i) Sample frames from sections for each of the 10 Subjects in different locations




(a) (b)
(c) (d)
(e) (f)
89













(g) (h)
(i)
(j)
90

(ii) Sample frames from session testing different head rotation of the test subjects


(a) (b)

(c) (d)









91

(iii) Sample frames from sessions testing varying lighting conditions.















(a)
(b)
(c)
(d)
92

(iv) Sample frames from sessions testing the blink detection accuracy













(a) (b) (c)
(c)
(d) (f)
93

(v) Sample frames of test subjects putting on glasses














(a) (b) (c)

You might also like