Object Trackinng PHD Thesis

Master Erasmus Mundus in Color in Informatics and Media Technology (CIMET)
ObjectTracking:StateofTheArtandCAMSHIFTImprovementUsing MultidominantColorsTracking MasterThesisReport

Presentedby
PriyantoHidayatullah
anddefendedatthe
UniversityofJeanMonnetSaintEtienne,France nd 22 June2010
JuryCommittee: Prof.AlainTremeau Prof.JonYngveHardeberg FaouziAlayaCheikh,Ph.D JavierHernndezAndrs,Ph.D DamienMuselet,Ph.D EricDinet,Ph.D Supervisor: HubertKonik,Ph.D
Object Tracking: State of The Art and CAMSHIFT Improvement Using Multi-dominant Color Tracking
Abstract
Object tracking is a wide area in which a lot of methods available and wide variety of applications. One of the applications would be tracking an object in a clickable hypervideo to enrich the interactivity of video application. In this thesis, some state of the art of object tracking methods are reviewed and closely observed. We then select one of object tracking state of the art methods to improve. Our selection goes to CAMSHIFT which has been very well accepted as one of the most prominent methods in object tracking which has real time speed performance and more suitable for clickable hypervideo. CAMSHIFT is very good for single hue object tracking and in the condition where objects color is different with backgrounds colors. In this thesis, we try to improve the robustness of CAMSHIFT for multihued object tracking and the situation where objects colors are similar with backgrounds colors. To improve robustness on the condition where objects colors are similar to backgrounds colors, we use object localization by selecting each dominant color object part using combination of Mean-Shift segmentation and region growing. Hue-distance, saturation and value color histogram are used to describe the object. We also track the dominant color object parts separately and combine them together to improve robustness of the tracking on multihued object. Our experiments showed that those methods improved CAMSHIFT significantly. This improvement hopefully will be useful for object tracking in clickable hypervideo. Keywords: Object tracking, CAMSHIFT, Segmentation, Mean-Shift, Hypervideo.
Table of Contents
Abstract ...................................................................................................................... i Table of Contents ....................................................................................................... ii Table of Figures ......................................................................................................... iv 1 1.1 2 2.1 2.2 Introduction ...................................................................................................... 1 The General Aim of The Master Thesis ....................................................... 1 Previous Work ................................................................................................... 2 Test Videos ................................................................................................ 2 Object Tracking Categorization .................................................................. 4
2.3 Corner Detector Combined with Optical Flow ............................................ 4 2.3.1 Corner detection .................................................................................... 5 2.3.2 Optical flow ........................................................................................... 5 2.4 2.5 Speeded Up Robust Features (SURF) ......................................................... 7 Mean Shift Tracking .................................................................................. 9
2.6 CAMSHIFT Tracking ................................................................................ 11 2.6.1 Color probability distribution and histogram back projection ............... 12 2.6.2 Mass center calculation ....................................................................... 13 2.6.3 CAMSHIFT advantages and disadvantages .......................................... 14 2.7 2.8 2.9 Local Binary Pattern ................................................................................ 15 Beyond Semi-Supervised Online Boosting Tracking ................................. 17 Method that We Choose ........................................................................... 19
2.10 CAMSHIFT/Mean-Shift Improvement in Literatures ...............................20 2.10.1 Mean-Shift tracking combined with texture histogram.....................20 2.10.2 CAMSHIFT and Mean-Shift combined with interest points .............. 21 2.10.3 CAMSHIFT improvement using new HSV model ............................. 24 2.10.4 CAMSHIFT improvement using hue-distance and saturation features25 2.10.5 CAMSHIFT with improvement of object localization........................ 27 2.10.6 CAMSHIFT improvement using adaptive background (ABCShift) .... 28 2.10.7 CAMSHIFT improvement by background subtraction...................... 31 2.10.8 The CAMSHIFT improvement method that we choose ..................... 32 2.10.9 The more specific aim of the master thesis ....................................... 32 3 Proposed Method ............................................................................................. 34 3.1 Object Localization .................................................................................. 34 3.1.1 Preprocessing ...................................................................................... 34 3.1.2 Image color transformation ................................................................. 36 3.1.3 Object Selection................................................................................... 36 3.1.4 Minimum and maximum values storing ............................................... 36 3.2 3.3 Object Modeling ...................................................................................... 37 Making Color Mask.................................................................................. 38
ii
3.4 3.5 3.6 4 4.1 4.2 5
Segmentation .......................................................................................... 38 Histogram Back Projection ...................................................................... 39 Tracking .................................................................................................. 39 Implementations ..................................................................................... 42 Experiments Setting ................................................................................ 43
Implementations and Experiments .................................................................. 42
Results and Discussions ................................................................................... 44 5.1 Results .................................................................................................... 44 5.1.1 First Experiment Results ..................................................................... 44 5.1.2 Second Experiment Results ................................................................. 46 5.1.3 Third Experiment Results.................................................................... 49 5.1.4 Forth Experiment Results.................................................................... 49 5.2 Discussion ............................................................................................... 52 5.2.1 Some Advantages ................................................................................ 52 5.2.2 Some Limitations ................................................................................ 52
Conclusions and Future Works ........................................................................ 54 6.1 6.2 Conclusions ............................................................................................. 54 Future Works .......................................................................................... 54
Bibliography .................................................................................................... 55
iii
Table of Figures
Figure 2.1 Test Videos. ............................................................................................... 2 Figure 2.2 Illustration of optical flow[28]. .................................................................. 6 Figure 2.3 Shi Tomasi corner detector also detect the background corners inside objects rectangle .................................................................................................................... 7 Figure 2.4 SURF Tracker Result. ................................................................................ 9 Figure 2.5 Intuitive description of Mean-Shift.[14] ................................................... 10 Figure 2.6 Summary of CAMSHIFT algorithm. ........................................................ 13 Figure 2.7 LBP and CS-LBP features for a neighborhood of 8 pixels [16]................... 15 Figure 2.8 Example of LBP calculation[16]............................................................... 16 Figure 2.9 LBP Tracker Result in second video.......................................................... 16 Figure 2.10 The core classifier system: detector, recognizer and tracker.[20] ............ 18 Figure 2.11 Comparing LBP Image and its back projection image. ............................ 21 Figure 2.12 SURF and CAMSHIFT 1......................................................................... 23 Figure 2.13 SURF and CAMSHIFT 2. ....................................................................... 23 Figure 2.14 CAMSHIFT with new HSV model. ......................................................... 26 Figure 2.15 CAMSHIFT improvement with hue-distance saturation features. ........... 29 Figure 2.16 Foreground extraction. .......................................................................... 29 Figure 2.17 Sample of elongated object.....................................................................30 Figure 2.18 Background subtraction in static background. ....................................... 31 Figure 2.19 Background subtraction in dynamic background. ................................... 33 Figure 3.1 A sample of complex shape object ............................................................ 34 Figure 3.2 Object Localization using only region growing ......................................... 35 Figure 3.3 More precise object localization with only a single click. .......................... 35 Figure 3.4 Text file configuration to tune the parameters ......................................... 36 Figure 3.5 Color mask illustration ............................................................................ 38 Figure 3.6 Segmentation for smoothing and noise removal of third test video........... 38 Figure 3.7 Histogram Back Projection of first test video ........................................... 39 Figure 3.8 Maximum rectangle illustration. .............................................................40 Figure 3.9 The proposed methods schema .............................................................. 41 Figure 4.1 Hue histogram of air plane body. ............................................................. 43 Figure 5.1 First video result with the proposed method. ........................................... 44 Figure 5.2 First video result with classic CAMSHFT at frame 33. .............................. 45 Figure 5.3 Object localization comparison ................................................................ 45 Figure 5.4 Second video result with our proposed method. ....................................... 47 Figure 5.5 Second video result with classic CAMSHIT. ............................................. 47 Figure 5.6 Third video result with the proposed method........................................... 48 Figure 5.7 Third video best result with classic CAMSHIFT at frame 300. .................. 50 Figure 5.8 Object (marked with red rectangle) tracked by the proposed method. ...... 50 Figure 5.9 Forth video best result with classic CAMSHIFT at frame 57. .................... 50 Figure 5.10 Drifting tracker...................................................................................... 51 Figure 5.11 Multiple object tracking using our proposed method .............................. 51
iv
Introduction
Object tracking has been one of the most emerging areas in computer vision. There are a lot of applications of object tracking. One of which would be tracking an object in a clickable hypervideo. Hypervideo is a displayed video stream that contains embedded user-clickable anchors[19]. In this application, user can interact with the video like interaction between user with a website. This enriches the interactivity of a video. There will be a lot of advantages with this capability. For example, user can monetize their videos by putting companys links inside the video and, in reverse way, companies now able to promote their product in videos. Another capability that would be interesting is object tracking in hypervideo. This means user can select any object in a video and track along the video sequence. For example, user has favorite football player in football match and want to track his movement along the match, then it would be possible with this capability. This also true if a user want to track his favorite racer in F1 videos, track his favorite movie stars in a movie cinema, etc. In this thesis, we try to improve an object tracking method that can be used in hypervideo. Some state of the art of object tracking methods are reviewed, experimented and closely observed. We then select one of object tracking state of the art methods and improve it.
1.1
The General Aim of The Master Thesis

1) Study some state of the art object tracking methods 2) Choose one to improve based on some criteria 3) Improve the chosen method with some constraint if needed
The general objective of the master thesis can be summarized into these points:
Previous Work
Object tracking is a very wide area in computer vision. There are many kinds of method which are sometimes suitable only for specific conditions. This part will describe the review of some state of the art object tracking methods available now.
2.1
Test Videos
Before we go deeper into the state of the art methods, in this section we present some test videos which we used to examine the state of the art methods and help us to choose one of them. Secondly, these test videos will be used to test our own proposed method compare to the chosen method without our improvement. The first one is a yellow trunk (Figure 2.1(a)). This is the simplest case where the object is single hue with scaling, rotation and little deformation in front of dynamic background which color is quite different with the object. The object is yellow while the background is mostly blue. In the middle of the video, partial dynamic occlusion occurs. The dimension of the video is 1280 x 720 pixels in 24 bit. The purpose of this
(a)
(b)
(c)
(d)
Figure 2.1 Test Videos. (a) First video: yellow trunk (b) Second video: air plane (c) Third video: small scaling toy (d) Forth video: Football match
video is to test the robustness of the state of the art object tracking methods as well as our proposed method on partial occlusion, scaled, rotated and deformed object. The second test video is an air plane flying above sea with some small islands below and mild cloud distraction (Figure 2.1(b)). This is a multihued object which passing through a dynamic background. There are some distractions from background which has similar color to some object parts color. The dimension of the video is 1280 x 720 pixels in 24 bit. The purpose of using this video is to test the robustness of object tracking methods on multihued object with some distractions. The third video is a small toy contains several dominant colors which moves across a complex background (Figure 2.1(c)) which is available in [33]. This is a multihued object in front of complex background which has very similar color to the object. Some more challenges of this video are scaling and skewing of the object. The object is moving outward the camera until the size is very small and skews several times. The object is also moving very fast so then it is harder to track. The background is actually static. Usually, for this kind of video, background subtraction is very powerful. But because of the object stayed for quite a long time in the early frames, even background subtraction will have a problem. It needs a lot of training data to have a very good background model. More over, in the middle of the video, there is some movement of the background that can ruin the background model. The last thing, not only the wanted object is moving, the hand and the paper below the object are also moving which make some more challenges if we are using background subtraction. The dimension of the video is 640 x 480 pixels in 24 bit. The purpose of this video is to test the robustness of object tracking methods in multihued object in front of similarly color background. Some challenging scaling and skewing on the object is also important to test the object tracking performance. The forth video is a football match video (Figure 2.1(d)) which is available in [34]. In this video there is almost full occlusion and there is distraction from similar color moving object. The object is also very small which will be a great challenge for some object tracking methods. The dimension of the video is 544 x 436 pixels in 24 bit. The purpose of using this video is to test the robustness of object tracking methods on very small object with almost full occlusion and distraction from other similar color objects.
2.2
Object Tracking Categorization

The
In [17], Yilmaz et. al. wrote a result of object tracking methods survey. He proposed object tracking categorization with methods that represent each category. categories themselves are divided into object detection and object tracking categories. The categories are presented on 2006. We update the categorization examples with some recent methods in each category so that it will be more relevant to our master thesis. The categorization can be summarized in Table 2.1 and Table 2.2. In the next section, we then choose some representative methods to study based on these criteria: 1) Acceptability. The methods that are widely used by researchers are more preferable. 2) Recentness. We prefer to choose more recent methods than the old ones.
2.3
Corner Detector Combined with Optical Flow
In [5] Bradski stated that one of the basic method to do object tracking is selecting representative point features and track that features using optical flow. This method is one of the most intuitive methods in object tracking. The KLT tracker as the representative of this method is a well known object tracking method. That is why we choose this method to review. This method represents the point detectors and kernel tracking according to Yilmaz et. al. categorization[17]. Categories Point detectors Representative Work Harris detector [Harris and Stephens 1988], KLT detector [Shi and Tomasi 1994], Scale Invariant Feature Transform [Lowe 2004], Speeded Up Robust Features [Bay 2006] Segmentation Active contours [Caselles et al. 1995]. Mean-shift [Comaniciu and Meer 1999], Texture Descriptor Gray concurrence matrices [C. C. Gotlieb et. al., 1990] Gabor filtering [G. Wouwer et. al., 1999] Local Binary Pattern [Ojala, Pietikainen, 2001] Background Modeling Mixture of Gaussians[Stauffer and Grimson 2000], Eigenbackground[Oliver et al. 2000], Dynamic texture background [Monnet et al. 2003]. Supervised Classifiers Support Vector Machines [Papageorgiou et al. 1998], Neural Networks [Rowley et al. 1998], Adaptive Boosting [Viola et al. 2003]. Beyond Semi Supervised Online Boosting [Stalder et al 2009] Table 2.1 Object Detection Categories[17]
Categories Point Tracking Deterministic methods Statistical methods
Representative Work MGE tracker [Salari and Sethi 1990], GOA tracker [Veenman et al. 2001]. Kalman filter [Broida and Chellappa 1986], JPDAF [Bar-Shalom and Foreman 1988], PMHT [Streit and Luginbuhl 1994]. KLT [Shi and Tomasi 1994], CAMSHIFT [Bradski, 1998], Layering [Tao et al. 2002], Eigentracking [Black and Jepson 1998], SVM tracker [Avidan 2001]. State space models [Isard and Blake 1998], Variational methods [Bertalmio et al. 2000], Heuristic methods [Ronfard 1994]. Hausdorff [Huttenlocher et al. 1993], Hough transform [Sato and Aggarwal 2004]. Table 2.2 Tracking Categories [17]
Kernel Tracking Template and density based appearance models Multi-view appearance models Silhouette Tracking Contour evolution Matching shapes
2.3.1 Corner detection Representative features naturally are the features that most probably have some
significant change in the next frame. We hopefully can select unique (or almost unique) points so that it can be tracked more easily. One can take the points that have strong derivative. Those points may be the points along the edge. But if we take two derivatives in orthogonal directions, then we can hope that the points are unique. Those points called corners. To detect corner, one method that can be used is KLT Shi Tomasi corner detector [26]. The implementation is available in OpenCV 2.0[24] with function name called cvGoodFeaturesToTrack(). This function computes the second derivatives (using Sobel operators) that are needed and from those computes the needed Eigen values. It then returns a list of points that meet the requirements of good features to track.
2.3.2 Optical flow Another approach to track a region defined by a primitive shape is to compute its
translation by use of an optical flow method. Optical flow methods are used for generating dense flow fields by computing the flow vector of each pixel[17]. One of the famous optical flow algorithm is the Lucas Kanade algorithm. The most basic equation of it is stated in [27] which is
(1)
The goal of feature tracking: for a given point u in image I, find its corresponding location v = u + d in next image J such as I(u) and J(v) are similar". Displacement vector d is the image velocity at x which also known as optical flow at x [27]. The similarity function is measured on an image neighborhood of size (2x + 1) x (2y +1). This neighborhood will be also called integration window. Let x and y are two integers which has typical values 2, 3, 4, 5, 6, 7 pixels. The basic idea of Lucas-Kanade algorithm based on three assumptions[5]: 1) Brightness constancy. A pixel of an object in an image does not change in appearance as it (possibly) moves from frame to frame. For grayscale image, this means we assume that the brightness of a pixel does not change as is tracked from frame to frame. 2) Small movements. The image motion of an object changes slowly in time. 3) Spatial coherence. Neighboring points in a scene belong to same surface have similar motion.
Figure 2.2 Illustration of optical flow[28].
The disadvantage of using small local window in Lucas-Kanade is the large motions can move points outside of local window and makes it impossible to track[5]. This led
to the development of pyramidal LK algorithm which start tracking from highest level of an image pyramid (lowest detail) and working down to lower levels (finer detail). Tracking using image pyramids makes it possible to track a large motions in local windows. In 1994, Shi and Tomasi proposed the KLT tracker which iteratively computes the translation (du, dv) of a region (e.g., 25 25 patch) centered on an interest point[17].
Figure 2.3 Shi Tomasi corner detector also detect the background corners inside objects rectangle
We have tried both methods combined together by using implementations in OpenCV 2.0. We make a rectangle bounding the object and detect the corners using Shi Tomasi method and detect the movement of that corners using pyramidal Lucas-Kanade method. To update the position of the objects rectangle, we do averaging the movement of all the corners movement based on the assumption of spatial coherence. Nevertheless, the result is not satisfying. Because, the Shi Tomasi corner detection give us, not only the objects corners, but also background corners inside the objects rectangle (Figure 2.3). For example, if the object moves upward then the background moves downward. When we do movement averaging to all of the corners inside the objects rectangle (which has objects and backgrounds corners) for movement calculation of objects rectangle, the result is not satisfying. It is difficult to tune the Shi Tomasi corner detection parameters so that it gives only objects corners inside objects rectangle.
2.4
Speeded Up Robust Features (SURF)
SURF is an image detector and descriptor using interest points of the image. This method is very well accepted by researchers as one of the most prominent image interest point detector and descriptor. That is why we choose this method to review.
This
method
represents
the
point
detectors
according
to
Yilmaz
et.
al.
categorization[17]. In [15] Bay et. al. describe that the points detection used a very basic Hessian-matrix approximation. This lends itself to the use of integral images which reduces the computation time drastically. Interest points need to be found at different scales, not least because the search of correspondences often requires their comparison in images where they are seen at different scales. Scale spaces are usually implemented as an image pyramid. The images are repeatedly smoothed with a Gaussian and then subsampled in order to achieve a higher level of the pyramid. In order to localize interest points in the image and over scales, a non-maximum suppression in a 3 x 3 x 3 neighborhood is applied. For interest point description and matching, they build on the distribution of first order Haar wavelet responses in x and y direction rather than the gradient, exploit integral images for speed, and use only 64 dimensions. This reduces the time for feature computation and matching, and has proven to simultaneously increase the robustness. Furthermore, they introduced new indexing step based on the sign of the Laplacian which increases the robustness of the descriptor and the matching speed. The sign of the Laplacian distinguishes bright blobs on dark backgrounds from the reverse situation. This feature is available at no extra computational cost as it was already computed during the detection phase. In conclusion, they claimed that SURF proves to be work great for classification tasks, performing better than the previous methods (SIFT, GLOH), while still being faster to compute. They stated that SURF should be very well suited for tasks in object detection, object recognition or image retrieval. This method has been widely used nowadays and drives us to try this method in object tracking. We try the code provided by the authors in [25]. We do a simple tracking by making a bounding rectangle around the object and find the interest points. We store the interest points as object model. We evaluate the next frame and find the matched points. We calculate the displacement of those matched points. We move the object rectangle based on the displacement of those matched points. With the steps above, we test SURF using our test videos (Figure 2.4). In the first video (yellow trunk), SURF failed to detect the interest points. In the second video (air plane), SURF can detect the object and move the rectangle quite nicely. For third
video, SURF can detect the object in several first frames. But when the object is too far from the camera and become too small, SURF fails. More over, sometimes the SURF implementation gives some wrongly matched interest points.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 2.4 SURF Tracker Result. (a) SURF Tracker in first video at frame 1 (b) SURF Tracker in first video at frame 65 (c) SURF Tracker in second video at frame 1 (d) SURF Tracker in first video at frame 95 (e) SURF Tracker in third video at frame 1 (f) SURF Tracker in third video at frame 148. Object rectangle in the left corner means SURF fails to detect objects interest points
2.5
Mean Shift Tracking
Mean-Shift is a robust method on finding mode in a density distribution of data set[5]. This method has multi-functionality since the density is not only for color distribution, but also texture, motion, etc [2,5]. This is an easy process for continues distributions which merely just hill climbing applied to a density histogram of the data[5]. This
method is efficient compare to standard template matching since it eliminates brute force search[17]. Those characteristics made Mean-Shift very well accepted by researchers. That is why we choose this method to review. This method represents the segmentation object detection according to Yilmaz et. al. categorization[17]. The method can be summarized intuitively as follows [14]: 1) Process is started by taking arbitrary position and size of a window (region of interest). 2) Find the mean-shift vector. 3) Move the window according the vector so then the center of the window now is the end point of the vector (the mean). 4) Recalculate the vector inside the current window position. 5) Return to step 3) until the convergence. Convergence here means the window movement is below threshold or the mean-shift procedure has been carried out for a particular number of iterations.
(a)
(b)
Figure 2.5 Intuitive description of Mean-Shift.[14]
Mean-shift is not meant to be tracking algorithm at the first time [1]. In [2] Comaniciu et. al. applied mean-shift for discontinuity preserving filtering and image segmentation. objects[29]. Bradski in [5] stated that mean-shift calculation can be simplified by considering a rectangular kernel. A rectangular kernel is a kernel with no falloff with distance from the center, until a single sharp transition to zero value. This is in contrast to the exponential falloff of a Gaussian kernel and the falloff with the square of distance from But then, Comaniciu introduce mean-shift to track non-rigid
10
the center in the commonly used Epanechnikov kernel. The simplification reduces the mean-shift vector equation to calculating the center of mass of the image pixel distribution using image moment 1) Zeroth moment calculation (2) 2) First moment calculation (3) 3) Mean search window calculation (4) So then practically, the mean-shift tracking algorithm runs as follows[5]: 1) Choose a search window with its characteristics i. Initial location ii. Type (uniform, polynomial, exponential, or Gaussian) iii. Shape (symmetric, rounded, rectangular) iv. Size 2) Compute the window center of mass using moment 3) Center the window at the center of mass 4) Return to step b until convergence. This method is good for a single hue object on background which has different color with the object. The disadvantage is mean-shift only gives the mean position. It does not give the objects orientation. In [5], Bradski implicitly implied that Mean-Shift does not give object size. In [17], Yilmaz denotes that Mean-Shift is not rotation invariants.
2.6
CAMSHIFT Tracking
This is the improvement of famous Mean-Shift method by making the distribution of the color adaptive to the changing in each frame. The heart of CAMSHIFT is MeanShift. Mean-Shift will give the center position of the rectangle. CAMSHIFT gives not only the position of the object, but also the size of the object and its orientation[1,5]. The ability of CAMSHIFT to improve Mean-Shit by giving size and orientation of the object is very important in our case. That is why we choose this method to review. This method represents the template and density based appearance models according to Yilmaz et. al. categorization[17].
11
The intention of CAMSHIFT actually to develop a real time perceptual user interface which in this case, the application is tracking human faces [1]. This method is based on mean-shift method. The mean-shift method is modified so it can be adaptive to dynamically changing color probability distributions from frame sequences in a video[1]. This is due to color distribution from frame sequence is changing over time. CAMSHIFT now is used for computer interface for controlling computer games. For the need of computer interface, they develop CAMSHIFT so it fulfills some characteristics: 1) Real time 2) Can be run on inexpensive consumer cameras without lenses calibration For these purposes, they decide to focus on color based tracking. To track the color object, they use color histogram. The CAMSHIFT algorithm can be summarize with these steps [1]: 1) Chose the initial region of interest which contain the object we want to track 2) Make a color histogram of that region as the object model 3) Make a probability distribution of frame using the color histogram. As a remark, in the implementation, they use histogram back projection method. 4) Based on the probability distribution image, find the center mass of the search window using mean-shift method. 5) Center the search window to the point taken from step 4 and iterate step 4 until convergence. 6) Process the next frame with the search window position from the step 5.
2.6.1
Color probability distribution and histogram back projection In order CAMSHIFT can track colored object, it needs a probability distribution image.
They use HSV color system and using only hue component to make the objects color 1D histogram. This histogram is stored to convert next frames into corresponding probability of the object. The probability distribution image itself is made by back projecting the 1D hue histogram to the hue image of the frames. The result called back projection image. CAMSHIFT then used to track the object based on this back projection image Regarding histogram back projection, it is a technique to find probability of a histogram in an image. It means each pixel of the image is evaluated on how much probability it has to the histogram.
12
2.6.2 Mass center calculation The mean location of the probability image inside search window is computed using
image moments. Given that I(x,y) is the intensity of the discrete probability image at (x,y) within the search window. 1) Zeroth moment calculation using formula (2) 2) First moment calculation using formula (3) 3) Mean search window calculation using formula (4) From this phase, we can have the center position of the image in every frame. But, this is not enough since the size of the object can be varied over time. For example, if the object is moving towards and away from the camera, then the size is changing. This information can be calculated using second moments which not only give the length and width of the object, but also the orientation.
Figure 2.6 Summary of CAMSHIFT algorithm. The gray box is the mean-shift algorithm [1]
Second moments are: (5)
13
The orientation is:
(6)
Then length l and width w from the distribution centroid are (7)
(8)
With a, b, c are (9)
(10)
(11)
2.6.3 CAMSHIFT advantages and disadvantages Thus, with all those characteristics, classic CAMSHIFT has advantages along with
disadvantages Some advantages: 1) Computationally efficient with the performance of real time tracking (30 fps). 2) Invariants to scaling and rotation 3) Ignores image distractors as long as they lie outside the search window. 4) Can deal with occlusion as long as the occlusion is not 100% 5) Insensitive to object deformation [4] Some disadvantages: 1) Problems with multi hue object. If the object has more than one hue, the tracker tend to track the most significant object part and leave the small part untracked. The problem also occurs in the case of complex background.
14
2) Because it only takes hue component, problems may occur if the object past a background that has similar colors to the object. 3) Sensitive to changing illuminant 4) In addition, when the object moves so fast that the target area in the two neighboring frame will not overlap, tracking object often converges to a wrong object [4]
2.7
Local Binary Pattern
Local binary pattern (LBP) is a descriptor which describes each pixel in region by its texture calculated by relative gray levels of its neighboring pixels. The LBP is a powerful illumination invariant texture primitive[16]. The histogram of the binary patterns computed over a region is used for texture description. Figure 2.7 show us how to calculate LBP using classic LBP and center-symmetric local binary pattern (CSLBP). LBP is a quite recent texture descriptor method which has been developed rapidly by a lot of researchers and shows some encouraging results in texture descriptor. Beside CS-LBP that has been illustrated above, there are also Rotation Invariant Volume LBP (RIV-LBP) and LBP from Three Orthogonal Planes (LBP-TOP) [22]. That is why we choose this method to review. This method represents the texture descriptor based on the categorization in Table 2.1. Based on the shared implementation LBP code in[23], we try to use it as object tracker in such following simple way. First we make a rectangle that marks the object. We calculate the LBP histogram using Rotation Invariant Volume LBP (RIV-LBP) and stored the histogram as the object model. We then we build a search region which is
Figure 2.7 LBP and CS-LBP features for a neighborhood of 8 pixels [16].
15
Figure 2.8 Example of LBP calculation[16]
twice larger than the objects rectangle on the next frame. Then we make a sliding window with the same size with the object rectangle. We propagate the sliding window inside the search region. Each time the sliding window propagates, we calculate the RIV-LBP histogram. We then compare the similarity of the propagated RIV-LBP histograms with the model RIV-LBP histogram. Histogram that gives best similarity assumed to be the object. The last step is to move the object rectangle to the sliding window that gives best histogram similarity. The histogram similarity that we used is histogram intersection implemented in OpenCV 2.0[24].
(a)
(b)
Figure 2.9 LBP Tracker Result in second video.
(a) Object rectangle at frame 1 (b) Object rectangle at frame 33

The result is not what we expected, even for a simple video which has homogeneous plain background and a rigid fix sized object (Figure 2.9). The object rectangle often goes to a position which, in visual perspective, is not the best position of the object in the next frame. So then we do not continue to base our thesis on improving LBP.
16
2.8
Beyond Semi-Supervised Online Boosting Tracking
This method is based on online boosting mechanism in giving weight to every feature. With this approach, tracking problem is treated as classification problem. In [20] Stalder et. al. presented a multiple classifier system which split the tasks of detection (finding the object of interest), recognition (distinguishing similar objects in a scene), and tracking (retrieving the object to be tracked) into separate classifiers. The purpose of this splitting is to simplify each classification task. This method represents the supervised classifiers according to Yilmaz et. al. categorization[17]. It is one of the most recent methods in this object tracking category which shows a lot of encouraging results. That is why we choose this method to study from supervised classifiers category. For feature selection, the goal of boosting is to minimize the error by selecting and combining a set of N weak classification algorithms into a strong classifier. In [20] they describe the on-line variant, where the main idea is to perform on-line boosting on selectors rather than on the weak classifiers directly. A selector holds a set of M weak classifiers and selects the one with the lowest estimated error. For tracking, online boosting is used by building initial classifier by taking positive samples from object and negative samples from background. Then, the classifier is evaluated exhaustively on the image at time t+1. The resulting confidence distribution is analyzed and in the simplest case the local maximum is considered to be the new object position. In order to adapt to appearance changes of the object (e.g. different illumination) or changed background, the classifier gets updated and the loop repeats. In supervised online boosting tracker used self-labeled data for updates, but in semisupervised online boosting tracking, the updates is using unlabeled data. In beyond semi-supervised tracking, they proposed to use multi classifier system (Figure 2.10). For detector, they used offline classifier which purpose is to reliably find the object of interest. The detector classifier is not updated during tracking. Any kind of object detector can be integrated in the system as long as it is generic and can be applied on any kind of scene. For recognizer, they used supervised online classifier. Updates are performed. The positive training set consists of tracked samples which are validated by the detector. The negative training set is composed of hard examples collected in the background image at the time of detection. This allows to distinguish similar objects in a scene. For tracker, they used semi-supervised on-line classifier.
17
The confidence map is analyzed via semi-supervised updates to retrieve a stable maximum.
Figure 2.10 The core classifier system: detector, recognizer and tracker.[20]
They used Haar-like features, histograms of oriented gradients (HOG), and color histograms. They managed to get 10 fps in common 3.0 GHz PC Dual Core with 2 GB RAM[20]. Some of the advantages of this method are: 1) Currently one of the state of the arts in supervised/semi-supervised object tracking method 2) It can distinguish between very similar object. They have given an example to track a coca cola bottle near another similar coca cola bottle. This can distinguish which one is the tracked object and which one is not 3) It can track partially or fully occluded object, either by static or dynamic occlusion. 4) It can track an object which has similar texture with its background. 5) It can do multiple object tracking. 6) Face tracking is also possible with a good result 7) Able to re-track the object.
The weakness is this method works only if the size of the object is unchange in each frame. If the size change (e.g. move out from the viewer), then it will not detect the
18
object. This is very critical lack in our case. The speed is also not high enough for a real time application. This method is very promising actually. The authors also share the code [21]. Unfortunately, we have some difficulties in some ways. Firstly is to understand the code. The implementation of multi classifiers system, which contains a lot of pattern recognition algorithms, is not so easy to understand. Even though this code contains only Haar-like feature, which means it is the simple version, this is still hard to understand. Secondly Pattern recognition is not our main field. So we decided to skip this. We do not continue to further study nor improve the method.
2.9
Method that We Choose
After studying, experimenting and observing some state of the art methods in object tracking, we decide to choose CAMSHIFT. This decision comes up with the following reasons: 1) CAMSHIFT has a real time performance while some others not. Our experiment shows that it reaches 30 fps. This capability is very important for our case (clickable hypervideo). Hypervideo application needs the tracker to have real time capability. Semi-Supervised Online Boosting Tracking fails to achieve real time performance. In our experiments, LBP, KLT and SURF also fail. 2) Invariants to scaling and rotation while some others not. Scaling and rotation is not avoidable in a real video application such as hypervideo. This capability is the next important method characteristic that we need. CAMSHIFT is very robust for scaling and rotated object. SURF is able to detect rotating object. Unfortunately it fails to detect highly scaled object. In our case, if the object goes too small, SURF fails. Semi-Supervised Online Boosting Tracking is certainly fails for scaled object. Mean-Shift is using fix kernel size and moving the kernel towards the mean position. The size is not changing over time while CAMSHIFT is adaptive to changing color distribution over time which makes it scaling invariants[5]. Mean-Shift is suitable for translational and scaling motion but not suitable for rotational motion[17]. KLT is only for affine motion but not suitable for translational, rotational nor scaling motion[17]. 3) From all of our test videos, CAMSHIFT is one of methods that can track the object with reasonable result. It can track the object but starts to fails after some frames. For the first video, the tracker drifts at frame 145. It drifts at frame 573 and 61 for third and forth video consecutively. It succeed to track the object in the second video but only some object parts (propellers). With
19
the procedure provide in 2.3 and 2.7, KLT and LBP fail for all of the videos. SURF succeed to track the object in second video. Nevertheless it directly fails to detect object interest points in the first video and forth video. It succeed to track the object in third video for some frames, but fails when the object is getting too small. We managed to get Beyond Semi-supervised Online Boosting tracking simple source code and try it to our test videos but not the full version one. This simple source code contains only Haarlike features without color histogram and Histogram Oriented Gradients (HOG) features. With this simple, it can track the object in the second video successfully. It certainly fails for third test video as stated by the author in [33]. 4) Insensitive to object deformation [4]. Changing object shape is not a problem for CAMSHIFT. 5) We believe that we can improve the some critical disadvantages of CAMSHIFT within the master thesis duration. We eliminate Semi-Supervised Online Boosting Tracking since mostly it deals with Pattern Recognition while our main field is image analysis and processing with specialization in color features (Color In Informatics in Media Technology). We believe that we can improve this later method to meet our requirements but the time to spend for improving this method is not feasible within the master thesis duration. 6) Actually CAMSHIFT is closed to Mean-Shift. But since CAMSHIFT is an improved version of Mean-Shift with the capability of adaptive to the color distribution changes during time, we believe that CAMSHIFT is a better starting point than Mean-Shift.
2.10
CAMSHIFT/Mean-Shift Improvement in Literatures
Before we start to improve CAMSHIFT, we study some literatures to know what the researchers have done to improve CAMSHIFT/Mean-Shift. After that, we will precise what CAMSHIFTs improvement will be done in this thesis.
2.10.1 Mean-Shift tracking combined with texture histogram Ning et. al. in [8] proposed a joint color-texture histogram to represent an object and
then applying it to the mean-shift framework. The purpose is to improve tracking accuracy and efficiency by adding conventional color histogram features with texture features, which is in this case, the local binary pattern (LBP). The idea is to improve the object model by modeling every pixel in the object by, not only color information, but also the texture value. Ning[8] combine mean-shift tracking with LBP because of LBPs fast computation and rotation invariants. They claimed the proposed method
20
performs much better than the original color based method with fewer iteration numbers, especially in tracking objects that have similar color appearance to the background. We have described LBP in 2.7. We try once again the LBP with an implementation code in [30] in the following ways: 1) We use the code to have LBP image as the result using our test video (Figure 2.11). 2) Soon after that, we calculate LBP texture histogram using RIV-LBP[23]. 3) We do back projection to get probability image based on texture. Unfortunately, the result is beyond our expectation. Back projection should give high intensity (probability) to the pixel that has similar characteristic regarding the model histogram, but the result seems against that (Figure 2.11). Then we decide not to use texture to improve CAMSHIFT.
(a)
(b)
Figure 2.11 Comparing LBP Image and its back projection image.
(a) LBP Image (b) The Back projection image. The small toy intensity is low.
2.10.2 CAMSHIFT and Mean-Shift combined with interest points Another way to improve CAMSHIFT is to combine with interest point feature. Interest
point feature is well known with its invariant to illuminant, rotation and scaling. This advantage is very useful to fill the disadvantage of color histogram in CAMSHIFT which is sensitive to illuminant change.
One of the implementation regarding this method is done by Ganoun et. al. in [10]. The aim of their research is to widen the field of CAMSHIFT method so that it can be
21
applied to gray image sequences. It tried to do so by improving the object model by adding feature points information to the color histogram. In principle, they measure the displacement of search window by calculating the displacement of matched interest points. Then they use the CAMSHIFT to determine the final object position. The method can be summarized by these following steps[10]: 1) Calculation of the object model using color histogram and feature points 2) Calculation of the temporary object displacement by the feature points matching between image It of the sequence at instant t and image It+1 at instant t+1. 3) Determination of a reduced search window positioned on the centre Ctemp calculated at step 2. 4) Calculation of the probability image in the search window. 5) Application of the Mean-Shift algorithm to determine the new object centre. 6) Actualization of the object model. One other implementation of this method is by Qiu Xuena et. al. in [9] which is using SIFT as the interest point descriptor combined with spatial features to create probability distribution of the tracked object. They use the spatial features in order to increase robustness when the tracker deals with occluded situation. They claimed their method can handle object scale, orientation, view, and illumination changes. It could also deal with the camera movement mode. The SIFT was added with the purpose of increasing the robustness of CAMSHIFT when dealing with the condition of occlusion and object has similar color with the background. Meanwhile color feature can help SIFT segment the target. In addition, the spatial feature can handle the object occluded situation. The probability distribution of the tracked object is represented by linear weighted combination of the kernel function of the above three features. The entire algorithm can be summarized as follows[9]: 1) Define a rectangle on the region of interest in the first frame; 2) Compute the color histogram of this region, at the same time extracting SIFT features within this region; 3) In the second frame, let the previous location be the center of the interested region and the size of this interested region is one quarter the size of the
22
frame. In this interested region, also let each pixel be the center of the subwindow and the size of the sub-window is the same as the target region; 4) For every sub-window calculate probability density; 5) Determine the final tracking window.
(a)
(b)
Figure 2.12 SURF and CAMSHIFT 1. (a) Yellow trunk is tracked by CAMSHIFT in frame 33 (b) Air plane is successfully tracked by SURF in frame 300
(a)
(b)
(c)
Figure 2.13 SURF and CAMSHIFT 2. (a) Small scaling toy is successfully tracked by SURF at frame 2 (b) SURF sometimes match wrong points (c) SURF fails completely when the object skews at frame 360
As we know that there is publicly available interest point detector and descriptor which is widely used and perform very well called SURF. So then we try SURF once more time with purpose to improve CAMSHIFT. We use SURF implementation in OpenCV 2.0 [24] and compare with the previous experiment in 2.4. We do it with the following ways: 1) Make a rectangle that covers the object in the first frame 2) Calculate the conservative CAMSHIFT color histogram of the object inside the rectangle. Store it as objects color model
23
3) Use SURF to inside the rectangle to detect interest points of the object. Store it as objects interest points model 4) In each frame we use both models to track the object. The result is more or less the same to what we have done in 2.4, good for some cases, but not good for some other cases. For the first test video (yellow trunk), SURF simply fails to track the interest points while CAMSHIFT can track the object quite perfectly. In our second test video (Figure 2.12), SURF helps very much with giving quite precise objects rectangle while CAMSHIFT fails to cover the whole object. CAMSHIFT only gives ellipse that covers both air planes propellers. So for these videos, both methods complement each other which is good. For the third video (Figure 2.13), CAMSHIFT succeed for 275 frames but then drift and track the background which has similar color with the object. SURF can manage to track the object also for 105 frames. But when the objects size goes to small or skews, SURF fails. SURF detector can not recognize the object anymore even if we update the interest points object model by taking the last successfully tracked object. Based on these result, we decide not to choose interest points as the improvement of CAMSHIFT.
2.10.3
CAMSHIFT improvement using new HSV model
In [4], G. Tian et. al. propose an improved H, S, V combined one-dimensional color histogram model for CAMSHIFT object tracking. The purpose is to improve CAMSHIFT tracking accuracy in the condition where color distribution of the object and background is similar or even in complex background. This method is based on Munsell 3D color coordinate system which has been confirmed suitable for human visual system[4]. Based on optical theory that each color has corresponding wavelength, they quantify the H, S, V into different ranges. In summary, the processes are: 1) Divided color scope: Base on the human visual ability to distinguish color, they divide H color space into 8 part, S color space into 3 parts, V color space into 3 parts. 2) Quantify the value of H, S, V with different intervals: based on the human subjective color perception on different scopes of the colors.
24
3) Building combined one-dimensional feature vector from 3 color component: G = HQs Qv + SQv + V They choose Qs = 3, Qv = 3, where Qs and Qv are S component and V components weight. Therefore: G = 9H + 3S + V
They showed some result which is quite encouraging. We have not tried this method but we put it on our algorithms list to improve CAMSHIFT.
2.10.4
CAMSHIFT improvement using hue-distance and saturation features J. A. Corrales et. al. in [3] use different approach. Their purpose is to improve
CAMSHIFT in tracking objects in dynamic backgrounds with similar hue values by using hue and saturation color component but modifying the hue component. Instead of using hue, they use hue-distance. The hue-distance is a function which represents each hue value H as a distance from a reference hue value Href. The following distance function is used instead of the hue component:
(12)
The hue reference Href is the hue value which has the highest frequency in the histogram h(x) obtained from the histogram calculation in the first step of CAMSHIFT algorithm: (13)
25
(a)
(b)
(c)
(d)
Figure 2.14 CAMSHIFT with new HSV model. (a) The result image using CAMSHIFT (b) The back projection image using CAMSHIFT (c) The result image using CAMSHIFT with new HSV model (d) The back projection image using CAMSHIFT with new HSV model[4]
Firstly, a histogram of the hue component of the standard HSV model is calculated in order to obtain the hue reference Href using (13). Afterwards, the histogram is recomputed using the hue distance in (12) and it is used as target distribution. All the following images are transformed to the HSV model but using the hue distance instead of the hue component. The use of the hue component to obtain the probability distribution image using classic CAMSHIFT is not sufficient when there are elements in the background which have similar hue values to the target object. In this case, the CAMSHIFT algorithm may include wrongly in the search window elements from the background. Two histograms are calculated: one for the hue distance hd(H,Href)(x) and another one for the saturation component hS(x). In the step 3 of the algorithm, the histogram
hd(H,Href)(x) is used to obtain the back-projection Bd(H,Href)(x, y) of the hue

distance channel, and the histogram hS(x) is used to obtain the back-projection BS(x,
y) of the saturation channel. These two back-projections are combined according to
26
the following equation in order to create a final probability distribution image B(x, y) which is used by the Mean Shift algorithm to find the center of mass: (14) (15)
Equation (15) removes from the hue-distance back projection those pixels whose saturation channel does not match the saturation values of the tracked object. Most of the background pixels, whose hue values are similar to the tracked object, can be removed because their saturation values are different. Therefore, only the pixels with similar hue and saturation to the object are considered. We have implemented this method according to the procedure written above but with modification that is V color component included. We believe that value is also important to take into account in the model. The result is encouraging (Figure 2.15). This has similar purpose with the previous new HSV model method.
2.10.5 CAMSHIFT with improvement of object localization These methods try to improve CAMSHIFT objects model by improving object
localization method. Foreground extraction This method tries to increase the robustness of tracking by giving more weight to the center of the rectangle by putting very high positive weight and giving low negative weight to the parts beyond the range[12]. The range is given as circular. Illustration is given in Figure 2.16. The formula is (16) Where x,y is pixel coordinate, hi is a any value desired by user to filter out background color clusters. This still gives problem because, in real applications, many objects are not in circular shape. For example, if the object is an elongated object (Figure 2.17), there will be some background information taken into the object model or some object parts are not taken into account to the object model.
27
Weighted and Ratio Histogram This method has, in some way, similar with foreground extraction. For pixels inside the object search window, it gives higher weight to the pixels near the center and gives 0 to the pixels far from the center in the model histogram calculation process [7]. They used ratio histogram which gives lower weights to the pixels outside the object search window. We decide not to choose these methods since it can not effectively localize the object. We propose another method which will be describe in 3.1.
2.10.6
CAMSHIFT improvement using adaptive background (ABCShift) The aim of this method is to track robustly in two situations where CAMSHIFT fails;
firstly with scenery change due to camera motion and secondly when the tracked object moves across regions of background with which it shares significant colors.[11] It tries to improve the tracker by modeling the background based on Bayesian probability model. In summary, the algorithm is[11]: 1) Identify an object region in the first image and train the object model. 2) Center the search window on the estimated object centroid and resize it to have an area r times greater than the estimated object size. The centroid position can be calculated using: (17) where i is index of all pixels in the search window and ci is the color of pixel i. Then the position of the centroid is (18) where (xi, yi) is the position of pixel i in the search window. At the end of the iteration the center of the search window is shifted to the new position (xc, yc) and the procedure is repeated until two consecutive center positions are within of each other. 3) Learn the color distribution, P(C), by building a histogram of the colors of all pixels within the search window.
28
Figure 2.15 CAMSHIFT improvement with hue-distance saturation features. (a) Tracked image with selected object in rectangle (b) hue distance back projection (c) Saturation back projection (d) Hue distance Saturation combination back projection[]
Figure 2.16 Foreground extraction. FEM applies high positive value to the pixels near the center and applies negative values to the pixels toward the edges of the object region[12].
29
Figure 2.17 Sample of elongated object. Elongated object is not suitable using foreground extraction method.
4) Use Bayes law, to assign object probabilities, P(O|C), to every pixel in the search window, creating a 2D distribution of object location. (19) where P(O|C) denotes the probability that the pixel represents the tracked object given its color, P(C|O) is the color model learned for the tracked object and P(O) and P(C) are the prior probabilities that the pixel represents object and has the color C respectively. 5) Estimate the new object position as the centroid of this distribution and estimate the new object size (in pixels) as the sum of all pixel probabilities within the search window. 6) Repeat steps 2-6 until the object position estimate converges. 7) Return to step 2 for the next image frame. They shows some videos that confirm their claim that ABCShift gives good result though the object is passing through a background which has similar color with it. They implement this method in robotics. The authors show some encouraging results but not multiple object tracking.
30
2.10.7 CAMSHIFT improvement by background subtraction This method tries to improve CAMSHIFT tracking robustness in complex background
by modeling the background and subtracting it from every frame sequence. Background subtraction has been widely used in object tracking in the case of static background. The basic principle is to model the background and subtract the frame sequences by that background. The subtraction result more or less is the moving object inside the frame. This method works in the assumption that the object is moving. Otherwise the object will be identified as background. Some methods of background subtraction are using average, median, code book and Gaussian mixture model. Those methods implementation are also available publicly. Average background subtraction code is available in [31], code book is available in OpenCV 1.0 [24] or higher, Gaussian mixture model code is available in [32]. For median background subtraction, we develop it by modifying the average background subtraction code. We have tried three of the methods combined with CAMSHIFT and here are the results in frame 30 in scaling small toy test video (Figure 2.18).
Figure 2.18 Background subtraction in static background. First row is the result image, the second row is the foreground image. First column is using average. Second column is median . Third column is using Gaussian mixture model
For the conclusion, background subtraction helps CAMSHIFT very much in static background videos. More over, with the help of background subtraction, we can extract object contour. Unfortunately, in dynamic background videos, background subtraction is not helpful. This is because the movement in the background is detected
31
as foreground. The result of using background subtraction in airplane test video frame 30 can be seen in Figure 2.19 Actually, we tried this method because we the previous improvement had difficulties in solving the challenging tracking problem in third video (Figure 2.19). We realized that this video has static background. This information is very useful because usually background subtraction method works very well in this kind of condition. Then we make sure by trying background subtraction method which in fact helps CAMSHIFT to track the object in third video. After that we think that we can use this method if we have a technique to detect whether a background is static or dynamic. When we detect the background is static, then we use background subtraction method otherwise, the background subtraction is not used. But this will be another research which will take quite amount of time. We do not continue to use this method in improving CAMSHIFT because hypervideo can not be limited to only video with static background and background subtractions limitation in dynamic background videos.
2.10.8 The CAMSHIFT improvement method that we choose After studying, experimenting, and observing we found that there are still some
problems with the previous improvement. For improving CAMSHIFT using LBP, we have tried with our simple procedure as stated in 2.10.1. Unfortunately, the result is beyond our expectation so we decide not to use texture to improve CAMSHIFT. Combining interest points with CAMSHIFT also gives unsatisfying result for our case. We do not choose CAMSHIFT improvement using new HSV model because the bins ranges rigidity. CAMSHIFT object model improvement with foreground extraction and weighted histogram fails to exclude background information into the model or fails to include all object information into the model. ABCShift is actually improve CAMSHIFT significantly especially in the condition where objects colors are similar with backgrounds colors. Nevertheless this method is closer to Pattern Recognition method which is not our main interest field. We want to try another approach. We do not use background subtraction as its effectiveness only occurs for video with static background. The only method that we adopt is the use of hue-distance histogram which is explained in 2.10.4. For this method also we slightly modify the method by, not only using hue-distance and saturation histogram, but also value histogram.
2.10.9 The more specific aim of the master thesis Based on discussion above, we decide to improve some critical disadvantages of
CAMSHFIT that are not fully solved by the researchers or not meet our aim for this thesis. From this conclusion, we define our specific aim of the master thesis as:
32
1) Improve the robustness of classic CAMSHIFT for multihued object tracking. 2) Improve the robustness of CAMSHIFT for the condition where objects colors are similar with backgrounds colors. 3) Improve CAMSHIFT capability to do multiple object tracking. We did not find any literature stating improvement of CAMSHIFT so that it can do multiple object tracking. 4) Speed and illuminant change are not our main concern. In [], Colantoni et. al. stated that using Graphical Processing Unit (GPU), the speed performance of an image processing task (e.g. object tracking) can be increased until 10 times faster. As our main field is image analysis and processing, we focus our work in improving the robustness while speed will be improved in our future works or given to the person who is expert in this GPU area. To achieve our aim, we adopt hue distance histogram idea as one of ways to improve CAMSHIFT for the condition where objects colors are similar with backgrounds colors. For object localization, we propose another method which will be described in 3.1. For increasing the robustness of tracking, we also propose another method which will be explained in 3.2 and 3.6. Multiple object tracking will be easy to do if we can solve the first two problems.
Figure 2.19 Background subtraction in dynamic background. First row is the result image, the second row is the foreground image. First column is using average. Second column is median . Third column is using Gaussian mixture model
33
Proposed Method
In this thesis, we propose several ways to improve CAMSHIFT. The use of multidominant color object localization and track the dominant color object parts separately are the key methods to improve CAMSHIFT. This section will describe the proposed method. Details of implementation such as parameters values will be describe in Implementation section.
3.1
Object Localization
Object rectangle is the most common method to do object localization. User makes a rectangle to the object. This is simple and easy to use, nevertheless problem may occur because most of the time, the object is not exactly in rectangle shape. This makes some backgrounds information is included in object model. If this happens, drifting often occurs. The tracker is not robust in tracking the object. Another method is using points as object boundary. The selection continue with making line between those points consecutively. This method can give exact region of the object. This method is suitable for simple object shape. The problem with this method is that it is not practical for object with complex or irregular shape (Figure 3.1). Even for object with simple circular shape, this method is not so practical.
Figure 3.1 A sample of complex shape object
Some recent methods [7][12] gives not precise localization of the object. They failed to give the exact information of the object. This drives us to create a more sophisticated method in object localization.
3.1.1 Preprocessing We propose object localization by combining mean-shift segmentation and region
growing. Mean-shift is a preprocessing before the object parts selection. It applied to segment each part of the object and make them homogeneous enough to be chosen easily. Mean-shift segmentation smoothen the image while preserving discontinuity[2]. This is actually happens until some level. Because if the object is in front of very similar color background at the localization phase, then the object will be
34
merge with the background. We need this preprocessing step because using region growing itself is not enough even though we increase the color tolerance. Figure 3.2 gives simple illustration for the case. As remark, this preprocessing is needed only for the frame in object localization step.
(a)
(b)
Figure 3.2 Object Localization using only region growing (a) before selection (b) after selection
(a)
(b)
Figure 3.3 More precise object localization with only a single click. (a) before selection (b) after selection
Actually for preprocessing step, one other alternative is using K-Means segmentation. One can segment the object and background by selecting some colors as means and all pixels will be classified into those means. But this method will be not practical because we have to choose, not only the object colors, but also the background colors. If there are a lot of colors in the background, which is very common in every day life video as well as in hypervideo, practicality will be an issue. For examples are the third and forth test video. To be more adaptive to users need, we designed the segmentation is tunable using text file. User can change the segmentation parameters value that makes all object appears
35
to be selected. Figure 3.4 shows that we can tune the preprocessing segmentation parameters as well as color tolerance for region growing, number of bins, etc.
Figure 3.4 Text file configuration to tune the parameters
3.1.2 Image color transformation The next step is we transform the image into HSV color space. We choose HSV because
this model is based on the human perception of eyes, which use the Munsell threedimensional color coordinate system to present. Munsell color space has been confirmed suitable for human visual comprehending by human eyes.[4].
3.1.3 Object Selection Next step is the object selection. After mean-shift segmentation, user can choose the
object by clicking each object parts. The clicks positions become set of seed points that have specific properties. From these seed points, the region grows by appending their neighbors which have similar properties to the seeds [18]. We also give some tolerance values so that the seeds can grow further more until reaching the edges of object parts. Those selected object parts are considered as containing the dominant colors of the object which will be tracked in the tracking phase.
3.1.4 Minimum and maximum values storing Each time an object part is selected, we store the minimum and maximum value of the
object parts color component. This means we store the hmin, hmax, smin, smax, vmin, vmax. These values are needed to make color mask which will be describe in section 3.3.
36
One advantage with the proposed method is no surrounding background information added to the model. This makes the tracker more robust and avoids drifting problem. Another advantage is for some less object part case, this method is more practical. For example, for the yellow trunk which has only one hue, the selection is just clicking any part of it, and it will be chosen entirely (Figure 3.3). Even though the shape of the object is complex and irregular, there will be no problem as long as it has homogeneous color.
3.2
Object Modeling
The next step is object modeling. We use color histogram to model the object. We try to improve CAMSHIFT using only color information of the object. No background information used in the process, only the objects information included. We use HSV color space. In the original implementation of CAMSHIFT, Bradski was using only hue component. This leads to a problem in a condition where the object passes through a background which has similar hue to the object. Some other methods are using only hue-saturation components. This might be sufficient for some cases, but for some other cases, hue-saturation component are not enough to distinguish the object from its background. Value color component often gives good discrimination between object with its background. That is why we use all three components and make the histogram of them. The use of those three components apparently is not enough to distinguish the object from its background. Tian et. al. in [4] proposed a new HSV color model to describe the object. They claim to improve the CAMSHIFT tracking. Corrales et. al. in [3] proposed to use hue-distance histogram instead of hue histogram which we have described in 2.8.4. They combine it with saturation component. In this thesis, we combine those ideas so then we use hue-distance, saturation and value histogram to model the object. This gives a better discrimination. We do quantization to each color histogram by using less number of bins. The number of bins for each component is the next important factor. Based on our experiments, the use of 30 bins for hue-distance component, 9 bins for saturation and 6 bins for value gives good result.
37
3.3
Making Color Mask
Color mask is made for each object part. It is made based on the minimum and maximum values taken from the object localization step (3.1.4). Each pixel in every frame will be evaluated according to those minimum and maximum values. If a pixel is inside the minimum and maximum values range, then it will be given a value 1. Otherwise, it will be given value 0.
(a)
(b)
Figure 3.5 Color mask illustration
(c)
(a) Original frame (b) Selected object (airplane body) (c) Color mask
3.4
Segmentation
For next frames, mean-shift segmentation is carried out. This will ease the differentiation of the object from the background. With small spatial and color ranges, mean-shift will smoothen the image and removes noise while preserving the discontinuity. Mean-shift will merge some close-color background areas into one region which is hopefully has color information beyond the object histogram. Figure 3.6 shows some yellow color noises in the background which is close to object color. Those noises are merged with neighboring pixels and assigned by color information which is different with objects color information.
(a)
(b)
Figure 3.6 Segmentation for smoothing and noise removal of third test video (a) Original frame (b) Segmented frame
38
3.5
Histogram Back Projection
Histogram back projection means evaluating each pixel in the frame sequence based on the histogram model we have made in the object modeling phase. Before we do back projection, we put the color mask to the frame to pass only pixels that satisfy the object color ranges. We then do histogram back projection to hue-distance histogram, saturation histogram and value histogram. Each histogram back projection will give a back projection image as the result which contains the probability of each pixel in the frame according to the histograms. We then combine all back projection image into a single back projection image using AND operator. This single back projection image is the last input for the tracking (Figure 3.7).
Figure 3.7 Histogram Back Projection of first test video
3.6
Tracking
Good localization is a first step towards good tracking. But that does not mean we will have 100% accurate tracking. If we choose all the object parts as one part and take HSV color histogram on it, it will be difficult to track the object if it passes through a background which has color in the range of the object color range. One good way to improve the robustness of the tracking is doing divide and conquer which means we split the problem of the tracking itself so then it will be easier to solve. To split the problem, we propose to track the object parts separately. Object parts represent the dominant color parts of the whole object. Each object part is modeled using hue-distance, saturation and value histogram and then track it. The whole object rectangle is the maximum rectangle of each object rectangle (Figure 3.8). The whole object center position is the center of the maximum rectangle. Maximum rectangle is defined as the smallest possible rectangle that covers all rectangles inside it[5]. We also propose a mechanism to detect if the tracker lost the object and how it deals with the whole object rectangle. If the next tracking rectangle area of an object part is
39
equal to 0, that particular object tracker is stated as lost. If an object tracker is lost, then the rectangle will not be taken into account in the whole object rectangle. The whole object rectangle will only consider the surviving tracking rectangle.
Figure 3.8 Maximum rectangle illustration. Maximum rectangle (thick red) of body, trouser and shoes rectangle (thin blue).
The proposed method can be summarized in the following steps (Figure 3.9) : 1) Do mean-shift segmentation to the first frame so the object parts will be easier to choose. 2) Transform the frame into HSV space 3) Choose each object by clicking it and do region growing starts from the click position. 4) For each object part (In red zone, Figure 3.9): a. Take minimum and maximum value of hue, saturation and value for each object part. b. Calculate hue-difference, saturation and value histogram c. Make a color mask by evaluating each pixel in the frame based on the minimum and maximum values taken from step 5 d. Do mean-shift segmentation to the image for smoothing the image and reducing noise. e. f. g. Do histogram back projection from the histogram in step 6 and combine all back projection images. Track the object based on the combined projection image and stored the new tracking window information. If the tracker lost, leave it. Otherwise go to step 5 5) Find maximum rectangle of each object parts rectangle. 6) Return to step 4c using new tracking window from step 4f.
40
Figure 3.9 The proposed methods schema
41
Implementations and Experiments
In this section, the implementations and experiments will be explained thoroughly. The implementation environment, library used, and some parameters used will be described.
4.1
Implementations
In the implementation phase, OpenCV 2.0 library is used. We use Microsoft Visual Studio 2005 as development tools. For mean-shift segmentation, we use mean-shift segmentation implementation in OpenCV library. As proposed by Bradski et. al. in [5], we use hs = 20 and hr = 40 and maximum level = 2 which is good for an image with dimension 640 x 480. In the case of image with dimension 1280 x 720 (e.g. first test video), these parameters also give good segmentation result. For region growing, we use Flood Fill method which is also available in OpenCV 2.0 library [24]. Flood fill will append each neighboring pixels based on the color characteristics. If a neighboring pixel has color characteristics that are close to the seed (i.e. within the tolerance ranges), the pixel will be appended. We use 20 as tolerance for hue and value color component and 40 for saturation. When seed is growing, it marks the area with perfect white (H=255, S=255, V=255). This perfect white area will become a mask to find the extreme value of the area and calculate the object part histogram. The extreme values are hmin, hmax, smin, smax, vmin, vmax. These values correspond to hue minimum, hue maximum, saturation minimum, saturation maximum, value minimum, and value maximum. These extreme will be used for making color mask of each object part. In histogram calculation, we use 30 bins for hue and hue-distance component, 9 bins for saturation and 6 bins for value. These parameters come up from experiments and observations which give good result. We keep the hue histogram. After histogram calculation which takes hue-distance, saturation and value color component, we make a threshold to hue histogram. This threshold is important to retain close hue pixels and remove unwanted far hue pixels in the back projection. First we make a threshold of 255 for the histogram. Any bin that has value above the threshold will be stated as
42
peak. Number of close hue is 70% of number of peaks. We round it after because we need an integer number. For example, if we have 10 peaks, then the number of close hue will be 70% x 10 = 7. This means, only hue that has distance below 7 to the hue reference that will be taken into the back projection. All other hues will be discarded (Figure 4.1). If no histogram bin exceed 255, then the hue-distance threshold will be set into 1.
Figure 4.1 Hue histogram of air plane body. Number of peaks = 10, Hue-distance threshold= 7
We smoothen the frames by mean-shift segmentation starting from second frame. We use spatial range hs=5, color range hr = 20, and maximum pyramid level = 2. With this parameters, noise is reduced significantly while the discontinuity (e.g. edge) is still preserved. Implementation of creating color mask is simply evaluates each pixel in the current frame based on extreme (minimum and maximum) values that we have taken in the previous step. While histogram back projecting, CAMSHIFT tracking, and maximum rectangle calculation, we use the available functions in OpenCV 2.0 library [24].
4.2
Experiments Setting
All experiments are carried out in a laptop with specification of AMD Turion X2 (dual core) 1.6 GHz with 2 GB RAM.
43
Results and Discussions
This section will provide the description of experiment results continued with Discussions about them.
5.1
Results
5.1.1 First Experiment Results In the first video, our proposed object localization method shows its powerfulness.
With a single click, the object is exactly selected without any surrounding background included. Surrounding background means background that is spatially closed to the object. This will help to make a robust histogram and robust tracking. The object can be tracked perfectly though there is object shape change and orientation change. There is partial occlusion occurs in the middle of video but the tracking can deal with it without any problem. The average frame rate is 2.4 frame per second (fps). This shows us that for a single hue object in front of a very distinct color background, our object localization method works very well.
Figure 5.1 First video result with the proposed method. First column is the frame, the second column is the back projection image, the third column is the hue histogram with 30 bins at frame 33.
If we use the classic CAMSHIFT, we have to configure the parameters for hue, saturation and value manually. These parameters are used to make a color mask. Every pixel that fits these parameters will be assigned 1, otherwise it will be assigned 0. In the implementation example of CAMSHIFT in [24], the hue parameters is set to hmin=0 and hmax=180 which mean takes all possible color types. For saturation, it is set to smin=30 and smax=256. For value, it is set to vmin=10 and vmax=256 (Table 5.1). We made some tuning to these parameters to get the best result using classic CAMSHIFT. The tuning is carried out on saturation and value parameters. In the first video, we set the smin=50. In the first experiment for this video, we set vmin= 150.
44
Extreme Minimum hue (hmin) Maximum hue (hmax) Minimum saturation (smin) Maximum saturation (smax) Minimum value (vmin) Maximum value (vmax)
Value 0 180 (maximum hue in OpenCV implementation) 30 256 10 256
Table 5.1 Default parameters of classic CAMSHIFT in [24]
(a)
(b)
(c)
Figure 5.2 First video result with classic CAMSHFT at frame 33. (a) The frame (b) back projection image (c) the hue histogram with 16 bins.
(a)
(b)
Figure 5.3 Object localization comparison (a) Object localization using proposed method (b) Object localization using classic CAMSHIFT (c) Hue histogram using proposed method (d) Hue histogram using classic CAMSHIFT
(c)
(d)
The result shows the yellow trunk can be tracked successfully. Unfortunately, it starts drifting from frame 145. The tracker is starting to track the background from that frame. This is due to the problem of object localization. In classic CAMSHIFT, we use rectangle to localize the object. With rectangle, there is big possibility that some
45
background information will be taken into the object model. As we can see from Figure 5.3(b), there is background inside rectangle which will be included into object model (hue histogram). That is why the blue bin is filled (Figure 5.3(d)).
5.1.2 Second Experiment Results For the second video (Figure 5.4), the localization is done by 3 clicks. First is the air
plane body and wing, the second and third are for the propellers. Each of the objects will be tracked separately. The result will be the whole object rectangle (search window) which is the maximum rectangle of each object parts rectangle. The experiment shows that the object parts and the whole object can be tracked successfully. The whole object rectangle is very stable in bounding the air plane. Even when there is cloud distraction, the tracker can still track the object (Figure 5.4 (a)). The average frame rate is 1.15 fps. Mean while, if we use classic CAMSHIFT, problem occurs. In CAMSHIFT, it takes whole range of hue which is in the OpenCV implementation from 0 to 180 and maximum saturation (smax) = 256. After selecting the whole air plane, the tracking ellipse goes very large when we set the value maximum (vmax) into 150 (Figure 5.5 1st row). While if we use vmax=100, the tracking ellipse goes to the island in the background (Figure 5.5 2nd row). Finally, if we use vmax = 70, the tracking ellipse covers only the propellers (Figure 5.5 3rd row). This can be explained using hue histogram and back projection image. Back projection image is the projection of the hue histogram to the hue image of the frame sequences. In the case of vmax=150, we have hue histogram which is close to the background color characteristic. That is why the background gives very high intensity in the back projection image. More over, the object gives very less intensity in the back projection image. That is why the tracking ellipse covers almost whole background. In the case of vmax=100, some parts in the background gives more intensity than the object. That is why, when the object is passing through that background, the tracking ellipse jumps into that background parts. In the case of using vmax=70, the tracking ellipse is robustly track the object. But unfortunately, it tracks only the propellers. It does not track the whole air plane. We also try to use different minimum and maximum saturation values. But this does not help to improve the tracking.
46
(a)
(b)
(c)
(d)
Figure 5.4 Second video result with our proposed method. (a) Tracked image passing through cloud at frame 290 (b) Air plane body back projection image at frame 10 (c) Left propeller back projection image at frame 10 (d) Right propeller back projection image at frame 10
Figure 5.5 Second video result with classic CAMSHIT. First column is the frame, the second column is the back projection image, the third column is the hue histogram with 16 bins. First row using vmax=150 at frame 95, Second row is using vmax=100 at frame 325, Third row is using vmax=70 at frame 29
47
Extreme Minimum hue (hmin) Maximum hue (hmax) Minimum saturation (smin) Maximum saturation (smax) Minimum value (vmin) Maximum value (vmax)
Values 0 180 (maximum hue in OpenCV implementation) 50 256 10 150 / 70
Table 5.2 Tuned CAMSHIFT parameters for test video 1 and test video 2 Extreme Minimum hue (hmin) Maximum hue (hmax) Minimum saturation (smin) Maximum saturation (smax) Minimum value (vmin) Maximum value (vmax) Values 0 180 (maximum hue in OpenCV implementation) 30 256 10 70
Table 5.3 Tuned classic CAMSHIFT parameters for test video 3
(a)
(b)
(c)
(d)
Figure 5.6 Third video result with the proposed method. (a) Tracked image when the object skewed at frame 639 (b) Yellow body part back projection image at frame 10. (c) Blue trouser back projection image at frame 10 (d) Orange shoes back projection image at frame 10
48
This is one of the main problems of classic CAMSHIFT method. When it deals with multihued image, it often drifts or it tracks only some object parts. We also have to specify the parameters manually for different kind of videos. This makes practically uncomfortable.
5.1.3 Third Experiment Results In the third video, there is a problem occurs which is the head part is merge with the
background by the mean-shift segmentation method so then it can not be selected. This is the problem if in the localization phase, the object is in front of a background which has similar color to the object. We select the body, trouser and the shoes. Apart from localization lack, our proposed method still can track the object nicely (Figure 5.6). The shoes can be tracked until frame 855. The tracker lost because the shoes is getting to small to detect. The trouser can be tracked until frame 105. When it goes far away from the camera, the color is getting darker which is hard to detect the hue. The body can be tracked successfully until the end of the video. The average frame rate is 1.96 fps. If we use classic CAMSHIFT method, we found some problems. As we have mentioned in the second video, the problem with multihued image happens again here. We do the same experiments configuration which bring us to this result. The best result is shown in Figure 5.7. We do not change the saturation minimum value as we have tried to vary it, it does not give much influence to the result. We use the default value smin=30. The result shows us that as soon as the object passing through the background which has similar color, the tracker drifts and start tracking the background. (Figure 5.7).
5.1.4 Forth Experiment Results For the forth video (Figure 5.8), we select one of the football player. Object can be
selected with one click. The result shows the object can be tracked successfully by the proposed method. In the middle of sequence, the object almost fully occluded by opponent player. Nevertheless, the tracker is still able to track the remaining unoccluded part of the object. There is also distraction from a team mate that run towards the object, but the tracker can still track the object. The average frame rate is 11.77 fps. When we test using CAMSHIFT, we tune the parameter to get the best result (Table 5,4). We set minimum value vmin = 10, minimum saturation smin = 10, maximum saturation = 256, and take all the hue range. We vary the maximum value with 150,
49
100, and 70. When we use vmax = 70 gives better tracking result. Nevertheless, when there is occlusion, the tracker drifts and track the occluder (Figure 5.10)
(a)
(b)
(c)
Figure 5.7 Third video best result with classic CAMSHIFT at frame 300. (a) the frame (b) back projection image (c) hue histogram with 16 bins. Extreme Minimum hue (hmin) Maximum hue (hmax) Minimum saturation (smin) Maximum saturation (smax) Minimum value (vmin) Maximum value (vmax) Values 0 180 (maximum hue in OpenCV implementation) 10 256 10 70
Table 5.4 Tuned classic CAMSHIFT parameters for test video 4
(a)
(b)
Figure 5.8 Object (marked with red rectangle) tracked by the proposed method. (a) The object is almost fully occluded at frame 57 (b) The corresponding back projection image
(a)
(b)
(c)
Figure 5.9 Forth video best result with classic CAMSHIFT at frame 57. (a) the frame (b) back projection image (c) hue histogram with 16 bins.
50
(a)
(b)
(c)
(d)
Figure 5.10 Drifting tracker. (a) The object is tracked at frame 54 (b) The object almost fully occluded at frame 57 (c) The ellipse covers the object and occluder at frame 61 (d) Tracker drifts: it tracks the occluder at frame 62
Figure 5.11 Multiple object tracking using our proposed method (a) Frame 2 (b) Frame 45
(a)
(b)
Classic CAMSHIFT can not do multiple object tracking. With our proposed method, CAMSHIFT has the capability to do that (Figure 5.11). The average frame rate is 4.18 fps.
51
5.2
Discussion
5.2.1 Some Advantages Our experiments results show that the proposed method improve CAMSHIFT
significantly. This happens because of some of these improvements. First, the object localization is more precise. It avoids the object model from taking its surrounding background information. With this, tracker drifts less. While classic CAMSHIFT uses rectangle which takes surrounding background information into the object model for a lot of cases. Some other improvements [7,12] also fail to model the object precisely. We use default preprocessing mean-shift segmentation parameters proposed by Bradski [5] which works very well in every test video and reduce the need to tune the parameters manually. But, we also give the possibility to tune the preprocessing mean-shift segmentation parameters so that it is adaptive to the users need. The proposed method can detect the extreme values of the object automatically while in classic CAMSHIFT, we have to tune the parameters manually. Second, the use of hue-distance histogram with threshold increase the robustness of CAMSHIFT in the situation object passing through background which has similar color to the object. The automatic threshold limits only very similar hue pixels will be taken into hue-distance back projection image. In classic CAMSHIFT, the use of only hue histogram make it difficult to track the object in that situation. Third, splitting the problem of tracking into smaller problem by tracking the object parts separately, increase the robustness of tracking multihued object. Classic CAMSHIFT and current CAMSHIFT improvement methods track the object as a whole which make them often drift. This is one of the main advantages of our method in term of robustness. Forth, our proposed method has a capability to track multiple object. We have tried tracking 6 object simultaneously with very good result. All objects can be tracked successfully. We did not find any CAMSHIFT improvement methods that support multiple object tracking.
5.2.2 Some Limitations Our proposed method has increased the robustness of CAMSHIFT tracking.
Nevertheless, there are some limitations in using our method.
52
First, in the case of object has a lot of hue, the object localization may be not so practical. In addition to that, the performance may be slower due to more tracker to compute. In the case of textured image, such as a running cheetah, the object localization also suffer from its object localization method. Second, if the object passing through a background with exactly the same color (hue, saturation, value) to the object, then the tracker will most likely fail. The reason is because we use only color information. So if the background has exactly the same color as the object, it will be considered as the object as well. Third, the tracker can not re-track the lost object parts. This is because we only use color information. If we re-track using color only, there is big possibility the re-tracker result will be wrong. To re-track, we need some other features to model the object. Forth, the speed. Our method does not achieve real time performance due to the separate tracking and some additional tasks to increase robustness. Actually, we have defined in the first time that this issue is not our main concern.
53
Conclusions and Future Works
6.1
Conclusions
We have developed different ways to improve CAMSHIFT robustness. The proposed object localization method improves the robustness of object model. With it, we can significantly avoid surrounding background information to be taken into the object model. The use of hue-distance histogram, tracking dominant-color object parts separately and the use of maximum rectangle that combine each object part rectangle also help CAMSHIFT so it can track multihued object in similar color background. With all the experiments result, we have shown that the proposed method is able to significantly improve CAMSHIFT robustness in challenging videos.
6.2
Future Works
Our future works will be improving the methods using graphical processing unit (GPU) or parallel programming in multi-core processor to increase the speed so it can achieve real time speed. Beside that, we propose to improve the proposed method so it can re-track the lost object parts, improve the ability to track textured object and object with has a lot of hue. These are very important things to do because there are a lot of real world applications need these capabilities. One remaining work is to improve the tracker in condition the object has exactly the same color with the background and apply this tracker into clickable hypervideo.
54
Bibliography
[1] [2]
Bradski, G. R. 1998. Computer Vision Face Tracking for Use in a Perceptual User Interface. Intel Technology Journal, 2(2), 13-27. Comaniciu, D. and P. Meer. 2002. Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603-619.
[3]
J. A. Corrales, P. Gil, F. A. Candelas, F. Torres. 2009. Tracking based on HueSaturation Features with a Miniaturized Active Vision System. In Proceedings Book of 40th International Symposium on Robotics, Asociacin Espaola de Robtica y Automatizacin Tecnologas de la Produccin AER-ATP, Barcelona, Spain. pp.107
[4]
Tian, G., Hu, R., Wang, Z., and Fu, Y. 2009. Improved Object Tracking Algorithm Based on New HSV Color Probability Model. In Proceedings of the 6th international Symposium on Neural Networks: Advances in Neural Networks - Part II, Wuhan, China.
[5] [6] [7]
Bradski, G., and Kaehler, A. 2008. Learning OpenCV: Computer Vision with the OpenCV Library. O'Reilly Media, Inc. Intel Corporation. 2001. Open Source Computer Vision Library Reference Manual, 123456-001 J. G. Allen, R. Y. D. Xu, and J. S. Jin. 2004. Object tracking using camshift algorithm and multiple quantized feature spaces, in Proceedings of the PanSydney area workshop on Visual information processing, ser. ACM International Conference Proceeding Series, vol. 100. Darlinghurst, Australia: Australian Computer Society, Inc., pp. 37.
[8]
J. Ning, L. Zhang, David Zhang and C. Wu. 2009, Robust Object Tracking using Joint Color-Texture Histogram. International Journal of Pattern Recognition and Artificial Intelligence, vol. 23, No. 7 (2009), World Scientific Publishing Company 12451263
[9]
Qiu, X., Liu, S., Liu, F. 2009. Kernel-based Target Tracking with Multiple Features Fusion. Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference, Shanghai, P.R. China.
[10]
Ganoun, A., Ould-Dris, N., and Canals, R. 2006, Tracking System Using CAMSHIFT and Feature Points. 14th European Signal Processing Conference (EUSIPCO 2006), Florence, Italy.
55
[11]
Stolkin, R., I. Florescu, M. Baron, C. Harrier and B. Kocherov. 2008. Efficient Visual Servoing with the ABCshift Tracking Algorithm. In: IEEE International Conference on Robotics and Automation, pp. 3219-3224, Pasadena, California, USA.
[12]
Xu, R Y D; Allen, J & Jin, J S .2003. Robust real-time tracking of non-rigid objects, Conferences in Research and Practice in Information Technology, VIP'03, Sydney, Australia.
[13]
K. Fukunaga and L.D. Hostetler .1975. The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Trans. Inf0rmation Theory, vol. 21, pp. 32-40.
[14]
Collins, R. 2007. Lecture 29: Video Tracking: Mean-Shift CSE/EE486 Computer Vision I, CSE Department, Penn State University http://www.cse.psu.edu/~rcollins/CSE486/lecture29.pdf (visited June 2010)
[15] [16] [17] [18] [19]
H. Bay, T. Tuytelaars, and L. Van Gool. 2006. SURF: Speeded Up Robust Features. In ECCV (1), pages 404417. M. Heikkila, M. Pietikainen, and C. Schmid. 2009. Description of interest regions with local binary patterns. Pattern Recognition 42(3):425436. A. Yilmaz, O. Javed, M. Shah. 2006). Object tracking: a survey, ACM Computing surveys, vol. 38, no. 4, pp.1-45. R. C. Gonzalez, R.E. Woods, and S. L. Eddins. 2004. Digital Image Processing Using MATLAB 1st Edition, Dorsing Kindersley, USA. J. McC. Smith and D. Stotts. 2002. An Extensible Object Tracking Architecture for Hyperlinking in Real-time and Stored Video Streams, Technical Report TR02-017, Department of Computer Science Univ of North Carolina at Chapel Hill, USA.
[20]
S. Stalder, H. Grabner, and L. Van Gool. 2009. Beyond Semi-Supervised Tracking: Tracking Should Be as Simple as Detection, but not Simpler than Recognition. In Proceedings ICCV09 WS on On-line Learning for Computer Vision, 2009
[21]
S. Stalder, H. Grabner, and L. Van Gool. Beyond Semi-Supervised Tracking Code. http://www.vision.ee.ethz.ch/boostingTrackers/download.htm (visited February 2010)
[22]
M. Pietikainen and G. Zhao. 2009. Local Texture Descriptors in Computer Vision. Tutorial in: IEEE International Conference on Computer Vision ICCV.
56
[23]
G. Zhao & M. Pietikainen. C++ implementation of spatio-temporal LBP. http://www.ee.oulu.fi/research/imag/texture/download/STLBP_VC.zip (visited March 2010)
[24] [25] [26] [27]
2009. OpenCV 2.0 library. Code from web. http://sourceforge.net/projects/opencvlibrary/ (visited February 2010) H. Bay, T. Tuytelaars, and L. Van Gool. 2006. Code from web. http://www.vision.ee.ethz.ch/~surf/download.html (visited February 2010) J. Shi and C. Tomasi. 1994. Good features to track, Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recogn., pages 593-600. Jean-Yves Bouguet. Pyramidal Implementation of the Lucas Kanade Feature Tracker, Description of the algorithm. Intel Corporation Microprocessor Research Labs
[28] [29] [30] [31] [32]
D. Stavens. 2007. The OpenCV Library: Computing Optical Flow. Stanford Artificial Intelligence Lab, USA D. Comaniciu, V. Ramesh and P. Meer. 2000. Real-time Tracking of Non-Rigid Objects Using Mean Shift. CVPR. M. Heikkil and T. Ahonen. 2009. Code from web. http://www.ee.oulu.fi/mvg/page/lbp_matlab (visited May 2010) Code from web. http://opencv.jp/sample/accumulation_of_background.html (visited May 2010) Z. Zirkovic. 2004. Improved adaptive Gausian mixture model for background subtraction. Code from web. http://staff.science.uva.nl/~zivkovic/Publications/CvBSLibGMM.zip (visited May 2010)
[33]
S. Stalder, H. Grabner, and L. Van Gool. 2009. Video from web. http://www.vision.ee.ethz.ch/boostingTrackers/contactBoosting.html (visited February 2010)
[34]
R Valenti, F Hageloh. Video from web. http://student.science.uva.nl/~rvalenti/uva/MIR/movies/soccer.avi (visited April 2010)
[35]
P. Colantoni, N. Boukala, J. Da Rugna. 2003. Fast and Accurate Color Image Processing Using 3D Graphics Cards, VMV 2003. Munich, Germany
57

Object Trackinng PHD Thesis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Object Trackinng PHD Thesis

Uploaded by

Copyright:

Available Formats

Master Erasmus Mundus in Color in Informatics and Media Technology (CIMET)

ObjectTracking:StateofTheArtandCAMSHIFTImprovementUsing MultidominantColorsTracking MasterThesisReport

3.4 3.5 3.6 4 4.1 4.2 5

Implementations and Experiments .................................................................. 42

The General Aim of The Master Thesis

Object Tracking Categorization

Corner Detector Combined with Optical Flow

Categories Point Tracking Deterministic methods Statistical methods

Figure 2.2 Illustration of optical flow[28].

Speeded Up Robust Features (SURF)

Mean Shift Tracking

Figure 2.5 Intuitive description of Mean-Shift.[14]

Second moments are: (5)

The orientation is:

With a, b, c are (9)

Local Binary Pattern

Figure 2.8 Example of LBP calculation[16]

Figure 2.9 LBP Tracker Result in second video.

(a) Object rectangle at frame 1 (b) Object rectangle at frame 33

Beyond Semi-Supervised Online Boosting Tracking

Method that We Choose

CAMSHIFT/Mean-Shift Improvement in Literatures

CAMSHIFT improvement using new HSV model

hd(H,Href)(x) is used to obtain the back-projection Bd(H,Href)(x, y) of the hue

y) of the saturation channel. These two back-projections are combined according to

Figure 3.1 A sample of complex shape object

Figure 3.4 Text file configuration to tune the parameters

Making Color Mask

Histogram Back Projection

Figure 3.7 Histogram Back Projection of first test video

Figure 3.9 The proposed methods schema

Implementations and Experiments

Results and Discussions

Value 0 180 (maximum hue in OpenCV implementation) 30 256 10 256

Table 5.1 Default parameters of classic CAMSHIFT in [24]

Values 0 180 (maximum hue in OpenCV implementation) 50 256 10 150 / 70

Table 5.3 Tuned classic CAMSHIFT parameters for test video 3

Table 5.4 Tuned classic CAMSHIFT parameters for test video 4

Conclusions and Future Works

[5] [6] [7]

[15] [16] [17] [18] [19]

[24] [25] [26] [27]

[28] [29] [30] [31] [32]

R Valenti, F Hageloh. Video from web. http://student.science.uva.nl/~rvalenti/uva/MIR/movies/soccer.avi (visited April 2010)

You might also like