Professional Documents
Culture Documents
Abstract This paper presents a new video surveillance system called KVSS
using background based on Type-2 Fuzzy Gaussian Mixture Models (T2 FG-
MMs). This techniques are used for resolve some limitations on Gaussian Mix-
ture Models (GMMs) techniques on critical situations like moved camera jitter,
illumination changes and objects being introduced or removed from the scene.
In this context, we introduce descriptions of T2 GMMs and we presents an
experimental validation using a new evaluation video dataset which presents
various problems. Results demonstrate the relevance of the proposed method.
Efficient algorithms for Human Action Recognition in video sequences are
highly in-demand in video surveillance application area. In this work, a method
has been proposed to extract the key frames from the videos based on hybrid
robust method using shape orientation and discrete wavelet transform (DWT)
to detect and recognize human action. Video surveillance is actually one of the
most important research topics in computer vision. In recent years, the num-
ber of surveillance cameras installed to monitor private and public spaces and
areas has increased greatly. The recent applied tools for an automated anal-
ysis detect precisely on human behaviour analysis, such as an intruder in a
prohibited zone. We presented in this work aim at a human activity detection
and behaviour analysis in automated recognition.
In this paper, we will introduce a new method of recognition of human ac-
tivities from a video sequence using a technique based on the orientation of
the ellipse. From the images of the person’s silhouette, key frames will be ex-
tracted based on the change of movement. The ellipse is applied for detecting
of the position and orientation of the person. Then, we will use a descrip-
tor (HOG) and silhouette orientation for the extraction of features from the
Slim Abdelhedi
REGIM-Lab., ENIS, Route Soukra km 3, BP 1173, Sfax 3038 Tunisia.
Tel.: +216-98-434891
E-mail: slim.abdelhedi.tn@ieee.org
2 Slim Abdelhedi et al.
1 Introduction
In video surveillance [xxx], the first objective is to detect and localize moving
object in the scene. The principal objective of this operation, called Back-
ground Subtraction, is to separate moving object (Foreground) from the static
information (Background). For this reason, background subtraction techniques
[xxxx] has received considerable attention from many researchers during the
last decades. In the related work, several background modeling approach have
been developed and the recent surveys can be found in [xxxx]. Therefore,
Gaussian mixture models (GMMs) have been applied to the field of video
surveillance particularly in dynamic object detection [2222]. In this paper, we
propose to model the background by using a Type-2 Fuzzy Gaussians Mix-
ture Model (T2-FGMM) developed by Zeng et al. [xxxxx]. Instead of using
a physical fence, the system uses virtual fences positioned within the camera
image. For the surveillance of a large area, one or more cameras are installed.
Thermal cameras are less influenced by light and weather changes and are
Title Suppressed Due to Excessive Length 3
used for extra robustness. This makes the system fully suitable for usage in
dark and bad weather conditions. A new video dataset is used to evaluate the
robustness of our system using T2 FGMM [xxxx] method against the critical
situations like inserted or moved background objects which have different spa-
tial and temporal characteristics which must be take into account to obtain a
good results.
Recognition of human action based on the vision is the labeling process image
sequences with action labels. Recognition of human action has many applica-
tions in various fields, including motion capture, medical analysis and biome-
chanics, ergonomics analysis, human-machine interaction, monitoring and se-
curity. [1] The purpose of a human activity recognition system is to identify
simple actions of everyday life (walking, running, jumping ...) from videos.
Each of these actions, carried out by one person in a specific period of time.
this area has become very popular and more it is considered a challenge in
the field of computer vision [2], so at the end to solve this problem, several
solutions have been proposed in the areas already mentioned, another prob-
lem also is the variety of activities that can cause difficulty to recognize the
different actions.
In this paper, there is a single and simple representation of an approach to
recognition of human activities based on the orientation of the ellipse, indeed,
the human detection is done by the ellipse along its movement. other work
that may be similar to our approach but they use the method of dynamic
time warping (DTW) which implements the distance between the template
and the key frame [3]. Other authors have proposed a method to extract key
frames by sampling. This is the most trivial method. Indeed, a selection of
key images is done from the images of the original sequence. This selection is
random or uniform according to certain time intervals [4]. There are several
other methods for extracting key frames among them, the method based on
the idea that the more the movement, the greater the image is necessary for
the summarization. another method that is based on a technique of improved
histogram adjustment for video segmentation, and then extraction is done by
using the characteristics of I frames, B and P for each sub-target [5].
In [6] paper, the authors propose an innovative approach to the selection of key
frames of a video sequence for the video summary. The selection is made by
comparing the difference between two successive frames of a video sequence;
in fact the algorithm determines the complexity of the sequence in terms of
changes in the visual content. In another paper [7], there is using of a visual
attention model which provides maps of saliency. Bright areas of these cards
are the areas that will attract the human eye. Then a detection of changes in
the images is made by these maps for the detection of key frames according
to an adaptive threshold. Normally after feature extraction, there grouping
another method of extracting key frames.
In paper [8], a new method of extracting the key image that is not only based
on this idea but also reduces redundant frames based on the assimilation of
the local information and World. Chang et al. [9] proposed an approach to
select key frames in such ways that the distance between a frame and the key
4 Slim Abdelhedi et al.
frame is below a certain threshold. The rest of the paper is organized as fol-
lows. Section 2 details our proposed approach. In section3, we will illustrate
the feature extraction. Then the classification is detailed in Section 4 , the
experiments are presented in Section 5 and we conclude our work in Section 6.
Visual recognition and understanding of human actions have attracted much
attention over the last years, e.g. [ref], and remained an important research
area of computer vision. In simple terms, the objective of human action recog-
nitions is to correctly classify the video into its action category, and the video
is fragmented to contain only one implementation of human movement. In gen-
eral cases, the purpose of human action recognition is focused on performing
the continuous recognition on every appeared human action of the input video
from start to end. Human action recognition task is of significant meaning in
several applications.
For example, a parking or airport surveillance system can automatically recog-
nize suspicious human activities like people suddenly run panicky or a person
waving his/her harms with swords in hands. Human activity and behavior
analysis is also helpful in the real-time monitoring of patients, children, and
elderly persons. A high performance action recognition system can make the
construction of vision-based intelligent environments and gesture-based hu-
man computer interfaces become indispensable.
Our aim is to find algorithms that can robustly overcome the variability of fea-
tures with the same action class label. A thorough human action recognition
framework should be able to tolerate variations within one class and distin-
guish actions of different classes. For increasing numbers of action classes, this
will even be more challenging as the overlap between classes will be higher.
Various Spatio-Temporal descriptors were described over the past years, most
of them algorithms using some motion representation, because it represent
the main characteristics that describe the semantic information of videos se-
quences.
The histogram of gradients HOG and optical flow HOF is an example of mo-
tion representation. To improve the action recognition rate, the optical flow
is associated with other features like HoG (Wang et al., 2011; Laptev et al.,
2008). This work is motivated by the possibility of combining the Histogram of
Structure Tensor (HoST) descriptor presented in [ref: walha] with other global
features.
Our work, presented the combination between a modeling of optical flow vec-
tor fields that gives a consistent global motion descriptor and Histogram of
Structure Tensor that represent spasial descriptor. Our descriptor ST-HoST
is obtain using the parameters of a polynomial model for each frame of a
video. The coefficients were found through the projection of the optical flow
on Legendre polynomials, reducing the dimension of the motion estimation
per frame. The sequence of coefficients were then combined using orientation
tensors.
The contributions of this paper would be summarized as follows: 1. A new
detection system architecture and recognition of human action in the video
sequences. 2. A new local spatio-temporal descriptor called ST-HoST based
Title Suppressed Due to Excessive Length 5
2 Related work
The recognition of human activities from video sequences is one of the most
important applications of computer vision. In past three years, this task has
attracted the attention of researchers from academia, security agencies and
industry. According to the features used for recognizing human actions, the
methods in related works can be mainly classified into two categories: global
feature based methods and local feature-based methods.
Global feature-based methods encode the visual observation as a whole and
the region of interest (ROI) is usually detected through background subtrac-
tion or tracking using Optical Flow. Several global features are obtained from
edges, silhouettes [24] or optical flow [22]. The representations are powerful
since they encode much of the information. The work of Bobick and Davis
[22], [23] is one of the earliest methods utilizing silhouettes. They extracted
silhouettes from a single view and accumulated the differences between subse-
quent frames of a motion clip with the generated binary motion energy image
(MEI) and motion history image (MHI). Instead of silhouette, the observa-
tion within the ROI can also be described with motion information optical
flow [23], [25], [26], [27]. Flow methods are usually executed when background
subtraction cannot be performed.Flow methods are usually executed when
background subtraction cannot be performed. Efros et al. [23] calculated op-
tical flow in person-centered video frames on sports footage, where persons in
the frame were very small.
There are also a lot of works combining flow and shape descriptors together,
which can overcome the limitations of single representation. Rectangular grids
of silhouettes and flow were employed in [28], and in each cell, a circular grid
was used to accumulate the responses. Ikizleret al. [29] combined the work of
Efros et al. [23] with histograms of oriented line segments.
In addition, the spatio-temporal volume (STV) approach is also a global rep-
resentation, despite the fact that this approach shares many similarities with
local approaches. Ke et al. [24] constructed the STV of flow, and sampled the
6 Slim Abdelhedi et al.
The detected local features preserve some rotation invariance, which is ro-
bust to the succeeding action recognition. Dollar et al. [32] presented a detector
to obtain abounding interest points by employing a series of spatio-temporal
filters. Oikonomopoulo et al. [33] developed the salient point detector [34] by
taking advantage of the entropy of space-time regions. After the detection of
interesting points or salient regions, many methods have been proposed to de-
scribe them. Scovanner et al. [35] extended the SIFT descriptor to 3D-SIFT,
by using the spatiotemporal information which is denoted by a sub-histogram.
Laptev et al. [36] applied histograms of gradient orientations (HoG) and his-
tograms of optic flows (HoF) to describe the information of local motion and
cubes of the space-time neighborhoods of detected interest points, and Klaser
et al. [37] ameliorated HoG to the 3D case.
After obtaining local descriptors of interest points and salient regions, a code-
Title Suppressed Due to Excessive Length 7
Where:
– N is the number of GMM PNdistributions,
– ω is the mixing weight, i=1 ωi = 1, ωi 0
The GMMs are extended to T2 FGMMs with uncertain mean (T2 FGMM-
UM) and covariance (T2 FGMM-UV). For the T2 FGMM-UM, the multivari-
ate Gaussian with uncertain mean vector is:
X 1 θ −µ
− 12 ( 1σ 1 )2 1 θn −µn 2
η(θ; µ
e; )= P 1/2
e 1 · · · e− 2 ( σn ) (3)
3/2
(2π) | |
8 Slim Abdelhedi et al.
4 Proposed methodology
First of all, to detect human actions, we applied a Kalman filter that pro-
vides a fundamental framework for the estimation of movement, indeed it is
a recursive estimator that is to say, to estimate current state, the previous
sound statements and current measures are necessary for estimation. Figure
(1) shows how made the detection of the individual movement from short video
and the segmented image of the person’s silhouette is obtained by separation
of the foreground (in white) from the background (in black), so we will keep
only the most significant objects of interest.
So to extract key frames, our idea is to select from the images of the silhouette
of the person key frames based on the orientation of the ellipse. The ellipse
will detect the person’s movement; according to this movement key frames will
be selected. we applied the ellipse on the frame of persons silhouette to obtain
the centroid and the orientation angle of each frame of the video that is to say,
detecting its movement as it shown in Figure (2). namely for example an X-Y
plane, it is considered an area M which is divided into parts and represent the
position of the parties centroid is given as follows:
10 Slim Abdelhedi et al.
With xi and yi are the coordinates of the individual segment. For the selection
of key frames, it is assumed that the numbers of total images of the video is
n, k (n) and v (n) are respectively the lengths of the horizontal and vertical
shape; is defined as a ratio s as follows: s = (k(n))/(h(n)) And key images
are selected taking only ones that contain more information that is to say the
maximum s for each sequence: key f rame = argmax(s)
After extracting key frames from the frame of persons silhouette in the video
sequence, pretreatment is applied to extract features. Orientation is the main
feature extracted for each key image as a well-defined action.
6 Experiment results
6.1 Datasets
To compute the optical flow between two images, you must solve the following
optical flow constraint equation:
Ixu+Iyv+It=0 Ix, Iy, and It are the spatiotemporal image brightness deriva-
tives.
Finding the right topology of the Artificial neural Network (ANN) for a
specific application is a challenging task and only through several simulations
of different ANN topologies, in terms of numbers of neurons per layers and
numbers of hidden layers, can be found the optimum one.
In our case after several tries we obtained good result using a two layer
feed forward network, with sigmoid activation function on both the hidden
and output layers. The training function adopted was trainlm that updates
the weights and biases values of the neural network according to Levenberg
12 Slim Abdelhedi et al.
Marquardt optimization.
As the input data vector was formed from 90 sets of data, 9 vectors for
each human action of the video dataset, the neuron numbers in the input layer
was set to 20. Considering that there are 10 human action to be recognized,
ten neurons have been placed on the output layer.
For the neuron from the hidden layer have been tried different numbers.
We finally chose a ten layer architectures with:
- 20 layers FF-BP - 10 neurons on the hidden layer - 10 neurons on the output
layer - Levenberg Marquardt as training algorithm - MSE for performance
function evaluation For testing purpose we have chosen data from training
set, the obtained recognition rate was 100.
We designed and tested a Feed-forward Back propagation network with two
layers having 10 neurons on hidden layer and 5 neuron on output layer. The
recognition rate was 99.96 . Simulation result is presented on Fig. 5.
A new data set was acquired and includes ten human action presented in
Fig. 6. The data set was split up in 3 subsets for training, validation and test-
ing.
The output layer has 10 neurons because we adopted for the output neurons
the one to n coding for showing the recognized human action and there are 10
human actions to be recognised as follows:
’bend’,’jack’,’jump’,’pjump’,’run’,’side’,’skip’,’walk’,’wave1’,’wave2’
After several tests we found that for the hidden layer 10 neurons seems
to be enough for a good recognition rate. The resulting neural network archi-
tecture for the activity recognition is shown in Fig.7. The testing sets were
chosen from the training set and was setup from 10 times 90 consecutive vec-
tors representing each activity to be recognized. The recognition / errors are
displayed in Fig. 8. Total number of errors was only 456, meaning a 99,088
recognition rate.
Title Suppressed Due to Excessive Length 13
ann.jpg
host.jpg
9 Experimental results
10 Conclusion
final_host.jpg
feasible. We tested our method downs recognized data (Weizman and KTH)
.our method show effective results by comparing with other methods.
as required. Don’t forget to give each section and subsection a unique label
(see Sect. 10).
a2 + b2 = c2 (13)
References