Human Activity Recognition Using Neural Networks and Spatio-Temporal Histogram of Structure Tensors (St-Host) Descriptor

Noname manuscript No.
(will be inserted by the editor)
Human activity recognition using neural networks

and Spatio-temporal Histogram of Structure Tensors
(ST-HoST) descriptor
Slim Abdelhedi · Ali Wali · Adel M.

Alimi.
Received: date / Accepted: date
Abstract This paper presents a new video surveillance system called KVSS
using background based on Type-2 Fuzzy Gaussian Mixture Models (T2 FG-
MMs). This techniques are used for resolve some limitations on Gaussian Mix-
ture Models (GMMs) techniques on critical situations like moved camera jitter,
illumination changes and objects being introduced or removed from the scene.
In this context, we introduce descriptions of T2 GMMs and we presents an
experimental validation using a new evaluation video dataset which presents
various problems. Results demonstrate the relevance of the proposed method.
Efficient algorithms for Human Action Recognition in video sequences are
highly in-demand in video surveillance application area. In this work, a method
has been proposed to extract the key frames from the videos based on hybrid
robust method using shape orientation and discrete wavelet transform (DWT)
to detect and recognize human action. Video surveillance is actually one of the
most important research topics in computer vision. In recent years, the num-
ber of surveillance cameras installed to monitor private and public spaces and
areas has increased greatly. The recent applied tools for an automated anal-
ysis detect precisely on human behaviour analysis, such as an intruder in a
prohibited zone. We presented in this work aim at a human activity detection
and behaviour analysis in automated recognition.
In this paper, we will introduce a new method of recognition of human ac-
tivities from a video sequence using a technique based on the orientation of
the ellipse. From the images of the person’s silhouette, key frames will be ex-
tracted based on the change of movement. The ellipse is applied for detecting
of the position and orientation of the person. Then, we will use a descrip-
tor (HOG) and silhouette orientation for the extraction of features from the
Slim Abdelhedi
REGIM-Lab., ENIS, Route Soukra km 3, BP 1173, Sfax 3038 Tunisia.
Tel.: +216-98-434891
E-mail: slim.abdelhedi.tn@ieee.org
2 Slim Abdelhedi et al.
key frames, a classification is done by the Multi-SVM and ANN algorithm.

We evaluate-we share our method standard human datasets Including KTH,
Weismann. our approach shows a significant results in comparison with the
other results. This paper presents, a new key frame extraction Method based
on combination of two imaging techniques allows for human actions recogni-
tion. We propose a local descriptor using Histogram of Orientation Structure
Tensors (HOST) and Optical Flow (OF) method. The first one represents mo-
tion by using Optical Flow (OF) approach for human detection and Region Of
Interest (ROI) extraction. The second represents the extracted interest points
using Histogram of Orientation Structure Tensors (HOST). Our descriptor is
tested and evaluated by a classification of KTH dataset, UCF11 dataset and
Weizmann dataset, with an Artificial neural networks classifier.
The key frame extraction, aimed at reducing the amount of information
from a surveillance video for analysis by human. The key frame is an impor-
tant frame of a video to provide an overview of the video. Extraction of key
frames from surveillance video is of great interest in effective monitoring and
later analysis of video. The computational cost of the existing methods of
key frame extraction is very high. The proposed method is a framework for
Key frame extraction from a long surveillance video with significantly reduced
computational cost. The proposed framework incorporates human intelligence
in the process of key frame extraction. The results of proposed framework
are compared with the results of IMARS (IBM multimedia analysis and re-
trieval system), results of the key frame extraction methods based on entropy
difference method, spatial color distribution method and edge histogram de-
scriptor method. The proposed framework has been objectively evaluated by
fidelity. The experimental results demonstrate evidence of the effectiveness of
the proposed approach.
Keywords Human Action Recognition · Key Frame Extraction · More
1 Introduction
In video surveillance [xxx], the first objective is to detect and localize moving
object in the scene. The principal objective of this operation, called Back-
ground Subtraction, is to separate moving object (Foreground) from the static
information (Background). For this reason, background subtraction techniques
[xxxx] has received considerable attention from many researchers during the
last decades. In the related work, several background modeling approach have
been developed and the recent surveys can be found in [xxxx]. Therefore,
Gaussian mixture models (GMMs) have been applied to the field of video
surveillance particularly in dynamic object detection [2222]. In this paper, we
propose to model the background by using a Type-2 Fuzzy Gaussians Mix-
ture Model (T2-FGMM) developed by Zeng et al. [xxxxx]. Instead of using
a physical fence, the system uses virtual fences positioned within the camera
image. For the surveillance of a large area, one or more cameras are installed.
Thermal cameras are less influenced by light and weather changes and are
Title Suppressed Due to Excessive Length 3
used for extra robustness. This makes the system fully suitable for usage in
dark and bad weather conditions. A new video dataset is used to evaluate the
robustness of our system using T2 FGMM [xxxx] method against the critical
situations like inserted or moved background objects which have different spa-
tial and temporal characteristics which must be take into account to obtain a
good results.
Recognition of human action based on the vision is the labeling process image
sequences with action labels. Recognition of human action has many applica-
tions in various fields, including motion capture, medical analysis and biome-
chanics, ergonomics analysis, human-machine interaction, monitoring and se-
curity. [1] The purpose of a human activity recognition system is to identify
simple actions of everyday life (walking, running, jumping ...) from videos.
Each of these actions, carried out by one person in a specific period of time.
this area has become very popular and more it is considered a challenge in
the field of computer vision [2], so at the end to solve this problem, several
solutions have been proposed in the areas already mentioned, another prob-
lem also is the variety of activities that can cause difficulty to recognize the
different actions.
In this paper, there is a single and simple representation of an approach to
recognition of human activities based on the orientation of the ellipse, indeed,
the human detection is done by the ellipse along its movement. other work
that may be similar to our approach but they use the method of dynamic
time warping (DTW) which implements the distance between the template
and the key frame [3]. Other authors have proposed a method to extract key
frames by sampling. This is the most trivial method. Indeed, a selection of
key images is done from the images of the original sequence. This selection is
random or uniform according to certain time intervals [4]. There are several
other methods for extracting key frames among them, the method based on
the idea that the more the movement, the greater the image is necessary for
the summarization. another method that is based on a technique of improved
histogram adjustment for video segmentation, and then extraction is done by
using the characteristics of I frames, B and P for each sub-target [5].
In [6] paper, the authors propose an innovative approach to the selection of key
frames of a video sequence for the video summary. The selection is made by
comparing the difference between two successive frames of a video sequence;
in fact the algorithm determines the complexity of the sequence in terms of
changes in the visual content. In another paper [7], there is using of a visual
attention model which provides maps of saliency. Bright areas of these cards
are the areas that will attract the human eye. Then a detection of changes in
the images is made by these maps for the detection of key frames according
to an adaptive threshold. Normally after feature extraction, there grouping
another method of extracting key frames.
In paper [8], a new method of extracting the key image that is not only based
on this idea but also reduces redundant frames based on the assimilation of
the local information and World. Chang et al. [9] proposed an approach to
select key frames in such ways that the distance between a frame and the key
frame is below a certain threshold. The rest of the paper is organized as fol-
lows. Section 2 details our proposed approach. In section3, we will illustrate
the feature extraction. Then the classification is detailed in Section 4 , the
experiments are presented in Section 5 and we conclude our work in Section 6.
Visual recognition and understanding of human actions have attracted much
attention over the last years, e.g. [ref], and remained an important research
area of computer vision. In simple terms, the objective of human action recog-
nitions is to correctly classify the video into its action category, and the video
is fragmented to contain only one implementation of human movement. In gen-
eral cases, the purpose of human action recognition is focused on performing
the continuous recognition on every appeared human action of the input video
from start to end. Human action recognition task is of significant meaning in
several applications.
For example, a parking or airport surveillance system can automatically recog-
nize suspicious human activities like people suddenly run panicky or a person
waving his/her harms with swords in hands. Human activity and behavior
analysis is also helpful in the real-time monitoring of patients, children, and
elderly persons. A high performance action recognition system can make the
construction of vision-based intelligent environments and gesture-based hu-
man computer interfaces become indispensable.
Our aim is to find algorithms that can robustly overcome the variability of fea-
tures with the same action class label. A thorough human action recognition
framework should be able to tolerate variations within one class and distin-
guish actions of different classes. For increasing numbers of action classes, this
will even be more challenging as the overlap between classes will be higher.
Various Spatio-Temporal descriptors were described over the past years, most
of them algorithms using some motion representation, because it represent
the main characteristics that describe the semantic information of videos se-
quences.
The histogram of gradients HOG and optical flow HOF is an example of mo-
tion representation. To improve the action recognition rate, the optical flow
is associated with other features like HoG (Wang et al., 2011; Laptev et al.,
2008). This work is motivated by the possibility of combining the Histogram of
Structure Tensor (HoST) descriptor presented in [ref: walha] with other global
features.
Our work, presented the combination between a modeling of optical flow vec-
tor fields that gives a consistent global motion descriptor and Histogram of
Structure Tensor that represent spasial descriptor. Our descriptor ST-HoST
is obtain using the parameters of a polynomial model for each frame of a
video. The coefficients were found through the projection of the optical flow
on Legendre polynomials, reducing the dimension of the motion estimation
per frame. The sequence of coefficients were then combined using orientation
tensors.
The contributions of this paper would be summarized as follows: 1. A new
detection system architecture and recognition of human action in the video
sequences. 2. A new local spatio-temporal descriptor called ST-HoST based
on the combination of histograms structure tensor and Optical Flow. 3. Ex-

tensive experiments have been conducted on three action datasets to verify
the performance of the proposed method. The experimental results show that
our approach of human action patterns is capable of constructing a powerful
action recognition framework. In addition, our method can give prominent
performance on several standard datasets (KTH, UCF50, and HMDB51). The
rest of this paper is organized as follows: In Section II, we review recent works
on action recognition. In Section III, we introduce the proposed method in
detail, including the features extraction using HoST descriptor and the whole
algorithm framework for classifying human actions. Then we present exper-
imental results on several data bases for video based action recognition in
Section IV. Finally, we conclude this paper in Section V.
2 Related work
The recognition of human activities from video sequences is one of the most
important applications of computer vision. In past three years, this task has
attracted the attention of researchers from academia, security agencies and
industry. According to the features used for recognizing human actions, the
methods in related works can be mainly classified into two categories: global
feature based methods and local feature-based methods.
Global feature-based methods encode the visual observation as a whole and
the region of interest (ROI) is usually detected through background subtrac-
tion or tracking using Optical Flow. Several global features are obtained from
edges, silhouettes [24] or optical flow [22]. The representations are powerful
since they encode much of the information. The work of Bobick and Davis
[22], [23] is one of the earliest methods utilizing silhouettes. They extracted
silhouettes from a single view and accumulated the differences between subse-
quent frames of a motion clip with the generated binary motion energy image
(MEI) and motion history image (MHI). Instead of silhouette, the observa-
tion within the ROI can also be described with motion information optical
flow [23], [25], [26], [27]. Flow methods are usually executed when background
subtraction cannot be performed.Flow methods are usually executed when
background subtraction cannot be performed. Efros et al. [23] calculated op-
tical flow in person-centered video frames on sports footage, where persons in
the frame were very small.
There are also a lot of works combining flow and shape descriptors together,
which can overcome the limitations of single representation. Rectangular grids
of silhouettes and flow were employed in [28], and in each cell, a circular grid
was used to accumulate the responses. Ikizleret al. [29] combined the work of
Efros et al. [23] with histograms of oriented line segments.
In addition, the spatio-temporal volume (STV) approach is also a global rep-
resentation, despite the fact that this approach shares many similarities with
local approaches. Ke et al. [24] constructed the STV of flow, and sampled the
horizontal and vertical components in space-time using a 3D variant of the

rectangle features of [30]. Global approaches, however, are much simpler to
compute and can achieve fast and fairly high recognition rates. Zelnik et al.
presents a global descriptor based on histogram of gradients (Zelnik-manor et
al., 2001). This descriptor is applied on the Weizmann video database and is
obtained with the extraction of multiple temporal scales through the construc-
tion of a temporal pyramid. To calculate this pyramid, they apply a low-pass
filter on the video and sample it. For each scale, the intensity of each pixel
gradient is calculated. Then, a histogram of gradients is created for each video
and compared with others histograms to classify the database.
In order to obtain a global descriptor on the KTH dataset, Laptev et al.

(2007) apply the Zelnik descriptor (Zelnik-manor et al., 2001) in two differ-
ent ways: using multiple temporal scales like the original and using multiple
temporal and spatial scales. Solmaz et al. (2012) present a global descriptor
based on bank of 68 Gabor filters. For each video, they extract a fixed num-
ber of clips and compute the 3-D Discrete Fourier Transform. Applying each
filter of the 3-D filter bank separately to the frequency spectrum, the output
is quantized in fixed sub-volumes. They concatenate the outputs and perform
dimension reduction using PCA and classification by a SVM.
Local representations describe the object as a collection of individual patch.

To estimate the local features, spatio-temporal interest points should be de-
tected first. After that, local cubes will be calculated around these points.
Then, the cubes are combined into a final feature. Finally, bag-of feature ap-
proaches are always followed. Local features are less sensitive to noise and par-
tial occlusion, and do not strictly require background subtraction or tracking.
However, as they depend on the extraction of a sufficient amount of relevant
interest points, pre-processing is sometimes indispensable. To detect the inter-
est points or salient regions, a large variety of methods have been put forward.
Laptev et al. [31] modified the 2D spatial interest points to detect 3D spatio-
temporal interest points for action recognition.
The detected local features preserve some rotation invariance, which is ro-
bust to the succeeding action recognition. Dollar et al. [32] presented a detector
to obtain abounding interest points by employing a series of spatio-temporal
filters. Oikonomopoulo et al. [33] developed the salient point detector [34] by
taking advantage of the entropy of space-time regions. After the detection of
interesting points or salient regions, many methods have been proposed to de-
scribe them. Scovanner et al. [35] extended the SIFT descriptor to 3D-SIFT,
by using the spatiotemporal information which is denoted by a sub-histogram.
Laptev et al. [36] applied histograms of gradient orientations (HoG) and his-
tograms of optic flows (HoF) to describe the information of local motion and
cubes of the space-time neighborhoods of detected interest points, and Klaser
et al. [37] ameliorated HoG to the 3D case.
After obtaining local descriptors of interest points and salient regions, a code-
book of local motion patterns can be obtained by using a clustering algorithm,

e.g., K-means. Then, the bag-of-words modle, and some other local descrip-
torsbased approaches are utilized to depict spatiotemporal relations among
local cuboids. Savrese et al. [38] proposed the spatial-temporal correlograms
to describe the long range temporal information into the local motion features.
Ryoo et al. [39], [40] proposed a spatio-temporal relationship matching method
for the recognition of multi-person activities, e.g., push and hand-shake.
3 Background modeling using Type-2 FGMM
In this part, we describe the type-2 FGMM applied to background modeling

and represent it in different case of video sequence.
3.1 Basic principles of T2 FGMMs
The single Gaussian probability density function is extended to Gaussian mix-

ture model (GMM). The multivariate Gaussian distribution is: The single
Gaussian probability density function is extended to Gaussian mixture model
(GMM). The multivariate Gaussian distribution η is:
X 1 1 X−1
η(θ; µ; )= 1/2
e(− (θ − µ)T (θ − µ)) (1)
2
P
(2π)3/2 | |
Where:
=diag(σ12 , , σn2 )
P P
– is the covariance matrix,
– µ is the mean vector,
– The observation θ is a vector θ = (θ1 , .., θn ) in the case of RGB color space,
n = 3.
The GMM is composed of N mixture components of multivariate Gaussian as
follows:
N
X X
P (θ) = ωi η(θ; µi , ) (2)
i
i=1
Where:
– N is the number of GMM PNdistributions,
– ω is the mixing weight, i=1 ωi = 1, ωi 0
The GMMs are extended to T2 FGMMs with uncertain mean (T2 FGMM-
UM) and covariance (T2 FGMM-UV). For the T2 FGMM-UM, the multivari-
ate Gaussian with uncertain mean vector is:
X 1 θ −µ
− 12 ( 1σ 1 )2 1 θn −µn 2
η(θ; µ
e; )= P 1/2
e 1 · · · e− 2 ( σn ) (3)
3/2
(2π) | |
For the T2 FGMM-UV, the multivariate Gaussian with uncertain variance

vector is:
1 1 θ1 −µ1 2 1 θn −µn 2
e− 2 ( σ1 ) · · · e− 2 ( σn )
X
η(θ; µ; )= (4)
g
P 1/2
(2π)3/2 | |
P
µ
e and f denote uncertain mean vector and covariance matrix. The Gaus-
sian primary membership function (MF) of Gaussian with uncertain mean
vector, the upper is in eq. (5) and the corresponding footprint of uncertainty
(FOU) is shown in fig. x :

 f (θ; µ, σ) θ≺µ
h(θ) = 1 µ≤θ≤µ (5)
f (θ; µ, σ) θ≺µ

Where:
θ−µ 2
1
f (θ; µ, σ) ≡ e− 2 ( σ ) (6)
end θ−µ 2
1
f (θ; µ, σ) ≡ e− 2 ( σ ) (7)
The lower membership function (MF) is:
µ+µ
(
f (θ; µ, σ) θ 2
h(θ) = µ+µ (8)
f (θ; µ, σ) θ≤ 2
The factors km and kv control the intervals in which the parameters vary as
follows:
µ = µ − km σ, µ = µ + km σ (9)
1
σ = kv σ, σ= σ (10)
kv
Because a one-dimensional Gaussian has 99.7 of its probability mass in the
range of [-3, +3], the parameters km and kv have been adopted in the inter-
vals respectively:
km ∈ [0, 3] and kv ∈ [0.3, 1]
T2-FGMM-UM and T2-FGMM-UV can be used for background modeling

and we can expect that the T2-FGMMUM will be more robust than the T2-
FGMM-UV, because that the means are performant than variance and the
weights[xx].
However, to initialize T2-FGMM, we have to estimate the parameters , and
the factor km and kv . [xxx] propose the factor km and kv as constants accord-
ing to prior knowledge. In this paper, these factors are designated depending
to the scene video sequence and the improved estimation includes two steps:
Step 1: Choose the number of GMM distributions (N) between 3 and 5 and
estimate GMM parameters by an Em algorithm.
Step 2: Add the factor km or kv to GMM to produce T2 FGMM-UM or T2
FGMM-UV.
3.2 Object detection
The aim is to detect moving objects in a video sequence. Furthermore, to detect

and to classify current pixel as foreground or background. In first step, we or-
dered the NPGaussians as in [yyy], by using the ratio rk = k /k .T hisorderingassumesthatabackgroundpixelcorrespondstoah
n
argminn ( i=1 ωi T h)(11)The other distributions are considered to repre-
sent a foreground distribution. A match test is performed for each pixel, for the
next frame at times t+1. For that reason, For this, we use the log-likelihood
as follows:
n
X
M = argminn ( ωi T h) (12)
i=1
4 Proposed methodology
Our method of recognition of human actions from a video is based on the

extraction of key frames. Indeed the extraction of key frames is done by se-
lecting from the all frames of the sequence video unless the images contain
more information and more significant. These key frames can be considered as
a summary of the whole video.
Text with citations [2] and [1].
4.1 Tracking of the person:
First of all, to detect human actions, we applied a Kalman filter that pro-
vides a fundamental framework for the estimation of movement, indeed it is
a recursive estimator that is to say, to estimate current state, the previous
sound statements and current measures are necessary for estimation. Figure
(1) shows how made the detection of the individual movement from short video
and the segmented image of the person’s silhouette is obtained by separation
of the foreground (in white) from the background (in black), so we will keep
only the most significant objects of interest.
4.2 Key frame extraction
So to extract key frames, our idea is to select from the images of the silhouette
of the person key frames based on the orientation of the ellipse. The ellipse
will detect the person’s movement; according to this movement key frames will
be selected. we applied the ellipse on the frame of persons silhouette to obtain
the centroid and the orientation angle of each frame of the video that is to say,
detecting its movement as it shown in Figure (2). namely for example an X-Y
plane, it is considered an area M which is divided into parts and represent the
position of the parties centroid is given as follows:
With xi and yi are the coordinates of the individual segment. For the selection
of key frames, it is assumed that the numbers of total images of the video is
n, k (n) and v (n) are respectively the lengths of the horizontal and vertical
shape; is defined as a ratio s as follows: s = (k(n))/(h(n)) And key images
are selected taking only ones that contain more information that is to say the
maximum s for each sequence: key f rame = argmax(s)
4.3 Feature extraction using silhouette orientation
After extracting key frames from the frame of persons silhouette in the video
sequence, pretreatment is applied to extract features. Orientation is the main
feature extracted for each key image as a well-defined action.
5 Human action classification
6 Experiment results
6.1 Datasets
6.2 Comparison between Multi-SVM and ANN
7 Local tensor descriptor : Histogram of Structure Tensors (HoST)
To compute the optical flow between two images, you must solve the following
optical flow constraint equation:
Ixu+Iyv+It=0 Ix, Iy, and It are the spatiotemporal image brightness deriva-
tives.
u is the horizontal optical flow.
v is the vertical optical flow.
In this paper, we present a novel spacio-temporel feature descriptor to ex-

tract the local information of a frame. Our work is based on the structure
tensor using an orientation tensor which represent a local orientation of a real
symmetric metrix. Our system used a video as input i.e, a set of frames F=fi,
where I appartient [1,N], or N is the number of frames. The output is a Host
of structure tensor M appartient Rn.
Which represented as a vector. However, our approach use the extraction blob
or motion object for all frames to form the descriptor. We extract every region
of interest (Roi) of the frame that represent blob of object detection. In this
case, we calculate Host of Roi. The result is then represented by a histogram
of orientation tensor. The tensors of each frame are regrouped to frame. The
final orientation tensor descriptor H of the video.
The method used in this paper consist of three steps:
Step1: extract key frame from video sequence

Step2: extract feature vectors for every video.
Step3: calssify action using neural networks.
Among the approaches to detect motion of object, Optical Flow (OF) is

used to extract vectors that specify displacements between two video frames.
Our method is motivated by the possibility of generating a vector composed
by k elements for each frames of video, to obtain the total descriptor called
ST-HoST. Our principal contribution is a new local spatio-temporal descriptor
called ST-HoST based on the combination of histograms structure tensor and
Optical Flow, based on orientation tensors [ref], which uses region of interest
(ROI) information.
Our aproach use the extraction blob or motion object for all frames to form
the descriptor.
However, Our system used a video as an input , i.e , a set of frames F=Fi,
where k [1,N], N is the number of frames. The output is a histograms structure
tensor descriptor H which viewed as a vector. Using Optical Flow method,
we select the corresponding HoST vector for each region of the frame is then
represented by a histogram of orientation tensor h. For each Region Of Interest
(ROI) of frames, HoST vectors are calculated and are accumulated to form
the final tensor descriptor H of the video represented as the folowing equation
:
An orientation tensor is a representation of local orientation which takes the
form of a n*n real symmetric matrix for n-dimensional ROI. Given a vector,
it can be represented by the tensor. We use the orientation tensor to represent
the histogram. The frame tensor for the initial size s, is given by:
8 Activity Recognition based on Neural Networks
The most important recognition algorithms developed so far have a biological

influence and can be grouped in the following categories:
Artificial neural networks, fuzzy systems, genetic algorithms, neuro-fuzzy
systems. All the enumerated systems share these main characteristics: paral-
lel computation, learning, generalization and flexibility. In the following, the
artificial neural networks are analyzed.
Finding the right topology of the Artificial neural Network (ANN) for a
specific application is a challenging task and only through several simulations
of different ANN topologies, in terms of numbers of neurons per layers and
numbers of hidden layers, can be found the optimum one.
In our case after several tries we obtained good result using a two layer
feed forward network, with sigmoid activation function on both the hidden
and output layers. The training function adopted was trainlm that updates
the weights and biases values of the neural network according to Levenberg
Marquardt optimization.
The training method is the fastest back propagation algorithm offered by

the Matlab. The performance function is set to the MSE function. It measures
the networks performance according to the mean of squared errors.
As the input data vector was formed from 90 sets of data, 9 vectors for
each human action of the video dataset, the neuron numbers in the input layer
was set to 20. Considering that there are 10 human action to be recognized,
ten neurons have been placed on the output layer.
For the neuron from the hidden layer have been tried different numbers.
We finally chose a ten layer architectures with:
- 20 layers FF-BP - 10 neurons on the hidden layer - 10 neurons on the output
layer - Levenberg Marquardt as training algorithm - MSE for performance
function evaluation For testing purpose we have chosen data from training
set, the obtained recognition rate was 100.
We designed and tested a Feed-forward Back propagation network with two
layers having 10 neurons on hidden layer and 5 neuron on output layer. The
recognition rate was 99.96 . Simulation result is presented on Fig. 5.
A new data set was acquired and includes ten human action presented in
Fig. 6. The data set was split up in 3 subsets for training, validation and test-
ing.
The dimensionality of the input vector dictates numbers of neuron on the

input layer vector (which is four: 10 sets of data from the 90 videos of the
dataset).
The output layer has 10 neurons because we adopted for the output neurons
the one to n coding for showing the recognized human action and there are 10
human actions to be recognised as follows:
’bend’,’jack’,’jump’,’pjump’,’run’,’side’,’skip’,’walk’,’wave1’,’wave2’
After several tests we found that for the hidden layer 10 neurons seems
to be enough for a good recognition rate. The resulting neural network archi-
tecture for the activity recognition is shown in Fig.7. The testing sets were
chosen from the training set and was setup from 10 times 90 consecutive vec-
tors representing each activity to be recognized. The recognition / errors are
displayed in Fig. 8. Total number of errors was only 456, meaning a 99,088
recognition rate.
ann.jpg
Fig. 1 Figure captions
host.jpg
9 Experimental results
10 Conclusion
In this paper, we presented a new method of recognition of human actions, this

method is to extract key images from the silhouette of the person in the video
based on the orientation of the ellipse, and then, according to the movement
of the person, key frames are selected. after that, since key frames features are
extracted. Through effective classification, recognition of human activities is
final_host.jpg
Fig. 4 Please write your figure caption here
Fig. 5 Please write your figure caption here
Table 1 Please write your table caption here
first second third

number number number
number number number
feasible. We tested our method downs recognized data (Weizman and KTH)
.our method show effective results by comparing with other methods.
as required. Don’t forget to give each section and subsection a unique label
(see Sect. 10).
Paragraph headings Use paragraph headings as needed.
a2 + b2 = c2 (13)
References
1. Author, Article title, Journal, Volume, page numbers (year)

2. Author, Book title, page numbers. Publisher, place (year)

Human Activity Recognition Using Neural Networks and Spatio-Temporal Histogram of Structure Tensors (St-Host) Descriptor

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Human Activity Recognition Using Neural Networks and Spatio-Temporal Histogram of Structure Tensors (St-Host) Descriptor

Uploaded by

Copyright:

Available Formats

Noname manuscript No.

(will be inserted by the editor)

Human activity recognition using neural networks

Slim Abdelhedi · Ali Wali · Adel M.

Received: date / Accepted: date

key frames, a classification is done by the Multi-SVM and ANN algorithm.

on the combination of histograms structure tensor and Optical Flow. 3. Ex-

horizontal and vertical components in space-time using a 3D variant of the

In order to obtain a global descriptor on the KTH dataset, Laptev et al.

Local representations describe the object as a collection of individual patch.

book of local motion patterns can be obtained by using a clustering algorithm,

3 Background modeling using Type-2 FGMM

In this part, we describe the type-2 FGMM applied to background modeling

3.1 Basic principles of T2 FGMMs

The single Gaussian probability density function is extended to Gaussian mix-

For the T2 FGMM-UV, the multivariate Gaussian with uncertain variance

T2-FGMM-UM and T2-FGMM-UV can be used for background modeling

3.2 Object detection

The aim is to detect moving objects in a video sequence. Furthermore, to detect

Our method of recognition of human actions from a video is based on the

4.1 Tracking of the person:

4.2 Key frame extraction

4.3 Feature extraction using silhouette orientation

5 Human action classification

6.2 Comparison between Multi-SVM and ANN

7 Local tensor descriptor : Histogram of Structure Tensors (HoST)

u is the horizontal optical flow.

v is the vertical optical flow.

In this paper, we present a novel spacio-temporel feature descriptor to ex-

Step1: extract key frame from video sequence

Among the approaches to detect motion of object, Optical Flow (OF) is

8 Activity Recognition based on Neural Networks

The most important recognition algorithms developed so far have a biological

The training method is the fastest back propagation algorithm offered by

The dimensionality of the input vector dictates numbers of neuron on the

Fig. 1 Figure captions

Fig. 2 Figure captions

In this paper, we presented a new method of recognition of human actions, this

Fig. 3 Figure captions

Fig. 4 Please write your figure caption here

Fig. 5 Please write your figure caption here

Table 1 Please write your table caption here

first second third

Paragraph headings Use paragraph headings as needed.

1. Author, Article title, Journal, Volume, page numbers (year)

You might also like