You are on page 1of 8

Human Activity Prediction:

Early Recognition of Ongoing Activities from Streaming Videos

M. S. Ryoo
Electronics and Telecommunications Research Institute, Daejeon, Korea
mryoo@etri.re.kr

A. Human activity classification problem


Abstract What is this activity?
Problem:
Observation:
In this paper, we present a novel approach of human
activity prediction. Human activity prediction is a proba-
bilistic process of inferring ongoing activities from videos
only containing onsets (i.e. the beginning part) of the activ-
ities. The goal is to enable early recognition of unfinished B. Human activity prediction problem
activities as opposed to the after-the-fact classification of Problem: What is this activity?

completed activities. Activity prediction methodologies are Observation:


particularly necessary for surveillance systems which are
required to prevent crimes and dangerous activities from oc- ???
curring. We probabilistically formulate the activity predic-
tion problem, and introduce new methodologies designed Figure 1. A comparison between the activity classification prob-
for the prediction. We represent an activity as an integral lem and the activity prediction problem. In contrast to the clas-
histogram of spatio-temporal features, efficiently modeling sification task, a system is required to infer the ongoing activity
how feature distributions change over time. The new recog- before fully observing the activity video in the prediction task.
nition methodology named dynamic bag-of-words is devel-
oped, which considers sequential nature of human activities
while maintaining advantages of the bag-of-words to handle tivity recognition in real-world environments [16, 4, 10, 6,
noisy observations. Our experiments confirm that our ap- 12, 17]. Motivated by the success of scale-invariant local
proach reliably recognizes ongoing activities from stream- patch features in object recognition, these approaches ex-
ing videos with a high accuracy. tract sparse local features from the 3-D XYT video volume
formed by concatenating image frames along time axis. The
bag-of-words paradigm that ignores locations of features
1. Introduction has been widely adopted by many approaches, successfully
classifying actions (e.g. videos in [16, 2]).
Human activity recognition, an automated detection of However, most of the existing activity recognition ap-
events performed by humans from video data, is an im- proaches are missing an important aspect of human activ-
portant computer vision problem. In the past 10 years, the ity analysis. Most previous researchers including those dis-
field of human activity recognition has grown dramatically, cussed above focused only on the after-the-fact detection of
corresponding to societal demands to construct various im- human activities (i.e. classifying activities after fully ob-
portant applications including smart surveillance, quality- serving the entire video sequence), making the approaches
of-life devices for elderly people, and human-computer in- unsuitable for the early detection of unfinished activities
terfaces. Researchers are now graduating from recogniz- from video streams. In many real-world scenarios, the sys-
ing simple human actions such as walking and running tem is required to identify an intended activity of humans
[16, 4, 2, 10, 6], and the field is gradually moving towards (e.g. criminals) before they fully execute the activity. For
recognition of complex realistic human activities involving example, in a surveillance scenario, recognizing the fact
multiple persons and objects [12, 17, 18]. that certain objects are missing after they have been stolen
Particularly, in the past 5 years, approaches utilizing may not be meaningful. The system could be more useful if
spatio-temporal features obtained successful results on ac- it is able to prevent the theft and catch the thieves by predict-

2011 IEEE International Conference on Computer Vision 1036


978-1-4577-1102-2/11/$26.00 2011
c IEEE
ing the ongoing stealing activity as early as possible based ing tested contains a complete execution of a single known
on live video observations. Similarly, if an autonomous ve- activity. There also have been previous works attempted to
hicle wants to avoid an accident, its vision system is re- recognize activities based on video segments (e.g. a sin-
quired to predict the accident which is about to occur and gle frame [9]), which may make decisions before fully ob-
escape from it before any damage is caused. Even though serving the activities. However, even though they obtained
one may extend traditional sequential models such as hid- successful results on recognizing simple actions, they were
den Markov models (HMMs) to roughly approximate the limited in recognizing more complex activities (e.g. inter-
prediction problem, they are unsuitable for modern high- actions in [13]) composed of similar body gestures.
dimensional features which provide a sparse discontinuous On the other hands, previous recognition approaches us-
representation of the video. A development of a new ac- ing sequential state models (e.g. HMMs) [3, 8] displayed
tivity prediction methodology which is able to recognize an an ability to infer the intermediate status of human actions.
ongoing (i.e. unfinished) activity from a video only contain- For example, [3] modeled each human action as a sequence
ing early part of the activity (i.e. onsets) is necessary. of hidden states generating posture features per frame, in
In this paper, we provide a formal definition of the activ- order to enable early recognition of actions. The limitation
ity prediction as an inference of the ongoing activity given of the previous sequential approaches is that they were un-
temporally incomplete observations (Figure 1). The focus suitable for the prediction of high-level activities with noisy
of this paper is the introduction of the paradigm of the ac- observations and concurrent movements. By their nature,
tivity prediction and the presentation of new methodologies HMMs relied on per-frame body features of a human, and
designed for the prediction. Our objective is to enable the thus had difficulties processing realistic videos with chang-
construction of an intelligent system which will perform ing backgrounds, dynamic lighting conditions, multiple ac-
early recognition from live video streams in real-time. We tors, and/or unrelated pedestrians.
formulate the prediction problem probabilistically, and dis- Recognition approaches using hierarchically organized
cuss the novel methodologies to solve the problem by esti- models (e.g. stochastic context-free grammars [5]) were
mating the activity’s ongoing status efficiently. able to process high-level human activities by recognizing
This paper introduces two new human activity predic- their sub-events. Particularly, Ryoo and Aggarwal’s sys-
tion methodologies which are able to cope with videos from tem representing human activities in terms of logical predi-
unfinished activities. These methods compute the poste- cates [15] was able to analyze the progress status of activi-
rior probability of ‘which activity has progress to which ties based on the sub-event detection results. However, they
point of the activity’, based on the observations available were unable to make an appropriate analysis if human activ-
at the time. 3-D XYT spatio-temporal features strong to ities share similar sub-events (e.g. pointing vs. punching).
noise, changing background, and scale changes are adopted. Recognizing complex activities at their early stage was dif-
We designed the activity prediction approach called integral ficult for the previous approaches.
bag-of-words, modeling how feature distributions of activ- An important contribution of this paper is the systematic
ities change as observations increase. Integral histogram formulation of the concept of activity prediction, which has
representations of the activities are constructed from train- not been studied in depth in previous research. This paper
ing videos. In addition, the new recognition methodology presents novel methodologies that reliably identify unfin-
named dynamic bag-of-words approach is developed, ex- ished activities from video streams by analyzing their on-
tending the prediction algorithm to consider the sequential sets. Our experiments confirm that our approach is able to
structure formed by video features. Structural similarities correctly predict ongoing activities even when the videos
between activity models and incomplete observations are containing less than the first half of the activity is provided,
computed using our dynamic programming algorithm. in contrast to the previous systems.

2. Previous works 3. Problem formulation


Many researchers have studied human activity recogni- In this section, we probabilistically formulate the activity
tion since early 1990s [1]. prediction problem. We first present our probabilistic inter-
As discussed in the introduction, approaches utilizing lo- pretation of previous human activity classification problem
cal spatio-temporal features have been popularly studied in briefly. Next, we formulate the new problem of human ac-
the last 5 years [16, 4, 10, 6, 12, 17]. These features are tivity prediction, while contrasting it with the previous ac-
shown to be invariant to affine transformations and robust to tivity classification problem.
lighting changes, making the approaches following the bag-
3.1. Human activity classification
of-words paradigm strong to noise and changing environ-
ments. The approaches were designed to perform after-the- The goal of human activity classification is to categorize
fact classification of activities, assuming that each video be- the given videos (i.e. testing videos) into a limited number

1037
of classes. Given a video observation O composed of im- t t

age frames from time 0 to t, the system is required to select


the activity label Ap which the system believes to be con-
tained in the video. Various classifiers including K nearest
neighbors (K-NNs) and support vector machines (SVMs)
have been popularly used in previous approaches. In ad-
dition, sliding windows techniques were often adopted to
apply activity classification algorithms for the localization Figure 2. Example 3-D spatio-temporal features (left) and visual
of activities from continuous video streams. words (right) from a hand-shake video. Features grouped into an
Probabilistically, the activity classification is defined as identical visual word are described with the same color.
a computation of the posterior probability of the activity Ap
given a video O with length t. In most cases, the video dura- where d is a variable describing the progress level of the ac-
tion t is ignored, assuming it is independent to the activity: tivity Ap . For example, d = 50 indicates that the activity Ap
has been progressed from the 0th frame to the 50th frame
P (Ap | O, t) = P (Ap , d∗ | O)
of its representation. That is, it describes that the activity
P (O | Ap , d∗ )P (Ap , d∗ ) (1) prediction process must consider various possible progress
=P ∗ ∗
i P (O | Ai , d )P (Ai , d ) statuses of the activities for all 0 ≤ d ≤ d∗ . P (t|d) repre-
sents the similarity between the length of the observation t
where d∗ is a variable describing the progress level of and that of the activity progress d.
the activity, which indicates that the activity is fully pro- The key of the activity prediction problem is the ac-
gressed. As a result, the activity class with the maximum curate and efficient computation of the likelihood value
P (Ap , d∗ | O) is chosen to be the activity of the video O. P (O|Ap , d), which measures the similarity between the
The probabilistic formulation of activity classification video observation and the activity Ap having the progress
implies the classification problem assumes that each video level of d. A brute force method of solving the activity pre-
(either a training video or a testing video) provided to the diction problem is to construct multiple probabilistic clas-
system contains a full execution of a single activity. That is, sifiers (e.g. probabilistic SVMs) for all possible values of
it assumes the after-the-fact categorization of video obser- Ap and d. However, training and maintaining hundreds of
vations rather than analyzing ongoing activities, and there classifiers to cover all progress level d (e.g. 300 SVMs per
have been very few attempts to recognize unfinished activi- activity if the activity takes 10 seconds in 30 fps) requires a
ties. significant amount of computational costs. Furthermore, the
brute force construction of independent classifiers ignores
3.2. Human activity prediction
sequential relations among the likelihood values, making
The problem of human activity prediction is defined as the development of robust and efficient activity prediction
an inference of unfinished activities given temporally in- methodologies necessary.
complete videos. In contrast to the activity classification,
the system is required to make a decision on ‘which activ- 4. Prediction using integral bag-of-words
ity is occurring’ in the middle of the activity execution. In
activity prediction, there is no assumption that the ongoing In this section, we present our human activity prediction
activity has been fully executed. The prediction method- methodology named integral bag-of-words. The major dif-
ologies must automatically estimate each activity’s progress ference between the approach introduced in this section and
status that seems most probable based on the video observa- the previous approaches is that our approach is designed
tions, and decide which activity is most likely to be occur- to efficiently analyze the status of ongoing activities from
ring at the time. Most of the previous classification methods video streams. In Subsection 4.1, we first discuss the video
are not directly applicable for this purpose, since they only features used. Next, our new probabilistic activity predic-
support the after-the-fact detections. tion methodology is presented in Subsection 4.2.
We probabilistically formulate the activity prediction
process as:
4.1. Features and visual words
X Our approach takes advantage of 3-D space-time local
P (Ap | O, t) = P (Ap , d | O, t) features to predict human activities. A spatio-temporal fea-
P d ture extractor (e.g. [16, 4]) detects interest points with
d P (O | Ap , d)P (t | d)P (Ap , d) salient motion changes from a video, providing descriptors
=P P
i d P (O | Ai , d)P (t | d)P (Ai , d)
that represent local movements occurring in a video. The
(2) feature extractor first converts a video into the 3-D XYT

1038
Observed video O:
volume formed by concatenating image frames along time
axis, and locates 3-D volume patches with salient motion
changes. A descriptor is computed for each local patch by
summarizing gradients inside the patch.
Once local features are extracted, our method clusters
them into multiple representative types based on their ap-
pearance (i.e. feature vector values). These types are called Integral histogram H:
‘visual words’, which essentially are clusters of features. t=1 t = 17 t = 24 t = 31 t = 35 t = 42 t = 53

We use k-means clustering algorithm to form visual words


from features extracted from sample videos. As a result,
each detected feature in a video belongs to one of the k vi-
sual words. Figure 2 shows example features and words. Figure 3. An example integral histogram representing a kicking
video. An integral histogram models how histogram distribution
4.2. Integral bag-of-words changes over time. Each histogram bin counts the number of fea-
Integral bag-of-words is a probabilistic activity predic- tures grouped into its visual word.
tion approach that constructs integral histograms to repre-
sent human activities. In order to predict the ongoing activ- where f is a feature extracted from the video Ol and tf is
ity given a video observation O of length t, the system is its temporal location. That is, each element hd (Ol ) of the
required to compute the likelihood P (O|Ap , d) for all pos- integral histogram H(Ol ) describes the histogram distribu-
sible progress level d of the activity Ap . What we present in tion of spatio-temporal features whose temporal locations
this subsection is an efficient methodology to compute the are less than d. Our integral histogram can be viewed as a
activity likelihoods by modeling each activity as an integral temporal version of the spatial integral histogram [11].
histogram of visual words. Figure 3 shows an example integral histogram. Essen-
Our integral bag-of-words method is a histogram-based tially, an integral histogram is a function of time describ-
approach, which probabilistically infers ongoing activities ing how histogram values change as the observation dura-
by computing the likelihood P (O|Ap , d) based on feature tion increases. The integral histograms are computed for all
histograms. The idea is to measure the similarity between training videos of the activity, and their mean integral his-
the video O and the activity model (Ap , d) by comparing togram is used as a representation of the activity. The idea is
their histogram representations. The advantage of the his- to keep track of changes in the visual words being observed
togram representation is that it is able to handle noisy ob- as the activity progress.
servations with varying scales. For all possible (Ap , d), our The constructed integral histograms enable us the pre-
approach computes the histogram of the activity, and com- diction of human activities. Modeling integral histograms
pares them with the histogram of the testing video. of activities with Gaussian distributions having a uniform
A feature histogram is a set of k histogram bins, where variance, the problem of predicting the most probable ac-
k is the number of visual words (i.e. feature types). Given tivity A∗ is enumerated from Equation (2) as follows:
an observation video, each histogram bin counts the num-
ber of extracted features with the same type, ignoring their
X
A∗ = arg max P (Ap , d | O, t)
spatio-temporal locations. The histogram representation of p
d
an activity model (Ap , d) is computed by averaging the fea- P (4)
d M (hd (O), hd (Ap ))P (t | d)
ture histograms of training videos while discarding features = arg max P P
observed after the time frame d. That is, each histogram bin
p i d M (hd (O), hd (Ai ))P (t | d)

of the activity model (Ap , d) describes the expected number where


of corresponding visual word’s occurrences, which will be
1 −(hd (O)−hd (Ai ))2
observed if the activity Ap has progress to the frame d. M (hd (O), hd (Ai )) = √ e 2σ 2 . (5)
In order to enable the efficient computation of likeli- 2πσ 2
hoods for any (Ap , d) using histograms, we model each ac-
Here, H(Ai ) is the integral histogram of the activity Ai ,
tivity by constructing its integral histogram. Formally, an
and H(O) is the integral histogram of the video O. An
integral histogram of a video is defined as a sequence of fea-
equal prior probability among activities is assumed, and σ 2
ture histograms, H(Ol ) = [h1 (Ol ), h2 (Ol ), ..., h|H| (Ol )],
describes the uniform variance.
where |H| is the number of frames of the activity video Ol .
The proposed method is able to compute the activity
Let vw denote the wth visual word. Then, a value of the wth
likelihood for all d with O(k · d∗ ) computations given the
histogram bin of each histogram hd (Ol ) is computed as:
integral histogram of the activity. The time complexity
hd (Ol )[w] = |{f | f ∈ vw ∧ tf < d}| (3) of the integral histogram construction for each activity is

1039
O(m·log m+k·d∗ ) where m is the number of total features Video observation O
in training videos of the activity. That is, our approach re- ∆t1 ∆t2 ∆t3 ∆t4 …
quires significantly less amount of computations compared

∆d2 ∆d1
to the brute force method of applying previous classifiers.

Activity model Ap
For instance, the brute force method of training SVMs for
all d takes O(n · k · d∗ · r) computations where n is the num-

∆d3
ber of training videos of the activity and r is the number of h∆t4(O) h∆d4(Ap)
iterations to train a SVM.
M( , )

… ∆d4
5. Prediction using dynamic bag-of-words
In this section, we present a novel activity recognition Figure 4. The matching process of our dynamic bag-of-words.
methodology, dynamic bag-of-words, which predicts hu-
man activities from onset videos using a sequential match- to update the likelihood of the entire observations (i.e.
ing algorithm. The integral bag-of-words presented in the P (Ot |Ap , d)). This incremental likelihood computation not
previous section is able to perform an activity prediction by only enables efficient activity prediction for increasing ob-
analyzing ongoing status of activities, but it ignores tempo- servations, but also poses a temporal constraint that obser-
ral relations among extracted features. In Subsection 5.1, vations must match the activity model sequentially.
we introduce a new concept of dynamic bag-of-words that Essentially, the above-mentioned recursive equation is
considers human activities’ sequential structures for the pre- dividing the activity progress time interval d into a set
diction. Subsection 5.2 presents our dynamic programming of sub-intervals D = {∆d1 , ∆d2 , ..., ∆dq } with varying
implementation to predict ongoing activities from videos. lengths and the observed video O into a set of sub-intervals
T = {∆t1 , ∆t2 , ..., ∆tq }. The likelihood is being com-
5.1. Dynamic bag-of-words
puted by matching the q pairs of sub-intervals (∆dj , ∆tj ).
Our dynamic bag-of-words is a new activity recogni- That is, the method searches for the optimal D and T that
tion approach that considers the sequential nature of hu- maximize the overall likelihood between two sequences,
man activities, while maintaining the bag-of-words’ ad- which is measured by computing similarity between each
vantages to handle noisy observation. An activity video (∆dj , ∆tj ) pair. Figure 4 illustrates such process.
is a sequence of images describing human postures, and The motivation is to divide the activity model and the
its recognition must consider the sequential structure dis- observed sequence into multiple segments to find the struc-
played by extracted spatio-temporal features. The dynamic tural similarity between them. Notice that the duration
bag-of-words method follows our prediction formulation of the activity model segment (i.e. ∆d) to match the
(i.e. Equation (2)), measuring the posterior probability of new observation segment (i.e. O∆t ) is dynamically se-
the given video observation generated by the learned activ- lected, finding the best-matching segment pairs to com-
ity model. Its advantage is that the likelihood probability, pute their similarity distance recursively. The segment
P (O|Ap , d), is now computed to consider the activities’ se- likelihood, P (O∆t |Ap , ∆d), is computed by comparing
quential structures. their histogram representations. That is, the bag-of-words
Let ∆d be a sub-interval of the activity model (i.e. Ap ) paradigm is applied for matching the interval segments,
that ends with d, and let ∆t be a sub-interval of the observed while the segments themselves are sequentially organized
video (i.e. O) that ends with t. In addition, let us denote based on our recursive activity prediction formulation.
the observed video O more specifically as Ot , indicating
that O is obtained from frames 0 to t. Then, the likelihood Video segment matching using integral histograms. Our
between the activity model and the observed video can be dynamic bag-of-words method takes advantage of integral
enumerated as: histograms for computing the similarity between interval
XX segments (i.e. P (O∆t |A, ∆d)). Integral histograms enable
P (Ot | Ap , d) = [P (Ot−∆t | Ap , d − ∆d) efficient constructions of the histogram of the activity seg-
∆t ∆d (6) ment ∆d and that of the video segment ∆t for any possible
P (O∆t | Ap , ∆d)] (∆d, ∆t). Let [a, b] be the time interval of ∆d. The his-
togram corresponding to ∆d is computed as:
where O∆t corresponds to the observations obtained during h∆d (Ap ) = hb (Ap ) − ha (Ap ) (7)
the time interval of ∆t, and Ot−∆t corresponds to those
obtained during the interval t − ∆t. where H(Ap ) is the integral histogram of the activity Ap .
The idea is to take advantage of the likelihood computed Similarly, the histogram of ∆t is computed based on the
for the previous observations (i.e. P (Ot−∆t | Ap , d − ∆d)) integral histogram H(O), providing us h∆t (O).

1040
Using the integral histograms, the likelihood probability Video observation O
computation of our dynamic bag-of-words is described with … u-1 u
the following recursive equation. Similar to the case of in-
… ...

d-1 d-2 d-3 d-4 …


tegral bag-of-words method, the feature histograms of the 𝐹𝑝 ′ 𝑢, 𝑑

Activity model Ap
𝐹𝑝 ′ 𝑢 − 1, 𝑑 − ∆𝑑 ∙
activities are modeled with Gaussian distributions. … = max
∆𝑑 𝑀(ℎ𝑢 𝑂 , ℎ∆𝑑 𝐴𝑝 )
XX
Fp (t, d) = [Fp (t − ∆t, d − ∆d)·
∆t ∆d (8)
M (h∆t (O), h∆d (Ap ))]

d
t
where Fp (t, d) is equivalent to P (O | Ap , d). Figure 5. A figure illustrating the process of our dynamic program-
ming algorithm. The algorithm iteratively fills in the array to ob-
5.2. Dynamic programming algorithm
tain the optimum likelihood.
In this subsection, we present a dynamic programming
implementation of our dynamic bag-of-words method to
find the ongoing activity from the given video. We con- ũ) where each of them is represented with a histogram of
struct the maximum a posteriori (MAP) classifier of decid- features in it. As a result, Fp0 (u, d) provides us an efficient
ing which activity is mostly likely to be occurring. approximation of the activity prediction likelihood, measur-
Given the observation video O with length t, the activity ing how probable the observation O is generated from the
prediction problem of finding the most probable ongoing activity (i.e. Ap ) progressed to the dth frame.
activity A∗ is expressed as follows: A traditional dynamic programming algorithm that cor-
P responds to the above recursive equation is designed to cal-
∗ d Fp (t, d)P (t | d)P (Ap , d) culate the likelihood. The goal is to search for the optimum
A = arg max P P . (9)
p i d Fi (t, d)P (t | d)P (Ai , d)
activity model divisions (i.e. ∆d) that best describes the
observation, matching them with the observation stage-by-
That is, in order to predict the ongoing activity given an stage. Figure 5 illustrates the process of our dynamic pro-
observation Ot , the system is required to calculate the like- gramming algorithm to compute the likelihood of the ongo-
lihood Fp (t, d) (i.e. Equation (8)) recursively for all activity ing activity from an incomplete video. The time complexity
progress status d. of the algorithm is O(k · (d∗ )2 ) for each time step u, which
However, even with integral histograms, brute force is in general much smaller than t.
searching of all possible combination of (∆t, ∆d) for a
given video of length t requires O(k · (d∗ )2 · t2 ) compu- 6. Experiments
tations: In order to find A∗ at each time step t, the system
must compute Fp (t, d) for d∗ number of possible d. Fur- In this section, we implement and evaluate our human
thermore, computation of each Fp (t, d) requires the sum- activity prediction methodologies while comparing them
mation of Fp values of all possible combination of ∆t and with other previous classification works. The methods’ abil-
∆d, as we described in Equation (8). ity to perform early detection of activities is tested with the
In order to make the prediction process computation- public video dataset containing high-level multi-person in-
ally tractable, we design an algorithm that approximates the teractions. We confirm the advantages of our approaches on
likelihood Fp (t, d) by making its ∆t to have a fixed dura- inferring ongoing activities at their early stage.
tion. The image frames of the observed video are divided
6.1. Dataset
into several segments with a fixed duration (e.g. 1 second),
and they are matched with the activity segments dynami- For our experiments, we used the segmented version
cally. Let u be the variable describing the unit time duration. of the UT-Interaction dataset [13] containing videos of six
Then, the activity prediction likelihood is approximated as: types of human activities: hand-shaking, hugging, kick-
ing, pointing, punching, and pushing. The UT-Interaction
Fp0 (u, d) = max Fp0 (u − 1, d − ∆d)M (hũ (O), h∆d (Ap )) dataset is a public video dataset containing high-level hu-
∆d
(10) man activities of multiple actors. The dataset is composed
where ũ is a unit time interval between u − 1 and u. of two different sets with different environments (Figure
Our algorithm sequentially computes Fp0 (u, d) for all u. 6), containing a total of 120 videos of six types of human-
At each iteration of u, the system searches for the best- human interactions. Each set is composed of 10 sequences,
matching segment ∆d for ũ that maximizes the function F 0 , and each sequence contains one execution per activity. The
as described in Equation (10). Essentially, our method in- videos involve camera jitter and/or background movements
terprets a video as a sequence of ordered sub-intervals (i.e. (e.g. trees). Several pedestrians are present in the videos

1041
1
Dynamic BoW
0.9 Integral BoW

0.8 SVM
Bayesian
0.7 BP-SVM

0.6 Voting

Accuracy
0.5

Snapshot from the UT-Interaction set #1 Snapshot from the UT-Interaction set #2 0.4

Figure 6. Example snapshots from the UT-Interaction dataset. 0.3

0.2

0.1

as well, preventing the recognition. The UT-Interaction 0


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
dataset was used for the human activity recognition con-
Video observation ratio
test (SDHA 2010) [14], and it has been tested by several 1
state-of-the-art methods [17, 18, 19]. 0.9
Dynamic BoW
Integral BoW
We chose a dataset composed of complex activities hav- 0.8 SVM

ing sufficient temporal durations, instead of testing our sys- 0.7


Bayesian
BP-SVM
tem with videos of periodic and instantaneous actions. Even 0.6 Voting

Accuracy
though the KTH dataset [16] and the Weizmann dataset [2] 0.5
have been popularly used for the action classification in pre- 0.4
vious works, they were inappropriate for our experiments: 0.3
Their videos are composed of short periodic movements
0.2
which only require few frames (e.g. a single frame [9]) to
0.1
perform a reliable recognition.
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

6.2. Experimental settings Video observation ratio

Figure 7. Activity recognition performances with respect to the


We implemented two human activity prediction systems observed ratio, tested with the UT-Interaction dataset #1 (top)
based on the two methods presented in this paper: the in- and #2 (bottom). A higher graph suggests that the corresponding
tegral bag-of-words (BoW) method and the dynamic BoW method is able to recognize activities more accurately and earlier
method. We adopted the cuboid feature descriptors [4] as than the ones below it. Our dynamic BoW showed the best per-
spatio-temporal features used by the systems. In principle, formance by considering sequential relations among the feature
our proposed approaches are able to cope with any spatio- points. BP-SVMs, which require a large amount of computations,
temporal feature extractors as long as they provide XYT lo- also showed a fair performance. However, it generated inferior re-
cations of the features being detected. Extracted features sults particularly when making early decision. We are also able
were then clustered into 800 visual words (i.e. k = 800), to observe that the per-feature-voting approach is able to recog-
nize human activities at relatively early stage than the other basic
and integral histograms were constructed.
classifiers (e.g. SVMs), but it display worse performances overall
In addition, we implemented several previous human ac- because of the characteristics of the activities in the UT-Interaction
tivity classification approaches to compare them with our dataset (i.e. they share many gestures, generating similar features).
methods. Four types of previous classifiers using the same
features (i.e. cuboid features) were implemented: We im-
plemented (i) a brute-force prediction approach of con- and thus 10-fold cross validation is performed. That is, for
structing SVMs for all progress levels (BP-SVMs) [7] and each round, videos in one sequence was selected as the test-
(ii) a basic voting-based approach that casts a probabilistic ing videos, and videos in the other sequences were used for
vote for each spatio-temporal feature. In addition, in order the training. Integral histograms were constructed from the
to show the limitation of the previous classification frame- training videos to recognize activities in the testing videos.
work, we implemented (iii) Bayesian classifiers with Gaus- This testing/training video selection process was repeated
sian models and (iv) standard SVMs, which are designed to for 10 rounds, measuring the average recognition accuracy.
classify videos assuming that they contain full activity exe-
cutions. The testing was performed by applying the learned
6.3. Results
classifiers to the videos containing ongoing activities. In order to test the systems’ ability to predict ongoing
Throughout our experiments, the leave-one-sequence- activities at their earlier stage, we measured the systems’
out cross validation setting was used to measure the per- recognition performances for classifying videos of incom-
formances of the systems. There are 10 sequences (each se- plete activity executions. The experiments were conducted
quence contains 6 videos) for each UT-Interaction dataset, with 10 different observation ratio settings, from 0.1 to 1.

1042
Let c denote the length of an original video containing a Table 1. Recognition performances measured with the UT-
Interaction dataset #1. The classification accuracies of our ap-
fully executed activity. Then, the setting with an observed
proaches and the previous approaches are compared. Most of the
ratio of x indicates that the video provided to the system is
listed classification approaches used 3-D spatio-temporal features.
formulated by segmenting the initial x · c frames of the orig- In addition, [17] took advantage of automatically extracted bound-
inal video. For example, the systems’ performances at ob- ing boxes of persons.
served ratio 0.5 describe the classification accuracies given
testing videos only having the first halves of the activities. System Accuracy w. half videos Accuracy w. full videos
Dynamic BoW 70.0 % 85.0 %
The observed ratio of 1 indicates that all testing videos con- Integral BoW 65.0 % 81.7 %
tain full executions of the activities, making the problem a Waltisberg et al. [17] - 88.0 %
Cuboid + SVMs [14] 31.7 % 85.0 %
conventional activity classification problem. BP-SVM [7] 65.0 % 83.3 %
Figure 7 illustrates the performance curves of the imple- Yu et al. [18] - 83.3 %
Yuan et al. [19] - 78.2 %
mented systems. Its X axis corresponds to the observed ra- Cuboid + Bayesian 25.0 % 71.7 %
tio of the testing videos, while the Y axis corresponds to the
activity recognition accuracy of the system. We have av-
eraged the systems’ performances for 20 runs, considering ficient prediction of human activities. The experimental
that the visual word clustering contains randomness. results confirmed that the proposed approaches are able to
The figure confirms that the proposed methods are able recognize ongoing human-human interactions at their much
to recognize ongoing activities at earlier stage of the activi- earlier stage than the previous methods.
ties, compared to the previous approaches. For example, in
dataset #1, the dynamic BoW is able to make a prediction
References
with the accuracy of 0.7 after observing only first 60% of [1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM
Computing Surveys, 43:16:1–16:43, April 2011.
the testing video, while the BP-SVMs must observe the first [2] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-
82% to obtain the same accuracy. In addition, our integral time shapes. In ICCV, 2005.
BoW and dynamic BoW are computationally more efficient [3] J. W. Davis and A. Tyagi. Minimal-latency human action recognition using
reliable-inference. Image Vision Computing, 24:455–472, May 2006.
than the previous brute-force approaches (e.g. BP-SVMs). [4] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via
The time complexity to construct activity models in our two sparse spatio-temporal features. In IEEE Workshop on VS-PETS, 2005.
methods is O(m·log m+k ·d∗ ) while that of the BP-SVMs [5] Y. A. Ivanov and A. F. Bobick. Recognition of visual activities and interactions
by stochastic parsing. IEEE T PAMI, 22(8):852–872, 2000.
is O(n · k · d∗ · r), as discussed in Subsection 4.2.
[6] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic hu-
Table 1 compares the classification accuracies measured man actions from movies. In CVPR, 2008.
with the UT-Interaction #1. Each accuracy in the table can [7] K. Laviers, G. Sukthankar, M. Molineaux, and D. W. Aha. Improving offensive
performance through opponent modeling. In Artificial Intelligence for Interac-
be viewed as the highest performance one can get using the tive Digital Entertainment Conference, 2009.
method with the optimum parameters and visual words. To- [8] F. Lv and R. Nevatia. Single view human action recognition using key pose
gether with the classification accuracies given videos con- matching and Viterbi path searching. In CVPR, 2007.
[9] J. C. Niebles and L. Fei-Fei. A hierarchical model of shape and appearance for
taining full activity executions, the table lists the prediction human actions classification. In CVPR, 2007.
accuracies measured with the videos of the observed ratio [10] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action
0.5. We are able to observe that our approaches perform categories using spatial-temporal words. IJCV, 79(3), Sep 2008.
superior to the previous methods given the videos with the [11] F. Porikli. Integral histogram: A fast way to extract histograms in Cartesian
spaces. In CVPR, 2005.
observed ratio 0.5, confirming that they predict ongoing ac- [12] M. S. Ryoo and J. K. Aggarwal. Spatio-temporal relationship match: Video
tivities earlier. In the traditional classification task (i.e. full structure comparison for recognition of complex human activities. In ICCV,
2009.
videos), our approach performed comparable to the state-
[13] M. S. Ryoo and J. K. Aggarwal. UT-Interaction Dataset, ICPR Contest on
of-the-arts results. Semantic Description of Human Activities (SDHA), 2010.
Our algorithms run in real-time with our unoptimized [14] M. S. Ryoo, C.-C. Chen, J. K. Aggarwal, and A. Roy-Chowdhury. An overview
of contest on semantic description of human activities (SDHA) 2010. In Pro-
C++ implementation, except for the adopted feature extrac- ceedings of ICPR Contests, 2010.
tion component. That is, if appropriate features are pro- [15] M. S. Ryoo, K. Grauman, and J. K. Aggarwal. A task-driven intelligent
vided to our system in real-time, the system is able to ana- workspace system to provide guidance feedback. CVIU, 114(5):520–534, May
2010.
lyze input videos and predict ongoing activities in real-time. [16] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local SVM
approach. In ICPR, 2004.
7. Conclusion [17] D. Waltisberg, A. Yao, J. Gall, and L. V. Gool. Variations of a Hough-voting
action recognition system. In ICPR Contest on Semantic Sescription of Human
Activities (SDHA), in Proceedings of ICPR Contests, 2010.
In this paper, we introduced the new paradigm of human
[18] T.-H. Yu, T.-K. Kim, and R. Cipolla. Real-time action recognition by spa-
activity prediction. The motivation was to enable the early tiotemporal semantic and structural forests. In BMVC, 2010.
detection of unfinished activities from initial observations. [19] F. Yuan, V. Prinet, , and J. Yuan. Middle-level representation for human activi-
ties recognition: the role of spatio-temporal relationships. In ECCV Workshop
We formulated the problem probabilistically, and presented on Human Motion: Understanding, Modeling, Capture and Animation, 2010.
two novel recognition methodologies designed for the ef-

1043

You might also like