Professional Documents
Culture Documents
M. S. Ryoo
Electronics and Telecommunications Research Institute, Daejeon, Korea
mryoo@etri.re.kr
1037
of classes. Given a video observation O composed of im- t t
1038
Observed video O:
volume formed by concatenating image frames along time
axis, and locates 3-D volume patches with salient motion
changes. A descriptor is computed for each local patch by
summarizing gradients inside the patch.
Once local features are extracted, our method clusters
them into multiple representative types based on their ap-
pearance (i.e. feature vector values). These types are called Integral histogram H:
‘visual words’, which essentially are clusters of features. t=1 t = 17 t = 24 t = 31 t = 35 t = 42 t = 53
1039
O(m·log m+k·d∗ ) where m is the number of total features Video observation O
in training videos of the activity. That is, our approach re- ∆t1 ∆t2 ∆t3 ∆t4 …
quires significantly less amount of computations compared
∆d2 ∆d1
to the brute force method of applying previous classifiers.
Activity model Ap
For instance, the brute force method of training SVMs for
all d takes O(n · k · d∗ · r) computations where n is the num-
∆d3
ber of training videos of the activity and r is the number of h∆t4(O) h∆d4(Ap)
iterations to train a SVM.
M( , )
… ∆d4
5. Prediction using dynamic bag-of-words
In this section, we present a novel activity recognition Figure 4. The matching process of our dynamic bag-of-words.
methodology, dynamic bag-of-words, which predicts hu-
man activities from onset videos using a sequential match- to update the likelihood of the entire observations (i.e.
ing algorithm. The integral bag-of-words presented in the P (Ot |Ap , d)). This incremental likelihood computation not
previous section is able to perform an activity prediction by only enables efficient activity prediction for increasing ob-
analyzing ongoing status of activities, but it ignores tempo- servations, but also poses a temporal constraint that obser-
ral relations among extracted features. In Subsection 5.1, vations must match the activity model sequentially.
we introduce a new concept of dynamic bag-of-words that Essentially, the above-mentioned recursive equation is
considers human activities’ sequential structures for the pre- dividing the activity progress time interval d into a set
diction. Subsection 5.2 presents our dynamic programming of sub-intervals D = {∆d1 , ∆d2 , ..., ∆dq } with varying
implementation to predict ongoing activities from videos. lengths and the observed video O into a set of sub-intervals
T = {∆t1 , ∆t2 , ..., ∆tq }. The likelihood is being com-
5.1. Dynamic bag-of-words
puted by matching the q pairs of sub-intervals (∆dj , ∆tj ).
Our dynamic bag-of-words is a new activity recogni- That is, the method searches for the optimal D and T that
tion approach that considers the sequential nature of hu- maximize the overall likelihood between two sequences,
man activities, while maintaining the bag-of-words’ ad- which is measured by computing similarity between each
vantages to handle noisy observation. An activity video (∆dj , ∆tj ) pair. Figure 4 illustrates such process.
is a sequence of images describing human postures, and The motivation is to divide the activity model and the
its recognition must consider the sequential structure dis- observed sequence into multiple segments to find the struc-
played by extracted spatio-temporal features. The dynamic tural similarity between them. Notice that the duration
bag-of-words method follows our prediction formulation of the activity model segment (i.e. ∆d) to match the
(i.e. Equation (2)), measuring the posterior probability of new observation segment (i.e. O∆t ) is dynamically se-
the given video observation generated by the learned activ- lected, finding the best-matching segment pairs to com-
ity model. Its advantage is that the likelihood probability, pute their similarity distance recursively. The segment
P (O|Ap , d), is now computed to consider the activities’ se- likelihood, P (O∆t |Ap , ∆d), is computed by comparing
quential structures. their histogram representations. That is, the bag-of-words
Let ∆d be a sub-interval of the activity model (i.e. Ap ) paradigm is applied for matching the interval segments,
that ends with d, and let ∆t be a sub-interval of the observed while the segments themselves are sequentially organized
video (i.e. O) that ends with t. In addition, let us denote based on our recursive activity prediction formulation.
the observed video O more specifically as Ot , indicating
that O is obtained from frames 0 to t. Then, the likelihood Video segment matching using integral histograms. Our
between the activity model and the observed video can be dynamic bag-of-words method takes advantage of integral
enumerated as: histograms for computing the similarity between interval
XX segments (i.e. P (O∆t |A, ∆d)). Integral histograms enable
P (Ot | Ap , d) = [P (Ot−∆t | Ap , d − ∆d) efficient constructions of the histogram of the activity seg-
∆t ∆d (6) ment ∆d and that of the video segment ∆t for any possible
P (O∆t | Ap , ∆d)] (∆d, ∆t). Let [a, b] be the time interval of ∆d. The his-
togram corresponding to ∆d is computed as:
where O∆t corresponds to the observations obtained during h∆d (Ap ) = hb (Ap ) − ha (Ap ) (7)
the time interval of ∆t, and Ot−∆t corresponds to those
obtained during the interval t − ∆t. where H(Ap ) is the integral histogram of the activity Ap .
The idea is to take advantage of the likelihood computed Similarly, the histogram of ∆t is computed based on the
for the previous observations (i.e. P (Ot−∆t | Ap , d − ∆d)) integral histogram H(O), providing us h∆t (O).
1040
Using the integral histograms, the likelihood probability Video observation O
computation of our dynamic bag-of-words is described with … u-1 u
the following recursive equation. Similar to the case of in-
… ...
Activity model Ap
𝐹𝑝 ′ 𝑢 − 1, 𝑑 − ∆𝑑 ∙
activities are modeled with Gaussian distributions. … = max
∆𝑑 𝑀(ℎ𝑢 𝑂 , ℎ∆𝑑 𝐴𝑝 )
XX
Fp (t, d) = [Fp (t − ∆t, d − ∆d)·
∆t ∆d (8)
M (h∆t (O), h∆d (Ap ))]
…
d
t
where Fp (t, d) is equivalent to P (O | Ap , d). Figure 5. A figure illustrating the process of our dynamic program-
ming algorithm. The algorithm iteratively fills in the array to ob-
5.2. Dynamic programming algorithm
tain the optimum likelihood.
In this subsection, we present a dynamic programming
implementation of our dynamic bag-of-words method to
find the ongoing activity from the given video. We con- ũ) where each of them is represented with a histogram of
struct the maximum a posteriori (MAP) classifier of decid- features in it. As a result, Fp0 (u, d) provides us an efficient
ing which activity is mostly likely to be occurring. approximation of the activity prediction likelihood, measur-
Given the observation video O with length t, the activity ing how probable the observation O is generated from the
prediction problem of finding the most probable ongoing activity (i.e. Ap ) progressed to the dth frame.
activity A∗ is expressed as follows: A traditional dynamic programming algorithm that cor-
P responds to the above recursive equation is designed to cal-
∗ d Fp (t, d)P (t | d)P (Ap , d) culate the likelihood. The goal is to search for the optimum
A = arg max P P . (9)
p i d Fi (t, d)P (t | d)P (Ai , d)
activity model divisions (i.e. ∆d) that best describes the
observation, matching them with the observation stage-by-
That is, in order to predict the ongoing activity given an stage. Figure 5 illustrates the process of our dynamic pro-
observation Ot , the system is required to calculate the like- gramming algorithm to compute the likelihood of the ongo-
lihood Fp (t, d) (i.e. Equation (8)) recursively for all activity ing activity from an incomplete video. The time complexity
progress status d. of the algorithm is O(k · (d∗ )2 ) for each time step u, which
However, even with integral histograms, brute force is in general much smaller than t.
searching of all possible combination of (∆t, ∆d) for a
given video of length t requires O(k · (d∗ )2 · t2 ) compu- 6. Experiments
tations: In order to find A∗ at each time step t, the system
must compute Fp (t, d) for d∗ number of possible d. Fur- In this section, we implement and evaluate our human
thermore, computation of each Fp (t, d) requires the sum- activity prediction methodologies while comparing them
mation of Fp values of all possible combination of ∆t and with other previous classification works. The methods’ abil-
∆d, as we described in Equation (8). ity to perform early detection of activities is tested with the
In order to make the prediction process computation- public video dataset containing high-level multi-person in-
ally tractable, we design an algorithm that approximates the teractions. We confirm the advantages of our approaches on
likelihood Fp (t, d) by making its ∆t to have a fixed dura- inferring ongoing activities at their early stage.
tion. The image frames of the observed video are divided
6.1. Dataset
into several segments with a fixed duration (e.g. 1 second),
and they are matched with the activity segments dynami- For our experiments, we used the segmented version
cally. Let u be the variable describing the unit time duration. of the UT-Interaction dataset [13] containing videos of six
Then, the activity prediction likelihood is approximated as: types of human activities: hand-shaking, hugging, kick-
ing, pointing, punching, and pushing. The UT-Interaction
Fp0 (u, d) = max Fp0 (u − 1, d − ∆d)M (hũ (O), h∆d (Ap )) dataset is a public video dataset containing high-level hu-
∆d
(10) man activities of multiple actors. The dataset is composed
where ũ is a unit time interval between u − 1 and u. of two different sets with different environments (Figure
Our algorithm sequentially computes Fp0 (u, d) for all u. 6), containing a total of 120 videos of six types of human-
At each iteration of u, the system searches for the best- human interactions. Each set is composed of 10 sequences,
matching segment ∆d for ũ that maximizes the function F 0 , and each sequence contains one execution per activity. The
as described in Equation (10). Essentially, our method in- videos involve camera jitter and/or background movements
terprets a video as a sequence of ordered sub-intervals (i.e. (e.g. trees). Several pedestrians are present in the videos
1041
1
Dynamic BoW
0.9 Integral BoW
0.8 SVM
Bayesian
0.7 BP-SVM
0.6 Voting
Accuracy
0.5
Snapshot from the UT-Interaction set #1 Snapshot from the UT-Interaction set #2 0.4
0.2
0.1
Accuracy
though the KTH dataset [16] and the Weizmann dataset [2] 0.5
have been popularly used for the action classification in pre- 0.4
vious works, they were inappropriate for our experiments: 0.3
Their videos are composed of short periodic movements
0.2
which only require few frames (e.g. a single frame [9]) to
0.1
perform a reliable recognition.
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1042
Let c denote the length of an original video containing a Table 1. Recognition performances measured with the UT-
Interaction dataset #1. The classification accuracies of our ap-
fully executed activity. Then, the setting with an observed
proaches and the previous approaches are compared. Most of the
ratio of x indicates that the video provided to the system is
listed classification approaches used 3-D spatio-temporal features.
formulated by segmenting the initial x · c frames of the orig- In addition, [17] took advantage of automatically extracted bound-
inal video. For example, the systems’ performances at ob- ing boxes of persons.
served ratio 0.5 describe the classification accuracies given
testing videos only having the first halves of the activities. System Accuracy w. half videos Accuracy w. full videos
Dynamic BoW 70.0 % 85.0 %
The observed ratio of 1 indicates that all testing videos con- Integral BoW 65.0 % 81.7 %
tain full executions of the activities, making the problem a Waltisberg et al. [17] - 88.0 %
Cuboid + SVMs [14] 31.7 % 85.0 %
conventional activity classification problem. BP-SVM [7] 65.0 % 83.3 %
Figure 7 illustrates the performance curves of the imple- Yu et al. [18] - 83.3 %
Yuan et al. [19] - 78.2 %
mented systems. Its X axis corresponds to the observed ra- Cuboid + Bayesian 25.0 % 71.7 %
tio of the testing videos, while the Y axis corresponds to the
activity recognition accuracy of the system. We have av-
eraged the systems’ performances for 20 runs, considering ficient prediction of human activities. The experimental
that the visual word clustering contains randomness. results confirmed that the proposed approaches are able to
The figure confirms that the proposed methods are able recognize ongoing human-human interactions at their much
to recognize ongoing activities at earlier stage of the activi- earlier stage than the previous methods.
ties, compared to the previous approaches. For example, in
dataset #1, the dynamic BoW is able to make a prediction
References
with the accuracy of 0.7 after observing only first 60% of [1] J. K. Aggarwal and M. S. Ryoo. Human activity analysis: A review. ACM
Computing Surveys, 43:16:1–16:43, April 2011.
the testing video, while the BP-SVMs must observe the first [2] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri. Actions as space-
82% to obtain the same accuracy. In addition, our integral time shapes. In ICCV, 2005.
BoW and dynamic BoW are computationally more efficient [3] J. W. Davis and A. Tyagi. Minimal-latency human action recognition using
reliable-inference. Image Vision Computing, 24:455–472, May 2006.
than the previous brute-force approaches (e.g. BP-SVMs). [4] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via
The time complexity to construct activity models in our two sparse spatio-temporal features. In IEEE Workshop on VS-PETS, 2005.
methods is O(m·log m+k ·d∗ ) while that of the BP-SVMs [5] Y. A. Ivanov and A. F. Bobick. Recognition of visual activities and interactions
by stochastic parsing. IEEE T PAMI, 22(8):852–872, 2000.
is O(n · k · d∗ · r), as discussed in Subsection 4.2.
[6] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic hu-
Table 1 compares the classification accuracies measured man actions from movies. In CVPR, 2008.
with the UT-Interaction #1. Each accuracy in the table can [7] K. Laviers, G. Sukthankar, M. Molineaux, and D. W. Aha. Improving offensive
performance through opponent modeling. In Artificial Intelligence for Interac-
be viewed as the highest performance one can get using the tive Digital Entertainment Conference, 2009.
method with the optimum parameters and visual words. To- [8] F. Lv and R. Nevatia. Single view human action recognition using key pose
gether with the classification accuracies given videos con- matching and Viterbi path searching. In CVPR, 2007.
[9] J. C. Niebles and L. Fei-Fei. A hierarchical model of shape and appearance for
taining full activity executions, the table lists the prediction human actions classification. In CVPR, 2007.
accuracies measured with the videos of the observed ratio [10] J. C. Niebles, H. Wang, and L. Fei-Fei. Unsupervised learning of human action
0.5. We are able to observe that our approaches perform categories using spatial-temporal words. IJCV, 79(3), Sep 2008.
superior to the previous methods given the videos with the [11] F. Porikli. Integral histogram: A fast way to extract histograms in Cartesian
spaces. In CVPR, 2005.
observed ratio 0.5, confirming that they predict ongoing ac- [12] M. S. Ryoo and J. K. Aggarwal. Spatio-temporal relationship match: Video
tivities earlier. In the traditional classification task (i.e. full structure comparison for recognition of complex human activities. In ICCV,
2009.
videos), our approach performed comparable to the state-
[13] M. S. Ryoo and J. K. Aggarwal. UT-Interaction Dataset, ICPR Contest on
of-the-arts results. Semantic Description of Human Activities (SDHA), 2010.
Our algorithms run in real-time with our unoptimized [14] M. S. Ryoo, C.-C. Chen, J. K. Aggarwal, and A. Roy-Chowdhury. An overview
of contest on semantic description of human activities (SDHA) 2010. In Pro-
C++ implementation, except for the adopted feature extrac- ceedings of ICPR Contests, 2010.
tion component. That is, if appropriate features are pro- [15] M. S. Ryoo, K. Grauman, and J. K. Aggarwal. A task-driven intelligent
vided to our system in real-time, the system is able to ana- workspace system to provide guidance feedback. CVIU, 114(5):520–534, May
2010.
lyze input videos and predict ongoing activities in real-time. [16] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions: a local SVM
approach. In ICPR, 2004.
7. Conclusion [17] D. Waltisberg, A. Yao, J. Gall, and L. V. Gool. Variations of a Hough-voting
action recognition system. In ICPR Contest on Semantic Sescription of Human
Activities (SDHA), in Proceedings of ICPR Contests, 2010.
In this paper, we introduced the new paradigm of human
[18] T.-H. Yu, T.-K. Kim, and R. Cipolla. Real-time action recognition by spa-
activity prediction. The motivation was to enable the early tiotemporal semantic and structural forests. In BMVC, 2010.
detection of unfinished activities from initial observations. [19] F. Yuan, V. Prinet, , and J. Yuan. Middle-level representation for human activi-
ties recognition: the role of spatio-temporal relationships. In ECCV Workshop
We formulated the problem probabilistically, and presented on Human Motion: Understanding, Modeling, Capture and Animation, 2010.
two novel recognition methodologies designed for the ef-
1043