You are on page 1of 7

Video Summarization Using Clustering

Tommy Chheng
Department of Computer Science
University of California, Irvine
tchheng@uci.edu

Abstract

In this paper, we approach the problem of video summarization. We


propose an automated algorithm to identify the unique segments of a
video. The video segments are separated using the k-means clustering.
We use Euclidean distance with histograms of the corresponding segment
as the distance metric. YouTube videos are used to test our procedure.

1 Introduction
We have seen YouTube and other media sources pushing the bounds of video consuming
in the past few years. As media sources compete for more of a viewers time everyday,
one possible alleviation is a video summarization system. A movie teaser is an example
of a video summary. However, not everyone has the time to edit their videos for a concise
version. See [2] for a more detailed description of the problem statement.
This paper highlights a fast and efficient algorithm using k-means clustering with RGB
histograms for creating a video summary. It is aimed particularly at low quality media,
specifically YouTube videos.

2 Approach
A outline of our system will be as follows:

1. Split the input file into time segments of k seconds: f0 ...fn


2. Take the first frame of each segment. Let this frame be representative of the seg-
ment. We assign it x0 ...xn
3. Compute the histograms from x0 ...xn and assign it y0 ...yn .
4. Cluster the histograms(y0 ...yn ) into k groups using K-Means. Euclidean distance
will be the error function.
5. Round robin for segment selection: Iterate through the k groups and select a seg-
ment randomly from a cluster, add it to list l until the number of desired segments
are chosen.
6. Join list l of segments together to generate a video summary.

A diagram of the system can be seen in Figure 1.


Figure 1: Overview

Figure 2: RGB Histogram

2.1 Feature Selection

We selected RGB color histograms for our feature comparator due to its global nature
and speed of processing. In Ruis Unified Video Summarization system[2], he cities his-
tograms are a good trade-off between accuracy and speed. Additionally, Valdes work[1]
for the TRECVID 2007 Rushes Task also cites video summarization methods based on
histograms were comparable to other features but without the performance loss. One par-
ticular attribute of the histograms is their global content. Histogram is a frequency approach
where it compresses the information of a video frame into a vector. Each entry in the vector
is a count of a color. Histograms lose spatial information but in a task like video summa-
rization, the spatial information may not be needed. The majority of YouTube videos are
lower quality so extracting more challenging features tends to be more difficult. Histograms
can perform well because they do not attempt to infer any semantic meaning in the actual
segments.

2.2 K-means Clustering

For our task, we went with an unsupervised learning approach because of the lack of prior
knowledge from Internet videos. We use k-means clustering to group together the related
scenes.

2.2.1 Algorithm
We want to group all the similar histograms into the k clusters. Each histogram is repre-
sentative of the corresponding video segment. Our version of the K-means algorithm is
defined below:

1. Select k random centroid points on our multi-dimensional space.


2. Compute each histogram against all the cluster centroids.
3. Each histogram is assigned to the cluster that minimizes the error function.
4. Recompute cluster centroids.
5. On every iteration, check to see if the centroids converged. If not, we go to step 2.

2.2.2 Error function


We use Euclidean distance as our error function. This is the general approach when directly
comparing histograms.
v
u I
uX
S = t (x y )2 i i (1)
i=1

Additionally, we also experimented with the cosine similarity and saw no noticeable dif-
ference in the clustering output.

3 Results

We selected k = 8 as our k-means parameter and use 20 segments for the output video.

3.1 Dataset

We processed following YouTube videos in our system. All of these videos are 320x240.

1. MotoGP: Recent round of the world motorcycle racing series. This represents a
typical sports video.
2. Chad Vader: A typical comedy video.
3. Tour of LA beaches: A semi-edited amateur web video.
4. Man Vs Wild Episode.

3.2 Clusters Generated

We see some interesting and useful results. In the Tour of LA beaches video shown in
Figure 3, the clustering grouped the scenes into the beach, boardwalk and indoor separately.
This is a good summary for viewers because it shows all the major sections of the video
clip.
When we clustered the MotoGP clip, it was able to separate all the action footage from the
pit stand footage. This is particularly useful for viewers who only want to watch the race
and not the pit stand.
In Figure 5, the Chad Vader video clip separated all the credits into one cluster. It has
a negative side effect for the video summary creation. Since we are using a round-robin
approach for segment joining, the credits were dispersed throughout the summary.
The Man vs Wild episode was able to correctly cluster different segments. It particu-
larly helped that the uniquely identifying segments had much color similarity. When the
Bear(the main actor) was in the desert, the colors are populated with a higher color inten-
sity. Similarly, when he was in the Florida everglades, the colors are lower in intensity.

3.3 Performance

The majority of our runtime is in the processing overhead including the histogram ex-
traction. In each iteration of K-Means clustering, the n frames are compared against k
Figure 3: Tour of LA beaches clusters: Each row is a cluster.

Figure 4: MotoGP clusters


Figure 5: Chad Vader clusters

Figure 6: Man vs Wild clusters


Name Video Duration Processing Time
MotoGP 9:53 15 seconds
Chad Vader 5:33 22 seconds
Tour of LA beaches 8:46 20 seconds
Man vs Wild Episode 50:00 2 minutes 59 seconds

Figure 7: Performance runtime

centroids. The iterations are generally constant. It took approximately 10 iterations to con-
verge. This gives us a O(kn) runtime for the clustering algorithm. Certainly scalable for
any production use.

4 Problems
4.1 Repeated segments

We run into problems of repeated segments when dealing with static images in the videos.
When a static image is present for a long time, two or more segments will be created from
this image. During the clustering, all of the segments with the static image will be clustered
in the same group. On the round-robin segment fetching, these static images will be littered
through the summary video. This was the case in the Tour of the LA Beaches video as seen
in Figure 2.

4.2 Background

In the MotoGP video clip, the majority of the segments consists of the road in the back-
ground. Our algorithm grouped most of these shots into one cluster. The intended behavior
would be to capture the different teams into different clusters because each team has a
unique color scheme. However, the background dominated and grouped most of these seg-
ments together. It would interesting future work to see if two levels of clustering would be
helpful: one for the initial segments and another sub-clustering for within each set.

5 Conclusion
We have presented a system to automatically create a summarized video from a YouTube
video. K-means is a simple and effective method for clustering similar frames together.
Our system is modular in design so future work can be developed by substituting in various
components. Instead of using histograms, future work can try to use other features such
as motion vectors or even audio. However, we have demonstrated that a simple feature
with a simple unsupervised learning technique can be a good starting point for a video
summarization system.

Acknowledgments
Thanks to Deva Ramanan and the CS273 class for the experience in Machine Learning.

References
[1] Vctor Valdes and Jose M. Martnez. On-line video skimming based on histogram
similarity. In TVS 07: Proceedings of the international workshop on TRECVID video
summarization, pages 9498, New York, NY, USA, 2007. ACM.
[2] Regunathan Radhakrishnan Ajay Divakaran Thomas S. Huang Yong Rui,
Ziyou Xiong. Unified framework for video summarization. MERL, Sept 2004.
http://www.merl.com/publications/TR2004-115/.

You might also like