Vs 2

Digital Video Segmentation
Arun Hampapur, Ramesh Jain*and Terry Weymouth Artificial Intelligence Laboratory Electrical Engineering and Computer Science University of Michigan 1101 Beal Ave, Ann Arbor, MI 48109-2110 arun@eecs. umich .edu
ABSTRACT
The data driven, bottom up approach to video segmentation

has ignored the inherent structure that exists in video. Th~ work uses the model driven approach to digital video segmentation. Mathematical models of video based on video production techniques are formulated. These models are used to classify the edit effects used in video and film prduction. The classes and models are used to systematically design the feature detectors for detecting edit effects in digital video. Digital video segmentation is formulated as a feature based classification problem. Experimental results from segmenting cable television programming with cuts, fades, dissolves and page translate edits are presented. 1 INTRODUCTION:
The data volume in multimedia is dominated by its digital video component. The effective management of this component is essential to the development of successful multimedia systems. Di@l Video Segmentation (the problem of
decomposing video into its component units ) is one of the first steps in digital video management. This paper presents
a new model driven approach to digital video segmentation. Several researchers have addressed the problem of digital video segmentation. The problem has been posed as aut~ matic location of comem motion breaks in video sequences. The emphasis of existing research is on designing digital image processing operators for identifying camera motion breaks. Nagaaaka and Tanaka [12] have presented work on detecting shot boundaries in digital video. They have evsJuated a number of image processing messurea for detecting cut edits in video. They conclude that an image su~window based histogram comparison measurement is the best fordtecting cuts. Zhang, Kankanhalli and Smoliar [14] have also
presented techniques which operate directly on compressed video to detect shot boundaries. Th~ technique relies on the properties of the coefficients of the digital cosine trmsform used in encoding the video to detect the transitions. All the existing approaches have addressed the problem of digital video segmentation from a purely data analysis point of view. A study of video production techniques [2] reveals several constraints originating in the production prcess which can be used in the design of the video segmentation system. This work approaches the problem of digital video segmentation by proposing a model for video, based on the production process. Video edit effects are clasaiiied based on these models. The edit effect models are used to design feature detectors, which are used in a feature baaed classification approach to segment the video. Section 2 presents the definition of some terms used in the paper. Section 3 discusses modeling of digital video and proposes a classification of edits. The design of feature detectors baaed on the models is addressed in section 4. Feature based classification is discussed in section 5. Experimental results and sample feature plots are presented in section 6. A summary of the work and future directions concludes the paper. 2 TERMINOLOGY DEFINITION
Some of the terms used in the paper are defined below: Image: A digitized representation of a picture. An image has a number of discrete pixel locations and is represented by E(z, y) = (r, g, b), where z C (l.. M), v C (l.. N), (x,Y) represent the location of a pixel with an image, M x N represents the size of the image and (r, g, b) represents the brightness values in the red, green and blue bands respectively. Image Sequence: A set of images that are indexed by time. An image sequence is represented by E(5 = ~ where ~= (z, y,t), ? = (r, g, b), and t represents the temporal index.
presented the evaluation of dMerent image processing routines for detection of cut edits. They have also addressed
the problem of detecting to detect
9209S-0407
special
solves and propose an approach

gradual transitions.
effects like fades and diswhich uses a dual threshold Arman, Hsu and Chui [1] have
Now at the University of California at San Diego, La Jolla, CA
Feature: A measurement or set of measurements made from an image sequence. A feature can be a function of individual images in the sequence or some subset of imagea from the sequence.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for
direct commercial advanta e, the ACM copyright notice and the title of the publication and ! ts date appear, and notice is given that copyi is by permission of the Association of Computing Machinery. ? o copy otherwise, or to republish, requires a fee andlor spedfic permission. Multimedia 94- 10/94 San Francisco, CA, USA
Shot: An image sequence which presents continuous

which appears to be from a single operation
action
of the camera [10, 3]. In other words, it is the sequence of images that is generated by the camera from the time it
begins recording images to the time it stops recording images.
1994 ACM 0-89791 -666-7/94/0010..$3.50
357
:XEF-4 Ffi
E= F(SLS2) sjb S:e szb S2=
Figure 1: Video Production Video:
. v(t) Video *
FmIEre
An image sequence which is generated by composing several shots by a process called editing. ThB is also
referred to as the final cut.
L- ....
~d
Figure 2: Feature based classification
Difference Image: An image generated by taking the pixel wise difference bet ween two consecutive frames in an image sequence [8, 9]. This is the partial derivative
(~) of an image sequence with reference to time. This is also referred to as the inter fime change image.
Edit Scene FYames: The set of images generated cess of editing two shots. during the pre
Activity: Changes that occur in the video caused by changes that occurred in the world during the production process. For example, changes in the image sequence due to movement of objects, the camera or changes in lighting etc. Changes that are introduced into the video during the editing process like, cuts, fades and dissolves.
where ~ = (z, y,t, 1) is the space of pixels in a shot, 7/ = (r, g, b, 1) the intensity (color) values in a shot. The use of homogeneous coordinates to represent the pixel space and the color space allows the use of afine 2D tmnsforms[13] to T, is the transformation applied to represent edit e~ects. the pixel space ~, T= represents the transformation applied to ~ by the editor. @ represents the way in which the two sequences are combmed during the edit. ~ represents the identity transform. Given two video shots, the options of the editor are summarized in table 1. Each of these possibilities leads to a different class of edits. The values in the transformation matrices determine the specitic type of edit. Modeling the edit as a #patio-chromatic transformation over time provides the flexibility of being able to represent a wide variety of edits under the same formalism.
EdM Activity:
Sequence Activity Graph: A graph indicating the activit y in progress over time in an image sequence. Edit Type 3 MODELING DIGITAL VIDEO T.
T= d
d
Null Spatial
d
T,
Mearung Concatenate
Manirxdate pixel ;pace
Examplea cut
Translate I Page I Fade J Dissolve
Iw lpe
The process of making a video or film involves the production of individual shots. The editor assembles the shots into the final cut. The process of editing may introduce additional frames into the final cut. Figure 1 shows the process of video The problem of production and the structure of a video. digital video segmentation can be posed as shot boundary detection, i.e, locate the points S]b, S1.l Sn, S2. in the video i.e, V. Thm problem can also be posed as e~]t detection, locate the points E~b, El. in the video V. These two ways of posing the problem are analogous to the region growing vs edge detection problem of image segmentation. The shot boundary detection approach requires analytical models of shots and the edit detection approach requires models of edits. The space of all possible shots of video is very ill defined. The cost of developing models for shots of video is very high, because of large number of degrees of freedom available in shot production. However the space of edits is much smaller and it is much easier to develop analytical models of edits. Hence the problem of digital video segmentation is posed as a problem of Edit Detection in this work. 3.1 EDIT EFFECT MODELS
11
[ ac
ic =!+! u
ombm
Chromatic
[ #
T. I Manipulate
I and
M-
intensity
space
pixels
ampulate
fitensit~-
I Mo_rphing
Table 1: Edit Types
VIDEO SEC3MENTATION USING FEATURE BASED CLASSIFICATION
The detection of edits in a video sequence requires the use of models for the edits. The edit effect model presented of the two shots involved in here is an Image Ihnsform the editing process. Let E(z, y, t) represent the e&t frames from editing two shots S1 and S;.
E(z, y,t) = sl(~x
~.1)
x Tcl x T4
(1)
CW%(F x T,2)
This work adopts a feature based chwaification approach to the problem of edit detection. In this approach meaaurc+ ments made from the video (image sequence) are used to decide which frame in the video is an edit frame as op+ step in the formulation of the to a shot frame. The first problem is to identify the features which respond to each of the edit classes to be detected. The second is to classify the video frames based on these featurea. Figure 2 illustrates the process. Features vectors extracted from the video are used in conjunction with a video model to classify and segment the video into edits and shots. The task of video segmentation can be achieved under a feature based clasaitlcation formaliarn by first asaigning to each frame in the video either the label ~hot or edit. Once the labels are assigned a simple iiltering technique can be used to detect the segment boundaries. The initial setof labels can be assigned by extracting features that respond to cuts, spatial edits and chromatic edits.
358
4.1
CUT
DETECTION
4.3
FADES
AND
DISSOLVES
A cut is a nuff edit, i.e. the transformation matrices for a . cut are T, = T= = 4. A cut bv itself hag no model. Hence designing a cut detector requires information about the shots being edited. Several other research efforts have addressed the problem of cut detection baaed on frame comparisons [12, 14, 1]. The basic idea used in all the frame comparison approaches is that frames belonging to a single shot are more similar than jmmes belonging to diflerent shots. Thus all the different approaches present the use of different measurements for comparing frames. However since there is no underlying model for a cut, the only way of validating different measurements is based on experimental analysis. Experiments on a wide variety of sequences were carried out using different operators to detect cuts. Some of the me~urements used were Gray Level Template Jfatch, Gmy Level Histogram Difference, Chi Square comparison of gmy level histogmm and Auemge Intensitv Difference. The detailed definition of these measures can be found in [12, 14, I]. Bssed on the experimental study conducted it was found that the performance of the average intensity difference was rerwonable tradeoff in terms of cost of computational r~ sources and performance in terms of cut detection. The exact form of the Average Intensity Measurement is given below, Fe, is the cut feature, Fccr, FC.9, Fc.b are the average intensities over the entire image in the R,G, B color bands is the average intensity over the respectively and A(V(t)) entire image for the tth frame in the video.
A fade is a gradual transition in the image sequence. The picture gradually darkens to black in the case of a fade out and gradually brightens in the case of a fade in [10]. A dissolve is a simultaneous application of a fade in and a fade out to the two shots being edited. A fade is typically achieved using an opticid printer or a video editing suite. The fade in (fade out) effect is achieved by gradually increasing (decreasing) the light intensity during the optical printing or video editing process. The effect of changing the projection light intensity or image brightness over the entire image is a chromatic editing operation (table 1). The exact model of the fade operation is an image intensity scaling. The edit fmmeo generated by the fade and dissolve operations are image sequences which have the chromatic scaling feature. A global view of the relationships between such sequences and the space of all image sequences is shown in figure 3. FIc DcCSCCCA (5)
Foc
DcCSCCCA
(6)
Equations (5,6) are the set relationships between the various classes of edit effect sequences. This illustrates that the space of chromatic scaling nequencea covers the spaces of fade ins, fade outs and dissolves. Thus the problem of detecting fades and dissolves can be treated as denigning a chromatic scalino seauence detector and verihina that the detector can be use> to- detect both fades and d~;soivea
Fa,(t) = F=.,(t) =
(Fccr + Fe., +
lA(W(t)) lA(K(t
Fccb)
(2)
- A(W(t
+ 1))1 (3)
- 1)) - A(t4(t))l
/
The feature presented above and the features discussed in [12, 14, 1] are not optimized for detecting cuts between any particular types of shots. Better cut detectors can be designed by optimizing the features to respond to cuts between specific types of shots categories. This will require a categorization of shots and feature design for cuts between various combination of shots. This approach to cut feature design will be addressed in future research. 4.2 CHROMATIC EDIT DETECTOR
AziiL
A chromatic edit is achieved by manipulating the color or intensity space of the two shots being edited. From table 1 a chromatic edit has T, = ~ and the transform T. can take
on dfierent types of values depending on the specific types of chromatic edits. Fades and Dissolves are the two most prevalent types of chromatic edits. The following discussions present the design of feature detectors that respond to fades and dkiaolves. The function of a chromatic edit detection feature is to discriminate between intensity changes in the video due to 8cene activity M opposed to intensity changes in the video
Figure 3: Relationship dissolve sequences
between
image sequences,
fade and
4.4
CHROMATIC
SCALING
due to chromatic editing. The key difference between the intensity change introduced into a video by scene activity as opposed to chromatic editing is theuniformity of the change. The changes due to e&ting are more uniform than naturally occurring changes. Thus the key idea behhid the design of the chromatic feature detector is to amplifi uniform intensity changeg in the video.
This section presents the model for the chromatic scaling effect and the operations necessary to detect chromatic acaling in a digital video sequence. Consider an image sequence g(z, y, t). Let the a rigid scaling be applied to the sequence over the length of 1, frames. Then the model for the output of the scaling operation is:
V(z,y,t) = g(z,
359
y,t)
1 ; () #
(7)
Equation 7 is a model of the chromatic scaling operation. The following operations are necessary to detect a chromatic scaling operation in an image sequence. Firstly, differentiate V(t) (equation 7) with respect to time:
z=
Equation 8 can be rewritten
-g(z, y, t)
1, aa
(8)
The relative values of the above parameters for the two sequences can be used to group dissolves into diferent clssses. Such a grouping can be used to analyze the effectiveness of the detection approach. Let Ed be the set of edit frames generated by dissolving two shots gl and gz. The shot gl is referred to as the Out Shot
and the shot gz is referred to as the In Shot. Equation (13) models the process of dissolving two shots.
Vd.(t)=
x .1 g(z,g,t) 1.
6V
(9)
In equation 9, V~s the scaled first order dtierence image [8, 9] is a constant image with the constant value being proportional to the fade rate. A simple function based on the distribution of intensities can be designed to provide a measure of the constancy of an image. Let F=, represent the chromatic scaling feature. F=,(t) = Const4nt(V~s(t)) (lo)
where Constant is an operation that measures the constaacy of the image Vd,. FC* is the feature that can be used to verify the presence of the chromatic scaling effect in any image sequence, Computing F., involves a simple image difference, image division and an image constancy measurement (presented in section 4.7). 4.5 FADES AND CHROMATIC DISSOLVES SCALING AS
Here il and iz are the dissolve lengths of the two shots in the dissolve, t1 and t2 are start times of gl and g2. Comparing equations (12,1 1) and equation 13 it can be seen that the dissolve is a combination of the fade in and fade out operation occurring simultaneously on two different shots. A dwsolve is a particular type of chromatic scaling (equation 5,6). Designing the dissolve detector can now be treated as a problem of verifying that the chrct matic scaling detector (equation 10) can be used to de
tect the dissolve. The approach followed in this work is to classify the dissolves into groups based on their qualitative properties and to verify the detectability of each group using the chromatic scaling operator. Figure (4) presents the sequence actiuity graph (SAG)
during a dissolve edit. It shows the qualitative labeling of dissolves. The hatched area indicates the Out Shot activity and the filled area the In Shot. A positive slope in the SAG indicates a fsde in operation and the negative slope indicates a fade out operation. A zero slope in the SAG indicates no sequence activity due to editing.
The fade and dissolve operations can be represented as some combination of chromatic scaling operations. Once fades and dissolves are modeled in terms of the chromatic scaling operations, the next step is to verify that the chromatic scaling detector can be used to extract the features necessary to detect fades and dissolves in image sequences.
Fade Detection
Two types of fades are commonly used in commercial video production, fade in from black and fade out to black. In both these cases the fades can be modelled as chromatic scaling operations with a positive and negative fade rate. Et. equation 11 represents the sequence of images generated by fading out a video gl to black. 11 is ~he fade out rate in
terms of the number of frames. O represents the black
The baais used for the qualitative labeling is the start time and the dissolve length. The labels are based Shot aton Shot of I~itial ActivityDominating tributes of the sequence, which are defined as follows: Shot of Initird s, where
Activity:
This is defined as the shot

(14) (15) (16)
image sequence.
Comparing equations 1 and 12, for
a fade out one of the shots S1 = gl, S2 = O and

+. Similarly, E~i (equation 12) represents the 8= images generated by fading in a sequence gz, at the rate of 12. The equations (11, 12) represent how the fade operation maps on to the edit effect model 1. Since the operations on the individual sequences in the fade are chromatic scaling operations the chromatic scaling feature 9 can be used for detecting fades in videos. EJo(z, y,t) Efg(z,~, Dissolve =gl(z, t) s y)
a =Iniftl>tz s = Out i$ tl < tz s = Both if tl= tz

where the equations Equation Equation Equation have the following meanings: before before Fade Out.
14: Fade In begins 15: Fade Out begins
Fade In.
16: Fade In, Fade Out begin
together. s
()
~
n-t
+ ~ :
(11) (12)
Dominating where
Shot:
This
is defined
as the shot
6+92(Z,
V)
()
Detection A dissolve is a chromatic scaling of two shots simultaneously. There are two parameters that can be controlled for each shot during a dissolve: Time:
sequence
s =Inijll<lz s = Out if 11>12 s = Equal if 11 = 12

where the equations Equation Equation Equation have the following meanings:
(17) (18)
(19)
Start
t The time at which the scaling of the

starts.
17: In Shot dominates 18: Out Shot dominates 19: No Shot dominates
dissolve. clhwolve. dissolve,
Dissolve Length: 1 The duration for which the scaJing of a sequence lasts.
360
A shot is said to dominate the dissolve if its activity slope is higher. In other words, if the shot contributes more to the inter ~me quence. From figure 4 it can seen: 1. that except in the case of
change /8, 9/ in the video m
Consider a video V(t) with a translate x spatial edit, In such an edit the first shot translates out, uncovering the second shot. Let V(t) be a gray scale sequence. Let 11,1. be the length of shot S1 and the length of the edit respectively. Let gl (z, y, t) is the gray scale representation of the first shot. The t ransiate edit can now be modeled as
Both-Equal
type of disscalIn the case of a pure spatial edit the brightness of a particular point does not change over time, the change in the video is caused by the motion of the point due to the edit. This fact can be used to write down the constant brightness equation [11, 5, 7] for the edit. dE T Using the chain rule for differentiation rewritten as equation 22. equation (21) 21 can be
solves, all the other types have portions during which

there is an exclusively ing in progress. single sequence chromatic
2. that except
in the case of Equal Dominance Sequences the change contribution of one sequence dominates the other.
Thus the cases in which the chromatic scaling feature (equation 10) will not respond to the dissolve are those in which very similar sequences are being dissolved with precisely equal fade rates over the dissolve. In most commercially pr~ duced video sequences dissolves are seldom executed with such precision and dissolving very similar sequences are avoided. Hence the chromatic scaling feature can be used to detect most dissolves. The experimental studies conducted confirm these observations over a wide range of commercial
(22)
substituting for E from equation r=t)+5E.d(y+try.t) dt as 6t4 dt 20 in equation 22 +:=0(23)
Dissolve Ctaaslflcatlon
OntSbd
dE X=z 6Ed(z+t
q > t2 , 11>12, In-out

which can be rewritten
tl>tz,
11 <Iz, hl-xri
%=:(%
tl < t2 , h >12, out-out tl<t2, h<12, tit-h
+tr~)+%(:+r,)+%= 24)
Assuming that there is no scene action in progress during the edit (i.e. the first shot freezes before the translation begins) there will be no relative changes in the image due to scene motion. Hence % = ~ = O. Therefore equation 24 can be rewritten as follows:
tl < tz , 11 =12, out-Equal
6E
6E
6E
tr=+Ttru+F= 62
For the case of pure translation Hence in the x direction
(25) try = O. (26)
tl = tz , 11>12, Both-out tl=t2, h <12, Both-h =12 , Both-llqw *

Figure 4: Sequence Activity Graph during a dissolve
% Vd.(t) = tr> = ~ z
Equation 26 shows that in the case of the edit being a pure translation in the x direction, the scaling of the difference image by the X gradient image results in a constant
image Vdt. Let F,9= represent the spatial tr~date

F 8gr G
cO?18ta@(Vd. )
featUre.
(27)
4.6
SPATIAL
EDIT
DETECTOR
Spatial edits are achieved by transforming the pixel space o{ the shots being edited. The transformation matrices l= = # and T, takes on different values depending on the speciiic type of spatial edit. One of the most commonly used edts is the page translate, where the shot preceding the edit is translated out of the viewport uncovering the shot that follows the edit. This type of edit is used as an exemplified of the class of spatiat edits and a feature derivation is presented. Similar ideaa can be used to design features for translations in various directions and other types of spatial edits.
where Constant denote the measure of constancy of the scaled difference images (section 4.7). Thus the feature F,o= can be used as an indicator of the spatial translate in x. For the case of translation in an arbitrary direction, the same result can be applied if the grdlent is computed in the direction of translation. Thus a set of features covering a set of quantized directions can be used to cover all types of translation edits. Many other types of spatial transforms like the page turn and several other digital editing effects can be modeled as piece wise transforms applied to image windows. A similar process can be used to design detectors for these various types of edits. Complex edits with significant 104 effects make the design of effective detectors more difiicult.
361
4.7
THE
CONSTANT
IMAGE
FEATURE
The features designed for both the chromatic and spatial edits show that a particular set of operations on the image sequence will yield an image which has a constant value in all its pixels. This is true in the ideal case. But when the feature is extracted from any real image sequence the resulting image is seldom exactly a constant, but the distribution of intensities corresponding to a positive detector response will be more uniform than for a negative feature detector response. A simple measurement of the uniformity of the pixel distribution in an image can be designed, The gray levels in the image should have a low variance in the case of a constant image. The number of active difference pixels should be large. This requirement arises because the edit typically afTects all parts of the image and hence almost all the pix-
,.-... ....~
els in the image should change. The active pixels should be distributed uniformly over the entire image. The distance
between the centroid of the active pixels and the physical image center can be used as a measure. A combination of these conditions is used as a measure of image constancy in equation 28. Let I= represent the constant image. Let Id rep resent the difference image. Let N (Id) represent the number of non zero pixels in the difference image. Let (C=, CY) rep resent the cent roid of constant image and c=, CVthe image center. Constancy = N(Zd) CT(IC). (1.0+
Figure 5: Feature extractor
block diagram
Ic= C=l + Icy - Cgl)
(28) of the
-+\
l~t) ,
*- Y
T-
tidt) t (0,1.2)
Thw equation was found to provide a good indication constancy of the image in most of the experiments. 4.8 FEATURE DETECTOR SUMMARY
The above sections presented a systematic approach to de signing features. The feature design was based on models of the different types of edit effects commonly used in the production of video. Features were designed to respond to the three classes of edits namely, cuts, chromatic effects and spatial effects. These features can now be used ss the input to a classification and segmentation system which performs various filtering steps and outputs the video segments. A few sample feature responses are presented in the experimental results section. Figure s shows the operations used in design of the three types of feature detectors. F(t - 1), F(t), F(t+ 1) 1, t, t+ 1). The boxes are images from the video at times (t represent the operations that need to be performed on the images to derive the edit effect features. 5 CLASSIFICATION AND SEGMENTATION
A-/
Ft h -
tidt) c(0.1.2)
Figure 6: Steps in feature baaed clasdcation
uncertain region can be overridden by the other feature re sponses depending on the discriminant function. The thresholds for this operation can be chosen baaed on conditional probability distributions of the features. In the experiments reported in this paper, thresholds were chosen baaed on a study of the conditional distributions over various clasaes of shots and edits. Discriminant Function: The discrirninant function ia a function designed to combine the feature reaponam of the three features. The output of the discriminant function thus assigns to each frame in the video one of the two labels edit, shot. The label assignment takes into account the correlation that exits between the features and the conditional dmtributions of the features, Based on a study of the cut, chromatic and spatial features the discriminant function allowed uncertain feature responses to be overridden by either positive or negative responsea, with positive responses winning out in the case of a tie. Segmentation: The input to the segmentation is a two label pulse train i.e, each frame is either called an edit or shot. Segmentation is now reduced to a problem of grouping consecutive labels into segments. 6 EXPERIMENTAL RESULTS
Once the features are extracted from the video, th~e features undergo several steps of processing before, the video segments are output. This section presents the various steps
involved. Figure 6 provides an illustration of the details involved in the classification and segmentation process. A modification of a standard two class discrete classifier [4] haa
been used to achieve the segmentation process. The lack of
aprioiri probabilities for the various features makes the use of standard bayesian type dtilon models unsuitable for this
application. Feature Threaholding: The first step in the classification and segmentation process is feature thresholding. The response space of each of the features namely cut, chromatic edit and spatial edit are partitioned into three regions based on feature thresholds. The regions of response are (positive, uncertain, negative). A feature with response in the
The technique described in this paper has been applied to cable television data. The experiments used about half hour
362
CONCLUSIONS
AND
FUTURE
WORK
A modeling scheme for the production the basis for arriving at a taxonomy This taxonomy along with the models atically denign video feature detectors
of video waa used se of video edit efkcts. was used to system-
CD: Correctly Detected

FD: TN: PC: PF:
Edits. Falsely Detected Edits. Total Number of Edits. Percentage (%) Correctly Detected Percentage (Ye) Falsely Detected
for detecting particular chases of edits. The featur- were used in a feature based classifwation approach to implement shot segmentation. The feature detectors designed are efficient since they
Table 2: Edit Detector
Results
of video that included various types of cable programming ranging from sitcoms and commercials to music televtilon and news casts. The video data is stored on a video disk to gain frame access to the data. The following is a presentation of a set of example images and the features responses to the images. A detailed set of experimental results and analysis can be found in [6].
rely solely on dMerence images, scaling of irnagea and simple average computations. Shot boundary identification is one of the preliminary processes in the design of a video data management system. The fundamental difference between the approach presented here and other existing work is the use of models of the video edit efiect.r and cfassijicutian of these models to design low level video features. This approach leads to a systematic way of analyzing the problem and afiows the effective use of domain constraints to aid in the solution of the problem. The results obtained using the modeling and classification approach were very encouraging. Future research will address the problem of improving the segmentation performance based on tuned cut detectors. Studies on the effect of individual feat urea on the performance of the entire system are currently underway. 8 ACKNOWLEDGEMENTS
Example 1: Fade Sequence Figure 7 (left) shows images from a fade in sequence. There is a significant amount
of object motion in progress close to the camera as the image sequence is being faded in. Figure 8 shows the response of the chromatic scaling feature to thii sequence. The detector responds positively during the fade in part of the sequence. The response drops of
ject
The first author would like to thank Clint Bidlack, Gopal Pingali, Sandy Bartlett and Jordi Rib*Corbera for the stimulating discussions on various topics. These discnsions have been helpful in formulating ideaa about the problem. REFERENCES [1] Farshid Arman, Arding Hsu, and Ming-Yes Chiu. Image processing on compressed data for large video In Pmceedinge of the ACM MultiMedia, databases.
considerably after the fade in, although the same ob motion continues. The detector SIEOresponds to a cut.
Example
The figure 7 (middle) 2: Dissolve Sequence shows a few images taken from the dissolve sequence. The top image is from the first shot and the bottom image is from the second shot, the intermediate iruages are during the dissolve. The response of the chr~ matic scaling feature is shown in figure 9 . The first peak corresponds to the sequence of images in figure 7(middle), the rest oft he response corresponds to further dissolves in the sequence. It should be noted that the detector misses a dissolve and gives a spurious r~ sponse. On examining the sequences carefully it was found that the spurious response was due to a large object moving very close to the camera in a low contrast scene, and missed response was due to a d-lve between very similar sequences in the music video.
pages 267-272, California, USA, June 1993. Association of Computing Machinery.

[2] David Bordwell and Thompson An Introduction. Addison- Wwley 1980.
Kristin.
Publishing
Film Art: Company,
[3] Gloriana Davenport, Thomas Aguirre Smith, and Natalio Pincever. Cinematic primitives for muMrnedia. IEEE Computer Gmphics & AppJicationq pages 67-74,
July 1991.
[4] Richard O Duda and Peter E Hart. Pattern CZa#si@rtion and Scene Anqlysia. A Wiley-Interscience Publication. John Wdey and Sons, 1973. B Thompson. Velocity [5] Claude L Fenema and Wfiam determination in scenes containhg several moving objects. Computer Gmphicu and fmage Pmceuaing, 9:301315, 1979. [6] Arun Hampapur, Rarnesh Jain, and Terry Weymouth. Production model based d@tsJ video segmentation. Technical report, The University of Michigan, Ann Arbor, 1994. [7] Berthold K. P. Horn and Brian G. Schunck. Determining optical flow. Artificial Intelligence, 17:185-203, 1981. [8] R Jain. Diference and accumulative dMerence pictures in dynamic scene analysis. Image and Vi~ion Computing, Vol 2(No 2):99108, May 1984.
Example 9 Translate E&Its Figure 7 (right) shows images taken from a page translate editbetween two shots, the first shot is a zoom in shot and the eecond shot is a pan shot. Figure 10 show the output of the translate edt detector. An observation of the result shows that the detector responds well to the edit while suppressing both pans and zooms. Table 2 summarizes the results obtained using thii aJgorithm. The result summary indicates an 8870 correct segmentation, which implies that 12 out of every 100 edits were missed, while about 12 edits were falsely detected. The performance of the system can be improved by improving the performance of the cut detector, as the number of cuts is significantly larger than the other types of edits.
363
Figure 8: Chromatic
Scaling Feature:
Fade In Sequence
Fes-
1-
Figure 7: Experimental Tr;nslate
Image Sequences:
Fade, Dissolve,
[9] R Jain and H H NageL On the analysis of accumulative

difference pictures from image sequences of rerd world IEEE i%ansactions on Pattern Andy8i8 and scenes. Machine lntelligen=, Vol 1(No 2):206-214, April 1979.
Figure 9: Chromatic
Scaling Feature
D=lve
[10] Ira Koenigsberg.

guin Books, 1989.
The Complete
Film Dictionar~.
Pen-
[11] J O Limb and J A Murphy. Estimating
the velocity of moving images in tele~lon signda. Computer Gmphics and Image PmceBsing, 4:311-327, 1975.
[12] Akio Nsgaaaka and Yuzuru Tanaka. Automatic video indexing and full-video search for object appearances. In $?nd Working Conference on Visual Database SVstems, pages 119-133, Budapest, Hungary, October
1991. IFIP WG 2.6.
[13] George Wolberg.

puter Society
Digital image Warping. Press, 1992.
IEEE Com-
[14] HongJinag Zhang, Atreyi Kankanhalli,
and Stephen W. Automatic partitioning of animate video. Smoliar. Technical report, Institute of Systems Science, National University of Shgapore * Heng Mui Keng Terrace * Kent Ridge * Singapore 0511, 1992.
Figure
10: SpatiaJ
Translation
Feature
364

Vs 2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vs 2

Uploaded by

Copyright:

Available Formats

Digital Video Segmentation

The data driven, bottom up approach to video segmentation

solves and propose an approach

Now at the University of California at San Diego, La Jolla, CA

Shot: An image sequence which presents continuous

1994 ACM 0-89791 -666-7/94/0010..$3.50

Figure 2: Feature based classification

Table 1: Edit Types

VIDEO SEC3MENTATION USING FEATURE BASED CLASSIFICATION

E(z, y,t) = sl(~x

Figure 3: Relationship dissolve sequences

This is defined as the shot

Comparing equations 1 and 12, for

a fade out one of the shots S1 = gl, S2 = O and

a =Iniftl>tz s = Out i$ tl < tz s = Both if tl= tz

14: Fade In begins 15: Fade Out begins

16: Fade In, Fade Out begin

s =Inijll<lz s = Out if 11>12 s = Equal if 11 = 12

t The time at which the scaling of the

dissolve. clhwolve. dissolve,

solves, all the other types have portions during which

q > t2 , 11>12, In-out

tl < tz , 11 =12, out-Equal

(25) try = O. (26)

tl = tz , 11>12, Both-out tl=t2, h <12, Both-h =12 , Both-llqw *

image Vdt. Let F,9= represent the spatial tr~date

Figure 5: Feature extractor

Ic= C=l + Icy - Cgl)

Figure 6: Steps in feature baaed clasdcation

of video waa used se of video edit efkcts. was used to system-

CD: Correctly Detected

Table 2: Edit Detector

pages 267-272, California, USA, June 1993. Association of Computing Machinery.

Film Art: Company,

Figure 7: Experimental Tr;nslate

[9] R Jain and H H NageL On the analysis of accumulative

[10] Ira Koenigsberg.

[11] J O Limb and J A Murphy. Estimating

[13] George Wolberg.

Digital image Warping. Press, 1992.

[14] HongJinag Zhang, Atreyi Kankanhalli,

You might also like