BU CLA CS 835: Seminar in Image and Video Computing --- Class commentary on articles

BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Video Cut Detection



 Gregory Ganarz
 
In "Production Model Based Digital Video
Segmentation" A. Hampapur et al. present a method for detecting
various scene edits based on the outputs of three sets of feature
detectors.  Chromatic scaling was used to detect fades and dissolves.
One limitation of using chromatic scaling is that chromatic
translations and rotations are difficult to detect.  Spatial edits
were detected using a spatial edit detector (I'm not sure how it
worked).  Scene cuts were detected by using histogramming and template
matching.  One limitation with the approach of designing task specific
features is that such an approach leads to a proliferation of
features.  However, the authors do note that the spatial and chromatic
edif features were also useful in detecting cuts.  Being able to use
detectors for more than one purpose is a useful ability.

In "Automatic Partitioning of Full-Motion Video" H. Zhang et al. show
that by using multiple thresholds, both gradual transitions and camera
breaks can be detected using the same histogram.  One difficulty with
the histogram approach is that object motion can induce frame-to-frame
differences which fool the algorithm into signaling a camera break.
By incorporating motion detection, the authors hope to eventually make
their system robust to object motion.  A histogram approach would also
be sensitive to fast lighting changes.  The use of "higher-level"
features such as objects seems like a better approach than the use of
histograms.  Still, the authors show that the utility/complexity ratio
for histograms is quite good.

 Shrenik Daftary

"Production Model Based Digital Video Segmentation" by Hampapur, Jain, and
Weymouth. 

An efficient method to sort digital databases is presented in order to
provide easy retrieval of information contained in such a database. The
first thing that needs to be accomplished in order to separate a database
is to have an automated segmentation algorithm that will identify
different shots in a video. This paper provides an explicit model of
video, which is called the production model based classification approach
to segmentation. 

Modeling Digital Video

The process of editing a sequence involves both the actual decision of the
order for shots, and the assembly of the shots into frames in a final cut.
A set of shots in the final cut can be considered to be represented by a
closed time interval between its beginning and end. The set of edits B can
then be represented by the duration of the edit, the type of edit, and the
transformation to perform the edit. The assembly model is simply Vam =
S1,Se12,S2,Se23,....Sn. An edit between two shots is modeled as a 2D Image
transformation between the out shot of the first sequence to the in shot
of the second sequence. Models are presented for concatenating shots,
translating a shot, fading, and morphing. 

Video Segmentation

Video segmentation involves the process of either separating edits, or
shots into discrete partitions. This paper addressed the problem of edit
boundary detection since that is simpler to determine. 

Video segmentation using production model based classification

The stages in model based classification are model formulation (isolation
of essential steps in data production), feature extractor design -
(feature extraction and classification). Cut detection relies on the
detection of the sudden transition between shots. Depending on the
director's methods cuts can be very sharp or very subtle, so cut detectors
must detect varying amounts of discontinuity between scenes - intensity
differences, or distribution differences. Chromatic detectors rely on
detecting the scaling of the chromatic scale in terms of fading out or
fading in. 

Spatial edits involve the transformation of pixel space. A translate
spatial edit involves the initial shot being translated out uncovering a
second shot. The limitation of this type of filter is that there are some
transforms that do not fit into any transition effects. Application to
subregions of the image make detector design more complex as well. 

Uniformity of an image is measured in terms of spatial uniformity, and
value uniformity. The feature detectors that are presented have simple
methods to store images. 

Classification and Segmentation

Feature thresholding involves the process of cutting parts of the image if
they are lower than a given magnitude. A discriminant function is then
applied to the image and each frame in the video is labeled as either an
edit or shot. Next the segmented pulse train of edit/shot decisions is
placed in a finite state machine that will determine beginning and end of
segments. 

Comparison to existing work

This paper presents a method to detect edit areas unlike other papers
which rely on a bottom up approach to segment detection. 

Error measures

Error is measured in terms of either improper identification or improper
labeling of a region. The type of segmentation errors that are possible
are presented as undersegmentation, (where there are more false negatives
then false positives) equal segmentation (where there are the same amount
of false positives as false negatives), and oversegmentation (more false
positives then false negatives). 

Experimental Results

Results are presented for data sets taken from the local cable television
system. Segmentation performance is given on page 40. 

Sensitivity analysis was also performed on a sequence from a headline news
sequence. The threshold was set for the cut, chromatic and spatial
features and the results were determined for the different thresholds. 
The cut threshold was the most significant in terms of system detection of
cuts. (due to its attempt to detect a null edit). (page 42)

"Automatic partitioning of full-motion video" by Zhang Kankanhalli, and
Smoliar

This paper presents another technique to segment video. A segment is 
defined as a single shot from a camera. Camera breaks are defined as a 
scene cut. Dissolves are defined as something similar to a fade.

Metrics for video partioning are presented. The first is a simple pixel by
pixel comparison at two different time points. This of course causes some
error in scenes with either camera or object motion. An improvement over
this simple pixel by pixel comparison is presented in terms of comparing
regions. The metric relies on a likelihood ratio where the mean and
variance are determined. Next the concept of just comparing histograms in
images is presented. The three metrics that were mentioned suffer problems
if there are large or high speed moving objects, or in cases that the
constant lighting assumption does not hold. 

Gradual transition detection

The first technique to determine effects is the use of a simple comparison
of histogram difference. This leads to the use of a twin comparison metric
with a threshold for break detection(Tb) and a lower threshold for special
effect detection(Ts). If the difference is greater then Tb a break is
determined, if the difference is between Ts and Tb then the potential for
a gradual transition is recorded. A method to determine that a camera pan,
zoom or object movement is not causing a change in the histogram is
presented. Optical flow techniques are presented, a technique similar to
the technique from last week would work well. 

Applying the comparison techniques

Problems with improper threshold selection are mentioned. Selection for 
thresholds are mentioned by assuming a Gaussian distribution and 
selecting a Tb value that is a function of the mean, and variance. The Ts 
value was shown to be best for values between 8 and 10. The benefits of 
skipping time frames is mentioned where if it becomes necessary frames 
can be looked at again if some significant changes occur in the skipped 
frames. This increases the time performance of the technique. 

Implementation and evaluation

The methods were tested with a cartoon video. In this case the results 
were better for the single pass algorithm but the pair-wise method seemed 
to have the least amount of misses for each system. The pair-wise pixel 
comparison in the multipass algorithm worked the most efficiently time 
wise, but missed 2 more segments then the single pass method. This 
technique functioned fairly well in the other cases that were presented. 
But the system failed when there were flickering lights, or dissolves 
when the two end segments had similar histograms.


 Paul Dell 

HongJiang Zhang, Atreyi Kankanhalli, Stephen W. Smoliar, "Automatic  
partitioning of full-motion video"

The goal of the Zhang et. al. work is to create an index structure and table of  
contents for video which can be stored as part of the video package.  To this  
end one major step is to break the video into meaningful segments.  In the  
article, a segment is defined as "a single, uninterrupted camer shot."  This  
articles presents a number of methods to detect boundaries between consecutive  
camera shots.

The simplest trasition in a video sequence is a camera break, but more  
sophisticated transitions including dissolve, wipe, fade-in and fade-out must  
also be detected.  Three difference metrics are presented.  They include a  
"pair-wise comparison" which counts how many pixels changed, a "likelihood  
ratio" which compairs corresponding regions, and "histogram comparison".    
These three metrics are susceptible to mistaking large object movements with  
transitions and sharp illumination changes.  To detect gradual transitions, a  
new approach called the "twin-comparison approach" is given which contains two  
thresholds.  The lower threshold is used to detect special effects.  When the  
lower threshold is exceeded an accumulated value is taken.  If the difference  
value drops below the lower threshold, the accumulated value is dropped.  If  
the accumulated value grows and exceeds the upper threshold, then a transition  
is assumed to have occured.  In addition to occomidating transitions, some  
effort is given to distinguish camera movements from transitions. 


Data is presented on the performance of the simpler difference techniques and  
the more elaborate system presented in the paper.  Overall the best systems  
detected 90% of the transistions correctly.  And the multipass systems  
performed faster than the simpler diffence techniques.

It is the readers opinion that the multipass system presented while performing  
faster than the other techniques, did not significantly improve the performace  
of the color based histogram approach.  Any system will have to perform much  
better than the 90% accuracy given in the paper.  Also, it seems that it would  
be much better to have extra false-positives than to miss even 1% of the actual  
transisitons.


Arun Hampapur, Ramesh Jain and Terry E. Weymouth, "Production Model Based  
Digital Video Segmentation"

The authors argue that a video specific model is needed to more effectively  
segment video transistions.  The proposed model has three components which are  
the Edit Decision Model, the Assembly Model and the Edit Effect Model.  This  
models the normal video processes of "Editing" and "Assembly".

The video segmentation problem can take two equivalent approaches.  One can  
detect boundaries between shots or one can detect boundaries between edits.   
The approach taken in the paper is to detect edit boundaries.

The performance of the system is on par with other systems.  In particular the  
overall percentage of positive detections is 88%.  This system does involve  
more processing than other systems and the 12% false rate may or may not be  
adequate.  One advantage of the system is that specific transitions can be  
modeled and detected.  One disadvantage is the increased level of complexity  
involved.

 John Petry 

AUTOMATIC PARTITIONING OF FULL-MOTION VIDEO,
by Zhang, Kankanhalli, and Smoliar

The authors are looking for breaks and other boundaries in video segments.
They discuss two principal approaches, spatial comparisons (pixel-to-pixel
over sequential frames, or groups of pixels to the corresponding group of
pixels) and histogram comparisons (grey or color intensity distributions
over the whole image).

While these approaches are generally sufficient to detect cuts (a clean
break, where sequential frames are from different segments), they are not
enough to detect gradual transitions due to editing effects such as fading
or disolving.

The authors' chief contribution to the latter problem is a twin-comparison
algorithm which works as follows:  run one of the above low-level algorithms
with a low threshold, so that it detects not only breaks, but locations that
might be breaks (i.e., which score lower than obvious breaks, but higher than
typical frame-to-frame differences).  At these candidate locations, save
the first frame in the sequence.  Then continue examining subsequent frames.
As soon as the difference between adjacent frames falls back into the normal
range, compare the frame at hand with the saved frame from the beginning of
the sequence.  If the frames differ by less than the strict break threshold,
a break has probably occurred.  If not, consider the whole sequence to be part
of the current segment.

Once a likely break has been detected (by the twin-comparison or any single-
threshold method), it is examined using optical flow techniques to decide
whether the break was actually caused by object or camera motion.  In either
of those cases, it will be considered part of the existing sequence.  Only
if object or camera motion (pan or zoom, for instance) are not likely
explanations will the sequence be classified as a break.

In addition, the authors suggest an implementation improvement over standard
single-pass approaches.  By running a first pass at low temporal resolution
(skipping n of every m frames), then running the tool of choice at full 
temporal resolution only over those sections which trip the threshold, a big
speedup can be attained without significantly increasing the number of missed
breaks.

The authors claim the best technique is color histograming.  I'm not sure
I see that from their data -- pixel-by-pixel comparison seems equally good,
at least on the two-pass approach where it speeds up greatly.  Both types
of approaches (spatial difference or histogram) have cases they can't handle:
for spatial difference, any strong motion can confuse it, as can sharp
intensity changes.  The histogram approach is most affected by intensity
changes (false break detected) and by switching to scenes with similar
intensity distribution (true break not detected).  The latter can happen
when cutting between scenes of the same type in the same location.

This idea doesn't make use of higher-level video information, in contrast
with the next paper, but the twin-comparison approach seems relatively
elegant.


PRODUCTION MODEL BASED DIGITAL VIDEO SEGMENTATION,
by Hampapur, Jain and Weymouth

These authors take issue with the whole low-level partitioning approach
urged by Zhang et al.  They feel low-level techniques don't use all possible
data.  In particular, they consider the case of commercial video (i.e.
television and movies, as opposed to say, scientific or security videos).  
Commercial video is edited by humans, and segment breaks usually fall into
one of types:  

	cut (switching from one segment to another without
	intermediate editing); 

	intensity (fade in/out, disolve); 

	spatial (translation); 

	and a mixture of intensity and spatial.  

The authors took a clever approach: since the last three of these are
man-made (usually computer-driven) transformations, they look at the transforms
and design detectors specifically for them.  They contrast this with low-level
approaches which make almost no assumptions about the transformation that
takes place within a break.

For cuts, the authors simply make use of existing single-pass, single-algorithm
approaches of the form initially discussed by Zhang et al.  For the mixed
class of transition, they feel the whole problem is too complex, and ignore it.
That leaves the intensity and spatial transformation classes, which are the
focus of their work.

They spend a good deal of time on their theoretical models.  Some limits of
these models are quite apparent.  For instance, they assume there is little
motion in segments immediately before or after one of these transformations,
and that any motion within an editing sequence can be described by affine
transformation.  

The authors claim their technique is significantly better than the low-level 
only approaches of Zhang et al, and others.  Unfortunately, they present
almost no data to back this up.  They assert that a comparison of cut detection
techniques is not possible at this time, though I don't see the basis for
that statement.  In addition, the evidence that they do present does not
seem better than that in Zhang et al, though it's hard to say since the
data streams are different.

If we accept their claims, though, it's still possible to make some interesting
judgements on the two approaches.  Hampapur et al have highly-tuned algorithms
for common types of transitions.  To the extent that these transitions are
the ones present in a video stream, their approach may be the best.  However,
the low-level approach is almost certainly better if the video stream 
incorporates different types of transitions (such as mixed chromatic
and spatial, or violates some of the assumptions about minimal motion, etc.).
Hampapur et al allude to this when they state that new graphics computing
capabilities are leading to changes which are beyond the ability of their
model to detect, such as shattering edits, or morphing.  New detectors would
have to be designed for these edits, and that may not be simple.

I left feeling less impressed by this paper, though that may have have been
partly due to the presentation.  Their idea of examining the transformations
used in editing from the top-down is very good, and it adds a certain clarity
to the problem that is missing from Zhang et al.  But I think they have spent
too much effort parameterizing their model and not enough testing it against
other approaches.


 John Isidoro
 
I have to say, I enjoyed the most recent batch of papers on video
segmentation.  It seems as though the introductory papers in certain
areas of machine vision although sometimes naive, can be the most helpful
in explaining the underlying problems and concepts.

My project for today was to review "Automatic partitioning of full-motion
video"  I enjoyed this paper, I felt that every technique discussed was
useful in some way.  It's not surprising how there is so much overlap between
some of the still image analysis techniques and the video analysis (e.g.
color histograms)  I think that almost every similarity metric we have
studied in the past can be applied to video segmentation.  Obviously some
of the more time consuming ones such as using Markov random fields for image 
segmentation may be a little too much computation for full motion video
analysis..  I wish they at least explained a minimal optical flow algorithm
in this paper.  I would assume that most optical flow papers explain how
to apply the algorithm to detect panning and zooming like this paper did..

The next paper "Production Model Based Digital Video Segmentation" was also
pretty decent.  It focused more on detecting certain types of video
transitions and effects..  I think that this may help when you know the
exact nature of the scene transitions in question, but it seems to harm
the segmenation algorithm otherwise.. These techniques had an 88% success
rate while the above paper had a 90% success rate..



Stan Sclaroff

Created:  Oct 1, 1995

Last Modified: