BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Video Cut Detection



Gregory Ganarz

In "Production Model Based Digital Video Segmentation" A. Hampapur et al. present a method for detecting various scene edits based on the outputs of three sets of feature detectors. Chromatic scaling was used to detect fades and dissolves. One limitation of using chromatic scaling is that chromatic translations and rotations are difficult to detect. Spatial edits were detected using a spatial edit detector (I'm not sure how it worked). Scene cuts were detected by using histogramming and template matching. One limitation with the approach of designing task specific features is that such an approach leads to a proliferation of features. However, the authors do note that the spatial and chromatic edif features were also useful in detecting cuts. Being able to use detectors for more than one purpose is a useful ability. In "Automatic Partitioning of Full-Motion Video" H. Zhang et al. show that by using multiple thresholds, both gradual transitions and camera breaks can be detected using the same histogram. One difficulty with the histogram approach is that object motion can induce frame-to-frame differences which fool the algorithm into signaling a camera break. By incorporating motion detection, the authors hope to eventually make their system robust to object motion. A histogram approach would also be sensitive to fast lighting changes. The use of "higher-level" features such as objects seems like a better approach than the use of histograms. Still, the authors show that the utility/complexity ratio for histograms is quite good.

Shrenik Daftary

"Production Model Based Digital Video Segmentation" by Hampapur, Jain, and Weymouth. An efficient method to sort digital databases is presented in order to provide easy retrieval of information contained in such a database. The first thing that needs to be accomplished in order to separate a database is to have an automated segmentation algorithm that will identify different shots in a video. This paper provides an explicit model of video, which is called the production model based classification approach to segmentation. Modeling Digital Video The process of editing a sequence involves both the actual decision of the order for shots, and the assembly of the shots into frames in a final cut. A set of shots in the final cut can be considered to be represented by a closed time interval between its beginning and end. The set of edits B can then be represented by the duration of the edit, the type of edit, and the transformation to perform the edit. The assembly model is simply Vam = S1,Se12,S2,Se23,....Sn. An edit between two shots is modeled as a 2D Image transformation between the out shot of the first sequence to the in shot of the second sequence. Models are presented for concatenating shots, translating a shot, fading, and morphing. Video Segmentation Video segmentation involves the process of either separating edits, or shots into discrete partitions. This paper addressed the problem of edit boundary detection since that is simpler to determine. Video segmentation using production model based classification The stages in model based classification are model formulation (isolation of essential steps in data production), feature extractor design - (feature extraction and classification). Cut detection relies on the detection of the sudden transition between shots. Depending on the director's methods cuts can be very sharp or very subtle, so cut detectors must detect varying amounts of discontinuity between scenes - intensity differences, or distribution differences. Chromatic detectors rely on detecting the scaling of the chromatic scale in terms of fading out or fading in. Spatial edits involve the transformation of pixel space. A translate spatial edit involves the initial shot being translated out uncovering a second shot. The limitation of this type of filter is that there are some transforms that do not fit into any transition effects. Application to subregions of the image make detector design more complex as well. Uniformity of an image is measured in terms of spatial uniformity, and value uniformity. The feature detectors that are presented have simple methods to store images. Classification and Segmentation Feature thresholding involves the process of cutting parts of the image if they are lower than a given magnitude. A discriminant function is then applied to the image and each frame in the video is labeled as either an edit or shot. Next the segmented pulse train of edit/shot decisions is placed in a finite state machine that will determine beginning and end of segments. Comparison to existing work This paper presents a method to detect edit areas unlike other papers which rely on a bottom up approach to segment detection. Error measures Error is measured in terms of either improper identification or improper labeling of a region. The type of segmentation errors that are possible are presented as undersegmentation, (where there are more false negatives then false positives) equal segmentation (where there are the same amount of false positives as false negatives), and oversegmentation (more false positives then false negatives). Experimental Results Results are presented for data sets taken from the local cable television system. Segmentation performance is given on page 40. Sensitivity analysis was also performed on a sequence from a headline news sequence. The threshold was set for the cut, chromatic and spatial features and the results were determined for the different thresholds. The cut threshold was the most significant in terms of system detection of cuts. (due to its attempt to detect a null edit). (page 42) "Automatic partitioning of full-motion video" by Zhang Kankanhalli, and Smoliar This paper presents another technique to segment video. A segment is defined as a single shot from a camera. Camera breaks are defined as a scene cut. Dissolves are defined as something similar to a fade. Metrics for video partioning are presented. The first is a simple pixel by pixel comparison at two different time points. This of course causes some error in scenes with either camera or object motion. An improvement over this simple pixel by pixel comparison is presented in terms of comparing regions. The metric relies on a likelihood ratio where the mean and variance are determined. Next the concept of just comparing histograms in images is presented. The three metrics that were mentioned suffer problems if there are large or high speed moving objects, or in cases that the constant lighting assumption does not hold. Gradual transition detection The first technique to determine effects is the use of a simple comparison of histogram difference. This leads to the use of a twin comparison metric with a threshold for break detection(Tb) and a lower threshold for special effect detection(Ts). If the difference is greater then Tb a break is determined, if the difference is between Ts and Tb then the potential for a gradual transition is recorded. A method to determine that a camera pan, zoom or object movement is not causing a change in the histogram is presented. Optical flow techniques are presented, a technique similar to the technique from last week would work well. Applying the comparison techniques Problems with improper threshold selection are mentioned. Selection for thresholds are mentioned by assuming a Gaussian distribution and selecting a Tb value that is a function of the mean, and variance. The Ts value was shown to be best for values between 8 and 10. The benefits of skipping time frames is mentioned where if it becomes necessary frames can be looked at again if some significant changes occur in the skipped frames. This increases the time performance of the technique. Implementation and evaluation The methods were tested with a cartoon video. In this case the results were better for the single pass algorithm but the pair-wise method seemed to have the least amount of misses for each system. The pair-wise pixel comparison in the multipass algorithm worked the most efficiently time wise, but missed 2 more segments then the single pass method. This technique functioned fairly well in the other cases that were presented. But the system failed when there were flickering lights, or dissolves when the two end segments had similar histograms.

Paul Dell

HongJiang Zhang, Atreyi Kankanhalli, Stephen W. Smoliar, "Automatic partitioning of full-motion video" The goal of the Zhang et. al. work is to create an index structure and table of contents for video which can be stored as part of the video package. To this end one major step is to break the video into meaningful segments. In the article, a segment is defined as "a single, uninterrupted camer shot." This articles presents a number of methods to detect boundaries between consecutive camera shots. The simplest trasition in a video sequence is a camera break, but more sophisticated transitions including dissolve, wipe, fade-in and fade-out must also be detected. Three difference metrics are presented. They include a "pair-wise comparison" which counts how many pixels changed, a "likelihood ratio" which compairs corresponding regions, and "histogram comparison". These three metrics are susceptible to mistaking large object movements with transitions and sharp illumination changes. To detect gradual transitions, a new approach called the "twin-comparison approach" is given which contains two thresholds. The lower threshold is used to detect special effects. When the lower threshold is exceeded an accumulated value is taken. If the difference value drops below the lower threshold, the accumulated value is dropped. If the accumulated value grows and exceeds the upper threshold, then a transition is assumed to have occured. In addition to occomidating transitions, some effort is given to distinguish camera movements from transitions. Data is presented on the performance of the simpler difference techniques and the more elaborate system presented in the paper. Overall the best systems detected 90% of the transistions correctly. And the multipass systems performed faster than the simpler diffence techniques. It is the readers opinion that the multipass system presented while performing faster than the other techniques, did not significantly improve the performace of the color based histogram approach. Any system will have to perform much better than the 90% accuracy given in the paper. Also, it seems that it would be much better to have extra false-positives than to miss even 1% of the actual transisitons. Arun Hampapur, Ramesh Jain and Terry E. Weymouth, "Production Model Based Digital Video Segmentation" The authors argue that a video specific model is needed to more effectively segment video transistions. The proposed model has three components which are the Edit Decision Model, the Assembly Model and the Edit Effect Model. This models the normal video processes of "Editing" and "Assembly". The video segmentation problem can take two equivalent approaches. One can detect boundaries between shots or one can detect boundaries between edits. The approach taken in the paper is to detect edit boundaries. The performance of the system is on par with other systems. In particular the overall percentage of positive detections is 88%. This system does involve more processing than other systems and the 12% false rate may or may not be adequate. One advantage of the system is that specific transitions can be modeled and detected. One disadvantage is the increased level of complexity involved.

John Petry

AUTOMATIC PARTITIONING OF FULL-MOTION VIDEO, by Zhang, Kankanhalli, and Smoliar The authors are looking for breaks and other boundaries in video segments. They discuss two principal approaches, spatial comparisons (pixel-to-pixel over sequential frames, or groups of pixels to the corresponding group of pixels) and histogram comparisons (grey or color intensity distributions over the whole image). While these approaches are generally sufficient to detect cuts (a clean break, where sequential frames are from different segments), they are not enough to detect gradual transitions due to editing effects such as fading or disolving. The authors' chief contribution to the latter problem is a twin-comparison algorithm which works as follows: run one of the above low-level algorithms with a low threshold, so that it detects not only breaks, but locations that might be breaks (i.e., which score lower than obvious breaks, but higher than typical frame-to-frame differences). At these candidate locations, save the first frame in the sequence. Then continue examining subsequent frames. As soon as the difference between adjacent frames falls back into the normal range, compare the frame at hand with the saved frame from the beginning of the sequence. If the frames differ by less than the strict break threshold, a break has probably occurred. If not, consider the whole sequence to be part of the current segment. Once a likely break has been detected (by the twin-comparison or any single- threshold method), it is examined using optical flow techniques to decide whether the break was actually caused by object or camera motion. In either of those cases, it will be considered part of the existing sequence. Only if object or camera motion (pan or zoom, for instance) are not likely explanations will the sequence be classified as a break. In addition, the authors suggest an implementation improvement over standard single-pass approaches. By running a first pass at low temporal resolution (skipping n of every m frames), then running the tool of choice at full temporal resolution only over those sections which trip the threshold, a big speedup can be attained without significantly increasing the number of missed breaks. The authors claim the best technique is color histograming. I'm not sure I see that from their data -- pixel-by-pixel comparison seems equally good, at least on the two-pass approach where it speeds up greatly. Both types of approaches (spatial difference or histogram) have cases they can't handle: for spatial difference, any strong motion can confuse it, as can sharp intensity changes. The histogram approach is most affected by intensity changes (false break detected) and by switching to scenes with similar intensity distribution (true break not detected). The latter can happen when cutting between scenes of the same type in the same location. This idea doesn't make use of higher-level video information, in contrast with the next paper, but the twin-comparison approach seems relatively elegant. PRODUCTION MODEL BASED DIGITAL VIDEO SEGMENTATION, by Hampapur, Jain and Weymouth These authors take issue with the whole low-level partitioning approach urged by Zhang et al. They feel low-level techniques don't use all possible data. In particular, they consider the case of commercial video (i.e. television and movies, as opposed to say, scientific or security videos). Commercial video is edited by humans, and segment breaks usually fall into one of types: cut (switching from one segment to another without intermediate editing); intensity (fade in/out, disolve); spatial (translation); and a mixture of intensity and spatial. The authors took a clever approach: since the last three of these are man-made (usually computer-driven) transformations, they look at the transforms and design detectors specifically for them. They contrast this with low-level approaches which make almost no assumptions about the transformation that takes place within a break. For cuts, the authors simply make use of existing single-pass, single-algorithm approaches of the form initially discussed by Zhang et al. For the mixed class of transition, they feel the whole problem is too complex, and ignore it. That leaves the intensity and spatial transformation classes, which are the focus of their work. They spend a good deal of time on their theoretical models. Some limits of these models are quite apparent. For instance, they assume there is little motion in segments immediately before or after one of these transformations, and that any motion within an editing sequence can be described by affine transformation. The authors claim their technique is significantly better than the low-level only approaches of Zhang et al, and others. Unfortunately, they present almost no data to back this up. They assert that a comparison of cut detection techniques is not possible at this time, though I don't see the basis for that statement. In addition, the evidence that they do present does not seem better than that in Zhang et al, though it's hard to say since the data streams are different. If we accept their claims, though, it's still possible to make some interesting judgements on the two approaches. Hampapur et al have highly-tuned algorithms for common types of transitions. To the extent that these transitions are the ones present in a video stream, their approach may be the best. However, the low-level approach is almost certainly better if the video stream incorporates different types of transitions (such as mixed chromatic and spatial, or violates some of the assumptions about minimal motion, etc.). Hampapur et al allude to this when they state that new graphics computing capabilities are leading to changes which are beyond the ability of their model to detect, such as shattering edits, or morphing. New detectors would have to be designed for these edits, and that may not be simple. I left feeling less impressed by this paper, though that may have have been partly due to the presentation. Their idea of examining the transformations used in editing from the top-down is very good, and it adds a certain clarity to the problem that is missing from Zhang et al. But I think they have spent too much effort parameterizing their model and not enough testing it against other approaches.

John Isidoro

I have to say, I enjoyed the most recent batch of papers on video segmentation. It seems as though the introductory papers in certain areas of machine vision although sometimes naive, can be the most helpful in explaining the underlying problems and concepts. My project for today was to review "Automatic partitioning of full-motion video" I enjoyed this paper, I felt that every technique discussed was useful in some way. It's not surprising how there is so much overlap between some of the still image analysis techniques and the video analysis (e.g. color histograms) I think that almost every similarity metric we have studied in the past can be applied to video segmentation. Obviously some of the more time consuming ones such as using Markov random fields for image segmentation may be a little too much computation for full motion video analysis.. I wish they at least explained a minimal optical flow algorithm in this paper. I would assume that most optical flow papers explain how to apply the algorithm to detect panning and zooming like this paper did.. The next paper "Production Model Based Digital Video Segmentation" was also pretty decent. It focused more on detecting certain types of video transitions and effects.. I think that this may help when you know the exact nature of the scene transitions in question, but it seems to harm the segmenation algorithm otherwise.. These techniques had an 88% success rate while the above paper had a 90% success rate..


Stan Sclaroff
Created: Oct 1, 1995
Last Modified: