BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Motion Estimation and Representation



William Klippgen

Performance of Optical Flow Techniques -------------------------------------- by J.L Barron, D.J. Fleet and S.S. Beauchemin This article attempts to compare existing optical flow techniques concentrating on their ability to compute velocities. All the code and images used in the test are nicely arranged at URL=ftp://ftp.csd.uwo.ca/pub/vision and easy to make use of ! Many of the techniques can be considered as having three main processing stages: 1. Smoothing / Prefiltering to find signal of interest. 2. Extraction of basic measurements on the time series, e.g. derivatives in time and space. 3. Construction of a 2-dimensional flow field based on the data from stage 2. Differential techniques make use of spatio-temporal derivatives to come up with velocity estimates. The approaches are variations of how to minimize an equation incorpotatin the gradient constraint equation. Lucas and Kanada's implementation proved to be the second best of all methods in the test when estimating the speed of the moving square, "Square2". Their method also performs as the second best when detecting motion in the translating tree sequence where the camera moves normal to its line of sight. The same method also won for 2-D motion detection in the Yosemite-sequence that consists of a wide range of velocities. Region-based matching tries to detect velocities by comparing the position of a region in subsequent images. It perform better than the differential techniques when noise, a small number of frames or aliasing makes pure pixel-by-pixel techniques fail. Frequency-based methods use filters to produce "energy" estimates based on Fourier transforms. Certain methods using this approach have been proven to be equivalent to correlation based methods. Phase-based techniques consider velocity as defined according to the phase behaviour of band-pass filter outputs. Fleet and Jepson's version of this method did excellent for the estimation of the speed of the sinusiodal field named "Sinusoid 1". Speed estimates of a moving square, "Square2" was also most successful with this method as well as for the translating and diverging tree sequences. It is remarkable that the simple first-order differential technique proposed by Lucas and Kanade performs so well on a wide range of the synthetic data. However, Fleet and Jepson's approach was the overall winner with the most reliable results across the sequences. Both methods also performed well for the real-image sequences. One important lesson learnt was that temporal smoothing is very useful to make up for aliasing effects. This paper is very important for establishing a common test-bed for velocity detection algorithms. There are still much to be done to find good error metrics that can be applied for the great variety of available methods. Representing Moving Images with Layers -------------------------------------- J. Y. A Wang and E.H. Adelson The proposed methods decomposes image sequences into a number of layer where each layer is defined by an intensity map, a velocity map and an opacity map as a minimum. Each layer has an ordered position and occludes the one beneath. A delta-map contains data that allows a layer to change internally over time. By doing this, several "real" layers can be represented as one single layer. In the tests carried out, no delta-maps were used, but the representation still proved successful. The segmentation based on motion uses affine motion decomposition. By first doing an optical flow estimation using a simple differentiation technique, the problem then is to find coherent motion regions. By using splitting and merging, a minimum number of layers with somewhat similar motion are obtained. A very interesting aspect of this representation is motion picture compression as the layers and their motion can represent the original sequence. This representation avoids the inherent redundancy in video frames but introduces errors due to that the real world seldom can be divided into a finite number of moving layers. There is, however, an enormous benefit from this approach as it can give a much higher image resolution as it constructs a given layer from a great number of samples. HDTV-quality video can in this way be constructed from lower resolution video signals given that the sequence is a good candidate for a layered representation.

Lars Liden

"Systems and Experiment: Performance of Optical Flow Techniques" Barron, Fleet & Beauchemin This paper provided an excellent overview of optical flow techniques including differential techniques, region-based matching, energy based-methods and phase-based techniques. The authors have done extensive work implementing and comparing each of the techniques. Unfortunately, even when the algorithms compared head-to-head it still seems somewhat difficult to tell which is "the best". Each technique seems to be susceptible to its own particular difficulties: (e.g. the altered version of Singh's technique has problems with periodic inputs, some methods have more difficulty discriminating between normal vs. 2-D velocity than others, and matching methods in general have problems with sub-pixel velocities). It seems like the various techniques may have to be tuned to specific problems and may be able to overcome them if properly addressed. For example, the matching based-techniques only looked at 2 or 3 frames at a time, and perhaps the problem with sub-pixel velocities can be solved by propagating information between frames which are further apart in time. This paper also raises the possibility of combining more than one technique to create velocity estimates. One can conceptualize a kind of voting scheme in which multiple techniques each contribute a velocity estimate which are then used to create a grouped velocity estimate. I think it is worth noting that all of these techniques did not use any information about object segmentation. Not only is motion a useful technique for segmentation, but it also seems that the segmentation itself can help to constrain information about velocity estimates and deal with such difficulties as the aperture problem. Although the techniques are compared for their ability to create accurate velocity estimates, in the real world of vision processing it would seem that having a very accurate estimate of direction and magnitude of movement is not as important and having an accurate segmentation and a rough idea of how fast and in what direction a segmented object is moving. Finally, (a side note), I wasn't familiar with one of the terms used in this paper and couldn't figure it out exactly from the paper. What exactly is the "aliasing" problem. "Representing Moving Images with Layers" Wang & Adelson Wang & Adelson introduced a significantly different method for dealing with motion of objects in an image sequence. Perhaps the most interesting feature of this method is that in using a layer representation of velocity maps it combines information about motion and segmentation of objects in the image. Because the segmentation is done using a velocity map coding the entire image, occluded objects which are never in physical contact in the image can be segmented as a single image in the layered map as each part of the separated image shares the same motion. Another interesting result of this method is that the background is treaded as an extended object that is larger than the frame of the image. The authors are also able to create artificial images by selectively removing objects (layers) when images are recreated. Although the method works exceptionally well there are a few difficulties. First, regions without texture cannot be assigned to a layer in any simple fashion using the techniques in the paper. The authors comment that areas such as 'sky' can be combined into a single layer since they share the same movement information. However, what if we have two large textureless objects, say a square and a triangle moving in opposite directions. There will be no motion information from the interior of the objects as they are untextured. The authors suggest that these two objects should be assigned to a single layer that describes stationary textureless objects. However these objects are not only not stationary, but they are different objects. Even if they were two different colors/intensities they would still be classified together. Note however that there is information about the motion of these objects contained in the object borders which can be propagated to the interior of these objects. This information would seem to be lost using only the layered support method. Another difficulty with this method is that it applies only to motions which can be approximated by an affine transformation. The authors point out that in most cases this is adequate to produce motions on re-synthesis of the image and appear normal to the human eye. However, on can think of motions, such as rotation in and out of the plane, which cannot me easily represented by an affine transformation.

Gregory Ganarz

In "Performance of Optical Flow Techniques" Barron et al. compare a variety of optical flow algorithms on both real and synthetic image sequences. One difficulty with this approach to comparison is that there generally must be agreement about what problem is trying to be solved. For the translating square sequence, Barron states "Of course, we expect normal estimates along the edges of the square and 2-D velocities only at the corners." p.58. However, some models such as the Motion BCS model of Grossberg and Mingolla (1993) would reorganize the velocity estimates at the edges based on the information at the corners, and thus obtain an accurate direction measurement for all areas of the square. None of the algorithms reviewed in the Barron paper have this ability. Also, algorithms generally require quite a bit of tuning to get them into a good operating range. While some of this tuning is image specific, some is not. A difficulty with one group simulating a number of algorithms is determining how well In "Representing Moving Images with Layers" Wang and Adelson present a method for segmenting a scene into objects ordered in depth. The idea of using depth planes is not a new one. Grossberg (1987) used them in his FACADE model for figure ground segmentation. One difficulty with "planes" is that researchers such as Z. He have provided evidence supporting the idea of surfaces and not depth planes. Surfaces can be curved in depth, while planes can't. The Adelson model would be unable to represent an object moving only in depth (looming or receeding). I believe the performance of the Wang model could be improved by including two dimensional cues to depth such as T-junctions. One of the interesting properties of their model is the ability to reconstruct sequences of images without certain objects (e.g. the flowerbed sequence without the tree).

Shrenik Daftary

Synopsis for "Systems and Experiment: Performance of Optical Flow Techniques by Barron, Fleet & Beauchemin" This paper provides a survey of methods to compute optical flow. Nine different techniques are presented. Unfortunately the first method's description was not included in my copy of the paper. The first described method in my copy was the Lucas and Kanade method which uses a weighted least-squares fit of constraints using a smoothing Gaussian filter. In this particular method a threshold is used for the eigenvalues to determine in what manner a velocity should be computed; if both l1 and l2 are greater than T the v(velocity) vector is calculated, if only l1 is greater than T only the normal velocity is calculated. The Nagel method uses second order derivatives to measure flow. The constraint based on oriented smoothness in this method attempts to allow occlusion. The next technique also second order is Uras et al, which uses a gradient to constrain the velocity. The actual implementation of these techniques, which rely on differentiation is that in high noise images differentiation is meaningless. The method suggested to deal with problems with noise is to use region based matching. Similarity measurements are presented to determine regions. So techniques which use matching were devised. The first technique by Anandan uses a Laplacian pyramid. The next method presented was Singh's two-stage matching method. The first stage is based on a sum-of-squared differences metric, while the second step propagates velocity using neighborhood constraints. The velocity is calculated by maximizing a likelihood function. A single energy-based method is presented, which can be tuned based on the Fourier domain representation of the image. Heeger is based on a spatiotemporal energy least-squares fit, where energy is extracted using a Gabor-energy filter applied to a Gaussian pyramid. Phase-based techniques are presented next. Waxman, Wu, and Bergholm's technique begins with the smoothing of edge maps, and tracking of contours using differential methods. Velocity is determined by using second derivatives of the activation profile which are used by convolving the appropriate Gaussian derivative with the edge map. The next phase based technique is the Fleet Jepson technique which relies on the derivatives of the phase. Next the experimental technique is presented. The first image that is tested is a synthetic input of sinusoidal waves in a plane. The second simple test involves the translation of a dark square. The sequences that are used, include a diverging tree sequence, and the Yosemite sequence. The correct flow fields are shown for both of these sequences. Finally some "real" life situations are presented. (rotating Coke can, taxi sequence, rotating Rubik's cube, and translation of camera angle. The results for the different techniques is presented for all of the above cases. The Fleet and Jepson technique appeared to work well in all cases, which leads me to believe that the selected tests were selected to work well with the author's technique. However the real-image data sets provides a good idea of what can happen in actual situations. The best techniques appeared to have higher magnitude velocities where the object was in motion. Problems with the methods are presented. In cases with little aliasing, first order systems performed well. Second-order methods worked well in all cases. Schunk's and Anandan's methods both presented techniques that can't distinguish normal from 2-D estimates. Energy based techniques did not work well, because of their dependence on initial conditions. Synopsis for "Representing Moving Images with Layers" by Wang and Adelson This paper presents a method to encode movement in an image, where an underlayer is created for a stable background, while separate layers are created for objects in the foreground. Rubber sheet models are criticized, since they can not cope well with image boundaries. The first part of the presented technique relies on the use of using either a stable background, or in cases of camera pans, moving backgrounds. Occluded backgrounds that are occluded throughout the sequence are ignored, since if they are never seen - they do not need to be reconstructed. Motion is used for the foreground to segment the image after a more traditional 2-D segmentation is performed. The actual segmentation is based on local motion estimation, and fitting the movement to an affine movement. Layer synthesis is presented in terms of using stable information within a layer and creating a depth, and occlusion relationship within layers. The implementation of this technique was presented for three frames of a Flower Garden sequence. The different frames were shown with affine transformations to align the flowerbed without comparing the other parts of the images. This demonstrated that if the flowerbed could be separated into a separate layer, its "movement" could be considered by using exclusively affine movements. This also demonstrated that objects with similar affine transformations from frame to frame could be considered to be part of the layer. Segmentation was performed using local motion estimation, and by affine model fitting. The actual methods are presented, and methods to assign regions by hypothesis testing is presented as well. The actual layer segmentation was presented as well in terms of combining stable information within a layer to determine as much of the object as possible. Finally occlusion relationships are determined and placed into maps. The results of this technique were presented for the 3 frames of the Flower Garden sequence that were presented earlier. It seemed to work well based on the layer representation. The method is presented in terms of compression, where a video sequence can be preserved in terms of the number of layers and the amount of affine transformations required to fill the video space. This technique is presented in terms of creating scenes that didn't exist by removing a layer. Overall this technique appeared to work well in video sequence reconstruction, but depending on the amount of detail that needs to be preserved, it may not work, and in cases of non-affine transformations this system would not work well.

John Isidoro

John Petry

REPRESENTING MOVING IMAGES WITH LAYERS, by Wang and Adelson This was a very interesting paper. The authors' approach models moving images in a very natural way by decomposing the scene into separate objects which are segmented, layered, and given independent affine motion descriptions automatically. They raised a good point about how layers solve a key problem with standard optical flow (only) approaches, by dealing with the nonlinearities of occlusion and separate motion, rather than averaging out the effects of separate objects with different motion paths. If the image sequence matches their constraints, it appears to work well, though it is quite slow. The chief constraints are that the images can only contain a limited number of distinct objects, especially in the foreground. Only affine transformations can be modeled, so that they can't handle something as common as a person walking across the image with arms and legs swinging. And while they do not discuss this, I'm not sure if scaling doesn't introduce problems. A related issue is that most object motion must be in the image plane. Motion toward or away from the camera can only occur if it doesn't affect the layering of objects, which is a constant. The compactness of the representation is very impressive: effectively two images per object (one for intensity, one for the mask) plus 6 motion terms per frame. This does not include the "delta" images they describe but did not implement. Delta images are intended to handle non-linearities and other effects of motion, lighting changes, etc... These can presumably equal a full image per frame per object, which is not effective. I have several implementation questions: 1) How would one determine the assignment of intermediate values between 0 and 1 to the alpha channel (mask image), which they suggest? 2) They mention the splitting and merging of discontinuous regions that have the same motion, but it's not clear how they decide whether to keep them as a single object or as separate ones. eg., the house and flowers are not grouped in "Flower Garden" sequence. Is this because the house is set back sufficiently that it's motion is noticeably different from that of the flowers? And is there any problem when new objects appear partway through a sequence? 3) I assume their clustering algorithm at each stage is guaranteed to stabilize. I'm not sure how that is assured, though. Also, I assume their system doesn't significantly depend on the number of initial arbitrary clusters or their starting positions, except in terms of the time to reach a stable arrangement, but this is not stated. 4) Is there any good way to incorporate local motion in a complex scene? For instance, on a windy day, the flowers might all be moving in a way that is only loosely related, and is certainly distinct from the general camera zoom effect. Is this captured at all, or do we get the same initial view of the flowers throughout the recreated sequence, distorted only for camera motion? 5) When motion is computed on a frame basis, is there a built in drift effect? I'm refering particularly to their remark that in the "Calendar" sequence the calendar moves so slowly it merges into the wallpaper background on a per-frame basis, even though it is clearly moving in the sequence as a whole. Also, is this related to their comment that a recreated sequence may contain significant error, but will still look natural to a human observer? 6) What precisely is done with pixels that don't fall into a moving region (ie., that don't cluster well)? Do they go into the delta image? A very interesting side effect was their ability to completely remove an object (the tree) when all the intensity and motion data was available for the underlying layers. Another is how the layer representation provides frame rate independence. Interpolation looks really easy. Overall, very good. I'd like to more about *where* this is used as well, though. PERFORMANCE OF OPTICAL FLOW TECHNIQUES, by Barron, Fleet and Beauchemin This paper is essentially a set of experimental data comparing the results of nine implementations of optical flow alogrithms. The results are quite informative, though, and would certainly be useful to anyone building a system with an optical flow component. Optical flow measurement is an attempt to determine the 2-D motion of a scene across a time and motion sequence. The authors examine several differentiation methods, region-based matching, and energy-based and phase-based methods. To cut directly to the key results, they found that two approaches were most successful across the range of images they tested on. Those were the local differential method of Lucas and Kanade, and the local phase-based method of Fleet and Jepson. In addition, the authors explored confidence measures associated with the different algorithms as a way of knowing when they were producing reasonable results. The phase-best method was best, but it still had several weaknesses, namely temporal aliasing, the lack of a single confidence measure, and high computational cost. However, I think the biggest limitations are those common to all of the optical flow approaches, namely that they treat all motion as a 2-D deformation of the whole scene. Although local deformations can occur, they are still part of the entire scene, and as such cannot handle occlusion or edge effects very well. On the other hand, as I understand it, they will do a better job of representing some types of local deformation better than the layered approach of Wang and Adelson. For instance, the local motions of a walking man -- arms and legs swinging -- can probably be handled well as a 2-D deformation of the original image, given uniform background behind him. But his motion across the scene as a whole is best handled by the layer method. Within constrained scenes, for instance the independent regions of the Wang and Adelson paper, optical flow is well suited, and in fact probably much superior to most other approaches. For instance, neither correlation nor general Hough transforms would work as well for several reasons, principally an inability to handle deformations without retraining on every frame.


Stan Sclaroff
Created: Oct 1, 1995
Last Modified: