BU CLA CS 835: Seminar in Image and Video Computing --- Class commentary on articles

BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Motion Estimation and Representation



 
 William Klippgen
 
Performance of Optical Flow Techniques
--------------------------------------
by J.L Barron, D.J. Fleet and S.S. Beauchemin

This article attempts to compare existing optical flow techniques
concentrating on their ability to compute velocities.  All the code
and images used in the test are nicely arranged at
URL=ftp://ftp.csd.uwo.ca/pub/vision and easy to make use of !

Many of the techniques can be considered as having three main
processing stages:

1. Smoothing / Prefiltering to find signal of interest.
 
2. Extraction of basic measurements on the time series,
e.g. derivatives in time and space.

3. Construction of a 2-dimensional flow field based on the data from
stage 2.

Differential techniques make use of spatio-temporal derivatives to
come up with velocity estimates.  The approaches are variations of how
to minimize an equation incorpotatin the gradient constraint equation.
Lucas and Kanada's implementation proved to be the second best of all
methods in the test when estimating the speed of the moving square,
"Square2".  Their method also performs as the second best when
detecting motion in the translating tree sequence where the camera
moves normal to its line of sight.  The same method also won for 2-D
motion detection in the Yosemite-sequence that consists of a wide
range of velocities.

Region-based matching tries to detect velocities by comparing the
position of a region in subsequent images.  It perform better than the
differential techniques when noise, a small number of frames or
aliasing makes pure pixel-by-pixel techniques fail.

Frequency-based methods use filters to produce "energy" estimates
based on Fourier transforms.  Certain methods using this approach have
been proven to be equivalent to correlation based methods.

Phase-based techniques consider velocity as defined according to the
phase behaviour of band-pass filter outputs. Fleet and Jepson's
version of this method did excellent for the estimation of the speed
of the sinusiodal field named "Sinusoid 1". Speed estimates of a
moving square, "Square2" was also most successful with this method as
well as for the translating and diverging tree sequences.

It is remarkable that the simple first-order differential technique
proposed by Lucas and Kanade performs so well on a wide range of the
synthetic data. However, Fleet and Jepson's approach was the overall
winner with the most reliable results across the sequences. Both
methods also performed well for the real-image sequences.

One important lesson learnt was that temporal smoothing is very useful
to make up for aliasing effects.

This paper is very important for establishing a common test-bed for
velocity detection algorithms.  There are still much to be done to
find good error metrics that can be applied for the great variety of
available methods.


Representing Moving Images with Layers
--------------------------------------
J. Y. A Wang and E.H. Adelson


The proposed methods decomposes image sequences into a number of layer
where each layer is defined by an intensity map, a velocity map and an
opacity map as a minimum. Each layer has an ordered position and
occludes the one beneath.

A delta-map contains data that allows a layer to change internally
over time.  By doing this, several "real" layers can be represented as
one single layer. In the tests carried out, no delta-maps were used,
but the representation still proved successful.

The segmentation based on motion uses affine motion decomposition.  By
first doing an optical flow estimation using a simple differentiation
technique, the problem then is to find coherent motion regions.

By using splitting and merging, a minimum number of layers with
somewhat similar motion are obtained.  

A very interesting aspect of this representation is motion picture
compression as the layers and their motion can represent the original
sequence.  This representation avoids the inherent redundancy in video
frames but introduces errors due to that the real world seldom can be
divided into a finite number of moving layers.  There is, however, an
enormous benefit from this approach as it can give a much higher image
resolution as it constructs a given layer from a great number of
samples.  HDTV-quality video can in this way be constructed from lower
resolution video signals given that the sequence is a good candidate
for a layered representation.

Lars Liden

"Systems and Experiment: Performance of Optical Flow Techniques"
Barron, Fleet & Beauchemin

	This paper provided an excellent overview of optical flow 
techniques including differential techniques, region-based matching,
energy based-methods and phase-based techniques.  The authors
have done extensive work implementing and comparing each of the
techniques.
	Unfortunately, even when the algorithms compared head-to-head
it still seems somewhat difficult to tell which is "the best".  Each
technique seems to be susceptible to its own particular difficulties:
(e.g. the altered version of Singh's technique has problems with periodic
inputs, some methods have more difficulty discriminating between normal
vs. 2-D velocity than others, and matching methods in general have
problems with sub-pixel velocities).  It seems like the various
techniques may have to be tuned to specific problems and may be able
to overcome them if properly addressed.  For example, the matching
based-techniques only looked at 2 or 3 frames at a time, and perhaps the
problem with sub-pixel velocities can be solved by propagating
information between frames which are further apart in time.
	This paper also raises the possibility of combining more than
one technique to create velocity estimates.  One can conceptualize a
kind of voting scheme in which multiple techniques each contribute a
velocity estimate which are then used to create a grouped velocity
estimate.
	I think it is worth noting that all of these techniques did
not use any information about object segmentation.  Not only is motion
a useful technique for segmentation, but it also seems that
the segmentation itself can help to constrain information about
velocity estimates and deal with such difficulties as the aperture
problem.  Although the techniques are compared for their ability to
create accurate velocity estimates, in the real world of vision
processing it would seem that having a very accurate estimate of
direction and magnitude of movement is not as important and having an
accurate segmentation and a rough idea of how fast and in what
direction a segmented object is moving.
	Finally, (a side note), I wasn't familiar with one of the
terms used in this paper and couldn't figure it out exactly from the
paper.  What exactly is the "aliasing" problem.



"Representing Moving Images with Layers"
Wang & Adelson

	Wang & Adelson introduced a significantly different method for
dealing with motion of objects in an image sequence.  Perhaps the most
interesting feature of this method is that in using a layer
representation of velocity maps it combines information about motion
and segmentation of objects in the image.
	Because the segmentation is done using a velocity map coding
the entire image, occluded objects which are never in physical contact
in the image can be segmented as a single image in the layered map as
each part of the separated image shares the same motion.
	Another interesting result of this method is that the
background is treaded as an extended object that is larger than the
frame of the image.  The authors are also able to create artificial
images by selectively removing objects (layers) when images are
recreated.
 	Although the method works exceptionally well there are a few
difficulties.  First, regions without texture cannot be assigned to a
layer in any simple fashion using the techniques in the paper.  The
authors comment that areas such as 'sky' can be combined into a single
layer since they share the same movement information.  However, what
if we have two large textureless objects, say a square and a triangle
moving in opposite directions.  There will be no motion information
from the interior of the objects as they are untextured.  The authors
suggest that these two objects should be assigned to a single layer
that describes stationary textureless objects.  However these objects
are not only not stationary, but they are different objects.  Even if
they were two different colors/intensities they would still be classified
together.  
	Note however that there is information about the motion of
these objects contained in the object borders which can be propagated
to the interior of these objects.  This information would seem to be
lost using only the layered support method.
	Another difficulty with this method is that it applies only to 
motions which can be approximated by an affine transformation.  The
authors point out that in most cases this is adequate to produce
motions on re-synthesis of the image and appear normal to the human
eye.  However, on can think of motions, such as rotation in and out
of the plane, which cannot me easily represented by an affine
transformation.

 Gregory Ganarz
 

In "Performance of Optical Flow Techniques" Barron et al. compare a
variety of optical flow algorithms on both real and synthetic image
sequences.  One difficulty with this approach to comparison is that
there generally must be agreement about what problem is trying to be
solved.  For the translating square sequence, Barron states "Of
course, we expect normal estimates along the edges of the square and
2-D velocities only at the corners." p.58.  However, some models such
as the Motion BCS model of Grossberg and Mingolla (1993) would
reorganize the velocity estimates at the edges based on the
information at the corners, and thus obtain an accurate direction
measurement for all areas of the square.  None of the algorithms
reviewed in the Barron paper have this ability.  Also, algorithms
generally require quite a bit of tuning to get them into a good
operating range.  While some of this tuning is image specific, some is
not.  A difficulty with one group simulating a number of algorithms is
determining how well

In "Representing Moving Images with Layers" Wang and Adelson present a
method for segmenting a scene into objects ordered in depth.  The idea
of using depth planes is not a new one.  Grossberg (1987) used them in
his FACADE model for figure ground segmentation.  One difficulty with
"planes" is that researchers such as Z. He have provided evidence
supporting the idea of surfaces and not depth planes.  Surfaces can be
curved in depth, while planes can't.  The Adelson model would be
unable to represent an object moving only in depth (looming or
receeding).  I believe the performance of the Wang model could be
improved by including two dimensional cues to depth such as
T-junctions.  One of the interesting properties of their model is the
ability to reconstruct sequences of images without certain objects
(e.g. the flowerbed sequence without the tree).


 Shrenik Daftary

Synopsis for "Systems and Experiment: Performance of Optical Flow
Techniques by Barron, Fleet & Beauchemin" 

This paper provides a survey of methods to compute optical flow. Nine
different techniques are presented. Unfortunately the first method's
description was not included in my copy of the paper. The first described
method in my copy was the Lucas and Kanade method which uses a weighted
least-squares fit of constraints using a smoothing Gaussian filter. In
this particular method a threshold is used for the eigenvalues to
determine in what manner a velocity should be computed; if both l1 and l2
are greater than T the v(velocity) vector is calculated, if only l1 is
greater than T only the normal velocity is calculated. 

The Nagel method uses second order derivatives to measure flow. The
constraint based on oriented smoothness in this method attempts to allow
occlusion. The next technique also second order is Uras et al, which uses
a gradient to constrain the velocity. 

The actual implementation of these techniques, which rely on
differentiation is that in high noise images differentiation is
meaningless. The method suggested to deal with problems with noise is to
use region based matching. Similarity measurements are presented to
determine regions. So techniques which use matching were devised. 

The first technique by Anandan uses a Laplacian pyramid. The next method
presented was Singh's two-stage matching method. The first stage is based
on a sum-of-squared differences metric, while the second step propagates
velocity using neighborhood constraints. The velocity is calculated by
maximizing a likelihood function. 

A single energy-based method is presented, which can be tuned based on the
Fourier domain representation of the image. Heeger is based on a
spatiotemporal energy least-squares fit, where energy is extracted using a
Gabor-energy filter applied to a Gaussian pyramid. 

Phase-based techniques are presented next. Waxman, Wu, and Bergholm's
technique begins with the smoothing of edge maps, and tracking of contours
using differential methods. Velocity is determined by using second
derivatives of the activation profile which are used by convolving the
appropriate Gaussian derivative with the edge map. The next phase based
technique is the Fleet Jepson technique which relies on the derivatives of
the phase. 

Next the experimental technique is presented. The first image that is
tested is a synthetic input of sinusoidal waves in a plane. The second
simple test involves the translation of a dark square. The sequences that
are used, include a diverging tree sequence, and the Yosemite sequence.
The correct flow fields are shown for both of these sequences. Finally
some "real" life situations are presented. (rotating Coke can, taxi
sequence, rotating Rubik's cube, and translation of camera angle. 

The results for the different techniques is presented for all of the above
cases. The Fleet and Jepson technique appeared to work well in all cases,
which leads me to believe that the selected tests were selected to work
well with the author's technique. However the real-image data sets
provides a good idea of what can happen in actual situations. The best
techniques appeared to have higher magnitude velocities where the object
was in motion. Problems with the methods are presented. In cases with
little aliasing, first order systems performed well. Second-order methods
worked well in all cases. Schunk's and Anandan's methods both presented
techniques that can't distinguish normal from 2-D estimates. Energy based
techniques did not work well, because of their dependence on initial
conditions. 

Synopsis for "Representing Moving Images with Layers" by Wang and Adelson

This paper presents a method to encode movement in an image, where an
underlayer is created for a stable background, while separate layers are
created for objects in the foreground. Rubber sheet models are criticized,
since they can not cope well with image boundaries. The first part of the
presented technique relies on the use of using either a stable background,
or in cases of camera pans, moving backgrounds.  Occluded backgrounds that
are occluded throughout the sequence are ignored, since if they are never
seen - they do not need to be reconstructed. Motion is used for the
foreground to segment the image after a more traditional 2-D segmentation
is performed. The actual segmentation is based on local motion estimation,
and fitting the movement to an affine movement. 

Layer synthesis is presented in terms of using stable information within a
layer and creating a depth, and occlusion relationship within layers. The
implementation of this technique was presented for three frames of a
Flower Garden sequence. The different frames were shown with affine
transformations to align the flowerbed without comparing the other parts
of the images. This demonstrated that if the flowerbed could be separated
into a separate layer, its "movement" could be considered by using
exclusively affine movements. This also demonstrated that objects with
similar affine transformations from frame to frame could be considered to
be part of the layer. 

Segmentation was performed using local motion estimation, and by affine
model fitting. The actual methods are presented, and methods to assign
regions by hypothesis testing is presented as well. The actual layer
segmentation was presented as well in terms of combining stable
information within a layer to determine as much of the object as possible.
Finally occlusion relationships are determined and placed into maps. 

The results of this technique were presented for the 3 frames of the
Flower Garden sequence that were presented earlier. It seemed to work well
based on the layer representation. The method is presented in terms of
compression, where a video sequence can be preserved in terms of the
number of layers and the amount of affine transformations required to fill
the video space. This technique is presented in terms of creating scenes
that didn't exist by removing a layer. Overall this technique appeared to
work well in video sequence reconstruction, but depending on the amount of
detail that needs to be preserved, it may not work, and in cases of
non-affine transformations this system would not work well. 

 John Isidoro 


 John Petry 

 REPRESENTING MOVING IMAGES WITH LAYERS, by Wang and Adelson

This was a very interesting paper.  The authors' approach models moving
images in a very natural way by decomposing the scene into separate objects
which are segmented, layered, and given independent affine motion descriptions
automatically.

They raised a good point about how layers solve a key problem with standard 
optical flow (only) approaches, by dealing with the nonlinearities of 
occlusion and separate motion, rather than averaging out the effects of 
separate objects with different motion paths.

If the image sequence matches their constraints, it appears to work well,
though it is quite slow.  The chief constraints are that the images can only
contain a limited number of distinct objects, especially in the foreground.
Only affine transformations can be modeled, so that they can't handle
something as common as a person walking across the image with arms and legs
swinging.  And while they do not discuss this, I'm not sure if scaling 
doesn't introduce problems.  A related issue is that most object motion
must be in the image plane.  Motion toward or away from the camera can
only occur if it doesn't affect the layering of objects, which is a constant.

The compactness of the representation is very impressive: effectively two
images per object (one for intensity, one for the mask) plus 6 motion terms
per frame.  This does not include the "delta" images they describe but did
not implement.  Delta images are intended to handle non-linearities and other
effects of motion, lighting changes, etc...  These can presumably equal a
full image per frame per object, which is not effective.

I have several implementation questions:

1) How would one determine the assignment of intermediate values between 
   0 and 1 to the alpha channel (mask image), which they suggest?

2) They mention the splitting and merging of discontinuous regions that
   have the same motion, but it's not clear how they decide whether to
   keep them as a single object or as separate ones.  eg., the house and 
   flowers are not grouped in "Flower Garden" sequence.  Is this because 
   the house is set back sufficiently that it's motion is noticeably
   different from that of the flowers?  And is there any problem when
   new objects appear partway through a sequence?

3) I assume their clustering algorithm at each stage is guaranteed to
   stabilize.  I'm not sure how that is assured, though.  Also, I assume
   their system doesn't significantly depend on the number of initial 
   arbitrary clusters or their starting positions, except in terms of the
   time to reach a stable arrangement, but this is not stated.

4) Is there any good way to incorporate local motion in a complex scene?
   For instance, on a windy day, the flowers might all be moving in a way
   that is only loosely related, and is certainly distinct from the general
   camera zoom effect.  Is this captured at all, or do we get the same
   initial view of the flowers throughout the recreated sequence, distorted
   only for camera motion?

5) When motion is computed on a frame basis, is there a built in drift
   effect?  I'm refering particularly to their remark that in the
   "Calendar" sequence the calendar moves so slowly it merges into the 
   wallpaper background on a per-frame basis, even though it is clearly
   moving in the sequence as a whole.  Also, is this related to their 
   comment that a recreated sequence may contain significant error, 
   but will still look natural to a human observer?

6) What precisely is done with pixels that don't fall into a moving region
   (ie., that don't cluster well)?  Do they go into the delta image?

A very interesting side effect was their ability to completely remove an
object (the tree) when all the intensity and motion data was available for
the underlying layers.

Another is how the layer representation provides frame rate independence.
Interpolation looks really easy.

Overall, very good.  I'd like to more about *where* this is used as well,
though.


PERFORMANCE OF OPTICAL FLOW TECHNIQUES, by Barron, Fleet and Beauchemin

This paper is essentially a set of experimental data comparing the results
of nine implementations of optical flow alogrithms.  The results are quite
informative, though, and would certainly be useful to anyone building
a system with an optical flow component.

Optical flow measurement is an attempt to determine the 2-D motion of a 
scene across a time and motion sequence.  The authors examine several
differentiation methods, region-based matching, and energy-based and
phase-based methods.

To cut directly to the key results, they found that two approaches were
most successful across the range of images they tested on.  Those were the
local differential method of Lucas and Kanade, and the local phase-based
method of Fleet and Jepson.  In addition, the authors explored confidence
measures associated with the different algorithms as a way of knowing
when they were producing reasonable results.

The phase-best method was best, but it still had several weaknesses,
namely temporal aliasing, the lack of a single confidence measure, and
high computational cost.  

However, I think the biggest limitations are those common to all of the 
optical flow approaches, namely that they treat all motion as a 2-D 
deformation of the whole scene.  Although local deformations can occur, 
they are still part of the entire scene, and as such cannot handle 
occlusion or edge effects very well.  

On the other hand, as I understand it, they will do a better job of 
representing some types of local deformation better than the layered 
approach of Wang and Adelson.  For instance, the local motions of a walking 
man -- arms and legs swinging -- can probably be handled well as a 2-D
deformation of the original image, given uniform background behind him.
But his motion across the scene as a whole is best handled by the layer
method.

Within constrained scenes, for instance the independent regions of the
Wang and Adelson paper, optical flow is well suited, and in fact probably
much superior to most other approaches.  For instance, neither correlation
nor general Hough transforms would work as well for several reasons,
principally an inability to handle deformations without retraining on
every frame.



Stan Sclaroff

Created:  Oct 1, 1995

Last Modified: