BU CAS CS 585: Image and Video Computing --- Class commentary on articles

BU CAS CS 585: Image and Video Computing

Applications of Pyramid Image Representations
November 14, 1996

Readings:

T. Darrell and A. Pentland, Space-time gestures, Proc. Computer Vision and Pattern Recognition Conf., 1993, 335-340.
A. Blake and R. Curwen and A. Zisserman, A Framework for Spatiotemporal Control in the Tracking of Visual Contours, International Journal of Computer Vision, 11(2):127-146, 1993.

Kriss Bryan
Bin Chen
Jeffrey Considine
Cameron Fordyce
Timothy Frangioso
Jason Golubock
Jeremy Green
Daniel Gutchess
John Isidoro
Tong Jin
Leslie Kuczynski
Hyun Young Lee
Ilya Levin
Yong Liu
Nagendra Mishr
Romer Rosales
Natasha Tatarchuk
Leonid Taycher
Alex Vlachos

Blake, Curwen, and Zisserman describe a method for contour tracking using elastic models and stochastic filtering combined affine invariance. Conceptually, this method is descended from snakes, for which the elastic framework is approximately equivalent to a Kalman filter. In this method, they develop the use of the Kalman filter in this context. Another refinement the use is variable temporal resolutioin, which corresponds to changing the spatial resolution, so if feature is lost, the model changes faster and the search space is expanded until it is found and then the search space zooms in again.

I found this paper difficult to follow because of the large amount of math nvolved. The method of finding features to lock on to seems very expensive and redundant and their solution with random sampling seems unreliable unless the data is blurred to increase the area where a particular feature may be detected.

Space-Time Gestures

Darrell and Pentland describe a method for applying view based methods to recognizing the pose of articulated objects and their gestures. This method is not dependent on any particular view model - only a score relating the input to the various view models is required. Pose is simply found by taking the model for which the input has the greatest score. Gestures are found by modeling these scores with respect to time and comparing them to models of the gestures.

This method for recognizing gestures seems very elegant. It is not dependent on any method of comparing images and does not introduce new methods. Identification of gestures is done in the same way using the pose scores as input. However, given the overlapping nature of the data for determining gesture scores, it seems that some sort of incremental calculation would be useful here as most of the data is the same for two consecutive gestures.

Cameron Fordyce

Space-Time Gestures by T. Darrell and A. Pentland

This paper outlines a method to track and recognize human gestures. As might be obvious, this is very hard problem due to the change in orientation of the object and the deformation of the object, i.e. the object being tracked does not maintain a static shape. This latter characteristic is central to human gestures. The method presented here is primarily a statistical pattern recognition approach with semi-supervised learning of models of the gestures to be tracked. At the date of writing, this algorithm was primarily user dependent, which means that once trained on one person's gestures, it can only recognize that person's gestures. This is similar to the current commercial automatic speech recognition programs that must be trained on individual users before they can be used and are limited to that user. I use this parallel to speech recognition because in many ways the paper itself draws from research carried out in the speech recognition community( see reference nos. 7, 9, and references to DTW).

Needless to say, the goal of this type of research is to find a way to track all gestures from any user not just the user whose models are stored in the computer. This is not to criticize the authors because the problem is a hard one and still as yet largely unsolved in other pattern research communities ( like speech research).

There are several limitations to the algorithm such as the assumptions of fixed illumination and uniform background. I would have also liked to see a comparison of the 'good bye' hand with the 'hello' hand which is given in the article. In this vein, the authors do not address the issue of a user's gestural variability. How one signals 'hello' can vary significantly from a simple palm out pose of the hand.

Since there appears to be a great deal of similarity of approaches between the speech recognition community and this area of object tracking, let me go out on a limb and suggest that at some level the use of Hidden Markov Models to track and recognize gestures might be the next step.

Timothy Frangioso

Jason Golubock

Jeremy Green

Daniel Gutchess

11. Space Time Gestures

This paper describes a view-based technique for the recognition of human gestures. Matches are done by taking the normalized correlation between the input image and the previously-learned models. Gestures are described by these paterns of scores directly, instead of first translating them into a parameter space. They used special image processing hardware to do the correlation. Ways of "pruning" the search (for the correct model) by making assumptions about the input (eg. limited movement from frame-to-frame) are also discussed. Experimental results for recognizing a "hello wave" gesture are shown, and seem pretty accurate. The paper explained the ideas in a very intuitive way, and left out details, making it a more enjoyable reading for me at least.

12. A Framework for Spatiotemporal Control in the Tracking of Visual Contours

This paper was very difficult for me to understand, but this much I got out of it. The tracker first estimates a contour, and then updates the estimate continuously by looking at image features surrounding it. In this regard, it sounds very similar to snakes. Contour tracking can be made easier (faster) if assumptions are made about the way things move. A Kalman filter can help here by predicting the next frame's image.

John Isidoro

The papers we read this week have to do with view based tracking. The first one uses a moving bspline to track the contour of a moving object. The second one uses a series of small still frames and normalized correlation for tracking.

The first paper by Blake, Curwen, and Zisserman was pretty long and difficult to understand.. It would be nice if Kalman filtering and isotropism? was explained in class. However, the basics of the paper seemed to make sense.. They were more or less using a snake to track a moving object, but they had more complex energy terms based on shape (affine invarience) and statistically based tracking (Kalman filtering). It seems like it works pretty well if the background is much darker then the object you are trying to track. Their test on page 139 shows what happens when even a small amount of background interferes with the tracking. I think it would be interesting to see what would happen if this technique was combined with optical flow to determine new snake positions.. That would probably help solve some of the background interference problems, but would probably slow down the engine tremendously..

In contrast, the space time gesture paper was perhaps one of the simplest papers I've ever read.. Its pretty cool though. They track a moving object by building up a library of views of that object. If a new view of the object doesnt correlate well with any of the views in the library, the new view is recorded. One thing I found unclear about this paper is the dynamic time warping algorithm they use, I wish they added an extra section explaining it..

Tong Jin

Leslie Kuczynski

Space-Time Gestures

Authors T.Darrell and A.Pentland present a method for learning, tracking and recognizing human gestures in real-time. Their method consists of utilizing a view-based recognition approach whereby, a library of human gestures is built and then used as the basis by which to interpret and classify incoming data (e.g., video of a moving hand).

Of course, the ideal case would be to store all possible views of an object (such as a hand at all possible positions, scales, rotations, etc.), as well as all possible instances of the hand. However, this is unrealistic in terms of storage capacity, as well as with the real-time constraint. The authors use a method whereby they initially identify the object to be classified (this is the first model), track the object, and sequentially add new models to the search set as the tracking (correlation) score falls below a certain threshold. In this way, a subset of object poses is obtained. The actually library of gestures is then built from sequences of these object poses. To minimize the search-space in the recognition phase, predictive techniques are used. In other words, the expected view model score at each time t+1 is computed given the view model score at time t. This approach makes sense for objects that can only logically be in k positions at time t+1 given their current position at time t.

In instances where gestures are very specific in nature (e.g., sign language) this method seem reasonable. However, the authors only showed results of testing the algorithm with seven test subjects and they did not specify the context in which the gesture was recorded (did the video merely contain the hand gesturing "hello", "goodbye", on a black background?).

A Framework for Spatiotemporal Control in the Tracking of Visual Contours

Authors A.Blake, R.Curwen and A.Zisserman present a framework for contour tracking as a relatively real-time autonomous process. They develop a mechanism for incorporating a shape template (e.g., parameterized shape or 'snake') into a contour tracker using an invariant union between the two. This allows the tracker to be selective in regards to shape and able to ignore background clutter. The methods proposed rely on a number of assumptions (1) the geometry of the shape being tracked and (2) the distribution of uncertainty. These assumptions are necessary to satisfy the real-time constraint.

An interesting aspect of the method is that the system maintains a memory of the object being tracked. In this way, system stability is maintained in the case that if a feature on the tracked object becomes obscured, the tracker will not be pushed completely out of it's steady, tracking state. Initially, the tracker will operate within a large spatial scale with a small memory while it is locating the object. Once it locks on to the object however, the spatial scale is reduced and its memory increased.

Hyun Young Lee

Ilya Levin

A framework for Spatiotemporal Control in the Tracking of Visual Contours

The article, written by Blake, Curwen, and Zisserman, describes the principles of trucking curves in motion. The reader is introduced to the mechanism developed for incorporating a shape template into a contour tracker via an affine invariant coupling. Affine invariance ensures that the effect of varying the viewpoint is accommodated. Earlier versions of the trucker, which is being further developed in this paper, have been used in the control of a robot arm, supporting closed-loop tracking and various aspects of hand-eye coordination. The authors in this paper set out a framework for contour tracking as a relatively autonomous process, in which tracking behavior is determined as a mathematical consequence of some natural assumptions about geometry and uncertainty. This framework has evolved from the principles of the snakes, which is an elastic model for shapes in motion that can be coupled to image features. Another theme that is related to the snakes idea has been the representation of geometric prior information which can be incorporated into the tracker by means of a template. A detailed description of mathematical theory is given along with discussion of advantages and disadvantages of each method, introduced to the reader.

Space-Time Gestures

An important new application of machine visions is to extend the interface between man and machine, allowing the machine to directly perceive what its user is doing. This article, written by Darrell and Pentland, describes the system that can recover such information in real time, so that it can help mediate human-machine interaction. The authors have adopted a view-based representation of objects and gestures, that allow to model complex, articulated objects for which no simple 3-D model or recovery method is available. A gesture recognition method is introduced in this paper. A gesture can be thought of as a set of views observed over time. Therefore, previously trained gestures can be recognized by collecting the model outputs produced when viewing a novel sequence and comparing them to a library of stored patterns. Analyzing the viewed models allows to train the gesture model. The computational burden of the methods introduced in this paper can be significantly decreased by the use of computational equipment of modest power

Yong Liu

As an application of Active Control Models, Andrew Blake, Rupert Curwen, and Andrew Zisserman developed a new method used in tracking moving contours at video rate. They represented their research in a 1993 article 'A Framework for Spatioteporal Control in the Tracking of Visual Contours (International Journal of Computer Vision, 11:2, 127-145 (1993) 1993 Kluwer Academic Publishers)

As an alternative to a pixel based representation, the authors employed B-spline, which is a parametric curve representation with a low dimentional basis. They also carry a snakes characteristics of using parameterized shapes (which they called templates) in nondynamic shape-fitting processes.

They have achieved on the following:

They have applied a template in the tracking of moving contour under the presence of elastic deformability and temporal noise.
Instead of using a manually pre-set spatial scale for tracking, they have derived a statistical basis to control spatio-temporal scale automatically during tracking.
To accomadate 3-D rigid motion or camera calibration, their template was made Affine invariant.
Instead of setting a 3-D template, they used an Affine-invariant 2-D template and generalized it to full 3-D motion of a nonplanar shape.

This article provided a deeper insight on the application of Snake method and offers inspiration for doing the extra-credit part of the homework.

An alternative approach to a single 2-D template was discussed in Trevor Darrell and Alex Pentland's paper, Space-Time Gestures. A collection of gestures were used here to discribe human body languages.

Their view based recognition stresses on scoring of the target object based on its statistic matching with a series of base models. In instances such as eye movement or 3-D motion of a box, scores can be successfully used to evaluate the gesture. In other instances, however, numerical score was not possible to obtain. A complete new models were developed to accomodate this situation.

They developed a new patter recognition hardware for use in understanding hand gesture. In this case, only the idea of what they do for in the eye movement study like computerized vision was retained. Further more, they employed training and context guiding to help their base models. Their result of identifying between a handwave of 'hello' and 'goodbye' was successful. However, it is not certain about the extent of applicability of their methodology.

Nagendra Mishr

Space-Time Gestures

The authors present a mathematicallly efficient model for recognizing objects in a time-series. They contend that if you know that object in the picture, then you do not need that much information to determine its time-position.

They implement an auto-learning model which automatically partitions the input images into sets of data which are parametically equavilent to some threashold. This is pretty cool but the problems arise when gestures come into the picture. To get around this, they mention a Dynamic Time Warp algorithm. It would have been nice if they briefly described what it did, but that was not the case. So this DTW algorithm helps them to organize the learned views and the associate playback speed. My guess is that this alg. varies the time between frames based on the context of the input signal.They also mention that for retreival, if you assume temporal data, ie the world is continous and does not jitter, the search space can be reduced.

The authors did not mention how the data are parametized but they do cite pretty good result times for their gesture search using 128x120 pixel pictures.

A fremework for Spatiotemporal Control in tge Tracking of Visual Contours

This is a long and lengthy article which talks about lots of things. They mention Bsplines, and jump right into lots of paretric data. They lack a good way of presenting what they are trying to do. Math is power but not when you're trying to explain what you're trying to do to people who do not understand the math.

Romer Rosales

A Framework for Spatiotemporal Control in the Tracking of Visual Contours

(Article Review)

This work discusses the principles of tracking curves in motion. It is based on the idea of making contour tracking an autonomous process. This is achieved by considering natural assumptions like geometry and uncertainty of the object to be tracked.

The idea of the state-space representation is based on assuming that the the moving object is a contour. The tracker(a vector x(t),y(t)) is updated continually according to the visual feature it is looking for by searching in its neighborhood. So search o ccurs in a defined window in which the current estimate is the center and the visual feature attracts it.

This work uses a tracking filter (Kalman filter) for the estimated contour. Initials conditions for the filter are computed by initializing the estimator to a fixed template. They prove that using its definition, the tracking behavior is accurate and effi cient.

In general this work uses statistical analysis for contour tracking. It show a mechanism for automatic control of spatio temporal scale. One of its basic ideas is to use some natural assumptions and a template mechanism to model a simpler system an d achieving shape-selective tracking.

This work is the result of a compilation of ideas from different articles related with tracking theory and own work from its authors. Although it is easy to get lost analyzing the mathematical foundations they use, I think that good solutions are presente d to different problems in the tracking field. I consider that it is an excellent work that can be very helpful for future research that could improve this work or take concepts from it in order to be used as a base to solve similar problems.

Space-Time Gestures

(Article Review)

This work is based on the importance of recognizing meaningful gestures or spatio-temporal patterns to an efficient machine-human interaction Although this paper is grounded on this idea, in general it introduces a technique for learning, tracking and recognizing spatio-temporal patterns (gestures is an example) using a view-based approach.

The pattern is represented by using a view based technique, which allows learning by observation, it does not have to be very precise. A problem arises with this type of representation: complex articulated objects have a very large range of appearances which makes view-based matching complex.

This is solved by using a representation based on interpolation of appearance from a reduced number of view. This approach computes the correlation between the image a the set of learned view models, which contain one or more instances of images of a view of an object. View-models include also information about which other models respond similarly, an the pattern of model response that normally precedes and follows.

Gesture recognition is based in the idea that for stereotyped behaviors, it is not necessary to know the parameter values associated with each model. Then, it is possible to identify a gesture by the pattern of view model scores directly (because the parameter values of the object pose are associated with the set of correlation values).

The approach used allows recognition of gestures without explicit knowledge of the transformation parameters.

In order to show this approach, this work implements a system that can build a view-based object model and its contextual dependencies at the same time it tracks the object using space and time..

The problem of learning an appropriate set of view models in which the parameters of the changes in object can be sampled. For complex cases (such as articulated objects) there is no analytical models, for simple transforms some methods provide good approximations. This works, develops a data-driven method for construction a set of view models whose normalized correlation values can proportionate good results when interpolated. See the paper for more information.

An interesting approach is proposed to guide the search by using contextual information. Temporal correlation between frames can be used so that the search is more efficient. Restrictions in the variation of the object in two consecutive frames can be also used. Some other knowledge about the object can be useful to generate a context-based model. The problem is that this would make it less general. The basic idea is to compute conditional expected score and position.

The training (construction of models) of a gesture is achieved by providing examples of the gesture, and computing the mean and variance of the view model correlation values as a function of time.

Based mainly on simple statistical models this approach can be very useful in recognizing some kind of patterns. The results presented are considerably accurate, the kind of constrains the example uses are very well suited to the assumptions of the model. Perhaps the most important issue is the way it can be trained to recognize future gestures. I am not sure about how well this model can perform with different set of inputs.

Natasha Tatarchuk

This paper is of interest to me, because it presents the topic of recognizing human gestures by a computer. There are many wide uses for a technique of gesture recognition. For example, you could have a robot waiter understand your gesture of calling him to the table and bringing you a coke in little time (and if the gesture recognition system is good, it'll be a coke and not sprite!) Of course, this example is but a trivial application of such an important application.

The authors approach this problem in a slightly different way rather than just simply matching the available templates of gestures, which seems to be deadly inefficient considering the multitude of those. Rather, they use sets of view models, that are learned through available set of examples, which they obtain from recieving new views for a familiar object by a tracking system. The authors (it seemed to me) weren't quite clear on the implementation of the system, or the tracking system, in much details.

An explanation of many real-time issues is given in the paper, relating to us that the implementation for this work was done using a special Cognex 4400 Vision processor and Sun computers. Then an example of a gesture recognition problem (simple 'hello' and 'good-bye' waving to the computer) was quickly explained. I still don't think that the authors did a good job of giving much details or intuition for this kind of problem.

Leonid Taycher

Alex Vlachos

Space-Time Gestures

This paper shows a method of tracking, in real-time, a person's gestures. The method they explain isn't extremely computational, so it seems like it could be applied to real life programs. They give an example of tracking someone's eye movements. Maybe sometime in the near future, a camera will replace the mouse by tracking where someone's looking at their screen...after all, you're always looking at the mouse pointer.

They mention a problem with coming up with a good method for making a usable set of view models, and their solution is to automatically create the set. They use a specific computer for realtime computations to recognize hand gestures. But I was unsure if these computations were too much for today's average home PC. It would be very interesting to see this work in real time on a home computer to track eye movements.



Stan Sclaroff

Created:  Sep 26, 1996

Last Modified: Nov 1, 1996