BU CAS CS 585: Image and Video Computing --- Class commentary on articles

BU CAS CS 585: Image and Video Computing

Model-Based Object Recognition December 5, 1996

Readings:

R. Brooks, Model-based three-dimensional interpretations of two-dimensional images, IEEE Trans. on Pattern Recognition and Machine Intelligence, 5(2):140-149, 1988.
A. Pentland, Perceptual Organization and Representation of Natural Form, Artificial Intelligence, 28(3):91-123, 1986.

Kriss Bryan
Bin Chen
Jeffrey Considine
Cameron Fordyce
Timothy Frangioso
Jason Golubock
Jeremy Green
Daniel Gutchess
John Isidoro
Tong Jin
Leslie Kuczynski
Hyun Young Lee
Ilya Levin
Yong Liu
Nagendra Mishr
Romer Rosales
Natasha Tatarchuk
Leonid Taycher
Alex Vlachos

Kriss Bryan

The article "Model-Based Three-Dimensional Interpretations of Two-Dimensional Images" is quite interesting. Throughout the article a model based system called ACRONYM is used to extract and determine models from an image. The system uses generalized cones (ribbons) and ellipses to determine parts of the object and each cone is given its own coordinate system. A few objectives of the system are to determine constraints implied by the image on the quantifiers in the models and determine the location and orientation of the camera if it was not already done.

The paper interested me because a few of the concepts seemed quite similar to Computer Graphics for instance when the system assumed that the airplanes were on the ground which was the plane z = 0. This reminded me of a clipping algorithm in 3D which was just an extension of the 2D version. This system seems to do a lot of computations on the same cones over and over to refine the prediction. I wandered if there is a better and much more efficient method to accomplish the same task. The system seems interactive in that, it uses image measurements to generate constraints on the original 3 dimensional models.

Bin Chen

Jeffrey Considine

Cameron Fordyce

This paper outlines the structure of and relevant algorithms used in the ACRONYM system, an image understanding system. The system is used to extract 3 dimensional models from images ( 2dim ) without the use of multiple images (e.g. stereo images or image mosaics). The system is basically an example of model-based vision.

Model-based vision seems like it might be a good idea on first glance. But upon further reflection there seem to be many road blocks to getting a system that works well. The primary one in this system appears to be the line detection algorithm by Nevatia and Babu. The ability of the system depends on this data, which judging from the illustrations presented in the paper, relatively noisy and incomplete representations of the desired objects of the image. But the main problem with model-based vision as described in this paper is the need to create rules describing the relationship of the subparts of an object ( i.e. a plane ) to each other. This need is dependent on the creator's observation and thus extremely time-consuming. This is also a drawback of many artificial intelligence algorithms in this field and others. While the base system is domain independent, the rules necessary for this system to classify objects are not and must be created by hand.

A further drawback of the system is that if the system were set-up to classify an oil tank, it is possible that given another image of with objects of the same geometric shape and scale that there would be no way to distinguish between this new object and the oil tank since the algorithm relies purely on geometric information and not other sources of information such as texture, the presence of different contexts etc. That is, to recognize that a group of shapes is a wide-bodied plane, we have to know that the image is of an airport and not of a harbor or a train depot.

A. P. Pentland presents a very well presented summary of the drawbacks of systems such as this one in the next paper reviewed here.

Despite these drawbacks, the system, given such noisy information and implemented with such limited hardware, does relatively well.

Perceptual Organization and the Representation of Natural Form by A.P. Pentland

This paper presents both a very detailed summary of human visual perception processes and of the approaches of researchers to duplicate that capability with machines. The paper then proposes a new representation of visual information based more closely on what researchers believe is the representation used by human beings. This representation is somewhere between the model-based representation( see previous paper reviewed and in general, cad models of objects) and the representation of a scene via pixel-level information(i.e. edges or contrast differences, etc.). Both extremes have many drawbacks and do not seem to correspond to the way that humans perceive the visual world. This representation relies on finding 'primitives' or ' a set of generically applicable part models' that can be deformed mathematically and added to one another to create an infinite number of real world objects. The purpose is to create a representation that can also mirror a possible history of the creation of this object( i.e. how a naive observer would describe the object).

This new representation appears to give researchers a method of description of objects in a scene that is capable of classifying these objects and of generating models of these objects. This is akin to a problem in Linguistics in that currently there is no grammar that is used to recognize a phrase and also generate it automatically. There are grammars that do one or the other task. The representation presented by Pentland seems to resolve this problem for vision.

Timothy Frangioso

"Perceptual organization and the Representation of Natural Form', by Alex P. Pentland

This paper discusses a technique for approaching machine vision from a common sense view. The author talks about two major schools of thought in interpretation of sensory data. The first is a high-level specific models and the second is a low-level model. The presented technique a third alternative.

The proposed process is called the intermediate-grain part model and based on the philosophy that all objects in the world are made up of recurring patterns within nature. Basically one uses these patterns to find out what the object is. The benefits of such a system are that one does not have to make unrealistic assumptions about the object that were required for previous methods to be useful. The author envisions these primitives as lumps of clay from which the any object can be constructed. The technique seems to work well with biological forms such as live creatures but had problems for extremely detailed inanimate forms such as crumpled newspaper or grass. To solve this problem the basic primitive model is combined with a fractal model. This gives a more robust representation of the image and allows from much more variation in the objects created.

This technique as applications in any area that you need a here level representation of an image. That is when a general common sense system is needed to find anything from people to cars. The major strength of this process is that it does not have to impose the constants or the assumptions of previous techniques and it does not need to have much specific information about the problem. This makes it extremely versatile.

"Model-Based Three-Dimensional Interpretations of Two-Dimensional Images" by Rodney A. Brooks

This paper describes a process that for predicting and relating different objects within a image. The interpretation of the images is accomplished by a combination of geometric relations and algebraic constraints.

This modeling system uses a coordinate systems to place object in the world. The first is a object coordinate system and the second is the camera system. This world coordinate system that is created by these is based on a scheme of cones. Basically these general cones are made to represent the objects in the world. After these models are made a series of constraints or rules are applied to be able to classify the object in a general category.

The major problem was this technique is that the classifications have to be made specifically for a object to be recognized. The problem with this testing was that the images were noise and of low quality. I am wondering why the author did not do more research with better quality images to better demonstrate that usefulness of the technique when low-level processing can be improved or at least when good images can be obtained. It seems that airfields was a bad choice of a test subject.

Jason Golubock

Jeremy Green

Daniel Gutchess

John Isidoro

Tong Jin

Leslie Kuczynski

Hyun Young Lee

Ilya Levin

Yong Liu

Nagendra Mishr

Model-Based Three-Dimensional Interpretations of Two-Dimensional Images

The author presents a method for representing three dimensional objects and for recognizing them. A bit of the work involves comming up with 3D representations of the objects.

The input images are sent through a line filter which outputs line drawing which represents the original image. This image is then imtrepreted using cones and elipses. This conversion process generates objects which represend the 2D image. Consequently, a 3D representation of the image in question is generated. The searching process uses a derivatave of predicate calculus to match the pattern to something in memory.

This method is error prone and can probably yield faulty 3D images. For example, in their examples, the images are of planes viewed from thousands of feet in the air. The method can easily be tricked by the use of cardboard cutouts of planes. e.g. in Desart Storm, the Iraqi's used similar structures to fool the allied bombers. The authors also do not state if traditional FOPL (First Order Predicate Logic) can be extended to use their techniques.

Preceptual Organization and the Representation of Natural Form

A. Pentland describes a technique to representing complete objects using subcomposition of simpler objects. This has the effect of minimizing space for storage and subsequently can be used to simplify calculations regarding image manipulation.

By useing a combination of parameterized subobjects and fractal textures, the images generated can be very detailed. The problem is to take a real-world image and come up with its accurate object/fractal representation. The auhtors gloss over the image recovery problem with lots of equations but do not give enough intuitive feeling for how this is acomplished.

Romer Rosales

Model-Based Three-Dimensional Interpretations of Two-Dimensional Images

Rodney A. Brooks

(Article Review)

This work approaches the problem of image feature and feature-relation prediction. It also deals with developing techniques for matching controlled by geometric relations and algebraic constrains when interpreting images. This paper consider a particular domain, aerial views of airfields.

Among its goals, it improves previous approaches used in a modeling system (ACRONYM) which was based on only three-dimensional geometric models to perform image understanding.The techniques developed include the recognition of object classes and the extraction of 3D information from images.

The paper gives a general description of the ACRONYM modeling system. In general the world is modeled as a coordinate system and objects as generalized cones with a local coordinate system. As for cameras, they are modeled as coordinate systems with viewing direction along the z axis...

In the image interpretation process, this approach needs to have some classes of generic geometric models. So it can identify instance of the object in the image, their location, orientation, and the camera orientation.

Some useful implementations of this technique can be useful in industrial (better than in a more complicated or unpredictable environment), reasoning about how to grasp objects based on algebraic constarints can be one of these applications.

This is one example of how the use of geometry and symbolic algebraic constraints can generate a more or less accurate image interpretation. But the use of only geometry can be confusing when in the environment many objects share the same basic geometric characteristics (which is the real situation).

Perceptual Organization and the Representation of Natural Form

Alex P. Pentland

(Article Review)

In this paper, a system for representing accurately an extensive variety of natural and artificial forms is presented. This is done by representing descriptive (simple) parts (achieving a significant efficiency in storage and algorithm performance). I agree that by using the properties of this representation, it is possible to recover descriptions of given objects from image data and also that it can be useful in supporting higher level cognition processes. Reasoning and machine communication are also discussed.

These parts that this model mentions will simplify the process of analyzing structure from the global perceived scene, not from detailed atomic structures that by themselves do not represent too much. Here it is assumed that parts are reliable recognizable from image data.

For this model, a parameterized family of shapes known as supercuadratics is the essential representation. This has a mathematical representation (see the paper). It includes cubes, spheres, diamonds, pyramids and intermediate forms of them.

I liked this idea of simplifying our representation of the word, i think it can be useful in representing it at an appropriate level (for machines). It loses a lot of detail but the general idea about objects is captured efficiently and can describe in an immense variety of forms in a somehow natural way. This efficiency in describing complicated forms can be very useful in performing complicated algorithm in a short period of time, an in simlplifying the work of analysis, etc.

This work used descriptions for representing biological forms, which can be described by hierarchical boolean combinations of the basic primitives. It explains, for example, that human body requires 40 primitives of 300 bytes per each one. Complex inanimate forms are more difficult to represent, specially complex natural surfaces like clouds and mountains, so solving the representation problem for these kind of objects is not well approached by this model, they are too complex, there is to much information and many objects can be represented in the same class when they are not even similar for human cognition.

This problem is faced by using fractal properties which are found to describe in a good way these natural surface shapes. Peoples perceptual motion of roughness have been found to be very similar to models based on fractals.

By using these two descriptions, this work construct fractal surfaces by using superquadratic lumps to describe the surfaces features, illustrations of these process are provided. (difficult to see)

A description of the recognition of these models primitives is also discussed.

To finish this review I will add that the general frame of ideas that move this work is not commonly found in normal computer-vision approaches. Besides this, cognition and naturality of representation are intelligently approached and very well described and explained through every section of the paper. A higher level modeling state can admit part of the orientation of this work.

Natasha Tatarchuk

This paper is a perfect example where the meaning of the work done by the author gets completely buried by the ambiguous and pompous wording, or at least so it seemed to me. The author describes a 'system' (what exactly does he mean by that word? not clear!) ACRONYM, with virtually no details about the method of implementation except a little sneak comment about the rules being done in lisp. ACRONYM is a comprehensive domain independen model-based system for vision and manipulation tasks. That sentence alone amazes me. I don't mean to harp on the particularities of the language with which this paper is written, but after laboring trying to excavate the real work that has been done, it's still not clear to me what does the system do. Of course, there a lot of very ambiguous details related to the reader, but what has to do with what is very unclear to me.

From what I've extracted from this paper, this system is trying to take on a huge problem onto its shoulders. The authors claim that the system will be able to recognize some objects by their 3D geometry descriptions from the planar image representation. They assume that some geometry description is given beforehand. Besides the areas of object detection and image feature and feature-realation prediction are complicated and ambiguous as it, the problem of this paper is compounded by the fact that the author does not introduce the system well enough in the beginning for reader to have a good understanding of what's going on. The model domain for ACRONYM is described, in which an a priori model of the world is given to ACRONYM. I am really confused at this point already. What does the author mean by that? Then he described the model geometry model, which is subpart hierarchies of generalized cones, with those described later. What I found really curious is that every time that the author introduces a confusing moment in this work, he sends the reader to dig the explanation in some reference. The two sections about predictions and constraits are clear examples of that, where the language is very ambiguous ('informal empirical evidence suggests..', so on), and all the shaky points and definitions are described in detail in [X] reference paper.

Well, without going off too much on the style of the paper, though it seems to destroy all the work done by the author, and plant a multitude of doubts in teh reader, I finished reading this paper with a few vague relations of what the System does, and no clear understanding.

Leonid Taycher

Model based 3-D Interpretation of 2-D images

This paper is more of a pure AI then a Computer Vision. It describes the way a normal, production-rule system would be able to process graphic input. The system tries to match the output of the low level filters and edge-finders to the projections of the generalaized cones (using which all of the object models are described). Since no details are given on HOW it is done, it is hard to judge whether method worked in the examples because it works in general or because these were lucky cases

Perceptual Organization and the Representation pf Natural Form

This article describes a representation of the natural objects using building blocks (superquadratics)

Alex Vlachos



Stan Sclaroff

Created:  Sep 26, 1996

Last Modified: Nov 1, 1996