BU CAS CS 585: Image and Video Computing --- Class commentary on articles

BU CAS CS 585: Image and Video Computing

Structure from Motion
November 26, 1996

Readings:

C. Tomasi and T. Kanade, Factoring image sequences into shape and motion, Proc. IEEE Workshop on Visual Motion, 1991, 21-28.

Kriss Bryan
Bin Chen
Jeffrey Considine
Cameron Fordyce
Timothy Frangioso
Jason Golubock
Jeremy Green
Daniel Gutchess
John Isidoro
Tong Jin
Leslie Kuczynski
Hyun Young Lee
Ilya Levin
Yong Liu
Nagendra Mishr
Romer Rosales
Natasha Tatarchuk
Leonid Taycher
Alex Vlachos

Existing solutions of the structure-from-motion problem work well for perfect images,but are very sensitive to noise. This paper presents a new method called factorization method which is accurate and robust.

Most previous work done by other researchers use perspective projection and a camera-centered representation of shape. This makes shape estimation difficult, unstable and noise sensitive. The factorization method formulate the problem in world-centered coordinates and under orthography,which leads to simple and well behaved solution. The reason is the mutual independence of shape and motion together with the linearity of orthographic projection makes it possible to cast the structure-from motion problem as a factorization problem. To do this, the image steam is represented by a registered 2FXP measurement matrix, then registered as the input to the factorization method. The rank theorem plays an extremely important role in the paper. It captures precisely the nature of the redundancy that exists in an image sequence. It claims that without noise, the registered measurement matrix is at most of rank three, while with noise, the rank theorem can also be extended in a well-defined manner. The rand theorem makes sure that the measurement matrix can be expressed in a form W=RS where R represents the camera rotation and S represents the shape. Based on that, the factorization algorithm is given in the paper. Further more, the author extend the factorization method for dealing with feature occlusions.

Several experiments including both laboratory experiments and outdoor experiments are done to demonstrate the factorization method. They show that the motion recovery is precise, the factorization method doesn't smooth the results and this method can deal without difficulty with streams that contain degenerate substreams. Also, the factorization method does not require a smooth motion assumption. All of the above show that the high accuracy and robustness of the factoriztion method.

Jeffrey Considine

Tomasi and Kanade present a method for recovering shape and motion under orthography without computing depth as an intermediate result, giving better results for objects that are distant with respect to their size. They do this by constucting a matrix from all the data points and factor it to get a shape matrix and a camera rotation matrix. They go through all the math involved but I was unable to follow it. The concept however seems very simple even if the linear algebra is complicated. The examples seem to work well and the technique seems very general. One potential problem is the reliance on trackable points within the image...

Cameron Fordyce

The authors of this paper present a mathematically-based method for recovering shape and motion from an image stream under orthographic projection. The method is based on tracking feature points from frame to frame. By finding the trajectory of the features, a matrix is built of their coordinates. The camera motion and the shape of the object are then factored into separate matrices so that both are recovered in one pass. The authors contend that this method is robust with regard to occlusions, and a small amount of camera yaw and roll. On the mathematical side, their method relies on singular value decomposition, a relatively well-behaved and stable algorithm.

The most surprising aspect of this method is the reported robustness of this algorithm. If it can deal with 'degenerate image streams' and sparse data then the algorithm is indeed very good. However, the outdoor experiment shows that the feature tracking algorithm will strongly affect the success of the algorithm. Nevertheless, the results in this case are still significant.

The primary limitation to this method, as reported by the authors, is that the camera motion with respect to the optical axis cannot be recovered and must remain small over the whole of the stream. Their method relies on the ability of the feature tracking algorithm which introduces another important limitation. This is not a lack of the paper, but I would have liked to read the references reported in the paper concerning this algorithm. The authors also mention an algorithm that automatically chooses the number of feature points to track. This type of automation is extremely important to the advancement of machine vision.

I had a few questions that were unanswered in the paper. One, what are the limits of the frame rate as they affect the tracking? Is this method only viable with a frame rate of 30 images/second (the video rate?)? Is there some lower bound to the frame rate vs. feature movement? This is a characteristic more of the tracking algorithm than of the Factorization algorithm.

In summary, the Factorization algorithm appears to produce extremely good results with relatively simple calculations. It appears to be robust in many environments. The use of the redundant information of the image stream is especially appealing. As the authors noted, further experimentation should go on with objects that have more occlusions.

Timothy Frangioso

This paper talks about a process of detecting both motion and shape of an object by tracking them through a sequence of images. The Factorization Method, as it is called, can detect both shape and motion under orthography. That is it can recover the shape and motion from a sequence of images where the projecting lines are perpendicular to the plane of projection. Given this constraint one can track the P desired points through the F frames to recover the two R (rotation) and S (shape) matrixes that are desired.

The method uses the differences in the points that are being track through the F frames to obtain trajectories of the points. This allows for the calculation of motion through the frames. To do this one makes a measurement matrix out of the trajectory coordinates. Then by subtracting from each entry the mean of the entries in the same row one gets a registered measurement matrix, call it Wr. This matrix Wr holds all the information about shape and motion that is needed. Wr = RS where R and S are as defined above. The reason that one can estimate the shape and rotation when there are occlusions is that even if we have and approximate registered measurement matrix, call it Ar. This would give us AR and AS corresponding to the approximate R and The approximate S matrices. It is shown in the paper that AR is a linear transformation of the true R and like wise with AS and S. Therefore it is possible to obtain R and S from AR and AS.

This method worked very good in the lab experiments. Even in the experiments where there were occlusions it performed very good. This technique brings up all kinds of possibilities for recognition. Not only would this technique allow one to find objects but to find activities. One could use this technique to find particular objects behaving in particular ways. You could make a recognition system to find actions. For example one could use the trajectories to find out how the object was moving and match it to the desired motion. In the example of the ball I think that it is clearest. You could find balls that were spinning on there axis or maybe ones that were rolled off a desk.

Jason Golubock

Jeremy Green

Daniel Gutchess

Authors Tomasi and Kanade describe a method of extracting shape and camera movement from video sequences. This "factorization method" is robust to noise and occlusions. The measurement matrix is a 2FxP matrix simply giving the horizontal feature coordinates in the first F rows, and the vertical coordinates in the bottom half. The registered measurement matrix is then calculated by subtracting the mean of the row from each element. The rank theorem is then presented in great detail, and it shows that the registered measurement matrix is rank-deficient, having rank of at most 3. The reg. measurement matrix can also be broken into the product of 2 matrices, R and S, where R is the camera rotation and S is the shape matrix. An outline of the algorithm is presented in detail. Several test cases are presented, some in a controlled environment, others outside in the real world. In the lab, the method performed very well- only minute errors in yaw, roll, and pitch were found. Outside, they used a hand-held camera to show that it would even work with some jerky camera motion. Also, a way of dealing with occlusions is shown. Hidden feature points can be reconstructed if the point is visible in 3 other frames, and there are 3 other points that are visible in the 4 frames. Two examples with occlusion are shown which successfully reconstruct shape. This seems like a very useful shape and motion recovery technique, because solutions are given to deal with noise.

John Isidoro

Tong Jin

Leslie Kuczynski

Authors C. Tomasi and T. Kanade present a method to ‘solving’ the structure-from-motion problem – recovering scene geometry and camera motion from a sequence of images. Their approach differs from traditional methods in that they deviate from the technique that uses perspective projection and a camera-centered representation of shape and reformulate the problem in world-centered coordinates under orthography. However, this is approach is not a novel one and the original proof of existence of a solution was done by Ullman (1979). What is novel (or was in 1992 – as claimed by the authors) was the introduction of a factorization method used in conjunction with world-centered coordinates.

The basic way the authors approach the problem is as follows: Two matrices are computed by tracking a number of features points in an image over a number of frames in an image stream. One matrix is composed of the horizontal feature coordinates and the other is composed of the vertical feature coordinates. By subtracting, from each entry in both matrices, the mean of the entries in the same row, what is called a registered measurement matrix, can be obtained. This matrix is then factored into two matrices where one matrix represents camera rotation and the other represents shape (I am not perfectly clear on how this is done). Additionally, the authors present method’s for using their algorithm with noisy image streams and with image streams where the features being tracked can appear and disappear from the image because of occlusions. For noise-free image streams where features suffer from occlusion, lost feature points can be recovered, or ‘hallucinated’ if the feature point is visible in enough of the other frames. For noisy image streams where features suffers from occlusion, lost feature points can be recovered, or ‘hallucinated’ by taking a full sub-block of the measurement matrix and ‘growing’ it one row, or column at a time.

The following is a list of the limitations/questions I found relevant to the author’s proposed techniques:

The authors assume that the images in the image stream are of correct projection (i.e., orthography).
Camera movement is restricted to shallow motion. Effects of camera translation along the optical axis of the camera are not accounted for under orthography.
In the author’s example of the outdoor image stream, without occlusion, camera motion only covered a small rotation angle (@15 degrees). What are the effects of larger rotation?
The identification of false features in the first image of the image stream. The author’s counteracted some of these effects by manually masking out identified features that were not relevant.
The process of choosing the correct sub-block from the measurement matrix when dealing with noisy images streams where features appear and disappear throughout the image stream as well as the propagation of noise error when growing the sub-block.
How to handle the figure-ground problem? The author’s example of the noisy image stream where features appear and disappear solved this by using a fixed camera.

Hyun Young Lee

A factorization method is developed to infer scene geometry and camera motion by recovering shape and motion under orthographic projection. Starting with the rank theorem and based on the singular-value decomposition, the algorithm defines two matrices which are linear transformations of true rotation and shape matrices and from them computes the rotation and shape matrices R and S.

Originated from Ullman's solution for structure-from-motion problem, this method takes advantage of reformulating the problem in world-centered coordinates and under orthography, different than previous works which usually employed a camera-centered representation of shape and perspective projection.

The experimental results show that not only for a controlled indoor image stream , but also for an outdoor image stream with more unstable movements of camera, the algorithm works correct and robust. Even for an occluded part of image, if they are noise-free, with certain conditions for extension given, the full motion and shape could be reconstructed.

Ilya Levin

Yong Liu

Carlo Tomasi and Takeo Kanade's'Shape and Motion from Image Streams under Orthography:a Factorization Method' (International Journal of Computer Vision, 11:2, 127-145 (1993) 1993 Kluwer Academic Publishers) introduced a method where scene geometry and camera motion can be recovered from in a way that is insensitive to noises.

The method introduced here is an extention of a factorization method the authors proposed in the 1990. There are several input quantities in this method. Measurement matrix W is obtained by combining the horizontal feature coordinates (matrix U) and vertical feature coordinate (matrix V). At the same time, a registered measurement matrux is formed by substracting the means.

The rank theorem states that without noise, the registered measurement matrix is at most of rank three. In the presence of noise, the best possible shape and rotation estimate is obtained by considering only the three greatest singular values of the registered matrix, together with the corresponding left and right eigenvectors.

The method requires following steps:

Compute the singular value decomposition of the registered measurement matrix to form o1, Sigma, and o2;
Based on o1, Sigma, and o2, smaller rank representation of R and S, denoted as r' and s'were deduced.
Compute the Q matrix by imposing the metric constraint.
Compute the R and S matrix based on Q and r', s'.

The method they introduced appears to be powerful based on their examples. However, it takes more explanation to understand their concept in real depth.

Nagendra Mishr

Romer Rosales

Recovering scene geometry and camera motion from a set of frames has some problems when the relation between the size of the objects and their distance is big or when the image is not perfect for this task. Some approaches are very sensitive to no ise.

This article presents a technique for understanding structure-from-motion tha is called the factorization method which uses a sequence of images under orthographic projection.

Thi method does not consider camera translation along the optical axis but acording to them, it improves the quality of the computed shape. In general this method is based in representing an image sequence as a 2FXP matrix (W). W is the x and y coordinate s of P points that are tracked through F frames. Then the rank thorem is used (measuring the image coordinates with respect to their centroid) to show that W can be factored into the product of R and S, whre R is a 2FX3 matrix (camera rotation) nad S is a 3XP matrix (shape in a coordinate system attached to the object centroid).

The redundance in image sequences permits a large number of points and frames to be processed efficiently. Simplifying the technique, the average of the rows of W are used to compute camera translation. Some features may dissapear sometimes so W is partia lly filled, this is solved by growing a partial solution obtained from the initial submatrix.

Some experimets were used to test this aproach under controlled conditions in the laboratoury and with a hand-held camera, with occlusions and without them. Good results were obtained specially in the lab, although it was necessary to do some of the step s by hand.

The basic approach is stated in the article: three pictures of four points determine the structure and motion under orthograpy. This is implemented here by using a factorization method that seems to be very efficient and stable, whithout worrying about c amera motion, and object shape.



Stan Sclaroff

Created:  Sep 26, 1996

Last Modified: Nov 1, 1996

BU CAS CS 585: Image and Video Computing

Structure from Motion November 26, 1996

Structure from Motion
November 26, 1996