Readings:
Existing solutions of the structure-from-motion problem work well for perfect images,but are very sensitive to noise. This paper presents a new method called factorization method which is accurate and robust.
Most previous work done by other researchers use perspective projection and a camera-centered representation of shape. This makes shape estimation difficult, unstable and noise sensitive. The factorization method formulate the problem in world-centered coordinates and under orthography,which leads to simple and well behaved solution. The reason is the mutual independence of shape and motion together with the linearity of orthographic projection makes it possible to cast the structure-from motion problem as a factorization problem. To do this, the image steam is represented by a registered 2FXP measurement matrix, then registered as the input to the factorization method. The rank theorem plays an extremely important role in the paper. It captures precisely the nature of the redundancy that exists in an image sequence. It claims that without noise, the registered measurement matrix is at most of rank three, while with noise, the rank theorem can also be extended in a well-defined manner. The rand theorem makes sure that the measurement matrix can be expressed in a form W=RS where R represents the camera rotation and S represents the shape. Based on that, the factorization algorithm is given in the paper. Further more, the author extend the factorization method for dealing with feature occlusions.
Several experiments including both laboratory experiments and outdoor experiments are done to demonstrate the factorization method. They show that the motion recovery is precise, the factorization method doesn't smooth the results and this method can deal without difficulty with streams that contain degenerate substreams. Also, the factorization method does not require a smooth motion assumption. All of the above show that the high accuracy and robustness of the factoriztion method.
Tomasi and Kanade present a method for recovering shape and motion under orthography without computing depth as an intermediate result, giving better results for objects that are distant with respect to their size. They do this by constucting a matrix from all the data points and factor it to get a shape matrix and a camera rotation matrix. They go through all the math involved but I was unable to follow it. The concept however seems very simple even if the linear algebra is complicated. The examples seem to work well and the technique seems very general. One potential problem is the reliance on trackable points within the image...
The most surprising aspect of this method is the reported robustness of this algorithm. If it can deal with 'degenerate image streams' and sparse data then the algorithm is indeed very good. However, the outdoor experiment shows that the feature tracking algorithm will strongly affect the success of the algorithm. Nevertheless, the results in this case are still significant.
The primary limitation to this method, as reported by the authors, is that the camera motion with respect to the optical axis cannot be recovered and must remain small over the whole of the stream. Their method relies on the ability of the feature tracking algorithm which introduces another important limitation. This is not a lack of the paper, but I would have liked to read the references reported in the paper concerning this algorithm. The authors also mention an algorithm that automatically chooses the number of feature points to track. This type of automation is extremely important to the advancement of machine vision.
I had a few questions that were unanswered in the paper. One, what are the limits of the frame rate as they affect the tracking? Is this method only viable with a frame rate of 30 images/second (the video rate?)? Is there some lower bound to the frame rate vs. feature movement? This is a characteristic more of the tracking algorithm than of the Factorization algorithm.
In summary, the Factorization algorithm appears to produce extremely good results with relatively simple calculations. It appears to be robust in many environments. The use of the redundant information of the image stream is especially appealing. As the authors noted, further experimentation should go on with objects that have more occlusions.
This paper talks about a process of detecting both motion and shape of an object by tracking them through a sequence of images. The Factorization Method, as it is called, can detect both shape and motion under orthography. That is it can recover the shape and motion from a sequence of images where the projecting lines are perpendicular to the plane of projection. Given this constraint one can track the P desired points through the F frames to recover the two R (rotation) and S (shape) matrixes that are desired.
The method uses the differences in the points that are being track through the F frames to obtain trajectories of the points. This allows for the calculation of motion through the frames. To do this one makes a measurement matrix out of the trajectory coordinates. Then by subtracting from each entry the mean of the entries in the same row one gets a registered measurement matrix, call it Wr. This matrix Wr holds all the information about shape and motion that is needed. Wr = RS where R and S are as defined above. The reason that one can estimate the shape and rotation when there are occlusions is that even if we have and approximate registered measurement matrix, call it Ar. This would give us AR and AS corresponding to the approximate R and The approximate S matrices. It is shown in the paper that AR is a linear transformation of the true R and like wise with AS and S. Therefore it is possible to obtain R and S from AR and AS.
This method worked very good in the lab experiments. Even in the experiments where there were occlusions it performed very good. This technique brings up all kinds of possibilities for recognition. Not only would this technique allow one to find objects but to find activities. One could use this technique to find particular objects behaving in particular ways. You could make a recognition system to find actions. For example one could use the trajectories to find out how the object was moving and match it to the desired motion. In the example of the ball I think that it is clearest. You could find balls that were spinning on there axis or maybe ones that were rolled off a desk.
Authors C. Tomasi and T. Kanade present a method to ‘solving’ the structure-from-motion problem – recovering scene geometry and camera motion from a sequence of images. Their approach differs from traditional methods in that they deviate from the technique that uses perspective projection and a camera-centered representation of shape and reformulate the problem in world-centered coordinates under orthography. However, this is approach is not a novel one and the original proof of existence of a solution was done by Ullman (1979). What is novel (or was in 1992 – as claimed by the authors) was the introduction of a factorization method used in conjunction with world-centered coordinates.
The basic way the authors approach the problem is as follows: Two matrices are computed by tracking a number of features points in an image over a number of frames in an image stream. One matrix is composed of the horizontal feature coordinates and the other is composed of the vertical feature coordinates. By subtracting, from each entry in both matrices, the mean of the entries in the same row, what is called a registered measurement matrix, can be obtained. This matrix is then factored into two matrices where one matrix represents camera rotation and the other represents shape (I am not perfectly clear on how this is done). Additionally, the authors present method’s for using their algorithm with noisy image streams and with image streams where the features being tracked can appear and disappear from the image because of occlusions. For noise-free image streams where features suffer from occlusion, lost feature points can be recovered, or ‘hallucinated’ if the feature point is visible in enough of the other frames. For noisy image streams where features suffers from occlusion, lost feature points can be recovered, or ‘hallucinated’ by taking a full sub-block of the measurement matrix and ‘growing’ it one row, or column at a time.
The following is a list of the limitations/questions I found relevant to
the author’s proposed techniques:
A factorization method is developed to infer scene geometry and camera motion by recovering shape and motion under orthographic projection. Starting with the rank theorem and based on the singular-value decomposition, the algorithm defines two matrices which are linear transformations of true rotation and shape matrices and from them computes the rotation and shape matrices R and S.
Originated from Ullman's solution for structure-from-motion problem, this method takes advantage of reformulating the problem in world-centered coordinates and under orthography, different than previous works which usually employed a camera-centered representation of shape and perspective projection.
The experimental results show that not only for a controlled indoor image stream , but also for an outdoor image stream with more unstable movements of camera, the algorithm works correct and robust. Even for an occluded part of image, if they are noise-free, with certain conditions for extension given, the full motion and shape could be reconstructed.
Carlo Tomasi and Takeo Kanade's'Shape and Motion from Image Streams under Orthography:a Factorization Method' (International Journal of Computer Vision, 11:2, 127-145 (1993) 1993 Kluwer Academic Publishers) introduced a method where scene geometry and camera motion can be recovered from in a way that is insensitive to noises.
The method introduced here is an extention of a factorization method the authors proposed in the 1990. There are several input quantities in this method. Measurement matrix W is obtained by combining the horizontal feature coordinates (matrix U) and vertical feature coordinate (matrix V). At the same time, a registered measurement matrux is formed by substracting the means.
The rank theorem states that without noise, the registered measurement matrix is at most of rank three. In the presence of noise, the best possible shape and rotation estimate is obtained by considering only the three greatest singular values of the registered matrix, together with the corresponding left and right eigenvectors.
The method requires following steps:
The method they introduced appears to be powerful based on their examples. However, it takes more explanation to understand their concept in real depth.
This article presents a technique for understanding structure-from-motion tha is called the factorization method which uses a sequence of images under orthographic projection.
Thi method does not consider camera translation along the optical axis but acording to them, it improves the quality of the computed shape. In general this method is based in representing an image sequence as a 2FXP matrix (W). W is the x and y coordinate s of P points that are tracked through F frames. Then the rank thorem is used (measuring the image coordinates with respect to their centroid) to show that W can be factored into the product of R and S, whre R is a 2FX3 matrix (camera rotation) nad S is a 3XP matrix (shape in a coordinate system attached to the object centroid).
The redundance in image sequences permits a large number of points and frames to be processed efficiently. Simplifying the technique, the average of the rows of W are used to compute camera translation. Some features may dissapear sometimes so W is partia lly filled, this is solved by growing a partial solution obtained from the initial submatrix.
Some experimets were used to test this aproach under controlled conditions in the laboratoury and with a hand-held camera, with occlusions and without them. Good results were obtained specially in the lab, although it was necessary to do some of the step s by hand.
The basic approach is stated in the article: three pictures of four points determine the structure and motion under orthograpy. This is implemented here by using a factorization method that seems to be very efficient and stable, whithout worrying about c amera motion, and object shape.
Stan Sclaroff
Created: Sep 26, 1996
Last Modified: Nov 1, 1996