BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: People in Video



William Kilppgen

Tracking and Recognizing Rigid and Non-Rigid Facial Motions using Local Parametric Models of Image Motion =========================================================== by Black, J. B. and Yacoob, Y. Previous methods for determining facial expressions vary in the way the represent knowledge about the human head. While some approaches build on a 3-D model of the human face, others rely simply on the computed flowfield. This paper describes an approach that extracts portions of the face and do calculations on each of them to determine the overall expression. The face is considered as flat and its image as a projection from 2D to 2D. They come up with the eight-parameter model (number 6 and 7, p. 375) to represent rigid face motion and with 7 parameter model using 6 of the parameters from the previous model plus an additional to represent curvature in the horizontal direction. (number 8 and 9) The approach can be divided into three stages or levels: 1. Take face, mouth, eye and eyebrow regions and the rigid and non-rigid motion by using the parameter model. Robust regression is used to determine the parameters. (Two subsequent images of a face are alligned -> Then the specific intro-face regions are compared to determine their motion) 2. The facial segments are now described in terms of mid-level predicates derived from their motion parameters. Table 1 and 2 show predicates for mouth and head motion. (p. 377) 3. The third stage is application of high-level rules to map the predicates into one of the six standard expressions listed in table 3. This high level representation considers expressions to consist of a beginning, apex and ending. The various expression stages might happen over several subsequent frames. In figure 3, p. 378, we see a temporal model of the beginning, apex and ending of a smile. Parameter explanation for figure 3: a3 - mouth vertical motion div - mouth expansion c - curving of mouth In the recognition results, most tests showed a recognition rate of over 90 percent. While no details on how the test expressions were formed, the robustness seems to be very promising. It would increase the value of the paper significantly, if they could have waited a little longer in front of the TV screen when collection real world expressions. Still, expressions from 36 videoclips seem to be quite well captured. There is no indications of the size of faces in the videoclips or how many faces were present simultaneously. To be able to use this method in general expression detection in a videostream, more work probably has to be done in the initial segmentation. There is also a question about the image resolution required for stable results. On a higher level, psychological models have to be used to in combination with a sense of context, to interpret a collection of expressions into semantics like "angry", "joking", "nervous" etc. Analyzing and recognizing Walking Figures in XYT ================================================ by Nioygi, S. A. and Adelson, E. H. The paper has one great idea; namely to use pattern analysis in a time-space representation of an image sequence. Applied on walking humans, a template model of a zig-zag pattern is fitted to a real pattern of the walking legs as shown in figure 4, p. 471. The approach makes it possible to construct a walking stick model of a human. The test results indicate that the following preconditions have to be met: - A single human - Appr. constant speed - Motion close to 90 degrees to the camera By transforming the created contours of the gait patterns to a pattern without translation, a basis is formed to recognize people based on their walking pattern. The test results contain far too few examples to be interesting. Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models =================================================================== by Terzopoulos, D. and Waters, K. The paper starts with a thourugh presentation of the underlying dynamic properties of the human face. The hierarchy in the representational model developed for the face consist of face expression, muscle control, muscle model, tissue model, face geometry and finally the visual representation by a series of images. Their model consists of a three-layer representation simulating cutaneous tissue, subcutaneous tissue and the muscle layer. The layers are represented as elements interconnected by springs where the elasticity of the springs is similar to the properties of the real-worlds tissues. The elements of the muscle layer are fixed. Each element's displacement in the two upper layers is a weighted sum of several muscle nodes or elements connected to it. The model uses a subset of the FACS representation. The synthetic tissue includes about 960 elements with appr. 6500 springs. The paper proposes texture-mapping of real 360 degrees scans onto the element model, enabling various expressions to be simulated with a high degree of realism. By using deformable contour models that lock onto ravines (extended local minima), it is possible to track significant features in the face. The mouth and the eyebrows are typical candidates that are used for this tracking. The deformable contours each give rise to one reference frame and 11 dynamic fiducial points. The model uses such contours for the hairline, the two eyebrows, the two nasal furrows, the tip of the nose, the upper and lower lips and the chin boss. The steps to detect the facial expression can be written as follows: 1. Paint subject along all feature contours with a dark color. 2. Manually position the contours in the correct positions in the first image. 3. For each picture, find the potential by using a gradient filter. 4. Compute the muscle contractions corresponding to the contour movements. 5. Find expressions corresponding to the detected muscle movement. Clearly, steps 1 and 2 are not acceptable for use in unsupervised environments. When it comes to facial expression detection, I can not see that this paper comes up with promising results considering the large underlying model and the heavy computation needed. However, if a graphic representation of a human head is wanted with high realism, this approach is very interesting.

Shrenik Daftary

Synopsis for "Analyzing and Recognizing Walking Figures in XYT" by Niyogi et al This paper presents a technique to recognize gait in cases where a fixed camera is used, the heights of the head and feet are roughly known, and individuals are walking frontoparallel to the camera, ie not at oblique angles relative to the camera. The first step is to take a slice of the xyt volume at the head height. Hough transforms are used to find the parameters of the stripes. Change detection is performed by recovering the background by using filtering techniques. A correlation is measured between movement templates and the actual filtered image. If the correlation is high enough then a potential walk is determined to exist. Next snakes are used in the image sequence to fit the walker's ankles at each time point. The fact that the slice of the xyt pattern will be roughly equivalent for each of the body parts is used to analyze the entire image. The contours that are created by applying snakes through different parts of the body can create a stick model of the walker. All of the information is used to recreate the direction of motion. Some mentioned problems with the technique is the slowness of the correlation process, the use of the entire image sequence, the fit of spatiotemporal snakes to XT slices, and the missing information about arm motion. Some methods to improve this technique is to cut out 3/4 of the time points perhaps, depending on the expected speed of the movement. Synopsis of "Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models" by Terzopoulos and Waters This paper presents a method to reconstruct facial expression. The model for facial expression is based on six levels of abstraction; expression, control, muscles, physics, geometry, and images. The physics model is based on the mechanical properties of the tissue and use of spring models to numerically model the physical motion. Snakes are introduced in this paper as well to model the movement over time. The functions that are used to reconstruct movements are based on contours that have elasticity and rigidity. This technique seemed to be accurate in reconstructing the motion of surprise, but some potential problems include the ability to calculate the reconstructions of expressions. Additionally the some of the images that were shown for the animated face model (Figure 5) were truly horrifying. This technique could be improved by a more complete understanding of the muscular structure underneath the face and creating a model based exclusively on these properties. Synopsis for "Tracking and Recognizing Rigid and Non-Rigid Facial Motions using Local Parametric Models of Image Motion" by Black and Yacoob This paper presents a method to track human facial expressions given a sequence of images. The three salient features that are mentioned are 1) the application to articulated motion of parameterized models, 2) interpretation of image-motion parameters to high-level features, and 3) development of a system to recognize facial expressions. The recognition algorithm that is presented has a low level - segmenting regions of the face, mouth, eyebrows, and eyes; mid level corresponding to image motion, and high level rules describing temporal structure of facial expressions. An affine model is defined for both horizontal, and vertical components of the flow at each point. Definitions for divergence, curl, and deformation were derived based on the affine model parameters. A new model that allows the yaw, and pitch of the camera to be modeled is introduced. The method to recover the parameters is given, with the corresponding change in mid-level predicates. From there the ability to get the high level idea of what is going on is given. This technique appeared to be quite good in the tests they performed, although their error numbers corresponded to missed expressions, and not to both missed expressions, and falsely identified expressions. Some potential improvements to this technique may include incorporation (somehow) of the feel of a scene to provide more contextual information. Although it may not be obvious how to accomplish this there may be a technique that could work on histograms to provide additional contextual information. An additional drawback of the system is that it was limited to only 6 types of expression, which may be sufficient, but does not include all classes of human expression.

John Petry

"ANALYSIS AND SYNTHESIS OF FACIAL IMAGE SEQUENCES USING PHYSICAL AND ANATOMICAL MODELS," by Terzopoulos and Waters This was a very intriguing paper. The authors have created a six-layer hierarchical model of frontal facial deformations in the expression of emotions. The model runs from high-level (which emotion is being expressed) to low-level (the intensity of pixels in a grey-level image), with intermediate levels such as which groups of facial muscles contract in which fashion. By relying heavily on an anatomical basis, rather than directly on image features, the authors claim very good results in synthesizing emotions on a face template. They also suggest their approach would work well in automatically determining which emotions were being expressed in a video sequence of a person. The authors concentrate primarily on six universal human expressions: anger, disgust, fear, happiness, sadness and surprise. They note that faces are highly complex and deformable, and that various portions of the face interact to a high degree during many expressions. For those reasons, they build their model on an anatomical framework. A facial expression is formed by starting a muscle control process. There is such a process for each expression, containing a set of instructions for each group of muscles involved. The muscle instructions are used by their "muscle model." A muscle model contains physical details such as the muscle size, location, type, attachment to the skeleton, etc..., as well as information regarding its relationship with the surrounding tissue. Tissue motion is handled in terms of muscle deformations, using a physics model of nodes and springs, and noting the elasticity of the tissue involved. This information is used to manipulate a low-level geometric model of the face, which in turn produces image manipulations. I think this is a great idea, given a willingness to make the models sufficiently detailed. The underlying structure is quite complex, but the authors seem to have risen to this challenge. For instance, their synthetic tissue models contains 960 discrete elements with 6500 springs. Given this detailed model and an input image, a human matches initial facial features to the corresponding points of the model. The project can then do one of two things: either track the features during a video sequence, then working backwards up the chain to determine the muscle movement that generated them, and then predict the emotion being expressed; or use the image as a starting point to synthesize emotions based on deformations of the image. The former seems quite useful for automatic database annotation, given a rigid and segmented frontal view of the subject, and automatic location of the starting features used in tracking. If the subject or the camera moves, the current model will fail. The authors do discuss moving to a 3-D model, which would be good, but I anticipate much more difficulty given the degree to which hair, hats, etc... can occlude key body parts, and in that the side and back of a person's head are much less expressive than the front. The second type of approach, synthesizing emotion given a starting image, would be very useful to people doing computer graphics, if it works as well as the authors say it does (and think of the great political commercials that could be created this way -- negative campaigning sinks to new depths). But their tests were pretty vague; in terms of recognizing emotions, they appear to have judged for themselves which emotions were seen in a video sequence, then determined whether the code produced the same answer. Also, the synthetic images displayed in the figures look more like carricatures than real faces, even the ones that started with a human subject rather than a completely synthetic "face." "TRACKING AND RECOGNIZING RIGID AND NON-RIGID FACIAL MOTIONS USING LOCAL PARAMETRIC MODELS OF IMAGE MOTION," by Black and Yacoob These authors set out to achieve a limited subset of the goals that the authors in the previous paper had. Specifically, they are trying to recognize the six universal expressions from video sequences. They do so by implementing an approach between the previous paper's strongly physical model, and a purely image-oriented one that relies on optical flow and other intensity and motion properties of the entire face. These authors locate specific features of interest (eg. eyes, mouth), then try to map their local motions as affine or related transformations. They extract properties from these measurements (eg. translation, curl), then use a table to determine the associated expression. In one sense, this is similar to the Terzopoulos and Waters approach, in that they are converting movement of facial features into subcomponents and indexing into an expression table based on them. They lack the detail of the physical model, and the resulting understanding of what facial motion is involved, however. I'm not sure how serious a drawback that is for expression recognition, though it probably precludes any reverse transform to generate synthetic expressions. I'm somewhat skeptical of their reported results. At least they had an outside group classify the sample set, which the other paper didn't. But it is not clear whether they created their expression table beforehand and used the experiment to validate its accuracy, or whether there was a feedback loop during development whereby they run their code, compare results to expected values, adjust the code appropriately, and then run again, with the reported results simply being those from the final run. Also, their accuracy rate ignores false positives entirely. Rather than reporting correct / total expressions where total expressions = correct + false negatives, they should use correct / (total expressions + false positives) I'm also curious because this paper came out two years later than the first, yet is much simpler. Does that imply that the work in the first paper didn't bear out the authors' expectations? Or that sources for the first work are not publicly available and too hard for others to duplicate? "ANALYZING AND RECOGNIZING WALKING FIGURES IN XYT," by Niyogi and Adelson This paper seemed like an interesting observation taken too far. The gist of it is that the authors noticed that in a controlled setting (constant human-camera distance; fixed camera orientation; a priori knowledge of the people in the scene), it is possible to take an XY-Time video volume and cut it to produce an XT slice with useful data on human motion. Specifically, a human head will appear as a wide line in this image, while legs will produce a braid pattern. They use this knowledge to match snakes to potential leg candidates to measure motion parameters, which they claim is accurate enough to identify individuals (not likely!). They also use it to extract profiles of the person walking. The latter is more believable. Several problems are apparent here. First, the constraints are pretty strict. This wouldn't work as is for arbitrary video, since the height at which the Y-slice should be made isn't known (don't know subject-to-camera distance, or subject height), and since it can't be assumed that people walk in a straight line. Also, I suspect two people together would confuse the algorithm pretty badly. As far as recognizing individuals, their sample set only contained a few people. While it's impressive that they can say anything about who they are from their gait, I can't imaging this is that extensible. Try running this on a military formation marching past! While that would be a rare case, it demonstrates how easy it is to confuse this algorithm. Also, in the fine print they mentioned that they arbitrarily compressed or expanded the time axis to compensate for an individual walking at different speeds in different sequences. That seems like a real hack. Finally, and more from a presentation point of view than from an algorithmic one, it is clear that since the background is fixed, it can be removed from the XT slice without difficulty, leaving only an absolute difference image would would appear to be easier to work with. I'll certainly concede that its is an interesting observation, and some steps could be taken to make it more reliable. For instance, standard motion detection or optical flow code could extract the top and bottom bounds of a person walking by, permitting automatic determination of the correct Y height for slicing. Also, this could compensate to some extent for people who didn't walk frontoparallel (I assume this is a word and means what they use it to mean!) to the camera.

Lars Liden

"Analyzing and Recognizing Walking Figure in XYT" Niyogi & Adelson Two methods for analyzing spatio-temporal gait patterns: 1) Analyze a single frame and track motion of the body parts in each successive frame 2) Consider the properties of the spatio-temporal pattern as a whole Spatio-temporal structure has regularities that are simpler than those found in single frames 1) translating head generates a slanted stripe in XT 2) walking legs generate braids of the same XT slant Restrictions: 1) cameras is fixed 2) heights of the head and feet of walker are roughly known 3) individuals are walking frotoparallel to the camera at relatively constant speeds * first two can be overcome Method: 1) Slice volume at the candidate head height 2) Use hough transforms to find the parameters of the tilted stripes 3) Use three-parameter template matching done once at a single height to find the amplitude, period and skew of the moving person. 4) Template match is used to initialize two one dimensional snakes which (if the template is good enough) will be attracted to the center of each ankle 5) Each of the two snakes is split into two. By taking the blurred positive and negative spatial derivative they can get one to go to each bounding contour of each ankle 6) Body only need two snakes (one for each bounding contour) Results: 1) Periodicity of the image solves occlusion problem (would have liked this to be explained in more detail) 2) Contours can be used to construct a stick model of the human walker. Time warping required to recognize gait at different speeds. Done by examing head translation Euclidean distance metric and weighted distance metric used to recognition gaits Get a 58-81% recognition rate Limitations (listed by the authors): Use the entire image for analysis (an incremental approach may be better) Could concievably fit snakes to XYT cubes instead of XT slices, but computationally difficult Missing information about arms (see commment below) Restricted to gait frontoparallel to the camera Misleading things: Figure 2 sample shows multiple people in an image. There was no indication in the paper that their system could handle multiple people, and it isn't clear how the splines would handle such intersecting paths. Figure 3 - straw man argument. Are using a edge detector which only looks at one XYT frame, and comparing with theirs which using the sequence of frames. Obviously something incorporating information over multiple frames will do better! Not a fair comparison. Should have used something which looks at edge detection that uses multiple frames All moving figures show no arm motion! Look at the pictures of the people - they all appear to be unnatural walking movement with hands tight to their sides. Why did the author's neglect to mention this? Arm motion would add much extra complexity which has to be delt with. Also seems to depend on the y-axis being constant, especially as the head is used as a reference. Natural gait involve bobbing motion (which would show up and blobs on the XT slice. In addition to keeping their arms at their sides, the subjects restrict bobbing motion? "Analysis and Synthesis of facial Image Sequences Using Physical and Anatomical Models" Terzopoulos & Waters This was quite an impressive paper in that the author's obviously spent a significant amount of time studying and modeling face musculature and skin properties in detail. I don't have a lot to comment on directly apart from a few questions about terminology. First, they mentioned a image processing technique which converts digitized image frames into "2-D potential functions" with "ravine". And the ravines are used to find the salient facial features. I'm not sure what a 2-D potential function involves or how its created. One thing the authors weren't clear about was the amount of automation actually present in the system. They mention that the mesh is conformed "semi-interactively" and that the user may interact with the deformable contour by directly applying forces, but they don't explain with any clarity the amount of user intervention required for the system to perform adequately. The greatest difficulty with the paper seems to be the video facial analysis which would appear to have some problems. First, in the example they gave the subject required a "humiliating makeup job" in order for the system to accurately find the important features. This obviously would be infeasible in any real-world setting. Additionally their system is highly dependent on the hair line. The hairline contour is used to create a head reference frame and all other computations for feature positions are made from this reference. This works fine when dealing with short hair individuals, but anyone with long hair is likely to have significant changes in their hair position over time making such tracking unreliable. "Tracking and Recognizing Rigid and Non-Rigid Facial Motions Using Local Parametric Models of Image Motion" Black & Yacoob This paper was rather disappointing. Although it did have a very well written description of affine and planer modeling the actual methodology seemed over simplified. First, the system is limited to a very small set of stereotyped facial expressions. Many facial expressions (such as the Billy Idol grimace) would not be properly processed by the system as it cannot deal with asymmetric curvatures. Perhaps the most discouraging thing about the paper was the authors presentation of "results". First I think it is important to note that the system was tested on a set of expression images collected from subjects ASKED to make certain expressions. This set of data is not representative of real world images. Studies have clearly shown that a different set of muscles are used when one is asked to smile and when one smile naturally and that these two smiles look different. One one is asked to make an expression one is likely to fall into one the the 6 stereotypical examples, which may not be representative of real expressions. A stereotyped smile or frown is likely to be highly exaggerated, with large changes in mouth position. Real expressions are usually much more subtle with only miniscule changes in position indicating expression. Notice that when the system was tested on real images (talk shows, etc) it did significantly worse. More importantly it would appear that the authors incorrectly calculated the accuracy rate! If you notice the Table 4 and 5, the authors counted false alarms as correct answers when calculating the accuracy rate. If one recalculates the accuracy rate correctly, one find the results are significantly worst - in the mid-80% range.

Gregory Ganarz

In "Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models" D. Terzopoulos and K. Waters present a method for estimating and resynthesizing human facial expressions. This method is computationally expensive, accomplishing the above by maintaining a detailed physical model of a face. For facial analysis, snakes were used to track the position of certain features. Their method needed "help" to find these features initially, and also "it was necessary to enhance lips, eyebrows, and nasolabial furrows by a humiliating makeup job." p. 576. The method presented seems better suited for rendering than for image analysis. In "Analyzing and Recognizing Walking Figures in XYT" S. Niyogi and E. Adelson present a technique for gait analysis using a spatiotemporal (XYT) volume. This memory intensive technique makes a variety of assumptions about the image sequence such as knee height of the walker and a fixed camera. Further, this technique cannot operate on a single frame. At first glance, the method also appears limited to sequences filmed in advance, and not arriving "on-line". However, the method could be generalized to process "on-line" by maintaining a memory trace of frames. In "Tracking and Recognizing Rigid and Non-Rigid Facial Motions using Local Parametric Models of Image Motion" M. Black and Y. Yacoob present a method for recognizing facial expressions using local optical flow techniques. The method determines the motions of features such as mouth and eyebrows, and then matches these motions to those characteristic of the transition to certain facial configurations (expressions). One limitation of the techniques is that it requires a dynamic face to recognize expressions and thus must operate on multiple frames. Still, the technique has been tested on a variety of image sequences and performed quite well.

Paul Dell

M. Black and Y. Yacoob, "Tracking and Recognizing Rigid and Non-rigid Facial Motions using Local Parametric Models of Image Motion," in Proc. International Conf. on Computer Vision, pp. 374-381, 1995. The approach taken by Black et. al. associates parameters to local features and uses parameter values to detect facial features. The system detects happiness, suprise, anger, disgust, fear, and sadness will 80% or better accuracy. The feature parameters are modeled with a hierachy of representations. The low-level representations are the parameter values, the mid-level representations are combined with thresholds to determine movements such as "mouth rightward", "Mouth curving downward", etc. The high-level representations combine the mid-level reps. to encode rules such as "Anger (begin) = inward lowering of brows and mouth contraction." This approach combined with code that can locate various facial features could be used for automatic annotation of facial expressions in video. That would be a very usefull automatic annotation function to have. D. Terzopoulos and K. Waters, "Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models," in IEEE Trans. on Pattern Analysis and Machine Intelligence, 15(6):569-579, 1993. The technique descibed in this paper uses a camera to detect facial muscle movements and resysnthesize the expressions on a physics-based synthetic face model. Overall the technique is very interesting an may have a number of applications. Thought the reader has doubts as to the acceptability of the system to resynthesize human facial expressions for video conferencing applications. One shortfall of the system that should not be difficult to resolve is that the system does not track eye motion. This would likely create an errie resynthesis for the user. Another shortfall of the system is that it ignores the z coordinate in facial movements. This creates difficulty when the head turns and would likely be difficult if the face moved toward or away from the camera. S. Niyogi and E. H. Adelson, "Analyzing and Recognizing Walking Figures in XYT," in Proc. IEEE Conf. on Vision and Pattern Recognition, pp. 469-474, 1994. The approach taken in the paper was novel and interesting, but the reader has doubts of the robustness and applicability of the system. The problem that the authors addressed was identifying walking persons. To that end the authors took xyt sections of video and identified a common line and braid pattern for a person walking accross a camera field. Then a model of the braid pattern was developed and an assumtion about where these patterns would occure. The system reportedly achieved a recognition rate as high as 81%. The system is limited to shots where the camera is fixed, the heights of the head and feet are approximately known, and the persons are walking frontoparallel. The system also seems to be limited to situations where there is only one walker. Since the reader did not know what the authors ment by "four sequences were of AJA, seven were of LWC, ...), it is not know if any of the tested sequences contained mutiple subjects. The system was somewhat interesting, esp. in the use of spatial-temporal data. But the usefulness and robustness of the system may not be acceptable for real-world applications.


Stan Sclaroff
Created: Dec 4, 1995
Last Modified: Dec 6, 1995