BU CLA CS 835: Seminar in Image and Video Computing --- Class commentary on articles

BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: People in Video



 William Kilppgen 

Tracking and Recognizing Rigid and Non-Rigid Facial Motions
using Local Parametric Models of Image Motion
===========================================================
by Black, J. B. and Yacoob, Y.


Previous  methods for determining  facial expressions  vary in the way
the represent knowledge about  the human head.  While some  approaches
build  on a 3-D  model of the  human  face, others  rely simply on the
computed flowfield.  This  paper  describes an approach that  extracts
portions of the face and do calculations  on each of them to determine
the overall expression.  The face is considered as  flat and its image
as a projection from 2D to 2D.

They come up with the  eight-parameter model (number 6  and 7, p. 375)
to represent rigid face  motion and with 7 parameter  model using 6 of
the parameters from the previous model plus an additional to represent
curvature in the horizontal direction. (number 8 and 9)

The approach can be divided into three stages or levels:

1.   Take face,  mouth, eye  and  eyebrow  regions and   the rigid and
non-rigid motion  by using the  parameter model.  Robust regression is
used to determine the parameters.

(Two  subsequent images of a  face are  alligned  -> Then the specific
intro-face regions are compared to determine their motion)

2. The  facial segments   are  now  described  in terms  of  mid-level
predicates derived from  their motion parameters.   Table 1 and 2 show
predicates for mouth and head motion. (p. 377)

3.  The  third stage is application   of high-level rules   to map the
predicates into one of the six standard expressions listed in table 3.
This high level representation  considers expressions to consist of  a
beginning, apex   and  ending.  The  various  expression  stages might
happen over several subsequent frames.

In figure  3, p. 378,  we see a temporal model  of the beginning, apex
and ending of a smile.

Parameter explanation for figure 3:

a3 - mouth vertical motion 

div - mouth expansion

c - curving of mouth


In  the recognition results, most tests   showed a recognition rate of
over 90 percent.  While no  details on how  the test  expressions were
formed, the robustness seems to be very promising.

It would increase the value of the  paper significantly, if they could
have waited a little longer in front of  the TV screen when collection
real world expressions.  Still, expressions from 36 videoclips seem to
be quite well captured.  There is no  indications of the size of faces
in the videoclips or how many faces were present simultaneously.

To be  able to use  this method in general  expression  detection in a
videostream, more   work probably  has to  be    done in  the  initial
segmentation.  There   is also a question  about  the image resolution
required for stable results.

On a  higher   level, psychological  models  have to  be   used  to in
combination  with a  sense of context,   to interpret  a collection of
expressions into semantics like "angry", "joking", "nervous" etc.



Analyzing and recognizing Walking Figures in XYT
================================================

by Nioygi, S. A. and Adelson, E. H.

The paper has one  great  idea; namely to  use  pattern analysis in  a
time-space representation of   an image sequence.  Applied on  walking
humans,  a template model of   a zig-zag pattern is   fitted to a real
pattern of the walking legs as shown in figure 4, p. 471.

The approach makes it possible to construct a walking stick model of a
human.  The test   results indicate that  the  following preconditions
have to be met:

- A single human
- Appr. constant speed
- Motion close to 90 degrees to the camera
 
By transforming the created contours of the gait patterns to a pattern
without translation, a basis is  formed  to recognize people based  on
their walking  pattern. The test results contain  far too few examples
to be interesting.



Analysis and Synthesis of Facial Image Sequences Using Physical and
Anatomical Models
===================================================================

by Terzopoulos, D. and Waters, K.

The  paper  starts  with  a  thourugh presentation  of  the underlying
dynamic  properties   of  the  human   face.   The  hierarchy  in  the
representational  model  developed for    the  face consist   of  face
expression, muscle control,  muscle model, tissue model, face geometry
and finally the visual representation by a series of images.

Their  model   consists  of a three-layer   representation  simulating
cutaneous tissue, subcutaneous   tissue  and the  muscle   layer.  The
layers are represented as elements interconnected by springs where the
elasticity  of  the  springs  is  similar  to the   properties of  the
real-worlds tissues.  The elements of the muscle layer are fixed.

Each element's displacement in the two  upper layers is a weighted sum
of several muscle nodes or elements connected  to it. The model uses a
subset of the FACS representation.

The synthetic tissue  includes   about 960 elements  with   appr. 6500
springs.

The paper proposes texture-mapping of real  360 degrees scans onto the
element  model,  enabling various expressions to  be  simulated with a
high degree of realism.

By using deformable contour  models  that lock onto  ravines (extended
local  minima), it is  possible to  track  significant features in the
face.  The mouth and the eyebrows are typical candidates that are used
for this tracking.   The  deformable contours  each give  rise  to one
reference frame and 11  dynamic fiducial points.   The model uses such
contours  for the hairline,  the two eyebrows,  the two nasal furrows,
the tip of the nose, the upper and lower lips and the chin boss.

The steps to detect the facial expression can be written as follows:

1. Paint subject along all feature contours with a dark color.

2. Manually position the contours in the correct positions in the
first image.

3. For each picture, find the potential by using a gradient filter.

4. Compute the muscle contractions corresponding to the contour
movements.  

5. Find expressions corresponding to the detected muscle movement.

Clearly, steps  1  and 2  are not acceptable   for use in unsupervised
environments.  When it comes to facial expression detection, I can not
see  that this paper comes  up with promising  results considering the
large underlying model and the  heavy computation needed.  However, if
a graphic representation of a human  head is wanted with high realism,
this approach is very interesting.

 Shrenik Daftary


Synopsis for "Analyzing and Recognizing Walking Figures in XYT"  by Niyogi
et al

This paper  presents a technique  to recognize gait  in cases  where a
fixed camera  is used, the heights  of the head   and feet are roughly
known, and individuals  are  walking frontoparallel to the  camera, ie
not  at oblique angles relative  to the camera.  The first  step is to
take  a slice of  the xyt volume at the  head height. Hough transforms
are used  to find the  parameters of the  stripes. Change detection is
performed    by  recovering    the   background    by using  filtering
techniques.  A correlation is  measured between movement templates and
the actual filtered  image. If the correlation is  high  enough then a
potential walk is determined to exist.

Next snakes are used in the image sequence to  fit the walker's ankles
at each time point. The fact that the slice of the xyt pattern will be
roughly equivalent  for each of the body  parts is used to analyze the
entire image. The contours that are created by applying snakes through
different  parts  of  the   body  can create   a   stick model of  the
walker. All  of the information is used  to  recreate the direction of
motion.

Some mentioned problems  with  the technique  is the  slowness  of the
correlation process, the use of the entire  image sequence, the fit of
spatiotemporal snakes to XT  slices, and the missing information about
arm motion. Some  methods to improve this  technique is to cut out 3/4
of the  time points perhaps, depending  on  the expected  speed of the
movement.

Synopsis of "Analysis and  Synthesis of  Facial Image Sequences  Using
Physical and Anatomical Models" by Terzopoulos and Waters

This  paper presents a method to  reconstruct  facial expression.  The
model  for facial expression is   based on six  levels of abstraction;
expression,  control,   muscles,  physics, geometry,  and images.  The
physics model is based on the mechanical  properties of the tissue and
use of spring models to numerically model the physical motion.

Snakes are introduced in this paper as well to model the movement over
time. The functions  that are used to  reconstruct movements are based
on contours that have  elasticity and rigidity. This  technique seemed
to  be  accurate in reconstructing  the motion  of  surprise, but some
potential   problems     include  the  ability   to    calculate   the
reconstructions of expressions. Additionally  the  some of the  images
that were shown   for the animated   face model (Figure 5)  were truly
horrifying.  This   technique could  be improved   by  a more complete
understanding of   the  muscular structure   underneath the   face and
creating a model based exclusively on these properties.

Synopsis for "Tracking and Recognizing Rigid and Non-Rigid Facial
Motions using Local Parametric Models of Image Motion" by Black and
Yacoob

This paper presents a method to track human facial expressions given a
sequence of images. The three salient  features that are mentioned are
1)  the application to articulated  motion of parameterized models, 2)
interpretation of image-motion parameters  to high-level features, and
3) development of a system to recognize facial expressions.

The recognition  algorithm  that  is   presented  has a  low   level -
segmenting regions  of the face, mouth,  eyebrows, and eyes; mid level
corresponding  to  image  motion,   and  high level   rules describing
temporal structure of facial  expressions. An affine model is  defined
for  both horizontal,  and  vertical components   of the flow  at each
point. Definitions for divergence,  curl, and deformation were derived
based on the affine model parameters. A new model that allows the yaw,
and pitch of the camera  to be modeled  is  introduced. The method  to
recover  the  parameters is  given, with  the  corresponding change in
mid-level  predicates. From there the  ability  to get  the high level
idea of what is going on is given.

This technique appeared to be quite good in  the tests they performed,
although  their error numbers corresponded  to missed expressions, and
not to    both     missed   expressions, and       falsely  identified
expressions. Some potential improvements to this technique may include
incorporation (somehow)  of  the feel   of a  scene  to provide   more
contextual  information.  Although   it may  not  be   obvious how  to
accomplish this there may be a technique that could work on histograms
to provide  additional contextual information. An  additional drawback
of the system is  that it was limited  to only 6 types of  expression,
which  may be sufficient, but  does not include   all classes of human
expression.


 
 John Petry
 
"ANALYSIS AND SYNTHESIS OF FACIAL IMAGE SEQUENCES USING PHYSICAL AND 
ANATOMICAL MODELS," by Terzopoulos and Waters


This was  a  very  intriguing  paper.  The   authors  have created   a
six-layer  hierarchical model  of  frontal facial  deformations in the
expression of emotions.  The model runs from high-level (which emotion
is being expressed)    to low-level (the   intensity  of pixels  in  a
grey-level  image), with intermediate levels  such  as which groups of
facial muscles contract in which fashion.

By relying  heavily on an anatomical  basis,  rather than  directly on
image features,  the authors claim very  good  results in synthesizing
emotions on a  face template.  They also  suggest their approach would
work well  in  automatically determining  which  emotions  were  being
expressed in a video sequence of a person.

The authors concentrate  primarily on six universal human expressions:
anger, disgust, fear, happiness, sadness and surprise.  They note that
faces are highly complex and  deformable, and that various portions of
the face interact to a high degree during many expressions.  For those
reasons, they build their model on an anatomical framework.

A  facial expression is  formed by starting  a muscle control process.
There is such  a process  for  each  expression, containing a  set  of
instructions   for  each group  of    muscles  involved.  The   muscle
instructions  are   used  by their "muscle    model."   A muscle model
contains physical details such  as  the muscle size, location,   type,
attachment to the skeleton,  etc..., as well as information  regarding
its  relationship with  the   surrounding  tissue.  Tissue  motion  is
handled in  terms  of muscle deformations, using  a  physics model  of
nodes and  springs, and noting the  elasticity of the tissue involved.
This information is used to  manipulate a low-level geometric model of
the face, which in turn produces image manipulations.

I think this  is a great idea, given  a willingness to make the models
sufficiently detailed.  The underlying structure is quite complex, but
the authors seem to have risen to this challenge.  For instance, their
synthetic tissue models    contains 960 discrete  elements   with 6500
springs.

Given this detailed model and an input image,  a human matches initial
facial features to the corresponding points of the model.  The project
can then  do one  of two things:  either track  the features during  a
video sequence,  then working backwards up  the chain to determine the
muscle  movement that generated  them,   and then predict the  emotion
being expressed; or  use the image  as a starting  point to synthesize
emotions based on deformations of the image.

The former seems quite useful for automatic database annotation, given
a rigid   and segmented frontal view   of  the subject,  and automatic
location of the starting features used in tracking.  If the subject or
the camera moves, the current model will fail.  The authors do discuss
moving to a 3-D model, which would be good, but I anticipate much more
difficulty given the degree to  which hair,  hats, etc... can  occlude
key body parts, and in that  the side and back  of a person's head are
much less expressive than the front.

The  second type  of  approach, synthesizing emotion  given a starting
image, would be very  useful to people doing  computer graphics, if it
works  as well  as  the authors say  it does  (and think  of the great
political  commercials  that  could  be created  this  way -- negative
campaigning sinks to new depths).

But their  tests were pretty vague; in  terms of recognizing emotions,
they appear to have judged for themselves  which emotions were seen in
a  video sequence, then determined  whether the code produced the same
answer.  Also, the synthetic images displayed in the figures look more
like carricatures  than real faces, even the  ones that started with a
human subject rather than a completely synthetic "face."


"TRACKING AND RECOGNIZING RIGID AND NON-RIGID FACIAL MOTIONS USING LOCAL 
PARAMETRIC MODELS OF IMAGE MOTION," by Black and Yacoob


These authors set out  to achieve a  limited subset of the  goals that
the authors in the previous paper had.   Specifically, they are trying
to recognize the six universal expressions from video sequences.

They do  so by implementing  an approach  between the previous paper's
strongly physical model, and  a purely image-oriented one that  relies
on optical flow   and other  intensity  and motion  properties of  the
entire  face.  These authors  locate   specific features  of  interest
(eg. eyes, mouth), then  try to map their  local motions as  affine or
related   transformations.    They  extract   properties   from  these
measurements  (eg. translation, curl), then use  a  table to determine
the associated expression.

In one sense, this is similar  to the Terzopoulos and Waters approach,
in that   they  are  converting  movement   of  facial features   into
subcomponents  and  indexing into  an expression  table based on them.
They   lack  the detail   of the   physical  model, and  the resulting
understanding of what facial  motion  is involved, however.  I'm   not
sure how serious a drawback that is for expression recognition, though
it probably  precludes  any reverse  transform  to generate  synthetic
expressions.

I'm somewhat skeptical of  their reported results.   At least they had
an  outside group classify   the  sample set,  which the   other paper
didn't.  But it  is not clear   whether they created  their expression
table beforehand and used the  experiment to validate its accuracy, or
whether there was a feedback loop  during development whereby they run
their code, compare results    to  expected values, adjust   the  code
appropriately, and  then run again, with  the  reported results simply
being those from  the final run.   Also,  their accuracy rate  ignores
false positives entirely.  Rather than reporting

	correct / total expressions

where total expressions =  correct + false negatives,
they should use

	correct / (total expressions + false positives)

I'm also curious because this paper came  out two years later than the
first,  yet is much   simpler.  Does that  imply that  the work in the
first paper didn't   bear  out  the authors' expectations?    Or  that
sources for the first work are not publicly available and too hard for
others to duplicate?


"ANALYZING AND RECOGNIZING WALKING FIGURES IN XYT," by Niyogi and Adelson


This paper seemed like an interesting observation taken  too far.  The
gist  of it is that  the authors noticed  that in a controlled setting
(constant human-camera distance;  fixed  camera orientation; a  priori
knowledge  of the people in   the scene), it is   possible to take  an
XY-Time  video volume and  cut it to  produce an XT  slice with useful
data on human motion.

Specifically, a human head will appear  as a wide  line in this image,
while legs  will produce a braid  pattern.  They use this knowledge to
match snakes to potential leg candidates to measure motion parameters,
which  they  claim is  accurate enough  to   identify individuals (not
likely!).  They also use it to extract profiles of the person walking.
The latter is more believable.

Several problems are apparent here.  First, the constraints are pretty
strict.  This wouldn't   work as is   for arbitrary video,  since  the
height at  which the  Y-slice should be  made isn't  known (don't know
subject-to-camera distance, or subject  height), and since it can't be
assumed that  people  walk in  a straight  line.  Also, I  suspect two
people together would confuse the algorithm pretty badly.

As  far as recognizing individuals, their  sample set only contained a
few people.  While  it's impressive that they  can  say anything about
who they are from their gait, I can't imaging this is that extensible.
Try  running this on a military   formation marching past!  While that
would be a  rare case, it demonstrates how  easy it is to confuse this
algorithm.   Also,  in the   fine   print  they mentioned  that   they
arbitrarily compressed or expanded the  time axis to compensate for an
individual  walking at different speeds  in different sequences.  That
seems like a real hack.

Finally, and more  from a  presentation  point  of view than  from  an
algorithmic  one, it is clear that  since the background  is fixed, it
can be removed  from the XT  slice without difficulty, leaving only an
absolute difference   image would would  appear  to be  easier to work
with.

I'll certainly   concede that its  is an  interesting observation, and
some  steps could be  taken to make   it more reliable.  For instance,
standard motion  detection or optical  flow code could extract the top
and  bottom  bounds of   a person   walking by,  permitting  automatic
determination  of the correct Y height  for slicing.  Also, this could
compensate to some extent for people who didn't walk frontoparallel (I
assume this is  a word and means what   they use it  to mean!)  to the
camera.

Lars Liden

"Analyzing and Recognizing Walking Figure in XYT"
Niyogi & Adelson

Two methods for analyzing spatio-temporal gait patterns:
	1) Analyze a single frame and track motion of the body parts
	   in each successive frame
	2) Consider the properties of the spatio-temporal pattern as a whole

Spatio-temporal structure has regularities that are simpler than those
	found in single frames

	1) translating head generates a slanted stripe in XT
	2) walking legs generate braids of the same XT slant

Restrictions:
	1) cameras is fixed
	2) heights of the head and feet of walker are roughly known
	3) individuals are walking frotoparallel to the camera at
	   relatively constant speeds

	* first two can be overcome

Method:
	1) Slice volume at the candidate head height
	2) Use hough transforms to find the parameters of the tilted stripes
	3) Use three-parameter template matching done once at a single height
	   to find the amplitude, period and skew of the moving person.
	4) Template match is used to initialize two one dimensional snakes
	   which (if the template is good enough) will be attracted to the 
	   center of each ankle
	5) Each of the two snakes is split into two.  By taking the blurred
	   positive and negative spatial derivative they can get one to go to
	   each bounding contour of each ankle
	6) Body only need two snakes (one for each bounding contour)

Results:
	1) Periodicity of the image solves occlusion problem
		(would have liked this to be explained in more detail) 
	2) Contours can be used to construct a stick model of the
	   human walker.
		Time warping required to recognize gait at different
		  speeds.  Done by examing head translation
		Euclidean distance metric and weighted distance metric
		  used to recognition gaits
		Get a 58-81% recognition rate

Limitations (listed by the authors):
	Use the entire image for analysis (an incremental approach may
		be better)
	Could concievably fit snakes to XYT cubes instead of XT
		slices, but computationally difficult
	Missing information about arms (see commment below)
	Restricted to gait frontoparallel to the camera


Misleading things:
	Figure 2 sample shows multiple people in an image.  There was no
	indication in the paper that their system could handle multiple
	people, and it isn't clear how the splines would handle such 
	intersecting paths.

	Figure 3 - straw man argument.  Are using a edge detector which only
	looks at one XYT frame, and comparing with theirs which using the
	sequence of frames.  Obviously something incorporating information
	over multiple frames will do better!  Not a fair comparison. Should
	have used something which looks at edge detection that uses
	multiple frames

	All moving figures show no arm motion!  Look at the pictures of the 
	people - they all appear to be unnatural walking movement with hands
	tight to their sides.   Why did the author's neglect to mention this?
	Arm motion would add much extra complexity which has to be delt with.

	Also seems to depend on the y-axis being constant, especially
	as the head is used as a reference.  Natural gait involve
	bobbing motion (which would show up and blobs on the XT slice.
	In addition to keeping their arms at their sides, the subjects
	restrict bobbing motion?


"Analysis and Synthesis of facial Image Sequences Using Physical and
Anatomical Models"
Terzopoulos & Waters

	This  was  quite  an impressive paper   in  that  the author's
obviously  spent a significant amount   of time studying and  modeling
face musculature and skin properties in detail.  I don't have a lot to
comment on directly apart  from  a  few questions about   terminology.
First,  they mentioned  a image  processing  technique which  converts
digitized image frames  into "2-D potential functions"  with "ravine".
And the ravines are used to find the salient facial features.  I'm not
sure what a 2-D potential function involves or how its created.

	One thing  the authors weren't  clear about was  the amount of
automation actually present in the system.  They mention that the mesh
is conformed "semi-interactively" and  that the user may interact with
the deformable  contour by directly  applying  forces, but  they don't
explain with any clarity the amount  of user intervention required for
the system to perform adequately.

	The greatest difficulty  with the paper seems  to be the video
facial analysis which  would appear to have  some problems.  First, in
the example they gave the subject required  a "humiliating makeup job"
in order for the   system to accurately  find the  important features.
This obviously   would be infeasible      in any real-world   setting.
Additionally  their system is highly dependent  on the hair line.  The
hairline contour is  used  to create  a head reference  frame  and all
other computations for feature positions are made from this reference.
This works  fine when dealing  with short hair individuals, but anyone
with long hair is  likely to  have  significant changes in their  hair
position over time making such tracking unreliable.
	
"Tracking and Recognizing Rigid and Non-Rigid Facial Motions Using
Local Parametric Models of Image Motion"
Black & Yacoob

	This paper was rather disappointing.   Although it did have  a
very well written description of affine and planer modeling the actual
methodology seemed over simplified.  First, the system is limited to a
very  small  set  of   stereotyped  facial expressions.    Many facial
expressions (such as   the Billy Idol grimace)  would  not be properly
processed by the system as it cannot deal with asymmetric curvatures.

	Perhaps the most discouraging thing   about the paper was  the
authors  presentation of "results".  First I  think it is important to
note   that the system   was  tested on  a   set of expression  images
collected  from subjects ASKED to make  certain expressions.  This set
of data  is  not representative of  real  world images.   Studies have
clearly shown that  a different  set of  muscles are used  when one is
asked to smile and when one smile naturally and  that these two smiles
look different.  One one is asked to make  an expression one is likely
to fall  into one the the 6  stereotypical examples, which may  not be
representative of real expressions.   A stereotyped smile or frown  is
likely to be highly exaggerated, with large changes in mouth position.
Real  expressions  are usually much more   subtle  with only miniscule
changes in position  indicating  expression.   Notice  that when   the
system   was  tested   on  real  images    (talk  shows, etc)  it  did
significantly worse.

	More importantly it would appear  that the authors incorrectly
calculated the accuracy rate!   If you notice the Table  4 and  5, the
authors  counted false alarms as correct  answers when calculating the
accuracy  rate.  If one recalculates  the accuracy rate correctly, one
find the results are significantly worst - in the mid-80% range.


 Gregory Ganarz
 

In "Analysis and  Synthesis of Facial  Image Sequences  Using Physical
and Anatomical  Models" D. Terzopoulos and  K. Waters present a method
for estimating   and resynthesizing human  facial  expressions.   This
method is    computationally expensive,  accomplishing  the  above  by
maintaining a detailed physical model of a face.  For facial analysis,
snakes  were used  to track the  position  of certain features.  Their
method needed  "help" to find these  features initially,  and also "it
was necessary  to enhance lips, eyebrows,  and nasolabial furrows by a
humiliating makeup job." p. 576.   The  method presented seems  better
suited for rendering than for image analysis.

In  "Analyzing and Recognizing Walking  Figures in XYT"  S. Niyogi and
E.   Adelson   present   a technique   for   gait   analysis  using  a
spatiotemporal (XYT) volume.  This  memory intensive technique makes a
variety of assumptions about the image sequence such as knee height of
the walker and a fixed camera.  Further, this technique cannot operate
on a single  frame.  At first glance,  the method also appears limited
to sequences filmed in advance, and  not arriving "on-line".  However,
the method could be generalized to process  "on-line" by maintaining a
memory trace of frames.

In "Tracking and Recognizing  Rigid and Non-Rigid Facial Motions using
Local  Parametric Models  of  Image  Motion"  M.  Black and  Y. Yacoob
present a    method for recognizing  facial   expressions  using local
optical   flow  techniques.  The   method  determines the   motions of
features such as mouth and eyebrows, and then matches these motions to
those    characteristic   of   the    transition    to certain  facial
configurations (expressions).   One  limitation of  the  techniques is
that it requires a dynamic face to recognize expressions and thus must
operate on multiple frames.  Still, the technique has been tested on a
variety of image sequences and performed quite well.



 Paul Dell 
  

M.  Black and Y. Yacoob, "Tracking and Recognizing Rigid and Non-rigid
Facial  Motions using Local  Parametric  Models  of Image Motion,"  in
Proc.  International Conf. on Computer Vision, pp. 374-381, 1995.


The approach taken  by  Black et. al.  associates parameters  to local
features  and uses  parameter values  to detect  facial features.  The
system  detects happiness, suprise,  anger, disgust, fear, and sadness
will 80% or better accuracy.

The feature parameters are modeled with a hierachy of representations.
The low-level representations are  the parameter values, the mid-level
representations  are  combined with thresholds  to determine movements
such   as "mouth rightward",     "Mouth curving downward",   etc.  The
high-level representations combine the mid-level reps. to encode rules
such   as  "Anger   (begin) =  inward    lowering  of  brows and mouth
contraction."

This  approach  combined  with code that  can   locate  various facial
features could be used for automatic  annotation of facial expressions
in video.  That would be a very  usefull automatic annotation function
to have.

D. Terzopoulos and K. Waters, "Analysis and  Synthesis of Facial Image
Sequences Using Physical and   Anatomical Models," in IEEE  Trans.  on
Pattern Analysis and Machine Intelligence, 15(6):569-579, 1993.


The technique descibed  in this paper uses a  camera to detect  facial
muscle movements and resysnthesize the  expressions on a physics-based
synthetic face model.  Overall  the  technique is very interesting  an
may have  a number of applications.  Thought  the reader has doubts as
to   the  acceptability of  the  system  to resynthesize  human facial
expressions for video conferencing applications.

One shortfall of the system that should not be difficult to resolve is
that the system  does not track  eye motion.  This would likely create
an errie resynthesis for the user.  Another shortfall of the system is
that it ignores the  z coordinate in  facial movements.  This  creates
difficulty when the  head turns and would likely  be  difficult if the
face moved toward or away from the camera.

S.  Niyogi  and E.  H.  Adelson,   "Analyzing and Recognizing  Walking
Figures  in   XYT," in   Proc. IEEE   Conf.  on  Vision   and  Pattern
Recognition, pp. 469-474, 1994.


The  approach taken in  the paper  was novel  and interesting, but the
reader has  doubts of the robustness  and applicability of the system.
The problem  that the  authors    addressed was identifying    walking
persons.  To that   end the authors   took xyt sections of  video  and
identified a    common line and   braid pattern  for a  person walking
accross  a  camera field.  Then   a  model of   the braid  pattern was
developed and an assumtion  about where  these patterns would  occure.
The system reportedly achieved a recognition rate as high as 81%.

The system is limited to shots where the camera  is fixed, the heights
of the  head  and feet  are  approximately known, and the  persons are
walking   frontoparallel.  The system  also    seems to be limited  to
situations where  there is only one walker.   Since the reader did not
know what the authors ment by "four  sequences were of AJA, seven were
of LWC, ...), it is not know if any of  the tested sequences contained
mutiple subjects.


The   system  was  somewhat    interesting,   esp.  in  the    use  of
spatial-temporal  data.    But the usefulness  and  robustness  of the
system may not be acceptable for real-world applications.



Stan Sclaroff

Created:  Dec 4, 1995

Last Modified: Dec 6, 1995