BU CLA CS 835: Seminar in Image and Video Computing --- Class commentary on articles

BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Combining Similarity Measures



 
 John Petry 
 

 SIMILARITY MATCHING, by Santini and Jain

The authors start with the observation that image databases require a 
searching method which will match human perception of whether a training 
example is "similar" to a database sample.  They make the point that this
differs from the traditional machine vision problem of trying to match a
perfect model to a degraded instance of the same object.

They then review the state of the current psychological models of human
similarity judgement, particularly as they pertain to spatial relationships.
After pointing out limits in the current state of these models, they show 
an implementation of a fuzzy logic classification system which emulates 
human decisions as to the similarity of spatial features.  These features 
include faces (measuring distances between key facial features) and synthetic 
objects.

Their analysis of the state of the models employed in psychological
experiments may be good, but the leap to their approach seems unsupported.
Namely, they offer little analysis as to the extent to which their approach
actually mimicks human judgement; instead, they compare it to other vision 
techniques.  It may be that a fuzzy logic system would provide a good 
interface for humans attempting to describe objects for which they are 
searching, but that is not what the authors try to show.  They use it
more for the specific task of physical similarity measurement, without 
showing that this would generalize to other types of objects which humans 
might wish to find when querying a database; nor do they show that it does
a good job of emulating human judgement for the specific size similarity
applications they present.

I found little to recommend this particular paper in terms of image
database retrieval and searching.  It may have been a good paper for
psychologists looking for a coherent analysis of various models in their
field and their validity, but it doesn't say much of direct interest 
regarding the authors' stated topic.




INTERACTIVE LEARNING USING A "SOCIETY OF MODELS".
by Minka and Picard


This paper presents a multilayer method of image database organization and
searching.  It combines operator interaction with offline image processing,
and remembers data from previous interactions.

This basic approach seems to me to be a key attribute for any successful
image database system.  There is so much data stored in image databases that
realtime searching is computationally expensive.  The more that can be done
offline, in advance, the better.  Obviously the exact data requested by a
user will not be known in advance, but to the extent that patterns of use
can be detected, the searching operation can be made much faster by taking
advantage of predetermined patterns and image groupings.  Even when no
operator has used the system, it can at least run its basic classification
tools to form hypothetical groupings beforehand.

This is the approach that Minka and Picard take.  Images are segmented 
a priori for one or more model types.  These segments are then grouped 
according to similarity.

A user looking for instances of a template will show positive and negative
examples of it.  The groupings are examined to see which includes most 
positive examples without negative cases, with a weighting factor that 
takes into account previous search groupings.  The user can choose more 
positive or negative examples from this set and iterate again.

As valid samples are chosen, weights are assigned to various tests to show 
the extent to which they are used to choose the correct samples.  When 
new runs performed, these weights provide a starting point, and influence 
the choice of groups which are believed to match the user's request, as
mentioned above.  In addition, the groups themselves are clustered and 
averaged for simplicity and ease of computation.

Some drawbacks are: 

(1) Extra layers of code and organiztion needed.

(2) It works best if user requests and similarity judgements are consistent
between interaction sessions, rather than being totally unrelated.

(3) Adding new images reqires recomputing groupings over the entire database.

(4) This method requires a unified database and search procedure.  It cannot
be directly applied to a new database, for instance.

Finally, I suspect it helps if the database consists of related images,
rather than dissimilar sets of images where the system may create groupings
across the sets where humans would see no natural connection.

In general, though, I think this is probably a very good way to organize
a database and its search tools.  It certainly implies that these two
(images and search tools) be unified.  The benefits are

(1) Increased speed of match-finding

(2) Minimizing the amount of user interaction (1 and 2 are related but 
not identical).

In addition, it provides an indirect test of various vision algorithms
and tests, by seeing which end up being chosen by the system in satisfying 
(real!) user requests rather than contrived experiments.

 William Klippgen 
 

Minka and Picard, "Interactive Learning using a "society of models"
-------------------------------------------------------------------


This paper suggest the use of a semi-automated tool for indexing and
finding related groups of image regions both within a single image and
across the image database.  By learning from interaction with users,
various groupings based on various comparison methods are constantly
weighted as the user refines his searh and finally accepts the
results.

By using pre-compiled groupings and using a restricted method for
chosing from and for combining the groupings, the proposed
interactive-time learning system, "Four-Eyes", can give results within
an acceptable time-span of continuos search refinements.

The performance metric suggested is the number of examples required to
achieve user satisfaction.

Groupings can be made based on any kind of measure, be it static or
dynamic. The measure can be derived from image content as well as from
user access to the images or image groupings.

The grouping clustering makes use of a single-link method (shared
neighbour algorithm) reducing local error instead of global error.

When within-image grouping is performed, a feature image is computed
where each pixel is a point in feature space.  After computing a
course image based on the feature image, clustering of the coarse
image pixels gives a hierarchy of image elements.  This within-image
segmentation tries to find areas that should be treated as a whole.

Across-image groupings are created based on features found over
within-image groupings.  This hierachical clustering also leads to an
inter-connected set of groupings.  

The next step is to create compound groupings based on various
features.  By collecting user feedback as a set of negative and
positive examples, the next suggested compound grouping is the one
which maximizes the number of positive examples times the grouping
weight while having no negative examples.

But, the user's time is valuable, so the number of training examples
leading to a satisfactory result should be kept to a minimum.  The
introduction of weigthing of the groupings is made task-dependent.
This means that the user interaction has to be classified before the
appropriate weight vector can be applied and updated.  Methods for
dealing with the community of weight vectors (stored in so-called
Self-organizing map units (SOM) units) exist, so that SOM units never
get to close without merging and never get too unpopular before they
are removed and die.

A test on natural scene was carried out to test FourEyes for labeling
performance on natural landscapes.  I will try to clarify what really
took place in this test in class, as it is pretty unclear from the
paper how learning takes place for within-image segmentation and how
the a:b annotation measure is applied.

I think the method of using user-interaction is very promising.  The
proposed method that takes place in stages can actually be applied for
information retrieval in general.  The detection of the present user
task and mapping to the SOM is a good general approach for working on
pre-selected sets of data.  An extension to user identification and
membership in user communities would probably make it easier to
distinguish various user tasks when a database will be subject to a
large number of users that will have totally different views of the
data.



Santini and Jain, "Similarity Matching"
---------------------------------------

This paper examines the concept of similarity as interpreted by
humans.  They present the Fuzzy Feature Contrast (FFC) model.

A popular and much proposed similarity measure is distance in some
perceptual metric space. Stimuli are thus represented as points in
high-dimensional space.  Axioms are presented for the distance
function d taking two paramters being two points in the space.  But,
because there is a major difference in the true distance function and
the one available for computing, none of the four presented axioms can
be easily accepted.

In Thurstone-Shepard Similarity Models, stimuli are modeled as
high-dimensional models. The momentary distance is the sum of the
vector component differences raised to a factor where the sum is again
raised to the inverse of the factor.  The similarity is derived from
the function g(d) = exp(-d raised to alfa).  Further refinement by
Krumhans also takes into account the spatial density around the
stimuli representation.

By using set theory, the feature contrast model view stimulis as sets
of features that can be compared using an asymmetric measure S(a,b).

Using stochastic models, the probability for a certain stimuli to give
a certain response is in focus.

Obviously, there are many models for similarity that have been derived
from phycological observations.  There are also various combinations
of similarity measures for different features that in turn makes up
the final similarity judgement.

The Fuzzy Feature Contrast model is based on Tversky's feature
contrast model where the features' relations to each other also come
into account when judging similarity. The interesting effect of
hysteresis in human perception makes the judgement of a series of
features dependend on direction of change, i.e. between female and
male faces.

The use of fuzzy logic seems able to account for many of the
pecularities in human perception when the computer tries to simulate
it.  By implementing learning techniques, the fuzzy membership
functions can be continuosly updated.

Lars Liden

"Similarity Matching"
Santini & Jain

	This paper addresses the issues involved in trying to model
human similarity assessment.  It addresses some of the problems with
conventional similarity theories, yet still seems inadequate.
Psychologists have shown that humans do not reason using logical
operatives (ie modus ponens, modus tollens etc), incorporate
contextual knowledge, and show non-linear, discontinuous in reasoning
and decision making.  These qualities are undesirable from a
theoretical point of view, and make creating a model impossible.  The
authors mention that the fact that similarity must be assessed using
using completely different mechanisms for different types of stimuli
is unsatisfactory, yet it seems natural in human reasoning.  
	Perhaps what makes such modeling intractable is that humans
always make similarity judgements based on the context of an enormous
database of their sensory interactions with objects (e.g. visual,
tactile) built up over a lifetime, encorporating information about
object's use, whether an object is inanimate/animate,
man-made/natural.  One couldn't possibly hope to model the data
provided by such a database.
	The argument was briefly made that one can instead try to
model "perceptual similarity", however, this argument fails.  It has
also been clearly demonstrated that humans are not consciously aware
of the primitive perceptual data in the environment.  Top-down
information about contextual knowledge is constantly altering
perception of sensory data.  It is impossible to extract the purely
perceptual aspects of human similarity judgements from the contextual
knowledge of the human subjects.
	The "similarity experiments" in the paper were less than
satisfying.  The propositions used for similarity judgements were
completely arbitrary, and require a priori knowledge about the
salient/relevant features of an image.  Another serious drawback
(pointed out by the authors) is that each of the propositions are
independent, which had been shown not to be the case of features in
human similarity judgements.  The authors also were obviously not
aware of the large history of literature in human face recognition
showing the relative salience of features, (as shown by the facial
features chosen for their experiments).  Finally from a computational
point of view the results reported on face recognition seemed
primitive compared the eigenface approach.

"Interactive learning using a 'society of models'.
Minka & Picard

	This paper was much better than the previous one, most notably
in that it addressed the fact that similarity judgements in humans are
highly task dependent and subject to the particular judgement
strategies used by an individual.  It allows for the weighting of features
depending on the task at hand, where such weighting is based on
previous user interactions and tailored to specific users during in
interactive session.  The fact that the system can tune itself to use
a specific set of features for a given task makes it much more robust
to contextual effects from the human user.
	The only difficulty with the particular algorithm (which we
have also seen in other papers) is that when novel images are added to
a database the entire database need to be reclustered for all the
features.  Although one can try to add a new image, without doing so,
the authors suggest that the clustering sub-optimal.
	
 Gregory Ganarz


In "Interactive learning using a society of models" Minka and Picard
address the problem of how to combine many different models of
similarity.  By interacting with the user, the system learns which
groupings of the data (which models) to favor.  One difficulty with
the system is that it precomputes the possible groupings of the data
in a manner which requires the entire database to be recomputed when a
novel image is added to the data set.  An incremental approach is
required for a database of significant size.  Further, as the number
of models and images increases, the memory requirements for storing
all the possible groupings may become prohibitive.

In "Similarity Matching" Santini and Jain present a fuzzy
generalization of Tversky's feature contrast model.  One of the
difficulties with their approach is choosing what features to use and
ranking their importance.  It seems likely that the relative ranking
of features used by humans is context sensitive.  Also, the membership
functions are probably task specific, and they must be determined
experimentaly.  The Santini model currently has no way to learn the
membership functions.

 Shrenik Daftary


Synopsis for "Similarity Matching" by Santini and Jain

This paper attempts to present a connection between psychological and
computer science perceptions of image similarity. The first idea
introduced is the metric of dissimilarity between images. The axioms of
metrics are presented and attacked in terms of psychological analyses of
similarity.  Euclidean, city block and alternate metrics that correspond
to the psychological data are presented. The rigidity of distance axioms
is attacked in terms of their conflict with the actual human perception.
In their place another set of properties for measures is given -
consistency, transitivity, and corner inequality 'axioms'. 

The next part of the paper focuses on the use of non discrete models. 
Fuzzy models are introduced to answer questions such as whether a person's
hair is long. The feature contrast model for the image is based on
computing fuzzy set predicates u(o) and u(s) which are compared by first
calculating the intersection, the difference, and an important salient
function. The validity of using outside measures to compare the perception
of an object is presented with an example. Next a method that incorporates
the perceptual effect of outside objects is presented in terms of the
Choquet Integral. This part was not clearly stated, and the paper
indicates that use of the Choquet Integral will reduce to a usual Tversky
similarity in cases where the features are binary. 

A testing method for humans that is similar to fuzzy logic is presented,
where a test taker can place in a relative value for the truth of a
statement. A comparison of different metrics is presented for a silhoutte,
showing a series of images that are most similar to a given image. Next
the same test was performed for faces, which showed the effects of using
different metrics to determine similar images. Hysteresis in human studies
was also presented in terms of the different perception of an intermediate
point between the morph of a man and a woman depending on whether the
woman or man was shown to the subject. 

This paper while presenting several models for distance did not provide an
idea of which method would work best in a particular situation, although
it did present data in terms of the best methods to meet psychological
criteria. 

Synopsis for "Interactive learning using a society of models" by Minka and
Picard

This paper presents a method to determine the best specialized model for a
set of objects in which a search will be performed. The paper mentions the
MRF model, Gaussian models, and active contour boundary models. The method
presented relies on the calculation of a slow stable set of parameters
that can be compared quickly when a similarity match needs to be
determined. Examples and counterexamples are presented for the similarity
metric that allows the user to "teach" the computer what is considered
similar. 

In the database the actual computations for the hierarchal upper level of
the system must be recomputed each time an image is added to the database,
but since this occurs infrequently compared to look ups, this additional
cost does not effect the overall system performance too severely. 
Groupings are determined in terms of (1) for a single image by feature
estimation in terms of mean and std and hierarchical clustering and of (2)
across images by comparing the relative positions of features in a feature
space and somehow determining their distance in feature space. 

The space allows the user to chose their own preferences when determining
which image parameters need to be closest in order for a match to occur. 
The number of examples that need to be performed for optimum performance
was demonstrated as well. Problems occur for this system in expanding
databases and in adding a set of examples to the system. 

 Paul Dell 

Simone Santini, Ramesh Jain, "Similarity Matching", ???

It is argued that similarity assessment is an important function needed for  
multimedia databases.  To construct a computational model for similarity, the  
authors examine articles from the past 70 years in the field of psychology.   
The paper presents some of the metric based theories that assume that  
similarity is based on some distance in a high dimensional perceptual space.   
Then the paper critics the axioms of many of the distance models.

A fuzzy set-theoretic measure is presented.  Some definitions and modifications  
to the traditional fuzzy set approach is given to define the Fuzzy Features  
Contrast (FCC) model.

An experimental system and results are then given.  An experimental is required  
to examine the methods since "there is no perceptual theory that allow us to  
decide a priori the value of the membership functions parameters: they must be  
set experimentally".  In the experimental system it is interesting to note that  
"absolute judgments are extremely unreliable" but comparative judgment is more  
reliable.    In the system users are asked to give there level of agreement to  
a given observation (eg. mouth is wide).

Four different models were tested to rank similarity of two different data  
sets.  First some silhouettes were tested.  Second the system was tested on a  
set of face images.  It would have been interesting if the authors would have  
compared the rankings of various models with the rankings given by humans on  
the datasets.

The similarity functions for human experiments are not symmetric.  Instead a  
hysteresis is observed.  The Thurston-Shepard and FCC models also contain  
hysteresis.

One advantage of utilizing fuzzy sets is the opportunity to fine tune the  
membership functions in a series of experiments to model the human system.


T. P. Minka, R. W. Picard, "Interactive learning using a 'society of models'",  
MIT Media Laboratory Perceptual Computing Section Technical Report No. 349.

The authors note that "the possiblitity of an a priori optimal  
context-dependent selections among similarity measures, either by human or  
cuputer, seems unlikely".  Therefore the authors argue that a variety of  
features for similarity measures are needed.  To this end the various groupings  
(likelly calculated by various models) need to be incorporated into a system.

The approach taken incorporates an interactive approach between the user and  
computer.  Some data-dependent groupings are calculated and then the user  
interacts with the system to positively or negatively reinforce the various  
groupings.

One disadvantage of the system is that when a new novel image is added to the  
data set, a full reclustering for all the features would be needed for an  
optimal grouping.  Another drawback is the large number of examples that are  
required when there are many groupings to choose from.

The system presented is most effective in "getting a quick first-cut labeling".   
Since the system integrates various feature groupings, the user does not have  
to choose a particular way of organizing the data.  This contrasts with other  
systems QBIC, SWIM, Photobook, and CORE, which have different ways of  
organizing the data, but do not give the user any guidance as to which approach  
to choose.  The "FourEyes" system provides various groupings, and determines  
the groupings based on user interaction.



Stan Sclaroff

Created:  Oct 1, 1995

Last Modified: