BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Combining Similarity Measures



John Petry

SIMILARITY MATCHING, by Santini and Jain The authors start with the observation that image databases require a searching method which will match human perception of whether a training example is "similar" to a database sample. They make the point that this differs from the traditional machine vision problem of trying to match a perfect model to a degraded instance of the same object. They then review the state of the current psychological models of human similarity judgement, particularly as they pertain to spatial relationships. After pointing out limits in the current state of these models, they show an implementation of a fuzzy logic classification system which emulates human decisions as to the similarity of spatial features. These features include faces (measuring distances between key facial features) and synthetic objects. Their analysis of the state of the models employed in psychological experiments may be good, but the leap to their approach seems unsupported. Namely, they offer little analysis as to the extent to which their approach actually mimicks human judgement; instead, they compare it to other vision techniques. It may be that a fuzzy logic system would provide a good interface for humans attempting to describe objects for which they are searching, but that is not what the authors try to show. They use it more for the specific task of physical similarity measurement, without showing that this would generalize to other types of objects which humans might wish to find when querying a database; nor do they show that it does a good job of emulating human judgement for the specific size similarity applications they present. I found little to recommend this particular paper in terms of image database retrieval and searching. It may have been a good paper for psychologists looking for a coherent analysis of various models in their field and their validity, but it doesn't say much of direct interest regarding the authors' stated topic. INTERACTIVE LEARNING USING A "SOCIETY OF MODELS". by Minka and Picard This paper presents a multilayer method of image database organization and searching. It combines operator interaction with offline image processing, and remembers data from previous interactions. This basic approach seems to me to be a key attribute for any successful image database system. There is so much data stored in image databases that realtime searching is computationally expensive. The more that can be done offline, in advance, the better. Obviously the exact data requested by a user will not be known in advance, but to the extent that patterns of use can be detected, the searching operation can be made much faster by taking advantage of predetermined patterns and image groupings. Even when no operator has used the system, it can at least run its basic classification tools to form hypothetical groupings beforehand. This is the approach that Minka and Picard take. Images are segmented a priori for one or more model types. These segments are then grouped according to similarity. A user looking for instances of a template will show positive and negative examples of it. The groupings are examined to see which includes most positive examples without negative cases, with a weighting factor that takes into account previous search groupings. The user can choose more positive or negative examples from this set and iterate again. As valid samples are chosen, weights are assigned to various tests to show the extent to which they are used to choose the correct samples. When new runs performed, these weights provide a starting point, and influence the choice of groups which are believed to match the user's request, as mentioned above. In addition, the groups themselves are clustered and averaged for simplicity and ease of computation. Some drawbacks are: (1) Extra layers of code and organiztion needed. (2) It works best if user requests and similarity judgements are consistent between interaction sessions, rather than being totally unrelated. (3) Adding new images reqires recomputing groupings over the entire database. (4) This method requires a unified database and search procedure. It cannot be directly applied to a new database, for instance. Finally, I suspect it helps if the database consists of related images, rather than dissimilar sets of images where the system may create groupings across the sets where humans would see no natural connection. In general, though, I think this is probably a very good way to organize a database and its search tools. It certainly implies that these two (images and search tools) be unified. The benefits are (1) Increased speed of match-finding (2) Minimizing the amount of user interaction (1 and 2 are related but not identical). In addition, it provides an indirect test of various vision algorithms and tests, by seeing which end up being chosen by the system in satisfying (real!) user requests rather than contrived experiments.

William Klippgen

Minka and Picard, "Interactive Learning using a "society of models" ------------------------------------------------------------------- This paper suggest the use of a semi-automated tool for indexing and finding related groups of image regions both within a single image and across the image database. By learning from interaction with users, various groupings based on various comparison methods are constantly weighted as the user refines his searh and finally accepts the results. By using pre-compiled groupings and using a restricted method for chosing from and for combining the groupings, the proposed interactive-time learning system, "Four-Eyes", can give results within an acceptable time-span of continuos search refinements. The performance metric suggested is the number of examples required to achieve user satisfaction. Groupings can be made based on any kind of measure, be it static or dynamic. The measure can be derived from image content as well as from user access to the images or image groupings. The grouping clustering makes use of a single-link method (shared neighbour algorithm) reducing local error instead of global error. When within-image grouping is performed, a feature image is computed where each pixel is a point in feature space. After computing a course image based on the feature image, clustering of the coarse image pixels gives a hierarchy of image elements. This within-image segmentation tries to find areas that should be treated as a whole. Across-image groupings are created based on features found over within-image groupings. This hierachical clustering also leads to an inter-connected set of groupings. The next step is to create compound groupings based on various features. By collecting user feedback as a set of negative and positive examples, the next suggested compound grouping is the one which maximizes the number of positive examples times the grouping weight while having no negative examples. But, the user's time is valuable, so the number of training examples leading to a satisfactory result should be kept to a minimum. The introduction of weigthing of the groupings is made task-dependent. This means that the user interaction has to be classified before the appropriate weight vector can be applied and updated. Methods for dealing with the community of weight vectors (stored in so-called Self-organizing map units (SOM) units) exist, so that SOM units never get to close without merging and never get too unpopular before they are removed and die. A test on natural scene was carried out to test FourEyes for labeling performance on natural landscapes. I will try to clarify what really took place in this test in class, as it is pretty unclear from the paper how learning takes place for within-image segmentation and how the a:b annotation measure is applied. I think the method of using user-interaction is very promising. The proposed method that takes place in stages can actually be applied for information retrieval in general. The detection of the present user task and mapping to the SOM is a good general approach for working on pre-selected sets of data. An extension to user identification and membership in user communities would probably make it easier to distinguish various user tasks when a database will be subject to a large number of users that will have totally different views of the data. Santini and Jain, "Similarity Matching" --------------------------------------- This paper examines the concept of similarity as interpreted by humans. They present the Fuzzy Feature Contrast (FFC) model. A popular and much proposed similarity measure is distance in some perceptual metric space. Stimuli are thus represented as points in high-dimensional space. Axioms are presented for the distance function d taking two paramters being two points in the space. But, because there is a major difference in the true distance function and the one available for computing, none of the four presented axioms can be easily accepted. In Thurstone-Shepard Similarity Models, stimuli are modeled as high-dimensional models. The momentary distance is the sum of the vector component differences raised to a factor where the sum is again raised to the inverse of the factor. The similarity is derived from the function g(d) = exp(-d raised to alfa). Further refinement by Krumhans also takes into account the spatial density around the stimuli representation. By using set theory, the feature contrast model view stimulis as sets of features that can be compared using an asymmetric measure S(a,b). Using stochastic models, the probability for a certain stimuli to give a certain response is in focus. Obviously, there are many models for similarity that have been derived from phycological observations. There are also various combinations of similarity measures for different features that in turn makes up the final similarity judgement. The Fuzzy Feature Contrast model is based on Tversky's feature contrast model where the features' relations to each other also come into account when judging similarity. The interesting effect of hysteresis in human perception makes the judgement of a series of features dependend on direction of change, i.e. between female and male faces. The use of fuzzy logic seems able to account for many of the pecularities in human perception when the computer tries to simulate it. By implementing learning techniques, the fuzzy membership functions can be continuosly updated.

Lars Liden

"Similarity Matching" Santini & Jain This paper addresses the issues involved in trying to model human similarity assessment. It addresses some of the problems with conventional similarity theories, yet still seems inadequate. Psychologists have shown that humans do not reason using logical operatives (ie modus ponens, modus tollens etc), incorporate contextual knowledge, and show non-linear, discontinuous in reasoning and decision making. These qualities are undesirable from a theoretical point of view, and make creating a model impossible. The authors mention that the fact that similarity must be assessed using using completely different mechanisms for different types of stimuli is unsatisfactory, yet it seems natural in human reasoning. Perhaps what makes such modeling intractable is that humans always make similarity judgements based on the context of an enormous database of their sensory interactions with objects (e.g. visual, tactile) built up over a lifetime, encorporating information about object's use, whether an object is inanimate/animate, man-made/natural. One couldn't possibly hope to model the data provided by such a database. The argument was briefly made that one can instead try to model "perceptual similarity", however, this argument fails. It has also been clearly demonstrated that humans are not consciously aware of the primitive perceptual data in the environment. Top-down information about contextual knowledge is constantly altering perception of sensory data. It is impossible to extract the purely perceptual aspects of human similarity judgements from the contextual knowledge of the human subjects. The "similarity experiments" in the paper were less than satisfying. The propositions used for similarity judgements were completely arbitrary, and require a priori knowledge about the salient/relevant features of an image. Another serious drawback (pointed out by the authors) is that each of the propositions are independent, which had been shown not to be the case of features in human similarity judgements. The authors also were obviously not aware of the large history of literature in human face recognition showing the relative salience of features, (as shown by the facial features chosen for their experiments). Finally from a computational point of view the results reported on face recognition seemed primitive compared the eigenface approach. "Interactive learning using a 'society of models'. Minka & Picard This paper was much better than the previous one, most notably in that it addressed the fact that similarity judgements in humans are highly task dependent and subject to the particular judgement strategies used by an individual. It allows for the weighting of features depending on the task at hand, where such weighting is based on previous user interactions and tailored to specific users during in interactive session. The fact that the system can tune itself to use a specific set of features for a given task makes it much more robust to contextual effects from the human user. The only difficulty with the particular algorithm (which we have also seen in other papers) is that when novel images are added to a database the entire database need to be reclustered for all the features. Although one can try to add a new image, without doing so, the authors suggest that the clustering sub-optimal.

Gregory Ganarz

In "Interactive learning using a society of models" Minka and Picard address the problem of how to combine many different models of similarity. By interacting with the user, the system learns which groupings of the data (which models) to favor. One difficulty with the system is that it precomputes the possible groupings of the data in a manner which requires the entire database to be recomputed when a novel image is added to the data set. An incremental approach is required for a database of significant size. Further, as the number of models and images increases, the memory requirements for storing all the possible groupings may become prohibitive. In "Similarity Matching" Santini and Jain present a fuzzy generalization of Tversky's feature contrast model. One of the difficulties with their approach is choosing what features to use and ranking their importance. It seems likely that the relative ranking of features used by humans is context sensitive. Also, the membership functions are probably task specific, and they must be determined experimentaly. The Santini model currently has no way to learn the membership functions.

Shrenik Daftary

Synopsis for "Similarity Matching" by Santini and Jain This paper attempts to present a connection between psychological and computer science perceptions of image similarity. The first idea introduced is the metric of dissimilarity between images. The axioms of metrics are presented and attacked in terms of psychological analyses of similarity. Euclidean, city block and alternate metrics that correspond to the psychological data are presented. The rigidity of distance axioms is attacked in terms of their conflict with the actual human perception. In their place another set of properties for measures is given - consistency, transitivity, and corner inequality 'axioms'. The next part of the paper focuses on the use of non discrete models. Fuzzy models are introduced to answer questions such as whether a person's hair is long. The feature contrast model for the image is based on computing fuzzy set predicates u(o) and u(s) which are compared by first calculating the intersection, the difference, and an important salient function. The validity of using outside measures to compare the perception of an object is presented with an example. Next a method that incorporates the perceptual effect of outside objects is presented in terms of the Choquet Integral. This part was not clearly stated, and the paper indicates that use of the Choquet Integral will reduce to a usual Tversky similarity in cases where the features are binary. A testing method for humans that is similar to fuzzy logic is presented, where a test taker can place in a relative value for the truth of a statement. A comparison of different metrics is presented for a silhoutte, showing a series of images that are most similar to a given image. Next the same test was performed for faces, which showed the effects of using different metrics to determine similar images. Hysteresis in human studies was also presented in terms of the different perception of an intermediate point between the morph of a man and a woman depending on whether the woman or man was shown to the subject. This paper while presenting several models for distance did not provide an idea of which method would work best in a particular situation, although it did present data in terms of the best methods to meet psychological criteria. Synopsis for "Interactive learning using a society of models" by Minka and Picard This paper presents a method to determine the best specialized model for a set of objects in which a search will be performed. The paper mentions the MRF model, Gaussian models, and active contour boundary models. The method presented relies on the calculation of a slow stable set of parameters that can be compared quickly when a similarity match needs to be determined. Examples and counterexamples are presented for the similarity metric that allows the user to "teach" the computer what is considered similar. In the database the actual computations for the hierarchal upper level of the system must be recomputed each time an image is added to the database, but since this occurs infrequently compared to look ups, this additional cost does not effect the overall system performance too severely. Groupings are determined in terms of (1) for a single image by feature estimation in terms of mean and std and hierarchical clustering and of (2) across images by comparing the relative positions of features in a feature space and somehow determining their distance in feature space. The space allows the user to chose their own preferences when determining which image parameters need to be closest in order for a match to occur. The number of examples that need to be performed for optimum performance was demonstrated as well. Problems occur for this system in expanding databases and in adding a set of examples to the system.

Paul Dell

Simone Santini, Ramesh Jain, "Similarity Matching", ??? It is argued that similarity assessment is an important function needed for multimedia databases. To construct a computational model for similarity, the authors examine articles from the past 70 years in the field of psychology. The paper presents some of the metric based theories that assume that similarity is based on some distance in a high dimensional perceptual space. Then the paper critics the axioms of many of the distance models. A fuzzy set-theoretic measure is presented. Some definitions and modifications to the traditional fuzzy set approach is given to define the Fuzzy Features Contrast (FCC) model. An experimental system and results are then given. An experimental is required to examine the methods since "there is no perceptual theory that allow us to decide a priori the value of the membership functions parameters: they must be set experimentally". In the experimental system it is interesting to note that "absolute judgments are extremely unreliable" but comparative judgment is more reliable. In the system users are asked to give there level of agreement to a given observation (eg. mouth is wide). Four different models were tested to rank similarity of two different data sets. First some silhouettes were tested. Second the system was tested on a set of face images. It would have been interesting if the authors would have compared the rankings of various models with the rankings given by humans on the datasets. The similarity functions for human experiments are not symmetric. Instead a hysteresis is observed. The Thurston-Shepard and FCC models also contain hysteresis. One advantage of utilizing fuzzy sets is the opportunity to fine tune the membership functions in a series of experiments to model the human system. T. P. Minka, R. W. Picard, "Interactive learning using a 'society of models'", MIT Media Laboratory Perceptual Computing Section Technical Report No. 349. The authors note that "the possiblitity of an a priori optimal context-dependent selections among similarity measures, either by human or cuputer, seems unlikely". Therefore the authors argue that a variety of features for similarity measures are needed. To this end the various groupings (likelly calculated by various models) need to be incorporated into a system. The approach taken incorporates an interactive approach between the user and computer. Some data-dependent groupings are calculated and then the user interacts with the system to positively or negatively reinforce the various groupings. One disadvantage of the system is that when a new novel image is added to the data set, a full reclustering for all the features would be needed for an optimal grouping. Another drawback is the large number of examples that are required when there are many groupings to choose from. The system presented is most effective in "getting a quick first-cut labeling". Since the system integrates various feature groupings, the user does not have to choose a particular way of organizing the data. This contrasts with other systems QBIC, SWIM, Photobook, and CORE, which have different ways of organizing the data, but do not give the user any guidance as to which approach to choose. The "FourEyes" system provides various groupings, and determines the groupings based on user interaction.


Stan Sclaroff
Created: Oct 1, 1995
Last Modified: