BU CLA CS 835: Seminar in Image and Video Computing---Term Projects

BU CLA CS 835: Seminar in Image and Video Computing
Fall 1995

Proposed Term Projects

Shrenik Daftary: Swallowing Position Identification For Single Frames
Paul Dell: Indexing an Atmospheric Image Database
Gregory Gancarz: An Active Vision Model for Geometric Indexing
John Isidoro: Housecat Type Recognition Using Color and Orientation Histograms
William Klippgen: Simple Real Time Recognition of Facial Expression
Lars Liden: Video Segmentation Using Relative Motion
John Petry: Multi-scale Eigenfaces

Swallowing Position Identification For Single Frames --- Shrenik Daftary

I. Introduction

As high speed imaging of anatomical functions becomes more prevalent, the use of an automated identifier of a particular part of the physiological cycle may play a significant role in developing a completely automated system that can point to areas of physiological dysfunction and to possible reasons for why the dysfunction occurs. Such a system can work in one of numerous methods. One technique may be to simply compare the whole sequence of images to other sequence of images and placing the images under either normal or any number of abnormal bins. Another technique is to first identify the key time points in the image and to then compare those time points to specific prototypes. The latter has the advantage of being more efficient, and less complex then a full video comparison. In addition the latter technique should not reduce the robustness of such a test system.

II. Description of data

The time point comparison will be performed on sequences of 50 images that are taken by Echo-Planar Magnetic Resonance Imaging. Most of the sequences in the current database were imaged at 10Hz. There are seven normal subjects with approximately 3 sagittal sequences each - one medial sequence, one right lateral sequence, and one left lateral sequence. The plan is to establish a single prototype for the propulsion stage of swallowing that forces the bolus towards the pharynx that will work within all three slices. The data that will be compared in this test will be fairly controlled in terms of resolution, but not in terms of gray scale.

III. Implementation

The implementation of the comparison algorithm will be done by first normalizing the color of the tongue sequence to adequately separate the bolus from the tongue. The next step will be to segment the tongue using either a standard edge detection algorithm or a segmentation technique established for MRI. [1] Once the tongue's shape is identified in each image the actual process of determining the maximum propulsion point will ignore the rest of the image and scale the segmented region minimally to correspond with different tongue sizes, and shapes. Each image in the sequence will then be compared to the prototype - the one with the highest correlation will be selected as the propulsion stage. Initially the prototype will correspond to an arbitrarily selected image. Eventually the system may progress to take averages to adjust the prototype to new data, or to have different prototypes based on the physical characteristics of the subject. If the prototype does not match well with any of the 50 images then perhaps either a diseased prototype can be compared with the sequence or even normal extremas can be compared with the sequence.

After the actual image is automatically selected the next step for this system is to compare the selected propulsion stage image to those of normal cases, and abnormal cases. In this test, since no abnormal cases exist in the database the selected image will be compared to other normal cases from different subjects. The subject will then be assigned to that particular bin of normal.

IV. Expectations

The major problem will be to properly segment the tongue from the bolus. Additionally there may be cases where the propulsion stage may be more distinguishable by other identifiers that may be used, and in some cases the propulsion prototype may not fit with the given sequence. However the performance of the system should be good in its stated goal of first identifying the propulsion stage in the sequence, and then finding the subject with the most similar swallow during the propulsion stage. It may be determined that a video comparison is necessary for the latter stage though to reduce the chances of false positives.

[1] E. Ashton, M. Berg, K. Parket, J. Weisberg, C. Chen, L. Ketonen, "Segmentation and Feature Extraction Techniques, with Applications to MRI Studies", Magnetic Resonance In Medicine. 33:670-677 (1995)

[2] M. Kass, A. Witkin, D. Terzopolous, "Snakes: active contour models", International Journal of Computer Vision 1, 321-331 (1987)

Indexing an Atmospheric Image Database --- Paul Dell

Problem:

The BU TERRIERS project (http://veebs.bu.edu/terriers.html) will launch a satellite in the Spring of '97 to study ions in the atmosphere. One of the data sets that will be produced will be 2D arrays of ion measurements. The x axis is the latitue and the y axis is the altitude. The value in the array will be a 16-bit value which corresponds to an intensity measurement of a certain wavelength. For each pass up to 10 different frequencies will be measured and a 2D array will be generated for each frequency.

This 2D array can be visualized as a 2D color or greyscale image. Assuming this format, the scientist would be interested in answering a number of questions. First, the average intensity value would need to be calculated and the scientist would want to see the "depleated" and "enhanced" regions. Regions would be considered depleated or enhanced if the intensity value was less than or greater than some deviation from the average value. (Some more accuate method may be devised later.) Second, the scientist would need to query for images that contained a feature (depleted or enhanced region) in a certain latitue/altitude bounding box.

Open Issues:

Three main things that need to be decided are the format of input (array or image format), feature detection algorithm, and indexing scheme. Ideally I would like to find some existing public domain feature detection code and transform the input to match the programs requirements. Then I would like to concentrate on indexing schemes. Perhaps utilizing some indexing tree like the GiST approach (J. Hellerstein, J. Naughton, and A. Pfeffer, "Generalized Search Trees for Database Sytems", Proceedings of the 21st VLDB Conference, Zurich, Switzerland, 1995) and/or integrating a retreaval method for spatial relationships to extend the system for searching for eg. depleted region to below enhanced region (A. Sistla, et. al. "Similarity based Retrieval of Pictures Using Indices on Spatial Relationships", VLDB 1995.).

Interesting Extensions:

"features of certain shape"- Instead of a simple bounding box, maybe the scientist could query for other shapes. eg. horizontal ovals, circles, triangles (unlikely in data?), or others

"flat 3D view"- The data could be visualized in 3D as a series of atmosphere cross sections

"extrapolated 3D view"- The atmosphere cross section data possibly could be extrapolated to visualize the larger 3D structures. (would probably be pretty difficult to do.)

"movement of features"- A series of images could be compaired to see similiarity of features and any repeating movement of features. Scientist could query for eg. "depletion zones moving right". (would likely be difficult to get meaningful science. But may be good.)

An Active Vision Model for Geometric Indexing --- Gregory Gancarz

A major difficulty of designing an image database search system is deciding what should be precomputed. If a system must rely on the precomputed information solely, it is forced to lose some generality. With no precomputation, the system is prohibitively slow. While some information must be computed to speed search, the system should have the ability to compute directly from the image to deal with cases where information not in the precomputed transform is needed. This ability must be fast. In the case of human vision, the problem of analyzing a scene has been made tractable by lowering sensor resolution with eccentricity (the resolution of the retina decreases with radial distance from the fovea).

By combining an active vision model (Gancarz and Wolfe, to be presented ARVO '96) with a structural object representation similar to the geometric histogram method of Evans et al., a computationally fast, invariant object recognition system is proposed. As a demonstration of the models ability, a small database of segmented objects will be used as learned prototypes (categories). A subset of these will be warped (rotated, scaled, contaminated by noise, stretched, and occluded), and submitted to the model for identification. These images will be obtained from the net (if I can't find suitable images, I could use hand-written letters/digits (zip-codes), which I believe are available, or I could have a program construct a small number of simple images).

Housecat Type Recognition Using Color and Orientation Histograms --- John Isidoro

The final programming project that I propose to do is to write a program which can recognize the different types of housecats from pictures of them. Housecat types can generally be distinguished from one another by coloration, markings (stripes,splotches, etc.), and fur length.

My procedure for cat type recognition goes as follows. First an image is loaded in to an offscreen buffer. (images will be in MS-Win BMP format, I have converters for other types also) Next the cat's outline will be traced by the user using the mouse. The tracing should not exclude any parts of the cat (with the exception of whiskers and eyelashes). Once the outline is traced, a point is selected on the interior of the cat to seed a flood fill which will be able to detect which pixels in the image are on the cat.

Once all the cat pixels are selected, a color histogram is computed. I think I will use the HSV color space since most cats have coloring in the grayscale range (luminosity). The only other cat colors are in the orange range. The color histogram will basically tell what colors are present in the cat, and might also provide some information onto whether or not the cat is one solid color.

The second property of cats that needs to be analyzed is markings. I plan to use orientation histogramming to do this. If there is not a large amount of directionality in the image, the cat is probably a solid color. If there is, the cat might have some form of markings. I think telling stripes from splotches will be a hard problem.. One possible idea is if there is a dominant direction to the markings, then the cat is probably striped, if not the cat may have splotches.

I also have another technique that may be able to distinguish between the different types of markings if needed. Working backward from the color histograms, I think that the each of the markings can be separated by looking for clusters in the color histogram, and then finding the regions on the cat that correspond to those clusters.. Once a marking is isolated, I should be able to determine the size and shape of the marking.

The last feature of cats that needs to be analyzed is fur length.. I believe this to be a very hard problem.. One possible approach relies on the following hypothesis, that the longer the fur on a cat, the more the outline of the cat will have a directionality perpendicular to the outline. The way I can measure this perpendicluarity is by creating a specialized angle histogram. In this histogram, the bins will be sorted by the angle made between the dominant orientation of each pixel and the direction of the closest point on the traced outline of the cat. The only pixels which will be considered in this histogram are those within a certain distance to the outline of the cat. If the specialized angle histogram tend to have more energy around the 90 degree range rather than at the 0 and 180 degree range, the cat probably has long hair. Otherwise it probably has short hair.

One other consideration I can think of is the location of markings on the cat, (white chin, gray paws, etc. ) for this I would need to determine the orientation of the cat. Since a cat is a highly deformable thing, I really don't think I will have any way to determine one part of a cat from another unless I have the user trace the outlines of certain parts of the cat.

Once the main cat properties are distinguished I will find the closest matches against a database of previously analyzed cats.. I may have to tweak this database a little to make certain properties of the different types of cats more distinct. (Ex. Calico cats must have orange and black splotches)

I believe this will be a highly challenging project. I am especially interested in seeing if my fur length hypothesis will work or not. I like the concept of working on more of an application based project rather than a theoretical one because it seems as though my goals can be more focused.

Simple Real Time Recognition of Facial Expression --- William Klippgen

1. Motivation

As computer software is evolving to be adaptive to a user's needs by implementing agent-like personal assistants [1], the need for timely and accurate user feedback is of outmost importance. As a lot of information can be derived from simply monitoring conventional user interaction with the software and its environment, it would be useful to be able to monitor the user's reactions in the physical world as well.

At the same time, semantic interpretation of recorded video also would benefit from a combination of image interpretation of the user's apperance and of the audio stream (speech and emotional utterances).

2. Project outline

I will try to investigate techniques to interpret the mood of the user via his or hers facial expressionc. As software should be able to adapt fast to user reactions and as large amounts of video will have to be parsed, the recognition process should be able to operate in real time.

The investigation should try to analyze the various mode changes and try to pinpoint what other detectable features can be found that would assist in interpreting the facial expression, as head and body movement. I will not make use of audio feedback.

3. Suggested approaches

There are two properties of expressions that can be used as a basis for a technique or a combination of techniques. The first property is visual coherence over a series of facial expressions representing the same mood. This static property can be used for similarity matching. The other property is visual coherence in the dynamic change from one mood to another. This property can be used in optical flow analysis to detect mood changes.

4. Image segmentation

4.1 Initial segmentation

I will first try to segment the image, assuming that the face and parts of the shoulder are visible in the video picture. If segmentation using optical flow or pure edge detection is difficult with a varying background, I will change to using a plain white background.

I will try to carry out the segmentation on my processes image by moving three virtual rectangles from above and downwards, above left and leftwards and above right and rightwards to surround the head and shoulders. Their final position will tell what part of the picture should be used to represent the lower half of the face. I will have to assume that the user faces the computer. To further optimize recognition, I should also come up with a fast method of finding the center of the face and then segment the picture again to make the center of the face appear in the center of the picture.

3.2 Image resizing

I will then resize the segmented picture (from now on called the sample) to 64 by 64 pixels, possibly 128 by 128 pixels. This concludes the segmentation process.

4. Similarity matching

As I understand from conversations with Stan, I can use one of the following approaches:

4.1 Eigenfaces

Eigenfaces representation and matching is probably a very stable approach, but requires heavy computation. It can match static images of various moods, but it seems as it would at least require flow processing to extract sample pictures where the moving parts of the face (the lips) have come to a halt. This is because the method will be unable to process more than 1 or 2 pictures a second being my initial guess.

4.2 Orientation histogram

This approach detects edges in the sample and generate a edge orientation histogram. By proper edge detection, it should be possible to get results from this technique. It is also pretty fast.

4.4 Pure correlation matching

Use of correlation is the most crude approach, but given good segmentation and proper pre-processing i.e. using a Sobel filter, it could work. I will certainly try this approach in the initial phase of the project.

4.5 Optical flow analysis

By using available software tools, the optical flow analysis can detect familiar mood changes by mapping the resulting vectors into a flow histogram. The main weakness with this approach can be that users change their facial expressions with varying speeds. Speed normalisation is difficult, I would guess, as low speeds could be a result of noise i.e. in the form of head movements.

6. Available hardware and software tools

I have an Indy workstation with a video camera (IndyCam) available. A software tool called "movieConvert" converts frames grabbed from the IndyCam to ppm-files. "nv2frame" is also available. The tool "dynamo" does optical flow analysis. A general set of image processing techniques are also available as a set of TCL scripts with attached C-code.

References:

[1] Maes, Pattie, "Reducing work overload...", IEEE Multimedia

[2] Santini, Simone and Jain, Ramesh, "Similarity Matching"

3] Freeman,Wiliam T. and Roth, Michael, "Orientation Histograms for Hand Gestures Recognition", Intl. Workshop on Automatic Face and Gesture Recognition, Zurich, Switzerland, June, 1995

[4] Darrel, Trevor and Pentland, Alex, "Space-Time Gestures", Perceptual Computing Group, MIT Media Laboratory, 1993

[5] Beymer, D., Shashua, A. and Poggio, T.,"Example Based Image Analysis and Synthesis", A.I. Memo No. 1431, C.B.C.L. Paper No. 80, MIT Artificial Intelligence Laboratory, November 1993

Video Segmentation using Relative Motion --- Lars Liden

There are many methods for achieve segmentation of objects in an image, one of the most useful being the relative motion of the objects in an image sequence. However, when processing visual motion (especially for complex, textured objects) the "aperture problem" becomes apparent (ie, local information about direction of movement is ambiguous). The only way to get an accurate estimate for motion would seem to be the use of more global image information such as the existence of connected objects within an image which share the same type of motion. In other words, once an object is segmented, the aperture problem is no longer a difficulty. However, if our goal is segmentation in the first place this poses a dilemma.

I propose to combine local movement information (in the form of a calculated vector field) with another simple segmentation processes (e.g. pixel brightness thresholding) to form a coherent labeling of objects and their motions within several frames of a moving image. The goal being that the two segmentation methods can compliment each other, getting both the utility of motion for segmentation and using brightness segmentation to overcome the aperture problem.

I will begin with a toy problem and gradually increase the complexity depending on the amount of success achieved at each stage. Images will be generated artificially using a graphical software package. Initially the images used will consist of several (4-5) uniform geometric shapes (triangles, squares, trapezoids) moving (translation only) over a uniform background for several frames. Each image sequence will be classified for the number of objects, their direction of motion and perhaps their general location in the image. Given a sample image, the program will search for other image sequences in the database which share similar numbers of objects and motions. Once accurate identification of translation has been achieved, other types of motion (rotation, contraction, expansion) will be added.

If successes achieved with uniform geometric figures, complexity will be added by making brightness segmentation more difficult. Instead of a uniform pattern, each object will be given a textured surface. If the system works with textured objects, the background will also be given texture and a direction motion .

Multi-scale Eigenfaces --- John Petry

I propose to investigate the use of eigenfaces for low-resolution face detection and recognition. There are two types of applications where this relevant:

First, when a training set has been created at a particular resolution, but a new scene is presented in which the user wishes to detect or recognize faces at a lower resolution (possibly not even known in the unconstrained case).

Second, as a speed improvement to full-resolution detection and recognition when the location or even presence of a face in a scene is not known, but the scale is identical to the training-time scale. Here, rather than run a full-resolution detection pass over the entire image, it might be faster to subsample or spatially-average the image to a lower resolution and run the detection algorithm over that image. Any candidate locations resulting from this pass could then be inspected at high resolution.

The key question in both of these cases is what is the best representation for the eigen-face training set at lower resolution? The choice is primarily between two alternatives. In one case, we scale down all of the training images, average face and eigen faces; compute the new face vectors (difference from average face in terms of eigen faces) at this lower resolution; and run them as usual. This implies the same number of eigen faces will be used as at high resolution.

The second approach is to note that as the image resolution decreases, high frequency information is lost while low frequencies are preserved. If we assume that the lowest-order eigenfaces correspond to low frequencies, and high-order eigenfaces to high frequencies, then at low resolution only the low order eigenfaces contribute meaningfully to the detection and recognition cases. In fact, at lower resolutions the presence of high-order eigenfaces might actually be detrimental.

So for the first and major portion of this project, I will compare the two approaches, namely: scaling down the training images, but training the same number of eigenfaces as at high resolution (8-15 in the Pentland paper), and doing the same, but using only lower-order eigenfaces.

The aim will be to compare performance as a function of the number of eigenfaces used, both for detection and for recognition. I expect that the results for detection will favor the low-order approach. Recognition may favor the high-order approach, or may not be possible at all as resolution decreases.

There are two ways to implement this test. The first is to use the original high resolution training data, and scale the resulting image vectors directly. The other is to scale the training images, and create new eigenfaces and difference vectors. Which I use will depend on the feasibility of doing either within the existing code. [I intend to use the ftp version of the code that is cited in the Pentland et al paper, so that I can concentrate on these tests, rather than rewriting the underlying functionality.] If both approaches to reducing resolution are possible, I'll try both and compare results.

In addition to this, if time allows I'd like to look specifically at the "resolution pyramid for speed" idea. The question there would be to see which works better: reducing the resolution of the image and running a scaled- down version of the entire training set (all M' eigenfaces) as a first pass for face detection, followed by a high resolution inspection at any likely candidate locations; or running a high resolution inspection with a very reduced set of eigenfaces (just the low order ones) for face detection, followed by the high resolution pass at the likely candidates. A combination of the two might even be best, i.e., running a low resolution version of the low-order eigenfaces.

I propose to use the Pentland code as the basis of this, plus the face database that you said you'd try to make available. If somehow I can't use the face database, recreating it would be too much work. I'd use a database similar to your rabbit and fish field guide examples in that case, since the basic properties of those animals are similar to faces (roughly constant size and shape, with minor variations both in outline and interior pixels).



Stan Sclaroff 

Created:  October 30, 1995

BU CLA CS 835: Seminar in Image and Video Computing Fall 1995