BU CLA CS 835: Seminar in Image and Video Computing --- Class commentary on articles

BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Video Annotation and Retrieval Interfaces



 
 John Isidoro 
 
        This weeks batch of papers described three different user interfaces
for querying a multimedia database of video clips.

        The first paper "Indexes for User Access to Large Video Databases"
was about doing keyword searches into a relational database.  This paper
really bored me, there were no cool image/video processing algorithms
involved, it was basically as description of a database.  Since the video
streams need to be human annotated, this technique requires also lot of man
power.  Its basically nothing more than a computerized card catalog for video clips.

        The second paper "Media Streams: an iconic visual language for video
annotation described using an iconic language to query a video database.
I thought this was a pretty good idea, when looking for scenes in video it
only makes sense to use pictures to describe them.  Another added benefit 
of using pictures to describe scenes is that you doesn't have to speak a
certain language to use the system.  All you need to be able to do is see
and understand the icons.  Personally I think it might take a little time
getting used to manipulating the icons to say what you want, (because these
icons can have multiple meanings), but after a while I think that it would
become second nature.

        The next paper "Video Query Formulation" takes the next logical step
in user input for video querying.  The concept of an iconic language is
made more robust by specifying motion along with the icons.  You make a mini
movie out of icons, and the MovEase system finds video clips that are similar
to the motion.  This was definitely the best paper of the three, because the
concept is very unique.  I can see something like this being a feature of
future cable networks where a person at home could choose a movie to watch
by manipulating icons on the TV set.  I can also see this at future librarys
where historical and documentary video clips can be accessed this way,
e.g. look for video clips showing Clinton shaking hands with other
politicians.

        I think that an interesting next step in user interfaces for video
querying would be instead of having a user manipulate icons, have the user
act out the action or scene that is being looked for.  For instance a user
looking for martial arts footage might get into a tai kwon do stance and
throw a kick.  
      
Lars Liden


"Media Streams: An Iconic Visual Language for Video Annotation"
Mark Davis
	
	This paper was based on the assumption that current
technologies as well as any technology likely to be developed in the
near future for image processing/machine vision will not be capable of
creating an automated mechanism for efficiently searching video
databases in any useful sense.  If one is willing to accept this
assumption, the question becomes what are the alternatives.  The only
obvious one is that of video annotation.
	Current video annotation predominantly uses the keyword
approach which as the author explains has many limitations.  As an
alternative the author suggests the use of a standardized iconic
language which is capable of representing objects, actions, locations,
etc.  The author mentions several advantages for such a method.
	The are several disadvantages with an iconic representation
which aren't discussed by the author.  Primary among them is the sheer
number of icons which must be created to fully represent all the
objects, actions and relationships between objects.  One can imagine
that in order to create an adequate set of icons one would have to
have literally 1000's, perhaps even 10000's of icons all of which
would have to be learned and easily recognized by anyone using such an
annotation method. The task of learning such a language is daunting to
say the least. 
	The next problem is that of accessing the icons.  Unlike a
written language whose words can be easily accessed, there is no easy
way to access individual icons in an iconic language.  The author's
suggest the use of a hierarchical, graph-like structure which allows
one to traverse a series of iconic paths to reach individual icons.
This is still far from adequate.  One can imagine that even if one
could picture the icon for say "cuisinart" one would still have to
search down a tree of icons, clicking at the least 4-5 times to reach
the "cuisinart" icon and one would have to do so for every object in a
scene if one wanted a complete annotation.  Such a method would be far
too time consuming.  To generate the iconic representation for the
author's example "in a bar in the USA" seems excessively protracted
compared to the generation of a verbal representation.
	It seems like a useful alternative would be the use of
non-pictorial icons for common objects, actions.  The author suggests
that natural languages are not good for sequential overlapping actions,
however, is one created icons consisting of words, sequential,
overlapping actions are possible.  Also, such icons are easily
recognizable and could be easily accessed.  One could imagine having a
dual representation, with each icon consisting of a word and
associated picture if deemed necessary.
	There are additional problems with annotation, such as it is
highly dependent on what that annotators goal are, but such drawbacks
have already been extensively discussed in class.


"Video Query Formulation"
Ahanger, G., Benson, D., & Little, T.

	Unfortunately I found this paper to be a bit vague on several
of the details.   At the beginning it seemed to skim over some of the
ideas we have encountered in previous papers, making a brief
comparison between query by example vs. iconic query and mentioning
the importance of the various types of camera shots and camera degrees
of freedom, but didn't really seem to make a strong point about any of
them, apart from mentioning some basic ideas incorporated into MovEase.
It discussed using a classification hierarchy for the types of motion
in a sequence, but it wasn't completely clear how the hierarchy was
used by the authors.  (I assume it was used in creating a query?) 
	The key objective of the paper seemed to be the consideration
of the effects of motion as an important quality for retrieving video
data, but it wasn't really addressed until midway through the paper.
The paper also seemed to be making the assumption that the motion
attributes were already generated by another means.  (I assume someone
determined them manually for the examples given?)
	I guess I'm just not clear on the contribution of this paper.
It seems the new idea presented by the paper was in the user interface
for generating motion queries, which allows for the user to specify a path
of an object, a camera motion and a duration estimation.  It would
have been nice to see more detail in this area of the paper and how
exactly the information given by the user is used to search the
annotated images in the database.    


 Gregory Ganarz
 

In "Media streams" M. Davis describes a system for the annotation and
retrieval of video information based on the use of icons.  Davis
argues that icons are a better representation of video content than
keywords, and that icons could enable the search and retrieval of
video from large archives.  Part of the basis for this argument is
that icons are not language specific, though the author later concedes
that different cultures often interpret the same icon in different
ways.  Also, to express ideas in icons requires learning a new
language, that of what icons are availible and how to express ones
ideas using them.  Davis also seems to claim that an iconic
representation would be easier for both computers and humans to
understand, though no evidence supporting such a claim is presented.
The whole idea of using icons for video search seems suspect, since
humans are used to expressing themselves in language, and most humans
would likely prefer to use a language that they already know.


In "Video query formulation" G. Ahanger et al. present a technique
based largely on motion for the retrieval of video data.  Like the
Media streams system, icons are used to represent the video.  One of
the troubles with using icons is that the number of icons required
(the vocabulary) increases rapidly with the specificity of the
description.  Plus, icons have difficulty representing the finer
points of motion beyond translation, e.g. speed and acceleration.
Finally, the automization of video indexing into an iconic language is
likely a very difficult problem, and may be no easier than translating
into more familiar languages such as English.


In "Indexes for user access to large video databases" L. Rowe et
al. present survey evidence suggesting that three indexes capture many
of the queries users typically ask when searching video databases.
These indexes are bibliographic, structural, and content.  Largely
based on keywords and keyframes of the video, the technique presented
is a very practical approach to providing video on demand.  One of the
areas which is not addressed is how to automatize the selection of the
keywords and keyframes from the video database.

 Shrenik Daftary

Synopsis of "Video Query Formulation" by Ahanger, Benson, and Little

This paper presents a method to perform an advanced query formulation on 
a video database. The system allows the user to describe predicates 
interactively while also giving the user the option to provide feedback 
that is similar to the video data. 

Query by Example Techniques

Some techniques that perform similar functions are IMAID, which uses
pattern recognition, and image processing manipulation functions to
extract a pictorial description from an example image. The system returns
all pictures of the sequences that satisfy the selection criteria. The
next mentioned method is ART MUSEUM, which requires the searcher to sketch
a rough outline of the object to be retrieved. The final technique under
this grouping that is mentioned is QBIC, which stores textual information
on an image, and searches based on that representation as well as on
features such as color, texture, shape, and layout. 

Iconic Query Techniques

Searches done in this method rely on a user's knowledge of the world, and 
use icons to represent entities in the world. Most of these techniques do 
not allow user defined icons, which limits their flexibility. Some 
techniques using this method are Virtual Video Browser, Video Database 
Browser, and Media Streams. Additional limitations of these systems 
include the need for computational power, and processing time.

The result of examining these techniques was the determination that a 
video retrieval system should be flexible and provide a facile way to 
formulate queries.

A database that store video information should have distinctive features 
to distinguish each video from other videos. An object in a video 
database can be divided according to its function in terms of rigidity, 
and articulation. Additionally camera motion should be separated from 
object motion to ensure the ability to capture the camera motion.

A motion classification hierarchy is presented. The retrieval of
information in this technique allows a combination of textual and visual
queries. The system uses a fuzzy logic description, so a user could find a
reddish apple for instance. Icons can also be formed to generalize object,
camera motions. Indexing the database can be performed both off-line, and
on-line.  This technique seems to be fine, but indexing the videos seems
to be a difficulty with using the technique. 

Synopsis for "Indexes for User Access to Large Video Databases" by Rowe,
Boreczky, and Eads

This paper presents an extension to the standard video on demand system,
which would allow access to any video clip in a large database based on
the video contents. The type of indexes for each video would be based on
bibliographic data, structural data, and content data. The first goal in
the paper was to ensure that the developed structure would be able to
answer queries that would be likely to take place. The description of the
indices is presented, as well as potential queries using the system. The
VDB is presented which allows a user to select a desired set of frame
sequences. Some problems with the Videodatabase Browser are presented
above. 

Synopsis for "Media streams: an iconic visual language for video 
annotation" by Marc Davis

The paper presents the goal of video annotation as being the ability of 
computer 1 being able to understand computer 2's annotations. The state 
of annotation now is that one person can use their own annotations. The 
necessity for universal understanding is presented in terms of a 
potential scenario, where a video recorded by one news team is useful to 
another news team in a different country many years in the future for a 
different purpose.

Problems with keywords are presented because they do not give an adequate 
hierarchal structure. The video annotation language should create 
representations that are durable and sharable. The actual representation 
of a video sequence should be representations that make clips, not 
representations of clips. If the latter were chosen the information that 
could be chosen would be limited. The annotations that should be chosen 
must allow new annotations to be developed, and within which differences 
between annotations to be developed.

The importance of surrounding events in a video sequence is presented 
(providing contextual information of human perception). This system also 
uses an iconic representation to search the video database. The system 
also provides the ability to order the search so instead of finding a man 
biting a dog, when searching for a dog biting a man, only scenes with the 
proper ordering would be returned. 

Object actions are represented using a horizontal representation, using
both motion, and state changes. Characters are represented in vertically
using sex, occupation, and amount of persons. The method is extended to
weather systems, and cinemetography. Transitions are also represented. 
This technique seems to an accurate method to retrieve information. The
actual performance of annotation would be on-line and could limit the
possibility to create a large database quickly. 

 William Klippgen


Media streams: an iconic visual language for video annotation
-------------------------------------------------------------
by Davis, M.

The paper claims that a video annotation language should support
visualization and browsing of the video as well as pure retrieval of
content. This is suggested done by an iconic language that is more
general across culture and time than textual annotation, yet more
abstract than the pure video content. 

Media streams annotate video in a stratified manner, i.e. annotations
can be linked to arbitrary video streams possibly overlapping other
annotated streams.  The iconic language is made up of constructs that
takes one or more composed icons to specify a certain annotation.  The
icons are stored in tree structures as a growing palette as users add
new icons.  A problem not solved is of course the great chance of
inconsistency that might arise from adding two ore more icons to the
palette with the same or overlapping meaning.  

Never-the-less, the paper has taken an unprecedented turn in actually
suggesting a way to annotate pictures with pictures; in other words an
attempt to use a slightly more abstract language than text.

Both characters and objects can have their actions represented by the
icons.  Human character actions are divided into conventionalized
physical motions and abstract physical motions. Object actions can be
represented by icons subdivided into object motions and object state
changes.  Action icons and others can be represented as movie icons or
micons to underline the dynamic properties of what they are
describing.

Both entities can additionally be described in ways of their relative
positions to each other and their absolute screen position (and
depth).

The paper also suggests icon classes for mise-en-scene,
cinematography, recording medium, thoughts (e.g. ratings) and
transitions.  

The media time line makes use of thumbnails, small sample frames from
the stream, and a so-called videogram, to represent a video stream.
The videogram is simply a result of concatenating all centre strip
from every videoframe in the stream.  This gives a fairly good
description of the interframe action when combined with the sample
keyframes.  

I think Media streams is a very interesting approach, but it meets
with some of the same consistency problems as with the use of pure
textual annotations.  Textual descriptors can well be arranged in
hierarchical tree structures like the icons, so they can both serve as
a highly user structured language.

One would possibly see that a strong combinatio of icons and text
would prove helpful where icons served as general high-level
descriptors and text as the detailed specific descriptors.


Indexes for User Access to Large Video Databases
------------------------------------------------
by Rowe, L. A., Boreczky, J. S. and Eads, C. A.

This paper investigates how to index and query video in a large VOD
system.  They propose the five index types document, bibliographic,
structural, obkect and keyword based on an investigation into what
queries users might make.

While the suggested bibliograhic and structural annoations are
straight-forward, the paper presents an interesting variant of keyword
indexing.  Keyword stems are stored in a separate class as is the
various documents with titles, abstracts or scripts.  A link class
combines the text documents and keywords stating how many times a
given keyword appears in the document.  This approach makes keyword
searches efficient, but it does not directly give the location(s) of
the keywords in the document.

Obejcts and people in the video are annotated using an object and a
people class.  An OBJ_INST class contains motion and other apperance
information related to a given object or person.

The user interface presents the user with both a browsing and querying
interface to the video data.  The browser i figure 5 shows a
hierarchical view of a tree of descriptors to identify an entire
movie. It seems unclear, though, how the authors will implement the
interface for the stratified object and person annotations.

I am not impressed with the paper if it is supposed to be one of the
best in its field.  It does present interesting queries based on the
annotations but does not treat consistency problems nor browsing of
complex annotations.  However, their use of content, spatial and
temporal queries is very interesting and is similar to work done at
the Norwegian Institute of Technology by Midtstraum and Hjelsvold.

The use of many of the annotations in a practical query is in the best
case vague and need much more work.  The authors should have been much
clearer in stating the limitations of the paper, not only when it
comes to the actual implementation of the VDB.
 

Video query formulation
-----------------------
Ahanger, G., Benson, D. and Little, T.D.C


This paper points out that video inherits all properties associated
with images like color, shape, texture etc.  In addition, video has
properties of implied motion, sequential composition, advanced
temporal and spatial relationships within or across frames and
synchronized audio signals.

The main contribution of the paper is in the suggested visual query
formulation that tries to overcome the shortcomings of textual
descriptions of many aspects of video's static and dynamic content.

The application, MovEase, is described where the primary attribute in
consideration is motion.  Icons represent objects, textures, actions
and shapes.  Unlike MediaStreams, no composite icons are introduced,
which seems to be a weakness in the system as each icon has to be
self-contained.  Attributes can however be assigned to the icons,
i.e. color and shape.  

A very interesting concept is the query icon that represents a
previous query and which can be combined into new composite queries.

The authors make a point of the fuzziness of video querying and claims
this leads to a lesser need for exact annotation of spatial
relationship like position and motion.  

The query interface of MovEase can specify icons, their movements in
space and time and camera movements and operations in space and time.

The composition of low-level motion into high-level motion
representation like rocking or swinging is mentioned and suggested to
be a good representation of human perception of temporal content.

 
 John Petry 

INDEXES FOR USER ACCESS TO LARGE VIDEO DATABASES
________________________________________________
by Rowe, Boreczky and Eads

The authors propose a method for indexing (large) video databases.  They
create four main indexing classes: bibliographic, structural, object and 
keyword.

Bibilographic data is high-level text containing standard information about
a movie, such as its title, director and case.  Structural data is a
hierarchy of film shots, scenes, segments and movies.  Object data pertains 
to visual features: people, cars, etc...  (essentially, "nouns") and their 
visual properties (eg. color histograms, area).  Keyword data is exactly that 
-- key words that describe objects, actions, or film techniques (panning, 
zooming, etc...).    

The authors have described a query language which permits searching of these
index types.  This search method is entirely textual -- even visual objects
are searched in the same manner as keywords.

Several problems exist with this approach:

1) These indices are separate (not connected by hypertext links), I believe.

2) How is all of this video to be annotated?  It's hard enough to annotate
according to one set of criteria (well, the bibliographic part is simple,
but not the rest).  But who is going to annotate separate structural,
object and keyword information?

3) The user interface as described is awful.  Where's the gui?

4) It should be possible to specify object data visually.

5) There is no discussion of how objects and keywords handle degrees of 
similarity or inheritance hierarchies.

Finally, the authors have only tested this on 1.5 hours of video.  That's
far too little to draw any useful conclusions.


VIDEO QUERY FORMULATION, by Ahanger, Benson and Little
_______________________

The authors describe a video query approach for a previously annotated video
database.  This annotation includes not only the statistical and object-type
annotation described by Rowe et al, but also motion segmenation.

A very large set of icons is used rather than keywords.

The authors' key improvement on Rowe et al is that they explicitly consider
motion to be an item to be indexed against.  For example, it is possible
to say "find me cars moving from left to right."  In addition, camera effects
that induce apparent motion (eg. panning, zooming) are handled.

There is little information about several critical issues:

1) How are the hierarchical and similarity issues treated within their
iconography?

2) How is the searching done?

3) Is it reasonable to assume that all of the segmenation and annotation is
done a priori?  Is this topic one they've already handled, but simply don't
describe here, or is it being handwaved away?

4) Assuming others have done the annotation, are their differences in
annotation standards that preclude or affect the use of the icons or
their relationships.

This is interesting, but I wish it had more detail.  I can't tell if the paper
is a good top-level view of a project with an acceptable but unstated way of 
handling low-level problems, or if the project itself treats these low-level
issues in a superficial way.


MEDIA STREAMS: AN ICONIC VISUAL LANGUAGE FOR VIDEO ANNOTATION
_____________________________________________________________
by Marc Davis

This is a much more detailed look at an implementation of a visual
representation and indexing scheme.  The vast majority of the paper covers
representation issues, not indexing per say, but much of the indexing approach
falls directly out from the representation.

A key goal is to have an annotation be usable by other people than the original
annotator, and by computers.

Current compute-supported annotation principally relies on keywords.  Keywords
have several limitations in the author's view, namely:

1) They fail to handle temporal structures [why?]

2) They are not semantic representations -- they fail to encode similarity
and hierarchical relationships.

3) They don't scale well -- as the number of keywords increases, the chance of
an exact match decreases, and given the poor handling of relationships,
this is a significant problem.

The author encodes many types of data in his iconic language.  This can include
camera motion (eg. zoom, pan), film information (eg. size, density, speed),
object data, time data, physical environment data, location, etc..

He permits annotation in parallel timelines, so that breaks in any one
type of data (eg. scene cut) are independent of others (eg. which actor is
present, or the location).

One additional argument the author makes for icons is that since they are
visual, they are culture-independent (or at least language independent).
This is not really so; they must be learned, like any other language (there
are far too many and the resolution is far too low for them to be inherently
obvious); and just because something is represented outside of a written
language doesn't imply that it is cultural-indenpendent (viz his example
of video shot by Germans in Brazil for a Korean company and viewed by
Americans).

There is some confusion in the level of data he is trying to capture.  In one
place he says that he wants to record both context-dependent and -independent
features (example: Kuleshov Effect).  But elsewhere he describes recording
a handshake at a treaty signing as a physical occurrence, not a higher-level
event, in this case an implication of agreement.  This is contradictory.

The latter can be important, and a mechanism should be provided for it, and
its use be approved even if it is not culture-independent.  In general, while
computers can use the representations of low-level features easily, I doubt
they will be as big a help on higher-level abstractions, since the meanings
of these become fuzzier and more dependent on the annotator.  This can also 
be seen in his classification of people by "apparent" profession, eg. Marcus
Welby is an MD because he wears a lab coat and has a stethescope.  That is
hardly culture-independent!

Media Streams has three stages:

1) Directors Workshop.  Users choose and/or create related icons.

2) Icon Palettes.  Icons from (1) are grouped here.

3) Media Time Lines.  Video is annotated in parallel timelines using the
above icons.  Each timeline represents a type of data (film data, camera data,
object data, location data, etc...)  Icon placement in Time Lines is 
constrained by a syntax, so it is not just an iconography, but a language.

Director's Workshop uses an intelligent cascading hierarchy of icons in
increasing specificity in one direction, and parallel-level class members
in the other.  Hyperlinks connect icons to multiple predecessors and 
successors, depending on the path chosen, so this is a true graph, not a tree.

DW Comments:

1) It's not clear how new icons are placed in the graph.

2) The icons for motion seem awkward.

3) The number of icons is huge -- how will a user learn them easily?

4) Icons seem like a bit of text with an associated picture.  The text can
be translated, but the picture is not.  Is this so?  If so, it is not that
different than a keyword approach.  If not, what else is going on?

5) Some of his icons, such as body parts, seem unrealistically complex.
Is someone really going to annotate a film by noting the motion of every
body part?

6) "Thought" icons exist for annotator comments.  Is this just a catch-all
for the kind of subjective notation that he is trying to avoid?

Icon Palette Comments:

1) He mentions new icons being spontaneously-createable when some icon
groupings occur frequently.  Doesn't this defeat the goal of sharing among
users and computers, if new icons are created only for particular video 
streams?

Time Lines Comments:

1)  Regarding the earlier comments about the advantages of icons, I have to say
some of his examples (see p.63) were unintelligible without the accompanying 
text notation.

Overall, the idea of parallel timelines is very good, as is the graph 
arrangement between icons.  But I am not yet convinced of the advantage of
icons over keywords (or perhaps I should say key phrases), since icons seem
to have a hidden, underlying textual nature.  The time to learn the iconic
language seems quite long.  And the degree of annotation he proposes is
prohibitive.



Stan Sclaroff

Created:  Nov 28, 1995

Last Modified: