BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Video Annotation and Retrieval Interfaces



John Isidoro

This weeks batch of papers described three different user interfaces for querying a multimedia database of video clips. The first paper "Indexes for User Access to Large Video Databases" was about doing keyword searches into a relational database. This paper really bored me, there were no cool image/video processing algorithms involved, it was basically as description of a database. Since the video streams need to be human annotated, this technique requires also lot of man power. Its basically nothing more than a computerized card catalog for video clips. The second paper "Media Streams: an iconic visual language for video annotation described using an iconic language to query a video database. I thought this was a pretty good idea, when looking for scenes in video it only makes sense to use pictures to describe them. Another added benefit of using pictures to describe scenes is that you doesn't have to speak a certain language to use the system. All you need to be able to do is see and understand the icons. Personally I think it might take a little time getting used to manipulating the icons to say what you want, (because these icons can have multiple meanings), but after a while I think that it would become second nature. The next paper "Video Query Formulation" takes the next logical step in user input for video querying. The concept of an iconic language is made more robust by specifying motion along with the icons. You make a mini movie out of icons, and the MovEase system finds video clips that are similar to the motion. This was definitely the best paper of the three, because the concept is very unique. I can see something like this being a feature of future cable networks where a person at home could choose a movie to watch by manipulating icons on the TV set. I can also see this at future librarys where historical and documentary video clips can be accessed this way, e.g. look for video clips showing Clinton shaking hands with other politicians. I think that an interesting next step in user interfaces for video querying would be instead of having a user manipulate icons, have the user act out the action or scene that is being looked for. For instance a user looking for martial arts footage might get into a tai kwon do stance and throw a kick.

Lars Liden

"Media Streams: An Iconic Visual Language for Video Annotation" Mark Davis This paper was based on the assumption that current technologies as well as any technology likely to be developed in the near future for image processing/machine vision will not be capable of creating an automated mechanism for efficiently searching video databases in any useful sense. If one is willing to accept this assumption, the question becomes what are the alternatives. The only obvious one is that of video annotation. Current video annotation predominantly uses the keyword approach which as the author explains has many limitations. As an alternative the author suggests the use of a standardized iconic language which is capable of representing objects, actions, locations, etc. The author mentions several advantages for such a method. The are several disadvantages with an iconic representation which aren't discussed by the author. Primary among them is the sheer number of icons which must be created to fully represent all the objects, actions and relationships between objects. One can imagine that in order to create an adequate set of icons one would have to have literally 1000's, perhaps even 10000's of icons all of which would have to be learned and easily recognized by anyone using such an annotation method. The task of learning such a language is daunting to say the least. The next problem is that of accessing the icons. Unlike a written language whose words can be easily accessed, there is no easy way to access individual icons in an iconic language. The author's suggest the use of a hierarchical, graph-like structure which allows one to traverse a series of iconic paths to reach individual icons. This is still far from adequate. One can imagine that even if one could picture the icon for say "cuisinart" one would still have to search down a tree of icons, clicking at the least 4-5 times to reach the "cuisinart" icon and one would have to do so for every object in a scene if one wanted a complete annotation. Such a method would be far too time consuming. To generate the iconic representation for the author's example "in a bar in the USA" seems excessively protracted compared to the generation of a verbal representation. It seems like a useful alternative would be the use of non-pictorial icons for common objects, actions. The author suggests that natural languages are not good for sequential overlapping actions, however, is one created icons consisting of words, sequential, overlapping actions are possible. Also, such icons are easily recognizable and could be easily accessed. One could imagine having a dual representation, with each icon consisting of a word and associated picture if deemed necessary. There are additional problems with annotation, such as it is highly dependent on what that annotators goal are, but such drawbacks have already been extensively discussed in class. "Video Query Formulation" Ahanger, G., Benson, D., & Little, T. Unfortunately I found this paper to be a bit vague on several of the details. At the beginning it seemed to skim over some of the ideas we have encountered in previous papers, making a brief comparison between query by example vs. iconic query and mentioning the importance of the various types of camera shots and camera degrees of freedom, but didn't really seem to make a strong point about any of them, apart from mentioning some basic ideas incorporated into MovEase. It discussed using a classification hierarchy for the types of motion in a sequence, but it wasn't completely clear how the hierarchy was used by the authors. (I assume it was used in creating a query?) The key objective of the paper seemed to be the consideration of the effects of motion as an important quality for retrieving video data, but it wasn't really addressed until midway through the paper. The paper also seemed to be making the assumption that the motion attributes were already generated by another means. (I assume someone determined them manually for the examples given?) I guess I'm just not clear on the contribution of this paper. It seems the new idea presented by the paper was in the user interface for generating motion queries, which allows for the user to specify a path of an object, a camera motion and a duration estimation. It would have been nice to see more detail in this area of the paper and how exactly the information given by the user is used to search the annotated images in the database.

Gregory Ganarz

In "Media streams" M. Davis describes a system for the annotation and retrieval of video information based on the use of icons. Davis argues that icons are a better representation of video content than keywords, and that icons could enable the search and retrieval of video from large archives. Part of the basis for this argument is that icons are not language specific, though the author later concedes that different cultures often interpret the same icon in different ways. Also, to express ideas in icons requires learning a new language, that of what icons are availible and how to express ones ideas using them. Davis also seems to claim that an iconic representation would be easier for both computers and humans to understand, though no evidence supporting such a claim is presented. The whole idea of using icons for video search seems suspect, since humans are used to expressing themselves in language, and most humans would likely prefer to use a language that they already know. In "Video query formulation" G. Ahanger et al. present a technique based largely on motion for the retrieval of video data. Like the Media streams system, icons are used to represent the video. One of the troubles with using icons is that the number of icons required (the vocabulary) increases rapidly with the specificity of the description. Plus, icons have difficulty representing the finer points of motion beyond translation, e.g. speed and acceleration. Finally, the automization of video indexing into an iconic language is likely a very difficult problem, and may be no easier than translating into more familiar languages such as English. In "Indexes for user access to large video databases" L. Rowe et al. present survey evidence suggesting that three indexes capture many of the queries users typically ask when searching video databases. These indexes are bibliographic, structural, and content. Largely based on keywords and keyframes of the video, the technique presented is a very practical approach to providing video on demand. One of the areas which is not addressed is how to automatize the selection of the keywords and keyframes from the video database.

Shrenik Daftary

Synopsis of "Video Query Formulation" by Ahanger, Benson, and Little This paper presents a method to perform an advanced query formulation on a video database. The system allows the user to describe predicates interactively while also giving the user the option to provide feedback that is similar to the video data. Query by Example Techniques Some techniques that perform similar functions are IMAID, which uses pattern recognition, and image processing manipulation functions to extract a pictorial description from an example image. The system returns all pictures of the sequences that satisfy the selection criteria. The next mentioned method is ART MUSEUM, which requires the searcher to sketch a rough outline of the object to be retrieved. The final technique under this grouping that is mentioned is QBIC, which stores textual information on an image, and searches based on that representation as well as on features such as color, texture, shape, and layout. Iconic Query Techniques Searches done in this method rely on a user's knowledge of the world, and use icons to represent entities in the world. Most of these techniques do not allow user defined icons, which limits their flexibility. Some techniques using this method are Virtual Video Browser, Video Database Browser, and Media Streams. Additional limitations of these systems include the need for computational power, and processing time. The result of examining these techniques was the determination that a video retrieval system should be flexible and provide a facile way to formulate queries. A database that store video information should have distinctive features to distinguish each video from other videos. An object in a video database can be divided according to its function in terms of rigidity, and articulation. Additionally camera motion should be separated from object motion to ensure the ability to capture the camera motion. A motion classification hierarchy is presented. The retrieval of information in this technique allows a combination of textual and visual queries. The system uses a fuzzy logic description, so a user could find a reddish apple for instance. Icons can also be formed to generalize object, camera motions. Indexing the database can be performed both off-line, and on-line. This technique seems to be fine, but indexing the videos seems to be a difficulty with using the technique. Synopsis for "Indexes for User Access to Large Video Databases" by Rowe, Boreczky, and Eads This paper presents an extension to the standard video on demand system, which would allow access to any video clip in a large database based on the video contents. The type of indexes for each video would be based on bibliographic data, structural data, and content data. The first goal in the paper was to ensure that the developed structure would be able to answer queries that would be likely to take place. The description of the indices is presented, as well as potential queries using the system. The VDB is presented which allows a user to select a desired set of frame sequences. Some problems with the Videodatabase Browser are presented above. Synopsis for "Media streams: an iconic visual language for video annotation" by Marc Davis The paper presents the goal of video annotation as being the ability of computer 1 being able to understand computer 2's annotations. The state of annotation now is that one person can use their own annotations. The necessity for universal understanding is presented in terms of a potential scenario, where a video recorded by one news team is useful to another news team in a different country many years in the future for a different purpose. Problems with keywords are presented because they do not give an adequate hierarchal structure. The video annotation language should create representations that are durable and sharable. The actual representation of a video sequence should be representations that make clips, not representations of clips. If the latter were chosen the information that could be chosen would be limited. The annotations that should be chosen must allow new annotations to be developed, and within which differences between annotations to be developed. The importance of surrounding events in a video sequence is presented (providing contextual information of human perception). This system also uses an iconic representation to search the video database. The system also provides the ability to order the search so instead of finding a man biting a dog, when searching for a dog biting a man, only scenes with the proper ordering would be returned. Object actions are represented using a horizontal representation, using both motion, and state changes. Characters are represented in vertically using sex, occupation, and amount of persons. The method is extended to weather systems, and cinemetography. Transitions are also represented. This technique seems to an accurate method to retrieve information. The actual performance of annotation would be on-line and could limit the possibility to create a large database quickly.

William Klippgen

Media streams: an iconic visual language for video annotation ------------------------------------------------------------- by Davis, M. The paper claims that a video annotation language should support visualization and browsing of the video as well as pure retrieval of content. This is suggested done by an iconic language that is more general across culture and time than textual annotation, yet more abstract than the pure video content. Media streams annotate video in a stratified manner, i.e. annotations can be linked to arbitrary video streams possibly overlapping other annotated streams. The iconic language is made up of constructs that takes one or more composed icons to specify a certain annotation. The icons are stored in tree structures as a growing palette as users add new icons. A problem not solved is of course the great chance of inconsistency that might arise from adding two ore more icons to the palette with the same or overlapping meaning. Never-the-less, the paper has taken an unprecedented turn in actually suggesting a way to annotate pictures with pictures; in other words an attempt to use a slightly more abstract language than text. Both characters and objects can have their actions represented by the icons. Human character actions are divided into conventionalized physical motions and abstract physical motions. Object actions can be represented by icons subdivided into object motions and object state changes. Action icons and others can be represented as movie icons or micons to underline the dynamic properties of what they are describing. Both entities can additionally be described in ways of their relative positions to each other and their absolute screen position (and depth). The paper also suggests icon classes for mise-en-scene, cinematography, recording medium, thoughts (e.g. ratings) and transitions. The media time line makes use of thumbnails, small sample frames from the stream, and a so-called videogram, to represent a video stream. The videogram is simply a result of concatenating all centre strip from every videoframe in the stream. This gives a fairly good description of the interframe action when combined with the sample keyframes. I think Media streams is a very interesting approach, but it meets with some of the same consistency problems as with the use of pure textual annotations. Textual descriptors can well be arranged in hierarchical tree structures like the icons, so they can both serve as a highly user structured language. One would possibly see that a strong combinatio of icons and text would prove helpful where icons served as general high-level descriptors and text as the detailed specific descriptors. Indexes for User Access to Large Video Databases ------------------------------------------------ by Rowe, L. A., Boreczky, J. S. and Eads, C. A. This paper investigates how to index and query video in a large VOD system. They propose the five index types document, bibliographic, structural, obkect and keyword based on an investigation into what queries users might make. While the suggested bibliograhic and structural annoations are straight-forward, the paper presents an interesting variant of keyword indexing. Keyword stems are stored in a separate class as is the various documents with titles, abstracts or scripts. A link class combines the text documents and keywords stating how many times a given keyword appears in the document. This approach makes keyword searches efficient, but it does not directly give the location(s) of the keywords in the document. Obejcts and people in the video are annotated using an object and a people class. An OBJ_INST class contains motion and other apperance information related to a given object or person. The user interface presents the user with both a browsing and querying interface to the video data. The browser i figure 5 shows a hierarchical view of a tree of descriptors to identify an entire movie. It seems unclear, though, how the authors will implement the interface for the stratified object and person annotations. I am not impressed with the paper if it is supposed to be one of the best in its field. It does present interesting queries based on the annotations but does not treat consistency problems nor browsing of complex annotations. However, their use of content, spatial and temporal queries is very interesting and is similar to work done at the Norwegian Institute of Technology by Midtstraum and Hjelsvold. The use of many of the annotations in a practical query is in the best case vague and need much more work. The authors should have been much clearer in stating the limitations of the paper, not only when it comes to the actual implementation of the VDB. Video query formulation ----------------------- Ahanger, G., Benson, D. and Little, T.D.C This paper points out that video inherits all properties associated with images like color, shape, texture etc. In addition, video has properties of implied motion, sequential composition, advanced temporal and spatial relationships within or across frames and synchronized audio signals. The main contribution of the paper is in the suggested visual query formulation that tries to overcome the shortcomings of textual descriptions of many aspects of video's static and dynamic content. The application, MovEase, is described where the primary attribute in consideration is motion. Icons represent objects, textures, actions and shapes. Unlike MediaStreams, no composite icons are introduced, which seems to be a weakness in the system as each icon has to be self-contained. Attributes can however be assigned to the icons, i.e. color and shape. A very interesting concept is the query icon that represents a previous query and which can be combined into new composite queries. The authors make a point of the fuzziness of video querying and claims this leads to a lesser need for exact annotation of spatial relationship like position and motion. The query interface of MovEase can specify icons, their movements in space and time and camera movements and operations in space and time. The composition of low-level motion into high-level motion representation like rocking or swinging is mentioned and suggested to be a good representation of human perception of temporal content.

John Petry

INDEXES FOR USER ACCESS TO LARGE VIDEO DATABASES ________________________________________________ by Rowe, Boreczky and Eads The authors propose a method for indexing (large) video databases. They create four main indexing classes: bibliographic, structural, object and keyword. Bibilographic data is high-level text containing standard information about a movie, such as its title, director and case. Structural data is a hierarchy of film shots, scenes, segments and movies. Object data pertains to visual features: people, cars, etc... (essentially, "nouns") and their visual properties (eg. color histograms, area). Keyword data is exactly that -- key words that describe objects, actions, or film techniques (panning, zooming, etc...). The authors have described a query language which permits searching of these index types. This search method is entirely textual -- even visual objects are searched in the same manner as keywords. Several problems exist with this approach: 1) These indices are separate (not connected by hypertext links), I believe. 2) How is all of this video to be annotated? It's hard enough to annotate according to one set of criteria (well, the bibliographic part is simple, but not the rest). But who is going to annotate separate structural, object and keyword information? 3) The user interface as described is awful. Where's the gui? 4) It should be possible to specify object data visually. 5) There is no discussion of how objects and keywords handle degrees of similarity or inheritance hierarchies. Finally, the authors have only tested this on 1.5 hours of video. That's far too little to draw any useful conclusions. VIDEO QUERY FORMULATION, by Ahanger, Benson and Little _______________________ The authors describe a video query approach for a previously annotated video database. This annotation includes not only the statistical and object-type annotation described by Rowe et al, but also motion segmenation. A very large set of icons is used rather than keywords. The authors' key improvement on Rowe et al is that they explicitly consider motion to be an item to be indexed against. For example, it is possible to say "find me cars moving from left to right." In addition, camera effects that induce apparent motion (eg. panning, zooming) are handled. There is little information about several critical issues: 1) How are the hierarchical and similarity issues treated within their iconography? 2) How is the searching done? 3) Is it reasonable to assume that all of the segmenation and annotation is done a priori? Is this topic one they've already handled, but simply don't describe here, or is it being handwaved away? 4) Assuming others have done the annotation, are their differences in annotation standards that preclude or affect the use of the icons or their relationships. This is interesting, but I wish it had more detail. I can't tell if the paper is a good top-level view of a project with an acceptable but unstated way of handling low-level problems, or if the project itself treats these low-level issues in a superficial way. MEDIA STREAMS: AN ICONIC VISUAL LANGUAGE FOR VIDEO ANNOTATION _____________________________________________________________ by Marc Davis This is a much more detailed look at an implementation of a visual representation and indexing scheme. The vast majority of the paper covers representation issues, not indexing per say, but much of the indexing approach falls directly out from the representation. A key goal is to have an annotation be usable by other people than the original annotator, and by computers. Current compute-supported annotation principally relies on keywords. Keywords have several limitations in the author's view, namely: 1) They fail to handle temporal structures [why?] 2) They are not semantic representations -- they fail to encode similarity and hierarchical relationships. 3) They don't scale well -- as the number of keywords increases, the chance of an exact match decreases, and given the poor handling of relationships, this is a significant problem. The author encodes many types of data in his iconic language. This can include camera motion (eg. zoom, pan), film information (eg. size, density, speed), object data, time data, physical environment data, location, etc.. He permits annotation in parallel timelines, so that breaks in any one type of data (eg. scene cut) are independent of others (eg. which actor is present, or the location). One additional argument the author makes for icons is that since they are visual, they are culture-independent (or at least language independent). This is not really so; they must be learned, like any other language (there are far too many and the resolution is far too low for them to be inherently obvious); and just because something is represented outside of a written language doesn't imply that it is cultural-indenpendent (viz his example of video shot by Germans in Brazil for a Korean company and viewed by Americans). There is some confusion in the level of data he is trying to capture. In one place he says that he wants to record both context-dependent and -independent features (example: Kuleshov Effect). But elsewhere he describes recording a handshake at a treaty signing as a physical occurrence, not a higher-level event, in this case an implication of agreement. This is contradictory. The latter can be important, and a mechanism should be provided for it, and its use be approved even if it is not culture-independent. In general, while computers can use the representations of low-level features easily, I doubt they will be as big a help on higher-level abstractions, since the meanings of these become fuzzier and more dependent on the annotator. This can also be seen in his classification of people by "apparent" profession, eg. Marcus Welby is an MD because he wears a lab coat and has a stethescope. That is hardly culture-independent! Media Streams has three stages: 1) Directors Workshop. Users choose and/or create related icons. 2) Icon Palettes. Icons from (1) are grouped here. 3) Media Time Lines. Video is annotated in parallel timelines using the above icons. Each timeline represents a type of data (film data, camera data, object data, location data, etc...) Icon placement in Time Lines is constrained by a syntax, so it is not just an iconography, but a language. Director's Workshop uses an intelligent cascading hierarchy of icons in increasing specificity in one direction, and parallel-level class members in the other. Hyperlinks connect icons to multiple predecessors and successors, depending on the path chosen, so this is a true graph, not a tree. DW Comments: 1) It's not clear how new icons are placed in the graph. 2) The icons for motion seem awkward. 3) The number of icons is huge -- how will a user learn them easily? 4) Icons seem like a bit of text with an associated picture. The text can be translated, but the picture is not. Is this so? If so, it is not that different than a keyword approach. If not, what else is going on? 5) Some of his icons, such as body parts, seem unrealistically complex. Is someone really going to annotate a film by noting the motion of every body part? 6) "Thought" icons exist for annotator comments. Is this just a catch-all for the kind of subjective notation that he is trying to avoid? Icon Palette Comments: 1) He mentions new icons being spontaneously-createable when some icon groupings occur frequently. Doesn't this defeat the goal of sharing among users and computers, if new icons are created only for particular video streams? Time Lines Comments: 1) Regarding the earlier comments about the advantages of icons, I have to say some of his examples (see p.63) were unintelligible without the accompanying text notation. Overall, the idea of parallel timelines is very good, as is the graph arrangement between icons. But I am not yet convinced of the advantage of icons over keywords (or perhaps I should say key phrases), since icons seem to have a hidden, underlying textual nature. The time to learn the iconic language seems quite long. And the degree of annotation he proposes is prohibitive.


Stan Sclaroff
Created: Nov 28, 1995
Last Modified: