BU CLA CS 835: Seminar in Image and Video Computing --- Class commentary on articles

BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Video Mosaics

Shrenik Daftary

Synopsis of "Quicktime VR" by Chen

This paper presented a method to store images that will allow an user the
ability to perform a virtual walkthrough from different viewing positions
and orientations. The first method that is presented is the use of
modeling systems that are rendered when a user accesses them. This is
attacked because it is laborious, limits scene complexity, and requires
special hardware. The next method is the creation of branching movies that
allow only limited points where movements are allowed, unless a lot of
storage space is used. The presented technique attempts to overcome the
computational complexity, and storage requirements of the other techniques.

The goals of the method that is presented in this paper were to allow
playback on personal computers without hardware acceleration, recreation
of both real and synthetic scenes, and high quality rendering independent
of scene complexity. Camera rotation is described in terms of pitch, yaw,
and roll. Methods are presented to reconstruct just pitch and yaw.

Camera motion can involve a change in viewpoint (difficult to model) or
direction (an environment map). Environment maps may be preserved in a
cubic form (6 times the normal space). Camera zooming can be accomplished
by tiling an image into multiple levels of the same size. By using
different levels of resolution, infinite zoom can be simulated.

Quicktime VR was tested in terms of presenting a panoramic movie, which
are "multi-dimensional, event-driven spatially-oriented movies." Users are
allowed to interact with the movie and perform pannings, zooms, and
motion. Videos are stored with one panoramic track (pointers to hot
points), and 2 video tracks. For slow media such as CDROMs, tracks should
be interleaved. The system was tested with different PCs and shown that
up to 29 updates of 1D Panning can be performed per second on high speed
PowerPCs.

The actual system to create the movies was presented next. Nodes are
selected in a space, panoramas are created with computer rendering, hot
spots are identified, panoramas are linked together, panoramic images are
linked together. Finally the system's potential and actual applications
were presented.

The overall system seemed okay, although it did not allow camera roll
(which could be modeled in software as an affine transformation). Its main
problem though was its slow update speed for even powerful personal
computers such as the pentium. The other goals though in terms of
displaying real, and imaginary scenery with high quality complex images
seemed feasible.

"Mosaic Based Representations of Video Sequences and Their Applications"
by Irani et al.

A mosaic is a panoramic view taken from a video of a scene. Static mosaics
refer to a single frame average of a sequence or a few frame average of
the sequence. Dynamic mosaics refer to the reconstruction of a sequence
of images by keeping a constant background. Unfortunately this technique
does not allow immediate access to a particular frame, since the
reconstruction is based on incremental frame reconstruction. Zooming can
be performed by using a spatial pyramid to store the images.

Image alignment is performed by first aligning successive images, then
aligning images directly to the current composite mosaic images, and
aligning the current mosaic image to the current image. Integration can
be performed after the alignment. The image intensity is maintained
within a reasonable level of change to allow smooth transitions.

Mosaics are presented in terms of allowing video compression,
visualization, and video enhancement. The paper presented a graphical
comparison between MPEG and Mosaics. The reconstruction appeared much
more effective for the mosaic reconstruction. The results of this paper
did not include much quantitative data, so it was difficult to determine
the efficiency of the system - it would be nice to have a quantitative
idea of how much storage space would be required to store some video
sequence, and then to show some arbitrary frames from the sequence using
the mosaic method and assess its reconstructive value based on that
compared to that of an MPEG storage.

Paul Dell
M. Irani, P. Anandan, and S. Hsu, "Mosaic Based Represenations of Video
Sequences and Their Applications," in Proc. International Conference on
Computer Vision 1995, pp. 605--611.

The paper examines the possible possible of two basic types of mosaics and some
extentions of the basic types. The two types are "static" and "dynamic"
mosaics. A static mosaic is construction of common elements of a video
sequence. If the scene does not contain moving objects and the camera pans
around, then the mosaic would consist of an image that has all points that the
camera captured. In this way one can construct a panoramic view. A dynamic
mosaic updates the "background" mosaic with each new image frame. The dynamic
mosaic allows for the moving object to be displayed without the fading present
in the static mosaic.

Some applications of the static and dynamic mosaics are video compression, low
bitrate transmission, "key frame" construction for later browsing, synopsis of
motion, and video enhancement. One of the interesting applications was to use
the mosaics as indexing and search tools. The operator could query a video
sequence based on finding certain static mosaics (ie. background) and residuals
of the mosaics (the changes).

S. E. Chen, "QuickTime VR -- An Image-Based Approach to Virtual Environment
Navigation," in Proc. ACM SIGGRAPH 1995 , pp. 29--38.

An image based method for 3D scene synthesis and navigation is presented. The
technique is based on 360-degree cylindrical panoramic images for the virtural
environment and 180-degree by 360-degree object movies. Panoramic images
spaced 5-10 feet apart (for inside shots) are taken to form a grid of the
environment, then theses images are "linked" together manually by the viewing
directions to form the virtual environment. Another aspect of the VR scene are
"hot spots" which are regions in the panoramic image for user interactions. (I
don't understand how these appear to the user. Are they just 8-bit images
placed on the scene? Are they icons? How differentiate from panoramic image?)

This approach differs from the 3D modelling and rendering approach and the
branching movies approach. The 3D rendering approach, consists of collections
of 3D object models and renders the objects in real-time. The author mentions
three drawbacks to the approach. 1) creating geometric objects is a "laborious
manual process" 2) rendering engine is limit on scene complexity 3) special
purpose rendering hardware is not widely available. The branching movies
approach two major drawbacks are that it limits the navigation and requires a
large amount of storage.

A big limitation to the approach is that scenes and objects need to be static.
Some movement is allowed around particular points where cyclical time varying
behavior has been captured and the image scenes have been looped. Another
limitation is that the user cannot look straight up or down (correct? or only
can look straigh up if look up at all?) due to the use of a cylindrical
environment map.

William Klippgen

Mosaic Based Representation of Video Sequences and Their Applications
---------------------------------------------------------------------
Irani, Anandan and Hsu

This paper presents a technique that basically makes use of the static
properties of what is depicted in a videostream. A given videostream
does not only have a string correlation to directly preceeding and
succeeding frames, but also very often across the entire videostream.

There is in other words heavy redundancy in many videostreams that is
not removed properly by frame-to-frame difference representations like
e.g. the MPEG standard.

By using all frames in a scene sequence, a MOSAIC IMAGE is
constructed, giving a panoramic view of the given "scene-space". The
original videostream can now be represented as framemovements inside
this panoramic view combined with one residual per original
frame. This residual contains the difference between the original
frame and the current frame area in the panoramic view.

A static mosaic or salient still is constructed by aligning all frames
to a fixed coordinate system. Different filters can be used in
combining the frames into this mosaic. The residuals needed to get
back to the original stream typically contains more information than
in the next case. Any frame can be found by adding the residual to
the proper frame area in the static mosaic.

The other kind of mosaic, the dynamic mosaic, represents the true
dynamic behaviour by being represented as a sequence of mosaics. The
way this mosaic is presented varies in the way camera motion is mapped
to viewing frame movement. The original sequence is reprensented by
the first dynamic mosaic image and subsequent residuals representing
the change from the previous mosaic to the next. Compared to the
static mosaic residuals, they will contain less information, but to
get to a particular frame, all preceeding residuals have to be added.

The temporal pyramid is a construct of mosaics from the coarsest
single static mosaic to succeeding levels of more and more temporal
integration and downsampling.

When zooming is applied in the scene, a small area is represented in a
higher resolution, so the paper introduces a mosaic pyramid.

* The mapping from a videostream to the mosaic representation:

To do the image allignment, the make use of a so-called Laplacian
pyramid. A function is created for the squared difference between the
new frame and the old frame which can be the entire mosaic. By
minimizing the function, the proper allignment of the new frame is
found.

To integrate the new image with the current mosaic, there are several
mappings suggested. One approach is to take little notice of pixels
far from the new image's center.

Image enhancement can be obtained through a proper model of the image
capturing device by comparing subsequent frames and producing
in-between pixels. Density of detector elements and their spatial
response will most likely be an uneven distrubution in space and will
enable resolution enhancement by comparing two frames captured from
two different positions corresponding to subpixel displacements in the
projected image. This method will also remove noise in the stream.

* Reconstruction of a given frame in the original video stream

Any frame can be reconstructed by first applying the transformation parameters associated with the frame to find the corresponding area in the mosaic. Then, this area and the associated residual are added together.

By using a weight factor, the importance of each residual can be represented and used to scale the residuals. I have some problems in understanding why this will lead to a more efficient description of the stream !

* Applications

Low-bitrate transmissions

Storage compression

Visualization (Increases local context)

- Large key frames

- Synopsis mosaic

- Mosaic video

* Some critical remarks

To be a general representation of video content, a mosaic
representation as the synopsis mosaic can be very confusing if the
line of events is not a simple sequence of objects moving in one
direction. If several objects cross each other or do repeating
movements within a single scene, the mosaic representation will fail
to give a good description. However, the mosaic creates an excellent
background upon which more advanced representations of action could be
performed.

I also think the compression aspect makes lots of sense. For video

transmission of e.g. newscasts, where the same background will be
re-used every day, a mosiac video buffer in the reciever could be
combined with a mosiac residual for the current transmission.

QuickTime VR - An Image-Based Approach to Virtual Environment Navigation
------------------------------------------------------------------------
Shenchang Eric Chen, Apple Computer, Inc.

The QuickTime VR technique enables the construction of a
pseudo-virtual reality environment created by a number of 260 degrees
panoramic views and a set of object rotations represented by a picture
series.

A certain view of the available panorama must be constructed based on
which shape the panorama is mapped to. QuickTime VR used a
cylindrical map onto which the enviroment is mapped. Panoramic
cameras can record this sylindrical representation directly, or tools
included in QuickTime VR can stich together a series of planar
overlapping photographs.

Object rotation is represented by a series of images representing all
allowable poses of the object in some degree step, typically 10
degrees.

A QuickTime VR movie is a combination of panoramic views, rotatable
objects and relations between various panoramic view in space. This
allows a viewer to move from one viewpoint to another. There is also
a concept of hotspots, regions of the panorama that can trigger some
avtion, e.g. moving to another viewpoint.

The panoramic movie is stored as a QuickTime movie with three tracks. The first track contains the nodes of viewpoints and their spatial relation to other viepoints expressed as links. Links to the corresponding panoramic picture and objects are also included.

The actual panoramic image data is stored on the second track as a series of small images along with images for rotating objects.

The third track contains images of the hotspots, represented as a given color in 8-bit images, allowing 256 hotspot for any viewpoint.

By introducing alternative images, both for the panoramic and object
representations, the time-dimension can typically be included, but
will consume a lot of storage space. I think some form of residuals
from the dynamic mosaic solution in the first paper would prove
interesting. Added algorithms to "clean" the noise introduced by the
moving frame border would have to be introduced.

Apple provides both the player and an authoring environment for its
QuickTime VR technology.

Lars Liden

"QuickTime VR - An Image-Based Approach to virtual Environment
Navigation"
Chen

This paper examined mosaic representations of motion sequences.
Two basic types of mosaics were discussed static mosaics and dynamic
mosaics. Static mosaics are formed by combining a series of frames
into one still image, dynamic mosaics produce a sequence of mosaics.
Static mosaics have the disadvantage that moving objects blur or
disappear, however they require less storage size. Dynamic mosaics
have the disadvantage that individual frames cannot be randomly
accessed.
Both methods capture the information not captured by the mosaics
themselves by recoding the residual information lost between frames.
In general this information tends to be rather small but is better in
the dynamic case.
A hybrid version of the dynamic mosaic, called the temporal pyramid
by the authors uses a hierarchy of temporal sampling. In other words
multiple mosaics are created, each one using a different amount of
spatial integration.
The authors cite video compression as being the biggest application
for this type of representation, however the idea of getting key
frames of a video sequence seems much more interesting from our point
of view. If one was to use an algorithm to generate scene cuts,
followed by the creation of a mosaic for each scene, one could create
a highly efficient method for human search of video database. In a
sense one would be creating a "story board".

"Mosaic Based Representations of Video Sequences"
Irani, Anadan & Hsu

This was a very interesting paper that dealt with the difficulties of
using conventional rendering techniques to create virtual reality for
the average user - namely that building and rendering 3D models is
computationally very intensive, and machines with the power to do such
rendering in real-time are not widely available.
The approach discussed in the paper was an extension of Apples'
QuickTime system which presents users with multimedia scene and
allows them to choose branching points at selected intervals.
QuickTime VR allows users to have more control as individual images are
replaced by cylindrical images, of which only one portion is
available to the viewer at a time. The user can then control the
direction of viewing making the image orientation independent.
The ability to zoom in on object is created by using a pyramidal
representation whereby an image is represented at several resolutions
which can be accessed by the user. This is an improvement over the
method of simple magnification which does not actually produce more
detail.
One major limitation suggested by the authors is that the images
themselves must be static. However its not clear to me why this is
necessarily the case (at least for computer generated image data).
Could not a panorama be created for each frame of motion and then the
user control the viewing direction as each frame is incremented?
It's not clear to me how this paper relates directly to image database
search techniques. Standard image/video data wouldn't be applicable
to this technique as it requires a camera setup capable of collecting
a cylindrical set of images.

Gregory Ganarz

In "QuickTime VR", S. Chen presents a commercial product which
produces virtual environments based on 360-degree cylindrical
panoramic images. The technique is similar to branching movies, with
the additional user options of changing view direction and zooming.
While a large improvement, the quicktime technique still suffers from
the restraint that views are confined to be made from particular
points in space. Further, when zooming, true 3D effects do not occur,
thus reducing the ability of the method to simulate "realistic"
environments. Also, since the images are captured and not
constructed, the environment is static: i.e. objects cannot be freely
manipulated in the environment. Still, the computational savings are
sufficient to make this a worthwhile technique.

In "Mosaic Bases Representations of Video Sequences" M. Irani et
al. argue that mosaics can serve as an efficient representation of
video sequences. This seems likely only for images sequences which
maintain similar backgrounds, and contain sufficient background
overlap so the frame matching algorithm can function. Zooming might
be difficult for the matching algorithm to deal with. One idea
discussed in the paper which seems especially interesting was the
concept of increasing frame resolution beyond that present in any
single frame.



Stan Sclaroff

Created:  Nov 20, 1995

Last Modified: