BU CLA CS 835: Seminar on Image and Video Computing

Class commentary on articles: Video Mosaics



Shrenik Daftary

Synopsis of "Quicktime VR" by Chen This paper presented a method to store images that will allow an user the ability to perform a virtual walkthrough from different viewing positions and orientations. The first method that is presented is the use of modeling systems that are rendered when a user accesses them. This is attacked because it is laborious, limits scene complexity, and requires special hardware. The next method is the creation of branching movies that allow only limited points where movements are allowed, unless a lot of storage space is used. The presented technique attempts to overcome the computational complexity, and storage requirements of the other techniques. The goals of the method that is presented in this paper were to allow playback on personal computers without hardware acceleration, recreation of both real and synthetic scenes, and high quality rendering independent of scene complexity. Camera rotation is described in terms of pitch, yaw, and roll. Methods are presented to reconstruct just pitch and yaw. Camera motion can involve a change in viewpoint (difficult to model) or direction (an environment map). Environment maps may be preserved in a cubic form (6 times the normal space). Camera zooming can be accomplished by tiling an image into multiple levels of the same size. By using different levels of resolution, infinite zoom can be simulated. Quicktime VR was tested in terms of presenting a panoramic movie, which are "multi-dimensional, event-driven spatially-oriented movies." Users are allowed to interact with the movie and perform pannings, zooms, and motion. Videos are stored with one panoramic track (pointers to hot points), and 2 video tracks. For slow media such as CDROMs, tracks should be interleaved. The system was tested with different PCs and shown that up to 29 updates of 1D Panning can be performed per second on high speed PowerPCs. The actual system to create the movies was presented next. Nodes are selected in a space, panoramas are created with computer rendering, hot spots are identified, panoramas are linked together, panoramic images are linked together. Finally the system's potential and actual applications were presented. The overall system seemed okay, although it did not allow camera roll (which could be modeled in software as an affine transformation). Its main problem though was its slow update speed for even powerful personal computers such as the pentium. The other goals though in terms of displaying real, and imaginary scenery with high quality complex images seemed feasible. "Mosaic Based Representations of Video Sequences and Their Applications" by Irani et al. A mosaic is a panoramic view taken from a video of a scene. Static mosaics refer to a single frame average of a sequence or a few frame average of the sequence. Dynamic mosaics refer to the reconstruction of a sequence of images by keeping a constant background. Unfortunately this technique does not allow immediate access to a particular frame, since the reconstruction is based on incremental frame reconstruction. Zooming can be performed by using a spatial pyramid to store the images. Image alignment is performed by first aligning successive images, then aligning images directly to the current composite mosaic images, and aligning the current mosaic image to the current image. Integration can be performed after the alignment. The image intensity is maintained within a reasonable level of change to allow smooth transitions. Mosaics are presented in terms of allowing video compression, visualization, and video enhancement. The paper presented a graphical comparison between MPEG and Mosaics. The reconstruction appeared much more effective for the mosaic reconstruction. The results of this paper did not include much quantitative data, so it was difficult to determine the efficiency of the system - it would be nice to have a quantitative idea of how much storage space would be required to store some video sequence, and then to show some arbitrary frames from the sequence using the mosaic method and assess its reconstructive value based on that compared to that of an MPEG storage.

Paul Dell

M. Irani, P. Anandan, and S. Hsu, "Mosaic Based Represenations of Video Sequences and Their Applications," in Proc. International Conference on Computer Vision 1995, pp. 605--611. The paper examines the possible possible of two basic types of mosaics and some extentions of the basic types. The two types are "static" and "dynamic" mosaics. A static mosaic is construction of common elements of a video sequence. If the scene does not contain moving objects and the camera pans around, then the mosaic would consist of an image that has all points that the camera captured. In this way one can construct a panoramic view. A dynamic mosaic updates the "background" mosaic with each new image frame. The dynamic mosaic allows for the moving object to be displayed without the fading present in the static mosaic. Some applications of the static and dynamic mosaics are video compression, low bitrate transmission, "key frame" construction for later browsing, synopsis of motion, and video enhancement. One of the interesting applications was to use the mosaics as indexing and search tools. The operator could query a video sequence based on finding certain static mosaics (ie. background) and residuals of the mosaics (the changes). S. E. Chen, "QuickTime VR -- An Image-Based Approach to Virtual Environment Navigation," in Proc. ACM SIGGRAPH 1995 , pp. 29--38. An image based method for 3D scene synthesis and navigation is presented. The technique is based on 360-degree cylindrical panoramic images for the virtural environment and 180-degree by 360-degree object movies. Panoramic images spaced 5-10 feet apart (for inside shots) are taken to form a grid of the environment, then theses images are "linked" together manually by the viewing directions to form the virtual environment. Another aspect of the VR scene are "hot spots" which are regions in the panoramic image for user interactions. (I don't understand how these appear to the user. Are they just 8-bit images placed on the scene? Are they icons? How differentiate from panoramic image?) This approach differs from the 3D modelling and rendering approach and the branching movies approach. The 3D rendering approach, consists of collections of 3D object models and renders the objects in real-time. The author mentions three drawbacks to the approach. 1) creating geometric objects is a "laborious manual process" 2) rendering engine is limit on scene complexity 3) special purpose rendering hardware is not widely available. The branching movies approach two major drawbacks are that it limits the navigation and requires a large amount of storage. A big limitation to the approach is that scenes and objects need to be static. Some movement is allowed around particular points where cyclical time varying behavior has been captured and the image scenes have been looped. Another limitation is that the user cannot look straight up or down (correct? or only can look straigh up if look up at all?) due to the use of a cylindrical environment map.

William Klippgen

Mosaic Based Representation of Video Sequences and Their Applications --------------------------------------------------------------------- Irani, Anandan and Hsu This paper presents a technique that basically makes use of the static properties of what is depicted in a videostream. A given videostream does not only have a string correlation to directly preceeding and succeeding frames, but also very often across the entire videostream. There is in other words heavy redundancy in many videostreams that is not removed properly by frame-to-frame difference representations like e.g. the MPEG standard. By using all frames in a scene sequence, a MOSAIC IMAGE is constructed, giving a panoramic view of the given "scene-space". The original videostream can now be represented as framemovements inside this panoramic view combined with one residual per original frame. This residual contains the difference between the original frame and the current frame area in the panoramic view. A static mosaic or salient still is constructed by aligning all frames to a fixed coordinate system. Different filters can be used in combining the frames into this mosaic. The residuals needed to get back to the original stream typically contains more information than in the next case. Any frame can be found by adding the residual to the proper frame area in the static mosaic. The other kind of mosaic, the dynamic mosaic, represents the true dynamic behaviour by being represented as a sequence of mosaics. The way this mosaic is presented varies in the way camera motion is mapped to viewing frame movement. The original sequence is reprensented by the first dynamic mosaic image and subsequent residuals representing the change from the previous mosaic to the next. Compared to the static mosaic residuals, they will contain less information, but to get to a particular frame, all preceeding residuals have to be added. The temporal pyramid is a construct of mosaics from the coarsest single static mosaic to succeeding levels of more and more temporal integration and downsampling. When zooming is applied in the scene, a small area is represented in a higher resolution, so the paper introduces a mosaic pyramid. * The mapping from a videostream to the mosaic representation: To do the image allignment, the make use of a so-called Laplacian pyramid. A function is created for the squared difference between the new frame and the old frame which can be the entire mosaic. By minimizing the function, the proper allignment of the new frame is found. To integrate the new image with the current mosaic, there are several mappings suggested. One approach is to take little notice of pixels far from the new image's center. Image enhancement can be obtained through a proper model of the image capturing device by comparing subsequent frames and producing in-between pixels. Density of detector elements and their spatial response will most likely be an uneven distrubution in space and will enable resolution enhancement by comparing two frames captured from two different positions corresponding to subpixel displacements in the projected image. This method will also remove noise in the stream. * Reconstruction of a given frame in the original video stream Any frame can be reconstructed by first applying the transformation parameters associated with the frame to find the corresponding area in the mosaic. Then, this area and the associated residual are added together. By using a weight factor, the importance of each residual can be represented and used to scale the residuals. I have some problems in understanding why this will lead to a more efficient description of the stream ! * Applications Low-bitrate transmissions Storage compression Visualization (Increases local context) - Large key frames - Synopsis mosaic - Mosaic video * Some critical remarks To be a general representation of video content, a mosaic representation as the synopsis mosaic can be very confusing if the line of events is not a simple sequence of objects moving in one direction. If several objects cross each other or do repeating movements within a single scene, the mosaic representation will fail to give a good description. However, the mosaic creates an excellent background upon which more advanced representations of action could be performed. I also think the compression aspect makes lots of sense. For video transmission of e.g. newscasts, where the same background will be re-used every day, a mosiac video buffer in the reciever could be combined with a mosiac residual for the current transmission. QuickTime VR - An Image-Based Approach to Virtual Environment Navigation ------------------------------------------------------------------------ Shenchang Eric Chen, Apple Computer, Inc. The QuickTime VR technique enables the construction of a pseudo-virtual reality environment created by a number of 260 degrees panoramic views and a set of object rotations represented by a picture series. A certain view of the available panorama must be constructed based on which shape the panorama is mapped to. QuickTime VR used a cylindrical map onto which the enviroment is mapped. Panoramic cameras can record this sylindrical representation directly, or tools included in QuickTime VR can stich together a series of planar overlapping photographs. Object rotation is represented by a series of images representing all allowable poses of the object in some degree step, typically 10 degrees. A QuickTime VR movie is a combination of panoramic views, rotatable objects and relations between various panoramic view in space. This allows a viewer to move from one viewpoint to another. There is also a concept of hotspots, regions of the panorama that can trigger some avtion, e.g. moving to another viewpoint. The panoramic movie is stored as a QuickTime movie with three tracks. The first track contains the nodes of viewpoints and their spatial relation to other viepoints expressed as links. Links to the corresponding panoramic picture and objects are also included. The actual panoramic image data is stored on the second track as a series of small images along with images for rotating objects. The third track contains images of the hotspots, represented as a given color in 8-bit images, allowing 256 hotspot for any viewpoint. By introducing alternative images, both for the panoramic and object representations, the time-dimension can typically be included, but will consume a lot of storage space. I think some form of residuals from the dynamic mosaic solution in the first paper would prove interesting. Added algorithms to "clean" the noise introduced by the moving frame border would have to be introduced. Apple provides both the player and an authoring environment for its QuickTime VR technology.

Lars Liden

"QuickTime VR - An Image-Based Approach to virtual Environment Navigation" Chen This paper examined mosaic representations of motion sequences. Two basic types of mosaics were discussed static mosaics and dynamic mosaics. Static mosaics are formed by combining a series of frames into one still image, dynamic mosaics produce a sequence of mosaics. Static mosaics have the disadvantage that moving objects blur or disappear, however they require less storage size. Dynamic mosaics have the disadvantage that individual frames cannot be randomly accessed. Both methods capture the information not captured by the mosaics themselves by recoding the residual information lost between frames. In general this information tends to be rather small but is better in the dynamic case. A hybrid version of the dynamic mosaic, called the temporal pyramid by the authors uses a hierarchy of temporal sampling. In other words multiple mosaics are created, each one using a different amount of spatial integration. The authors cite video compression as being the biggest application for this type of representation, however the idea of getting key frames of a video sequence seems much more interesting from our point of view. If one was to use an algorithm to generate scene cuts, followed by the creation of a mosaic for each scene, one could create a highly efficient method for human search of video database. In a sense one would be creating a "story board". "Mosaic Based Representations of Video Sequences" Irani, Anadan & Hsu This was a very interesting paper that dealt with the difficulties of using conventional rendering techniques to create virtual reality for the average user - namely that building and rendering 3D models is computationally very intensive, and machines with the power to do such rendering in real-time are not widely available. The approach discussed in the paper was an extension of Apples' QuickTime system which presents users with multimedia scene and allows them to choose branching points at selected intervals. QuickTime VR allows users to have more control as individual images are replaced by cylindrical images, of which only one portion is available to the viewer at a time. The user can then control the direction of viewing making the image orientation independent. The ability to zoom in on object is created by using a pyramidal representation whereby an image is represented at several resolutions which can be accessed by the user. This is an improvement over the method of simple magnification which does not actually produce more detail. One major limitation suggested by the authors is that the images themselves must be static. However its not clear to me why this is necessarily the case (at least for computer generated image data). Could not a panorama be created for each frame of motion and then the user control the viewing direction as each frame is incremented? It's not clear to me how this paper relates directly to image database search techniques. Standard image/video data wouldn't be applicable to this technique as it requires a camera setup capable of collecting a cylindrical set of images.

Gregory Ganarz

In "QuickTime VR", S. Chen presents a commercial product which produces virtual environments based on 360-degree cylindrical panoramic images. The technique is similar to branching movies, with the additional user options of changing view direction and zooming. While a large improvement, the quicktime technique still suffers from the restraint that views are confined to be made from particular points in space. Further, when zooming, true 3D effects do not occur, thus reducing the ability of the method to simulate "realistic" environments. Also, since the images are captured and not constructed, the environment is static: i.e. objects cannot be freely manipulated in the environment. Still, the computational savings are sufficient to make this a worthwhile technique. In "Mosaic Bases Representations of Video Sequences" M. Irani et al. argue that mosaics can serve as an efficient representation of video sequences. This seems likely only for images sequences which maintain similar backgrounds, and contain sufficient background overlap so the frame matching algorithm can function. Zooming might be difficult for the matching algorithm to deal with. One idea discussed in the paper which seems especially interesting was the concept of increasing frame resolution beyond that present in any single frame.


Stan Sclaroff
Created: Nov 20, 1995
Last Modified: