Jeremy Biddle
The ALIVE System: Full-body Interaction with Autonomous Agents This paper is a description of a computer vision implementation of a wireless interactive system. One of the motivations behind this approach is because of the restrictive constraints imposed by wearing VR gear. The world used in the system contains both inanimate objects as well as agents. Agents are autonomous entities that act on their own in response to their own sensory perception of the system. Interaction with the system (and the agents) is done indirectly, through the use of gestures and signals. Agents appear semi-intelligent because of a set of internal needs and motivations that they have, which are balanced in real-time so as to provide feedback about their behavior. The feedback comes as both physical movement in the system and auditory response. The specifications of an agent includes: sensors to judge the positions of objects; motivations which indicate an agent's desires/needs; activities of the agent to actually tell the agent what to do under different circumstances. The activities are perhaps the most interesting aspect of the agent, as they start out very high level (identifying actions such as playing, feeding, etc.). These high level activities are broken into lower level children activites (for feeding, these would be chewing, looking for food, etc.). These in turn, are broken down to lower level functions until the lowest level functions are reached, which specify specific motor system activities. These different activities compete for control over the agent, so that only one activity will be pursued at a time. As different parameters are modified due to different conditions, the most important activity to the agent will gain control and have the agent do something. In order for the system to pick out the salient details of the scene, there are several visual routines used. The first, visual ground processing, isolates the user of the system. This is achieved with color and motion cues to determine the differences in the scene. After the user is identified, a bounding box is created around the user. Next, the user's Z dimension is determined by casting a ray from the camera to the user's feet. Since it is assumed that the bottom of the user is on the ground, and the ground is fixed and precomputed, the Z distance is computed and the user becomes a 2D plane in the 3D enviroment. After the body is located, the hands and feet are found using domain constraints. Finally, in order for the system to register an indirect command given by the user, it must recognize gesture. Some of the gestures recognized are poiting, waving, and kicking. In ALIVE I, the first incarnation of the ALIVE system, there was a Puppet agent and a Hamster agent. The Puppet simply tried to follow the user in 3D and hold the user's hand. The Hamster would avoid predators, eat food, and beg the user for food if there was none around. The Hamster was also a social little fellow, and enjoyed having its stomach scratched. The user could let out the predator who would try to catch the Hamster, but who was afraid of the user. Evaluating the results revealed that the system is very user friendly, and also that users are more likely to be patient with agents than with inanimate objects.Roberto Downs
The ALIVE System: Full-body Interaction with Autonomous Agents by Maes, Trevor, Blumberg, and Pentland at MIT Media Laboratory This paper discusses the design and implementation of a system called the Artificial Life Interactive Video Environment (ALIVE), which allows wireless full-body interaction between a human participant and a rich graphical world inhabited by autonomous agents. The authors argue that such a system offers more complex and very different experiences than traditional virtual reality systems. Examples of the uses of this extended model are in the areas of training and teaching, entertainment, and digital assistants or interface agents. Traditional VR interfaces require cumbersome equipment and a limited interaction which has restricted the range of applications which could use this type of technology. In the model proposed in this paper, full- body interaction between a human participant and a richer graphical world inhabited by autonomous agents is accomplished through the use of a single video camera to obtain a color image of a person which is then composited into a 3D graphical world. The resulting image is then projected onto a large screen facing the user (referred to as a magic mirror) which shows the user imbedded within the 3D world. Computer vision techniqes are used to extract information about the human participant, such as 3D location, the position of various body parts, and gestures. The 3D environment contains inanimate objects as well as agents. These agents are modeled as autonomous behaving entities which have their own sensors and goals and which can interpret the actions of the human participant and react to them in interactive-time. Such a system offers a more powerful, indirect style of interaction in which gestures can have more complex meanings, which may vary according to the situations presented. This system resides at the MIT Media Laboratory, where it has undergone extensive testing by real users. The results show that (1) the magic mirror approach has several advantages over head-mounted display- based virtual reality systems, and (2) virtual worlds including autonomous agents can provide more complex and very different experiences than traditional virtual reality systems. The authors conclude that the ALIVE system significantly broadens the range of potential applications of VR systems; in particular, applications in the areas of training and teaching, entertainment, telecommunication, and interface agents. The modelling of autonomous agents is broken up into three components: (1) the sensors of the agent, (2) the motivations or internal needs of the agent, and (3) the activities and actions of the agent. Given this information, the system automatically infers which activities are most relevant to the agent at any particular moment in time given the state of the agent, the situation it finds itself in and its recent behavior history. Through using a vision-based interface, the authors hoped to create an interface to a computer graphics world which is as non-intrusive as possible, while allowing a rich and intuitive set of gestures to be used in controlling and navigating the world. Ultimately, users who have used this system have found that they concentrate more on the environment itself, rather than on the complex and unfamiliar equipment being used to interact with that environment. Visual search tasks, also referred to as active vision, are solved using the paradigm of goal-directed computation. Specific search tasks are carried out depending on the state of the agents in the world as opposed to computations performed uniformly across the image at each time step. Four basic types of visual routines have been developed for interacting with a VR world using the magic-mirror paradigm: (1) figure-ground processing, (2) body localization, (3) hand and feet localization, and (4) gesture spotting. Implementation of the four routines includes conventional image processing and computer vision techniques. By combining the behavior modeling and the vision techniques described in this paper, the authors have constructed a system for video-based interaction with artificial agents. Using the two, ALIVE allowed the user to interact with both agents and inanimate objects. The ALIVE-II project expanded upon this interaction by creating a more sophisticated repetoire of behaviors than the previous agents as well as simple auditory output consisting of prerecorded samples. This system seems to be well constructed but it seems awkward to constantly refer to a huge screen in the area. In order to equal other interactive methods for VR, this system will have to somehow be cut down to a smaller size which does not compromise its power. Issues dealt with by real users included the inability of the system to recognize hand gestures in front of the body (due to the silhouette imaging). A more dynamic system would perhaps move around as to get a better sense of the subject or better yet offer multiple references to the subject (multiple cameras). Both of these would be more of an engineering issue which utilized the imaging technology. --- =93Rich Interaction in the Digital Library=94 by Rao, Pedersen, Hearst,=20 Mackinlay, Card, Masinter, Halvorsen, and Robertson ---------------------------------------------------------------------------= --- The authors discuss the development of techniques which support=20 various aspects of the process of user/information interaction in order=20 to increase the =93bandwidth and quality=94 of the interactions between=20 users and information in an information workspace (an environment=20 designed to support information work). Current access tools and=20 applications could be considered information workspaces, but they=20 limit the effectiveness of information access and the larger process of=20 information work. Conventional retrieval interfaces are based on the=20 view of information retrieval as an isolated task in which the user=20 formulates a query against a homogenous collection to obtain=20 matching documents. The authors argue that this view =93misses the=20 reality of users doing real work=94. A strong system should then offer=20 iterative query refinement; source heterogeneity; parallel, interleaved=20 access, and a larger work process. The authors address each of these=20 four through presentation of examples from their own work that lead=20 towards information workspaces supporting rich interaction. =20 Exploration of these examples leads to some categories of meta- information which should be addressed: (1) content, (2) provenance,=20 (3) form, (4) functionality, and (5) usage statistics. Emphasis is placed= =20 on visualizations which map sources into spatial and graphical=20 elements based on meta-information, allowing interactions that allow=20 users to select sources as well as build a spatial memory of sources.=20 The rendering of retrieved data can be explored much in the same=20 way. Access management of the data extends into analysis of time and=20 cost, asynchronous performance, and status feedback to the client. The=20 later increases the user=92s ability to formulate and execute multiple=20 operations in parallel to manage form search strategies more=20 effectively (refer to the GAIA protocol). While this paper covers=20 existing applications which contain commendable features, the authors=20 might have considered the design of a system which incorporated all=20 of these features. While remaining in the realm of theory, this=20 proposed system (any system at all) would still add more content to=20 this paper than a synopsis of existing systems and their limitations.Bob Gaimari
"The ALIVE System: Full-body Interaction with Autonomous Agents" This paper describes a virtual environment system which is implemented in a unique way: instead of messing around with goggles, gloves, helmets, etc., the user sees him/herself in a mirror. The user is shown in the virtual world, interacting with the objects and agents there as if look into what they call the "magic mirror". A large screen in front of the user displays the image, and the users position and actions are recorded using a video camera mounted above the screen. Section 2 discusses how agents are modeled in this environment. These agents are semi-intelligent, and have the following things. a set of goals, motivations, and needs: The agents will want certain things, such as food or attention. a set of activities which they can perform: They have a set of activities which will allow them to meet these needs, such as walking around searching for food, or hopping up and down for attention. These activities are hierarchically designed, with a high-level activity made up of several low-level activities. a set of visual sensors: These are used to watch the user for gestures and queues, or to look around in the environment. Rays are shot out of the sensors, and the first thing they hit is recorded as seen. a behavior system: This, given the above sets, will determine what the agent will do in the next time step. There is a certain persistence, so that the agent won't flip/flop between actions, but it also won't keep doing them forever. Section 3 discusses the interface between the user and the environment. It is based upon vision, and the user can perform simple gestures to make things happen in the virtual world. The user can point at things, wave, pick up or manipulate inanimate objects, touch agents, etc. The camera which is pointed at the user records his/her actions, and the system interprets them, by 1) isolating the user from the background; 2) localizing the position of the body; 3) localizing the positions of the hands and feet; and 4) spotting gestures by the positions and orientations of the hands and feet. Gestures are temporal as well as positional; waving, for example, is a gesture which can only be discerned as a side-to-side motion over time. Also, the user may not want something interpreted as a gesture if he/she was simply moving his/her hand from one place to another. Section 4 discusses some environments they have used with the system, such as a puppet world, a hamster world, and a virtual dog. Each of them has specific gestures and interpretations of these gestures by the agents. The agents also have different motivations and abilities. Section 5 compares the system with other systems, and section 6 evaluates the results. I think that the strongest success is the naturalness of the "magic mirror" interface. This is something people are already familiar with, and so will have very little trouble getting used to. Users don't have to wear special goggles and gloves, and avoid the possibility of disorientation and tripping. Also, simple, everyday gestures can be used. Finally, section 7 discusses possible applications, in such areas as entertainment, training, and interfacing to digital assistants. "Rich Interaction in the Digital Library" This paper discusses the need for better interaction between users searching for information and on-line information sources, such as databases, file servers, and digital libraries. Using current information retrieval tools, users have to overcome a number of barriers to accessing information, such as unfamiliarity with different interfaces and functionalities of sources, lack of ability to interleave operations, or lack of smooth integration with the overall work process. It then goes on to discuss various methods of overcoming these barriers using better interface tools. The first tool discussed is the "Scatter/Gather" paradigm of browsing. Here, a user will select a topic area to search from a list. The possible documents in this topic area are all "scattered", forming a small number of clusters of similar document areas, each with a list of key words, and a sample document title. The user can select one or more of these clusters, and they will be "gathered" together to form a subcollection of the larger group, with its own list of topic areas, and the process can continue until the user finds a set of documents to directly access. Another tool is a "Snippet Search" which, given a keyword, will return "snippets" of context showing where these keywords occur. This can give the user a better idea of what other words may be useful to search for. Combining this with "TileBars", the user can see how often these key phrases are used in a document, whether they occur together often, etc. These show graphically where the words occur by filling in black spaces in a white rectangle, according to frequency of use. Next, they discuss how visualization tends to be a strong aid in gathering and sorting through information. They show what I think is the best method of visualizing data I have seen yet, the Butterfly, which combines search and browsing capabilities. This is used to search for articles through bibliographic sources. The display has the currently active article in the center (the head). The left wing contains a list of the articles which the paper references, and the right wing contains a list of articles which reference the current one. So the user can browse through a space of interrelated articles, and select particular ones for saving. References to the 3-D display systems from the previous paper are also mentioned toward the end, as examples of other methods of visulaizing and browsing data.Daniel Gentle
John Isidoro
The paper we haad to read, "The ALIVE System", was very interesting. It described a system where a user, using a "magic mirror" interface, could interact with virtual agents as if they were actually touching these agents. The thing that struck me the most about this paper is how well the magic mirror system worked.. I think if I were to use the system, I think I'd keep looking away from the dog's image in the magic mirror (supposing I was playing with the virtual dog) and expect to find the dog in the real world.. I guess it just takes a little getting used to. :^) As for real world applications, one thing I though of is using this system as an interactive assistant to animators doing rotoscoping of 3d or cartoon models. A the model or cartoon could be made to mimic the real life person, and the user could get feedback on how his movements look when an animated actor is attempting to do them. The user could then adjust his movements according to how he wants the actor to act.. I wonder if they did something like this for "Toy Story" or some other computer rendered movies??Dave Martin
The ALIVE System: Full-body Interaction with Autonomous Agents Pattie Maes, Trevor Darrell, Bruce Blumberg, Alex Pentland The idea of ALIVE is so simple it is truly inspired: instead of strapping on expensive and entangling sensing gear onto a VR world participant, use a single well-placed camera and image understanding techniques in order to locate the participant in 3-space; and instead of depositing a bulky display device on the participant's head, just put a really big screen on the wall. This puts almost all of the system implementation in software, giving the developers enormous flexibility to change their system without retooling. The participants have complete freedom of motion, and since their natural view isn't obstructed, any lack of clarity and realism in the system will more likely annoy than nauseate them. This system would excel in VR applications where only rough user controls are required. For instance, one could imagine a workstation with a camera pointed at the user's face for imaging, and another camera having a profile view of the user for gesture interpretation. The workstation screen would depict the user in an information space; just by rolling the hands, pointing, grabbing, and so on, the user would be able to very quickly change viewpoints and activate encountered objects. The authors describe a system that works with a single camera, sensing the user and rendering the synthesized agents at about 10 frames per second. This fine performance is due in part to what they cite as "active vision", wherein certain analyses are performed only under appropriate circumstances. For instance, it is not important to determine the precise location of a hand if it is not close to anything in the VR world. Their system features several critters with primitive but cute behavior. Much of the paper describes the implementation of these agents' behaviors. While this helps the reader envision the running system, I did not find that part of the implementation particularly compelling. Nor do I see any reason to deify the agents by capitalizing them ("Puppet", "Hamster", "Dog"), but I guess that's just my pet peeve. The ALIVE system internally reduces the input image to silhouettes in order to extract gestures, so it loses much potentially valuable information. As mentioned in the paper, hands cannot be located if they are held close to the body. The authors suggest looking for particular flesh colors or using a second camera to resolve this problem. A second camera would also make it easier in principle for the system to track more than one participant (no extra equipment required for more participants!), but I'm sure I don't fully appreciate the complexity of the multi-camera multi-participant problem. Most people point their heads pretty much directly at what they want to see, so I wonder if users of the ALIVE system suffer from some kind of stiff-neck-rigid-torso fatigue. But even this sounds preferable to having to wear sensing gear. All told, the system sounds very cool, and I want one. Rich Interaction in the Digital Library Rao, Pedersen, Hearst, Mackinlay, Card, Masinter, Halvorsen, Robertson This article describes the authors' approaches towards improving interaction with large on-line databases. They list four basic requirements and describe techniques for addressing them. Many of these techniques have been implemented by the authors. The requirements are (1) iterative query refinement, (2) source heterogenity, (3) interleaved access, and (4) larger work process. The idea behind iterative query refinement is that most users of large databases do not know precisely what they seek in terms of the database content and organization. One of the main roles of a reference librarian is to create the dialog that leads to query refinement, but in a completely on-line system, there is typically no such presence. Therefore, it is important that the system make it very easy for the user to ask broad questions and subsequently refine the questions based on results. The authors describe a technique called scatter/gather to facilitate this need. In this technique, the system dynamically creates sort of a hierarchical view of the data. By navigating this view, the user implicitly refines the query. Beyond that, I did not really understand how the technique works: exactly what is scattered? The authors identification of scattering with clustering is confusing. How much up-front effort is required to adapt a database for this system? It is hard to imagine that the system would automatically know to put "text print format menu page word font image mac size" in one catagory and "agency office government department contract center" in another. Finally, how quickly does the system run? Are the views built in advance, or are they really computed at the time of use? Are they computed at the client or server end? TileBar lets the user specify how often certain terms must appear in a document for the search to find it. In the result view, a little graphic shows the distribution of the terms in the result documents. These strike me as novel and useful techniques for whittling down the query space. The authors note that professional database searchers spend much of their time sharing information about sources, while the average user does not have this kind of knowledge. The GAIA protocol was built to provide an interface for uniform access to a wide variety of heterogeneous sources. GAIA also sends time estimates and status messages to the client while searches are in progress, which is an important means of enabling interleaved access. However, the authors do not describe GAIA in any detail; instead, they concentrate on a GAIA client--- the butterfly viewer. The butterfly viewer displays matching articles on the left and articles that cite the matched articles on the right. It can also show 3D plot of the citing-cited relation between articles over time, making it very easy to identify fringe versus "hot spot" papers. The viewer allows the user to create "piles" of references at will and automatically generates references to related works in an unspecified manner. It seems like a very helpful searching tool. The authors also describe a tool called Protofoil for navigating image and OCR versions of scanned documents. Protofoil supports multiple search methods and result visualizations, including a TileBar display. It is particularly successful in its use of thumbnail images, which greatly speed searching of familiar documents. I have also found that I remember much more than text when reading: the page layout, paragraph shapes, shape of displayed equations, etc. are very important in helping me locate a passage in a technical text. Finally, the authors refer to a variety of data visualization techniques, many of which we have seen: the perspective wall, cone tree, and others. This is a fine article. It describes both abstract goals and the state-of-the-art with references, so that an interested reader can follow up if desired.John Petry
"THE ALIVE SYSTEM: FULL-BODY INTERACTION WITH AUTONOMOUS AGENTS," by Maes, Darrell, Blumberg and Pentland. The ALIVE System is a combination of a full-body user interface and a variety of autonomous agents which interact with an image of the user. I'd like to consider three aspects of this paper independently: the autonomous agents; the full-body interface; the interaction of the agents with the interface. 1) Autonomous agents These are moderately complex, but in themselves are not particularly original. The most involved one seems to be a virtual dog which has an action-generating algorithm which consists of a tree several layers deep, with the upper levels denoting high-level actions (e.g., eat, move) and the leaves specific sub-tasks (open mouth, chew, swallow). This is fairly tangential to the core of the paper, which is the interface. As such, the agents simply give some purpose to the interface. They could just as easily be replaced with file system icons, video game controls, a virtual jukebox (pick your music interactively): the possibilities are wide-ranging. 2) The full-body interface This is the most superficially appealing aspect of the work, but to be honest, I found it to be just that: superficial. The vision portion is difficult, but the results don't justify the effort. A vision system that could read American Sign Language would be excellent, if incredibly hard; a vision system that could interpret facial motions of quadriplegics would be very useful; this is much cruder than either, and of much less use, since anyone who can make these types of gestures can also control a mouse or joystick, or type. I'm reminded that a person moving in front of a giant screen while moving her arms to interact with a program is much like the same person sitting in front of a monitor moving a mouse, except that the former uses 16' x 16' of floorspace, a wall-sized monitor, an incredibly imprecise mechanism to specify actions (compared to the fine control presented by a mouse), and needs a person to be standing the whole time if she wants access to the full range of controls. Oh, and it is much more expensive. While gloves and headsets have significant limitations, I think this system is at least as limited in its own way. I strongly suspect a combination of the two is the way to proceed. Perhaps wearing gloves or headwear that was visually distinctive without requiring wiring or an internal power source might work. For instance, a pair of IR-sensitive gloves and headband could vastly simplify the vision task without causing any appreciable difficulties for the user, as well as being quite inexpensive. 3) The interaction between the autonomous agents and the full-body interface This is somewhat interesting, but to some extent I'm not sure if it is meant to be the focus of the paper, or if the interface is the key part. To the extent that it is the main topic, I don't really understand what it is meant to show. Overall, the application strikes me as very cute. It's quite easy for inexperienced users to learn how to operate, which certainly deserves credit, and it is a novel implementation that will no doubt generate considerable ideas as a result. But overall, I don't think there's a lot of depth to this paper.Robert Pitts
The ALIVE System: Full-body Interaction with Autonomous Agents by Maes et al. ============== This paper describes a virtual reality system that uses a different interface than traditional systems and that has a richer virtual environment. The authors begin by contrasting their system to the traditional virtual reality approach. Their system differs in that: o There are no devices attached to the user. The system uses computer vision technique to determine position and orientation of a user. o The virtual environment contains autonomous agents that can interact with the user in addition to static objects. The model used for the autonomous agents in the virtual world has several realistic properties that help to produce believability; they are: o Sensors for detecting aspects of the environment. The basic method uses a "what is in my line of sight" approach; however, the agents are privy to more than just visual properties of the objects in their sight (since all objects are modeled in the virtual system, including the user). o Motivations and goals. These give the agents a reason to do something other than just sitting there. o Activities/actions that the agent can perform. This will be a pool of behaviors that an agent knows how to perform. They are organized in a hierarchical manner, with more specific behaviors at the leaves. o A motor system that is "told to do something" based on what the current behavior is. The model of the motor system uses both physical and kinematic modeling. o A control system that, in real-time, takes sensor information, determines what the current goal should be and activates activities to satisfy that goal. I believe this model is a good tradeoff between the believability and complexity of the agents. Humans often attribute deeper cognition to behavior that can be generated by simple motivations interacting with an environment. Thus, agents behave as expected, but with less computation. In the third section, the authors describe the human interface to the virtual world. Advantages of their computer vision approach are that users are not tethered to unnatural devices. Because the system models users in the virtual system, they describe how using computer vision techniques, the system can efficiently produce an internal model of the user. Their efficient techniques make assumptions about the positions of certain body parts, etc. based on other information. It would have been nice for them to mention limitations to this technique in this section. For example, do certain natural movements present difficulties to the system? Are there limitations to the the speed of users' movements for reliable tracking? Lastly, they describe "gestures" that are supported by this system that are used to control the behavior of agents. It is not clear in this section if users experience any of the usual problems faces in a "mirror" interface. For example, humans have trouble coordinating certain task when faced with a mirror image of themselves. This issue is somewhat covered later. Next, a description of some of the virtual environments and the agents found in these environments is described and the ALIVE system is contrasted with previous work. The primary improvements are: the use of vision techniques, using 3-D user positions, using more complicated gestures and more sophisticated agents. There are certain limitations to the ALIVE system. Nonetheless, interesting observation is that users are more tolerant of mistakes by the agents than by inanimate objects. For example, an agent may have missed a gesture. This is a clever way to use human expectations to ignore technological limitations. The authors do a decent job in addressing some of the limitations of the "mirror" metaphor, thought it does not adequately address the question of whether users have trouble coordinating because of the mirrored image of themselves. Finally, their discussion gives a good description of possible uses of the magic mirror technology. The authors concentrated on future applications. Although they briefly describe possible improvements to the underlying implementation (e.g., improved hand detection), I would have liked to hear more about these issues. Rich Interaction in the Digital Library by Rao et al. ============= This article addresses the issue of constructing "information workspaces," by allowing users to perform queries on heterogeneous sets of information and form a unified or multi-aspect view of that information. There are a few typical characteristics of queries performed by humans (when not limited by computer interface design or resources). Humans typically collect information from "multiple sources" and perform "parallel/interleaved retrieval" and "non-sequential processing." The authors present some schemes for improved querying of homogeneous sources. Techniques include iterative refinements of searches and geometric representations of document contents (both based on keywords). The preceding examples describes techniques for improving query of homogeneous sources. To handle heterogeneous sources, a set of meta- information, information about the sources themselves, has to be constructed. The important pieces of meta-information identified are: information about the source, content, structure and how to access a service as well as who is using an information service. The GAIA protocol, a protocol for querying multiple sources, is presented. It supports access to the pieces of meta-information that were identified as important as well as characteristics of human query sessions mentioned above. It is not evident when meta-information must be programmed into GAIA or when it can obtain this information automatically. However, the authors do mention that the system supports a certain set of sources, suggesting that it has been programmed to do so. The importance of real-time constraints in an parallel, asynchronous query system is mentioned. A user (the system) needs an estimate on how long a query will take before deciding to initiate that query. It is mentioned that these times can be estimated by GAIA, but it is not mentioned how this is done. Furthermore, the authors later state that "Query results typically have an unpredictable size and require an unpredictable amount of time to enumerate." Finally, the authors describe how they organized multiple views for a particular search operation. Even though they identify categories of different views, I did not find their example to be general enough. There description is only of a single problem and doesn't really contain any theory about the task in general. In summary, this article provides a couple good strategies for keyword- based search and does a minimal job of identifying a framework for dealing with multiple sources. It lacks convincing examples of how these principal can be applied. Last, the authors do not speculate as to what path this work should take in the future.
Stan Sclaroff
Created: Mar 13, 1996
Last Modified: Apr 3, 1996