Efficient User-Level Sandboxing Techniques for Extensible Services
OverviewThis work is motivated by the desire to allow applications to customize and extend the system for their specific needs. The "User-Level Sandboxing" project is concerned with the development of safe and efficient methods for user-level extensibility of commercial off-the-shelf (COTS) systems that require only minimal changes to the kernel. As with micro-kernels, user-level services and extensions can be deployed without polluting the kernel address space with potentially unsafe application-specific code. User-level sandboxing provides a clean separation of core system functionality from higher-level abstractions. Likewise, user-level code can leverage libraries and system calls, it can be rapidly prototyped without causing system failure, and it can be written without knowledge of kernel internals.
Unfortunately, implementing service extensions at user-level incurs costs, due to communication across the kernel-user boundary, as well as scheduling and switching between address spaces that isolate such extensions. To alleviate some of these costs, researchers have leveraged hardware support, such as segmentation and tagged translation lookaside buffers (TLBs). For example, segmentation on the Intel x86 processor enables the same page-tables to be used for multiple logical protection domains, thereby avoiding expensive TLB flushes and reloads when switching between such domains. However, not all processors support segmentation or tagged TLBs but many do support page-level protection, even in embedded systems. For example, the StrongARM SA1110 and Xscale processors that are popular in handheld devices, such as PDAs, have page-based memory management units.
Using Shared Virtual Memory Pages for SandboxingAs part of an effort to provide a portable, safe and efficient method for user-level extensibility, we propose a sandboxing mechanism that relies only on page-based hardware protection. Logical protection domains are established within the page (or pages) of a sandbox using type-safe languages such as Cyclone (or even Java). Our approach involves modification to the page tables of all processes when they are first created, to include a common set of virtual addresses mapped to the same physical memory (as shown in Figures 1 and 2).
Figure 1: Processes share a sandbox region that is made user-level accessible by kernel events.
Figure 2: A sandbox in each process has the same virtual-to-physical memory mapping.
This shared virtual address region defines an `upcall sandbox'. Applications can register handlers and extensions that are mapped into this sandbox, where they may be executed in the context of any process. This is possible since all processes will have page tables that can resolve virtual addresses of instructions and data in this memory area.
Under normal operation, the sandbox region is made inaccessible at user-level. This is to avoid arbitrary access to the sandbox region by application processes. However, when an event occurs in the kernel that requires the execution of sandbox code, the sandbox region is opened for user-level access. For example, using Linux x86 requires toggling user/supervisor flags in the current process's page directory or table, and invalidating the corresponding TLB entries via the INVLPG instruction.
Activating Sandbox Code via Kernel UpcallsTraditionally, signals and other such kernel event notification schemes have been used to invoke actions in user-level address spaces when there are specific kernel state changes. However, these schemes are inefficient, since they require the target user-level address space to be active before actions can be taken in response to specific events. With our approach, sandboxed functions can be executed at any time a kernel event occurs, without the need to schedule the execution of code in a specific address space. As a result, it would be useful to have an upcall mechanism that operated like the mirror image of a system call.
Unfortunately, many operating systems such as Linux that leverage hardware protection to separate user- and kernel-address spaces do not support conventional trap gates to user-level. General protection faults occur when attempting to trap to a `ring of protection' that is less critical than the kernel. That said, architectures such as the Intel IA-32 support instructions such as SYSENTER and SYSEXIT that can be used in conjunction with Model Specific Registers (MSRs) to allow fast transitions between kernel and user-level address spaces. By using such instructions, we can implement fast upcalls to activate sandboxed code without delays associated with scheduling and context switching. In the absence of these instructions, we can modify the kernel stack of the current process, to give the impression we are returning to a user-level function, as is typically done upon return from a system call.
To avoid the problem of triggering kernel events for upcall extensions when a user-level process is not running (e.g., when a kernel thread, having no user-level context outside the sandbox, is executing), all upcall extensions utilize a private stack in the upcall sandbox (see Figure 2). Moreover, to allow application processes to read from and write to the sandbox, pages of the sandbox can be mapped into process-private virtual address ranges. Type-safe languages such as those used in SafeX are still needed to ensure code executing inside a sandbox does not access addresses outside the sandbox since that could adversely affect processes or the system itself.
ObservationsA prototype sandboxing system has been developed for use on Linux x86, that implements sandboxes in 4MB superpages, thereby providing space for substantial service extensions. By using just one page for the sandbox, only one TLB flush/reload is necessary to toggle its user/supervisor protection flag. This compares favorably to the cost of switching between process-level address spaces on the x86, since the entire instruction and data TLBs must be flushed (as they are not tagged with address space identifiers). Initial tests indicate inter-protection domain communication costs appear close to those of hardware-based solutions leveraging segmentation. Likewise, a prototype upcall mechanism has been benchmarked at up to four times faster than signals.
- Richard West
- Gerald Fry
- Gabriel Parmer
- Others include Jason Gloudon, Xin Qi, Luis Hernandez,
Sidelnik and Jason Gilanfarr
|SafeX:||Safe Kernel Extensions. This is a mechanism to support the compilation and dynamic-linking of application-specific `QoS safe' code into the kernel.|