Lecture 20 11/17/1994 Latency Hiding Techniques ------------------------- In distributed shared memory machines, access to remote memory is likely to be slow compared to the ever-increasing speeds of processors. Thus, any scalable architecture must rely on techniques to reduce/hide/tolerate remote-memory-access latencies. Generally speaking, four methods could be adopted: 1. Use of prefetching techniques 2. Use of coherent cacheing techniques 3. Relaxing the memory consistency requirements 4. Using multiple-contexts to hide latency Prefetching Techniques ---------------------- Prefetching is either software-controlled or hardware-controlled. In software-controlled prefetching explicit "prefetch" instructions are issued for data that is "known" to be remote. Hardware-controlled prefetching is done through the use of long cache lines to capitalize on spatial locality or through the use of instruction lookahead. Long cache lines introduce the problem of "false sharing", whereas instruction lookahead is limited by branches and/or branch-prediction accuracy (on the average there is a branch every 4 instructions). When prefetching is used, issues of coherence must be addressed. For example, what should be done if a block is updated after it has been prefetched, but before it has been used? Generally speaking one of two policies are adopted. Using the "binding prefetch" policy, the fetch is assumed to have happened when the prefetch instruction is issued (in other words, it is the responsibility of the "fetching process" to ensure that no other processor will update the prefetched value before it is actually used -- this may require the use of locks, if necessary). Using the "non-binding prefetch" policy, the system cache coherence protocol will make sure to invalidate a prefetched value if it is updated prior to its use. It has been observed that prefetching on the Dash processor may cut latency (of read/write/synchronize) by almost one half. Coherent Caches --------------- We have studied examples of maintaining coherent caches using snoopy caches for bus-based systems and using directory-based caches for general distributed memory machines. As we have seen, a major problem with the design of coherent caches is scalability. Relaxed Memory Consistency models --------------------------------- We have studied two memory models: The Sequential Consistency (SC) model and the Weak Consistency (WC) model. Two other weak consistency models are: Processor consistency and Release consistency. Under the sequential consistency model, there is a partial ordering of all reads and writes by all processors, which obeys the program ordering. Thus, the result of any execution appears as some interleaving of the operations of the individual processors (as if executed on some multithreaded sequential machine). Most of the references to sequential consistency often attribute stronger conditions in that they require a total ordering of all reads and writes (a property that is better named "dynamic atomicity"). Under the Weak consistency model, synchronization operators are not allowed to "perform" until all preceeding load/store operations have completed. Similarily, no load/store operations are allowed to "perform" until all preceeding synchronization operations have completed. In addition, (only) synchronization operators are sequentially consistent. Under processor consistency, writes issued by the same processor are never seen out of order (by any other processor in the system), but writes by different processors may be observed in different orders by different processors. Read operations following a write operation may bypass it (i.e. be observed by other processors as happening before the write). Under release consistency, synchronization operations are classified into two categories: acquire (e.g. lock) and release (e.g. unlock). Load/store operations are not allowed to perform until all previous acquire operations have been performed, and a release operation is not allowed to perform until all previous load/stores have been completed. In addition, the acquire (special read) and release (special write) across all processors are guaranteed to be processor consistent (i.e. releases by the same processor are never seen out of order, and acquires may bypass releases on the same processor -- also releases by different processors may be seen in different orders by different processors). Adopting a processor or release consistency protocol may likely reduce the "idle time" due to to write misses (because a following read or acquire doesn't have to wait for the completion of the write).
Date of last update: November 16, 1994.