Lecture 20
                              11/17/1994


                      Latency Hiding Techniques
                      -------------------------

In distributed shared memory machines, access to remote memory is
likely to be slow compared to the ever-increasing speeds of
processors. Thus, any scalable architecture must rely on techniques to
reduce/hide/tolerate remote-memory-access latencies. Generally
speaking, four methods could be adopted:

  1. Use of prefetching techniques
  2. Use of coherent cacheing techniques
  3. Relaxing the memory consistency requirements
  4. Using multiple-contexts to hide latency


Prefetching Techniques
----------------------
Prefetching is either software-controlled or hardware-controlled. In
software-controlled prefetching explicit "prefetch" instructions are
issued for data that is "known" to be remote. Hardware-controlled
prefetching is done through the use of long cache lines to capitalize
on spatial locality or through the use of instruction lookahead. Long
cache lines introduce the problem of "false sharing", whereas
instruction lookahead is limited by branches and/or branch-prediction
accuracy (on the average there is a branch every 4 instructions).

When prefetching is used, issues of coherence must be addressed. For
example, what should be done if a block is updated after it has been
prefetched, but before it has been used? Generally speaking one of two
policies are adopted. Using the "binding prefetch" policy, the fetch
is assumed to have happened when the prefetch instruction is issued
(in other words, it is the responsibility of the "fetching process" to
ensure that no other processor will update the prefetched value before
it is actually used -- this may require the use of locks, if
necessary). Using the "non-binding prefetch" policy, the
system cache coherence protocol will make sure to invalidate a
prefetched value if it is updated prior to its use. 

It has been observed that prefetching on the Dash processor may cut
latency (of read/write/synchronize) by almost one half. 


Coherent Caches
---------------
We have studied examples of maintaining coherent caches using snoopy
caches for bus-based systems and using directory-based caches for
general distributed memory machines. As we have seen, a major problem
with the design of coherent caches is scalability.


Relaxed Memory Consistency models
---------------------------------
We have studied two memory models: The Sequential Consistency (SC)
model and the Weak Consistency (WC) model. Two other weak consistency
models are: Processor consistency and Release consistency. 

Under the sequential consistency model, there is a partial ordering of
all reads and writes by all processors, which obeys the program
ordering. Thus, the result of any execution appears as some
interleaving of the operations of the individual processors (as if
executed on some multithreaded sequential machine). Most of the
references to sequential consistency often attribute stronger
conditions in that they require a total ordering of all reads and
writes (a property that is better named "dynamic atomicity").

Under the Weak consistency model, synchronization operators are not
allowed to "perform" until all preceeding load/store operations have
completed. Similarily, no load/store operations are allowed to
"perform" until all preceeding synchronization operations have
completed. In addition, (only) synchronization operators are
sequentially consistent.

Under processor consistency, writes issued by the same processor are
never seen out of order (by any other processor in the system), but
writes by different processors may be observed in different orders by
different processors. Read operations following a write operation may
bypass it (i.e. be observed by other processors as happening before
the write). 

Under release consistency, synchronization operations are classified
into two categories: acquire (e.g. lock) and release (e.g. unlock).
Load/store operations are not allowed to perform until all previous
acquire operations have been performed, and a release operation is not
allowed to perform until all previous load/stores have been
completed. In addition, the acquire (special read) and release
(special write) across all processors are guaranteed to be processor
consistent (i.e. releases by the same processor are never seen out of
order, and acquires may bypass releases on the same processor -- also
releases by different processors may be seen in different orders by
different processors). 

Adopting a processor or release consistency protocol may likely reduce
the "idle time" due to to write misses (because a following read or
acquire doesn't have to wait for the completion of the write).
This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.
Date of last update: November 16, 1994.