Lecture 21
                              11/22/1994


Multithreading
--------------
Perhaps the most promising "scalable" approach to deal with large
latencies (whether due to simply physical limitations or process
synchronization) is to "hide it" (rather than reduce it). Using
mutithreading, each processor in the system is responsible for running
a large number of "threads". When a thread issues a remote access,
then instead of "idling" (or busy waiting), the processor switches
contexts (i.e. executes a different thread). In other words, the cost
of the remote access is hidden because "useful work" on a different
thread is done. Obviously, for this technique to work, the cost of
context switching must be much cheaper than the cost of a remote
access. 

A major question in designing a multithreaded machine to hide latency
is "when to switch context" ? There are various possibilities. For
example, a processor may switch context on a cache miss, or it may
switch context on every load (independent of whether or not it hits or
misses), or it may switch context after each instruction (i.e. it
interleaves the instructions from different threads on a
cycle-by-cycle basis -- this is particularly good for pipelined
execution because it removes pipeline hazards!)

Let (L) denote the communication latency on a remote memory
access. The value of L reflects the network delays as well as any 
cache miss penalty, etc.. Let (N) denote the number of threads that
can be interleaved in each processor. Let (C) denote the context
switching overhead. Finally, let (R) denote the average number of
cycles (instructions) between remote accesses. 

Clearly the objective is to maximize the fraction of time that a
processor is busy doing useful work (i.e. utilization). 

For a single-threaded processor, the utilization will be:

           R        1
    U1 = ----- = -------
         R + L   1 + L/R

For large L and small R, it can be easily seen that the utilization
goes down quite fast.

For a multiple-threaded processor, the maximum utilization will occur
when there are enough threads to completely hide latency (i.e. there
is always a ready thread to execute after each remote access). For
this to be true we need: 

    (N-1)(R+C) > L

Thus, 
            L
    N >= ------- + 1
          R + C

In this case, the utilization becomes: 

             R        1
    Umax = ----- = -------
           R + C   1 + C/R

Obviously, if the context switching cost is reduced to 0 (as in the
Tera computer for example), then a 100% utilization becomes possible.

If N is below the level for a maximum utilization, then the
utilization becomes 

              N.R
    Ulin = ---------
           R + C + L

An interesting observation is that multithreading may have an adverse
effect on the latency! In particular, multithreading results in a much
higher number of "data in transit" (or on the wire), which may result
in hot spots or congestions, and may result in a higher latency. It is
important to strike the right balance (a quite open area of
research!)


Example Architectures overviewed:
--------------------------------

o The Winsconsin Multicube          (e.g. of snoopy cache design)

o The Stanford Dash Multiprocessor  (e.g. of directory-based cache design)
 
o The Tera Computer                 (e.g. of a multithreaded machine design)

o The KSR Computer                  (e.g. of an ALLCACHE machine design)
This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.
Date of last update: November 16, 1994.