Lecture 12
                              10/20/1994

		  Multiprocessors and Multicomputers


Earlier in this course we distinguished "shared-memory" multiprocessors
from "message-passing" multicomputers. In this lecture we look closer
at the architectural details of these systems.

Multiprocessors
---------------
A general multiprocessor system must have the following component: 
  - Processors
  - Global shared memory (for all processors)
  - Shared I/O devices

and may have the following components
  - Local memory (per processor)
  - A cache memory (per processor, to hold a subset of shared memory)

Recall that depending on the architecture of the system, a
multiprocessor system may be UMA, NUMA, or COMA. In all cases, the
processors are attached to the global shared memory (and I/O) via an
interconnection network. The design of the network may vary depending
on the network topology (static and dynamic network topologies),
routing (circuit switching, packet switching, and wormhole routing),
timing (synchronous vs asynchronous) and control (centralized and
distributed). This has been overviewed in previous lectures. In this
lecture, we look at alternatives to the "straightforward" architectures
we discussed earlier. Each of the concepts we present has been
incorporated in at least one "real machine". We mention these
machines as well. 


Hierarchical Bus Systems (Encore's Ultramax)
------------------------
Such a system consists of a hierarchy of buses, with the global shared
memory connected to the highest level intercluster bus, cluster caches at
the intermediate nodes, and processors with their private caches at the
leaves.

                                      Memory Units
                                        | | | |
intercluster bus ----------------------------------------------------
                    |              |              |              |
                 Cluster        Cluster        Cluster        Cluster 
                  Cache          Cache          Cache          Cache
                    |              |              |              |
     cluster bus -------        -------        -------        -------   
                  | | |          | | |          | | |          | | |
                  C C C          C C C          C C C          C C C
                  | | |          | | |          | | |          | | |
                  P P P          P P P          P P P          P P P

An example of such a design is the Encore's Ultramax multiprocessor
architecture. It has a two-level bus (as shown above), except that the
memory is "distributed" at the leaves (i.e. each leaf node has a
processor, a private cache, and a "piece" of the main
memory). A multi-level cache coherence protocol is used to maintain the
consistency of the multi-level caches with main memory.


Combining Networks (IBM RP3 multiprocessor) 
------------------ 
Many nonuniform communication patterns may create hot spots, where a
certain memory module becomes excessively accessed by many processors
at the same time (e.g. semaphores). All these accesses have to
serialize through the hot spot, which results in a considerable
slowdown. One way to solve this problem is to enable the network to
"combine" requests for the same memory module. This has been
implemented in the FETCH-AND-OP network of the IBM RP3. The IBM RP3 was
designed to include 512 processors unsing a high speed Omega network
for reads and writes and a combining network for FETCH-AND-OP.  The
advantage of using a combining network to implement FETCH-AND-OP is
achieved at a significant cost (in the NYU Ultracomputer the
predecessor of the RP3, it amounted to 6 times the cost of a simple
Omega network).


Cache Coherence
---------------
The existence of multiple cached copies of data creates the possibility
of inconsistency between a cached copy and the shared memory or between
cached copies themselves. This may result because of data sharing,
because of process migration, or because of I/O. 

With a bus interconnection, cache coherence is usually maintained by
adopting a "snoopy protocol", where each cache controller "snoops" on
the transactions of the other caches and guarantees the validity of the
cached data. In a (single-) multi-stage network, however, the
unavailability of a system "bus" where transactions are broadcast makes
snoopy protocols not useful. Directory based schemes are used in this
case. 

Snoopy Bus Protocols
--------------------
The Snoopy Cache Coherence Protocols differ in the number of states
they allow each cache block (line) to assume and the transition
relation between these states  as a result of bus/cpu read/write
transactions. We will considered a number of protocols:


Write-through

Write-back

The Illinois Write Invalidate Protocol
This protocol is one of the first write invalidate protocols. This
scheme introduces the exclusive unmodified state of data and entirely
avoids invalidations on write hits to unmodified non-shared blocks. To
implement this algorithm, it is necessary to associate two status bits
with each block in cache. The first bit indicates either Shared or
Exclusive ownership of a block, while the second is set if the block
has been locally modified.  As a write policy, write back policy is
used. At any point in time the cache block may be in one of four
states:  Invalid,  Exclusive Unmodified,  Shared
Unmodified, and  Exclusive Modified.

The Synapse Protocol
This approach was used in the Synapse N + 1, a multiprocessor for
fault-tolerant transaction processing. The N + 1 differs from other
shared bus designs in that it has two system buses. The added
bandwidth of the extra bus allows the system to be expanded to a
maximum of 28 processors. In this protocol, cache blocks are in one of
four states:  Invalid,  Valid  (data valid and may
be shared with another caches), and  Dirty (data is valid,
shared with no other cache, and not consistent with main memory).

The Goodman Write Once Protocol
This protocol was introduced in the early 1980s and was designed for
the Multibus. The requirement that the scheme work with existing bus
protocol was a severe restriction but one that resulted in
implementation simplicity. In this protocol, blocks in the local cache
can be in one of four states:  Invalid, Valid (data
valid but may be shared with another cache),  Reserved (data
valid, shared with no other cache, and consistent with main memory),
and  Dirty (data is valid, shared with no other cache, and
not consistent with main memory -- write-back needed). A block is 
Reserved if it has been modified exactly once. It becomes 
Dirty if it is modified more than once.

The Berkeley Protocol
This protocol is implemented for a RISC multiprocessor designed in the
University of California at Berkeley. The approach is similar to
Synapse with two major improvements: cache to cache transfer
implemented on the shared bus and dirty data is not updated in the
main memory when its becomes shared. The following states are used:
Invalid, Valid,  Shared Dirty, and 
Exclusive Dirty. Like the Synapse protocol, the Berkeley approach
uses the idea of ownership -- the cache that has the block in a 
Dirty  state is the owner of that block. If a block is not owned
by any cache, memory is the owner; in any case, the owner supplies the
block on a miss.

Dragon and Firefly
The Firefly cache coherence protocol was used in the Firefly, a
multiprocessor workstation developed by DEC. The Dragon protocol is a
small variation of the Firefly, which was researched by Xerox Palo
Alto Center couple of month later. The main difference between these
two protocols is that Dragon protocol does not update memory on a
cache to cache transfer and delays the memory and cache consistency
until the data is evicted and written back, which saves time and
lowers the memory access requirements. These two schemes represent a
family of write update (or write broadcast) protocols. This snoopy
protocols has ability to detect dynamicly the sharing status of a
block and use a write through policy for shared blocks and write back
for currently non-shared blocks. These approaches are most powerful
ones in their class. The hit rate for such protocols was found close
to 100% for many applications. The Dragon protocol employs the
following five states for the cache blocks:  Invalid , 
Valid Exclusive,  Shared Clean,  Shared Dirty,
Exclusive Dirty.  When a line is evicted from the cache on a
cache miss, a block tagged with a {dirty} bit has to be written back
to the main memory in order to keep memory consistent.


Directory-based Protocols
-------------------------
When a multistage network is used to build a large multiprocessor
system, the snoopy cache protocols must be modified. Since
broadcasting is very expensive in a multistage network, consistency
commands are sent only to caches that keep a copy of the block. This
leads to  Directory Based protocols.

Various directory-based protocols differ mainly in how the directory
maintains information and what information is stored. Generally
speaking the directory may be central or distributed. Contention and
long search times are two drawbacks in using a central directory
scheme. In a distributed-directory scheme, the information about memory
blocks is distributed. Each processor in the system can easily "find
out" where to go for "directory information" for a particular memory
block. Directory-based protocols fall under one of three categories:
Full-map directories, limited directories, and chained directories.
This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.
Date of last update: October 19, 1994.