Lecture 12
                              10/20/1994

		  Multiprocessors and Multicomputers


Earlier in this course we distinguished "shared-memory" multiprocessors
from "message-passing" multicomputers. In this lecture we look closer
at the architectural details of these systems.

Multiprocessors
---------------
A general multiprocessor system must have the following component: 
  - Processors
  - Global shared memory (for all processors)
  - Shared I/O devices

and may have the following components
  - Local memory (per processor)
  - A cache memory (per processor, to hold a subset of shared memory)

Recall that depending on the architecture of the system, a
multiprocessor system may be UMA, NUMA, or COMA. In all cases, the
processors are attached to the global shared memory (and I/O) via an
interconnection network. The design of the network may vary depending
on the network topology (static and dynamic network topologies),
routing (circuit switching, packet switching, and wormhole routing),
timing (synchronous vs asynchronous) and control (centralized and
distributed). This has been overviewed in previous lectures. In this
lecture, we look at alternatives to the "straightforward" architectures
we discussed earlier. Each of the concepts we present has been
incorporated in at least one "real machine". We mention these
machines as well. 


Hierarchical Bus Systems (Encore's Ultramax)
------------------------
Such a system consists of a hierarchy of buses, with the global shared
memory connected to the highest level intercluster bus, cluster caches at
the intermediate nodes, and processors with their private caches at the
leaves.

                                      Memory Units
                                        | | | |
intercluster bus ----------------------------------------------------
                    |              |              |              |
                 Cluster        Cluster        Cluster        Cluster 
                  Cache          Cache          Cache          Cache
                    |              |              |              |
     cluster bus -------        -------        -------        -------   
                  | | |          | | |          | | |          | | |
                  C C C          C C C          C C C          C C C
                  | | |          | | |          | | |          | | |
                  P P P          P P P          P P P          P P P

An example of such a design is the Encore's Ultramax multiprocessor
architecture. It has a two-level bus (as shown above), except that the
memory is "distributed" at the leaves (i.e. each leaf node has a
processor, a private cache, and a "piece" of the main
memory). A multi-level cache coherence protocol is used to maintain the
consistency of the multi-level caches with main memory.


Combining Networks (IBM RP3 multiprocessor) 
------------------ 
Many nonuniform communication patterns may create hot spots, where a
certain memory module becomes excessively accessed by many processors
at the same time (e.g. semaphores). All these accesses have to
serialize through the hot spot, which results in a considerable
slowdown. One way to solve this problem is to enable the network to
"combine" requests for the same memory module. This has been
implemented in the FETCH-AND-OP network of the IBM RP3. The IBM RP3 was
designed to include 512 processors unsing a high speed Omega network
for reads and writes and a combining network for FETCH-AND-OP.  The
advantage of using a combining network to implement FETCH-AND-OP is
achieved at a significant cost (in the NYU Ultracomputer the
predecessor of the RP3, it amounted to 6 times the cost of a simple
Omega network).


Cache Coherence
---------------
The existence of multiple cached copies of data creates the possibility
of inconsistency between a cached copy and the shared memory or between
cached copies themselves. This may result because of data sharing,
because of process migration, or because of I/O. 

With a bus interconnection, cache coherence is usually maintained by
adopting a "snoopy protocol", where each cache controller "snoops" on
the transactions of the other caches and guarantees the validity of the
cached data. In a (single-) multi-stage network, however, the
unavailability of a system "bus" where transactions are broadcast makes
snoopy protocols not useful. Directory based schemes are used in this
case. 

Snoopy Bus Protocols
--------------------
The Snoopy Cache Coherence Protocols differ in the number of states
they allow each cache block (line) to assume and the transition
relation between these states  as a result of bus/cpu read/write
transactions. We will considered a number of protocols:



Directory-based Protocols
-------------------------
When a multistage network is used to build a large multiprocessor
system, the snoopy cache protocols must be modified. Since
broadcasting is very expensive in a multistage network, consistency
commands are sent only to caches that keep a copy of the block. This
leads to  Directory Based protocols.

Various directory-based protocols differ mainly in how the directory
maintains information and what information is stored. Generally
speaking the directory may be central or distributed. Contention and
long search times are two drawbacks in using a central directory
scheme. In a distributed-directory scheme, the information about memory
blocks is distributed. Each processor in the system can easily "find
out" where to go for "directory information" for a particular memory
block. Directory-based protocols fall under one of three categories:
Full-map directories, limited directories, and chained directories.


This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.

Date of last update: October 19, 1994.