Lecture 15
                              11/01/1994

		    Message Passing Architectures


Caltech Cosmic Cube and the nCUBE/2
-----------------------------------
The Cosmic Cube is an example of early multicomputers. The
interconnection network is a 7-dimensional hypercube, thus a maximum
of 128 nodes could be supported. Each node consisted of a
printed-board circuit with an 80286 processor with 512 Kbytes of local
memory and 8 I/O ports, of which 7 are used to form the 7-dimensional
hypercube, whereas the 8th is used for an Ethernet connection from
each node to the host. Routing on the Cosmic Cube was done using a
store-and-forward technique with an average latency in the order of
few milliseconds. The nCUBE/2 is a followup architecture, which
implemented a hypercube of up to 8K nodes with a total aggregate
memory of 512 GBytes. A major departure was the departure from the
store-and-forward routing protocol to a pipelined "wormhole routing",
which effectively cut the latency by 3 orders of magnitude.


The Intel Paragon
-----------------
The Intel Paragon is an example of a "current" architecture for
multicomputers. Unlike hypercube systems, it employs a
low-dimensionality communication network (a mesh) and relies on a
heterogeneous design where all nodes are not identical. In particular,
I/O responsibilities (disk, frame buffers) were delagated to some of
the nodes in the system, instead of performing them through a
"front-end" or a "host", which presented a serious bottlenck for many
applications with high I/O throughput requirements. The mesh
architecture of the Paragon is divided into 3 sections. The middle
"compute" section is a mesh of "numeric nodes", where "number
crunching" takes place. All I/O is handled by two disk I/O columns at
the left and right edges of the mesh. Like the Caltech Cosmic Cube,
each node of the Paragon is implemented on one board. It consists of a
bus connected to a processor, a FP unit, an external I/O unit (for
processors of the disk I/O columns), and a Message I/O unit, which
interfaces the node to the backplane mesh through a router. The router
at each mesh intersection is a 5x5 crossbar switch. The router has 10
buses, 8 for communication with NEWS neighbors, and 2 for data
transfer with the board attached to the mesh at that intersection.
Routing is piplined using wormhole routing with hardware support for
virtual channeling (to resolve deadlocks).


Wormhole routing vs. Store-and-forward Message Passing
------------------------------------------------------
A message is a logical unit for internode communication. Messages
could be variable in length, consisting of a variable number of
fixed-length packets. The packet is the smallest part of a message
that could be independently routed. It is tagged with the destination
address and a sequence number (since packets may not reach their
destinations in their original order in the message). Various packets
from the same message may follow different routes. Store-and-forward
message passing refers to architectures where packets are transmitted
from one node to the next, where they are completely stored before
being forwarded to the next node along their paths. Each
"store-and-forward" is called a hop. If the size of a
packet is L and the average bandwidth of the network is W and the
total number of hops is D, then it can be shown that the
communication latency will be (D+1)*(L/W), which is proportional to the
number of hops (i.e. the distance traveled). 

An alternative to the store-and-forward approach to routing is
"wormhole routing" or "pipelined routing". In this case, a packet is
divided into "flits", such that all flits within a packet are routed
through the same path. The first few flits of a packet carry the
destination address (necessary for routing) and the sequence number,
whereas the remaining flits are purely data flits. Wormhole routing
can be thought of as a "software" implementation of circuit
switching. In particular, several channels may be "reserved" for the
passage of a sequnece of flits from a given packet. If the size of a
flit is F, then the latency for wormhole routing is (L/W) +
(F/W)*D. If L >> F*D, then the latency can be approximated by (L/W),
which is independent of the distance traveled. This property of
wormhole routing makes it particularly attractive to use
low-dimensionality network of high bandwidth (e.g. Paragon, Tera,
and Multicube, J-machine). 

The physical communication channels between nodes are resources that
need to be reserved by the virtual connections established between a
source and a destination. The limited resources in the system may
result in a deadlock situation, where it is impossible to free
physical channels (or buffers). Such deadlocks could be avoided if the
physical channels are time-multiplexed. In other words, a channel
could be used to transmit multiple packets in a time-multiplexed
fashion. This gives rise to virtual channels.


Routing algorithms
------------------
Routing could be either deterministic or adaptive. In deterministic
routing the communication path is completely determined (a priori) by
the source and destination addresses. In adaptive routing the path
depends on network conditions.


Unicasting, Multicasting, and Broadcasting
------------------------------------------
One-to-one communication is often called "unicasting" to distinguish
it from the more general one-to-many "multicasting" and one-to-all
"broadcasting". Given a particular network and a routing mechanism
(store-and-forward vs wormhole), a particular multi/broadcasting
configuration may be better than another. Two measures are used to
evaluate such configurations: channel traffic and communication
latency. Channel traffic is the number of channels used to deliver the
messages involved. Latency is the longest packet transmission time
involved. Generally speaking, for wormhole routing it's better to use
a multi/broadcasting pattern that results in a low channel traffic as
opposed to one that reduces the latency (why?)
This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.
Date of last update: November 1, 1994.