Lecture 15 11/01/1994 Message Passing Architectures Caltech Cosmic Cube and the nCUBE/2 ----------------------------------- The Cosmic Cube is an example of early multicomputers. The interconnection network is a 7-dimensional hypercube, thus a maximum of 128 nodes could be supported. Each node consisted of a printed-board circuit with an 80286 processor with 512 Kbytes of local memory and 8 I/O ports, of which 7 are used to form the 7-dimensional hypercube, whereas the 8th is used for an Ethernet connection from each node to the host. Routing on the Cosmic Cube was done using a store-and-forward technique with an average latency in the order of few milliseconds. The nCUBE/2 is a followup architecture, which implemented a hypercube of up to 8K nodes with a total aggregate memory of 512 GBytes. A major departure was the departure from the store-and-forward routing protocol to a pipelined "wormhole routing", which effectively cut the latency by 3 orders of magnitude. The Intel Paragon ----------------- The Intel Paragon is an example of a "current" architecture for multicomputers. Unlike hypercube systems, it employs a low-dimensionality communication network (a mesh) and relies on a heterogeneous design where all nodes are not identical. In particular, I/O responsibilities (disk, frame buffers) were delagated to some of the nodes in the system, instead of performing them through a "front-end" or a "host", which presented a serious bottlenck for many applications with high I/O throughput requirements. The mesh architecture of the Paragon is divided into 3 sections. The middle "compute" section is a mesh of "numeric nodes", where "number crunching" takes place. All I/O is handled by two disk I/O columns at the left and right edges of the mesh. Like the Caltech Cosmic Cube, each node of the Paragon is implemented on one board. It consists of a bus connected to a processor, a FP unit, an external I/O unit (for processors of the disk I/O columns), and a Message I/O unit, which interfaces the node to the backplane mesh through a router. The router at each mesh intersection is a 5x5 crossbar switch. The router has 10 buses, 8 for communication with NEWS neighbors, and 2 for data transfer with the board attached to the mesh at that intersection. Routing is piplined using wormhole routing with hardware support for virtual channeling (to resolve deadlocks). Wormhole routing vs. Store-and-forward Message Passing ------------------------------------------------------ A message is a logical unit for internode communication. Messages could be variable in length, consisting of a variable number of fixed-length packets. The packet is the smallest part of a message that could be independently routed. It is tagged with the destination address and a sequence number (since packets may not reach their destinations in their original order in the message). Various packets from the same message may follow different routes. Store-and-forward message passing refers to architectures where packets are transmitted from one node to the next, where they are completely stored before being forwarded to the next node along their paths. Each "store-and-forward" is called a hop. If the size of a packet is L and the average bandwidth of the network is W and the total number of hops is D, then it can be shown that the communication latency will be (D+1)*(L/W), which is proportional to the number of hops (i.e. the distance traveled). An alternative to the store-and-forward approach to routing is "wormhole routing" or "pipelined routing". In this case, a packet is divided into "flits", such that all flits within a packet are routed through the same path. The first few flits of a packet carry the destination address (necessary for routing) and the sequence number, whereas the remaining flits are purely data flits. Wormhole routing can be thought of as a "software" implementation of circuit switching. In particular, several channels may be "reserved" for the passage of a sequnece of flits from a given packet. If the size of a flit is F, then the latency for wormhole routing is (L/W) + (F/W)*D. If L >> F*D, then the latency can be approximated by (L/W), which is independent of the distance traveled. This property of wormhole routing makes it particularly attractive to use low-dimensionality network of high bandwidth (e.g. Paragon, Tera, and Multicube, J-machine). The physical communication channels between nodes are resources that need to be reserved by the virtual connections established between a source and a destination. The limited resources in the system may result in a deadlock situation, where it is impossible to free physical channels (or buffers). Such deadlocks could be avoided if the physical channels are time-multiplexed. In other words, a channel could be used to transmit multiple packets in a time-multiplexed fashion. This gives rise to virtual channels. Routing algorithms ------------------ Routing could be either deterministic or adaptive. In deterministic routing the communication path is completely determined (a priori) by the source and destination addresses. In adaptive routing the path depends on network conditions. Unicasting, Multicasting, and Broadcasting ------------------------------------------ One-to-one communication is often called "unicasting" to distinguish it from the more general one-to-many "multicasting" and one-to-all "broadcasting". Given a particular network and a routing mechanism (store-and-forward vs wormhole), a particular multi/broadcasting configuration may be better than another. Two measures are used to evaluate such configurations: channel traffic and communication latency. Channel traffic is the number of channels used to deliver the messages involved. Latency is the longest packet transmission time involved. Generally speaking, for wormhole routing it's better to use a multi/broadcasting pattern that results in a low channel traffic as opposed to one that reduces the latency (why?)
Date of last update: November 1, 1994.