Lecture 19
                              11/15/1994

		    
			   SIMD Processing
			   ---------------

The CM-2 architecture
---------------------
The CM-2 can be thought of as a "glorified" vector unit. It consists
of an array of PEs connected to a scalar host (front-end). One way to
view the CM-2 is that it has a front-end attached to a 
memory distributed amongst a number of PEs. Data can be
communicated between the front-end and the PE array in one of 3 ways:
broadcasting, global combining, and scalar memory bus. In broadcasting the
front-end can broadcast a scalar value to all PEs (casting a scalar
variable/constant to a parallel variable/constant). In global
combining, the front-end can obtain the scalar sum, max, min, OR, AND,
of a parallel variable. Using the scalar memory bus, the front-end can
read or write the memory of a particular PE. 

The CM-2 has one control unit for all the PEs. A "parallel"
instruction is passed from the front-end to the controller, which
broadcasts a sequence of nanoinstructions to all PEs to execute that
instruction. Data can be moved around between the PEs using one of
three method, the NEWS grids, the global router, and the scan
operations.

The CM-2 PE array consists of up to 2048 processing nodes. Each
consists of a memory unit, a FP unit, and 32 bit-slice processors (1
PE = a 3-input/2-output 1-bit ALU with latches) organized into 2
groups of 16 PEs each. Each one of these 16-PE groups is connected to
a NEWS/hypercube router interface. 

The hypercube router consists of 4096 router nodes interconnected as a
12-dimensional hypercube. A hardware routing algorithm that may
introduce misrouting is used. When the router is "delivering"
messages, the PE array is idle.


The MasPar architecture
-----------------------
The main conceptual difference between the MasPar and the CM-2 is the
departure from 1-bit PEs to 4-bit PEs and the departure from a router
that uses "packet switching" to one that uses "circuit switching"
(using a multistage 1024x1024 crossbar interconnect). Another
important difference between the CM-2 and the MasPar is the capability
of the MasPar PEs to access local memory "independently". This
allows the machine to effectively simulate a SPMD operation (by being
at different states of the same automata). 

 
The basic language for programming the MasPar is MPL, which is a SIMD
language (like C*). MPL introduces a new basic datatype (plural) and
allows various data types to be derived from it. The following are
some examples: 


plural int x ;

plural int *x ;

plural int * (plural x) ;


As mentioned above, the major difference between the C* (of the CM-2)
and MPL is MasPar's ability to use parallel indirection efficiently.
For example the following is a segment of code that will result in
independent string copying in parallel:

p_strcpy(s, t)
  plural char * plural s, * plural t;
{
  while ( *s++ = *t++ ) ;
}


In MPL communication is done through the use of the "proc" construct,
the "xnet" construct, and the "router" construct. The following are
some examples. 

int values[nproc] ;
plural int val ;

for (i=0; i < nproc; i++) {
  proc[i].val = values[i] ;
}

The above piece of code copies the ith element of the vector values[]
(stored in the ACU memory) into the variable val of processor i. The
argument of the proc contruct can be either one-dimensional, in which
case the MasPar is viewed as a unidimensional SIMD architecture --  or
two-dimensional, in which case the MasPar is viewed as a
two-dimensional (array) SIMD architecture. There is no support for
meshes of higher dimensions.

The xnet construct is used for nearest neighbor communication using
the 2-D mesh of the MasPar. For example:
plural int i, j, k ;

i = xnetN[1].j ;
xnetS[1].k = i ;

The above two statements will result in all active processing elements
fetching the value of j from their northern neighbor and storing it
locally in variable i and then sending that value to the variable k of
their southern neighbor. This is equivalent to 

xnetS[1].k = xnetN[1].j

The xnet construct can be used with any integer offset and in any
one of the Xnet directions, namely: N, S, E, W, NE, NW, SE, and
SW. For example:

sum  = xnetN[2].val  + xnetE[2].val  + xnetS[2].val  + xnetW[2].val  +
       xnetNW[2].val + xnetNE[2].val + xnetSE[2].val + xnetSW[2].val 
temp = xnetE[1].val ;
sum  = sum + xnetN[2].temp + xnetNW[2].temp + xnetS[2].temp + xnetSW[2].temp ;
temp = xnetN[1].val ;
sum  = sum + xnetW[2].temp + xnetSW[2].temp + xnetE[2].temp + xnetSE[2].temp ;
avg  = sum / 16 ;

The above code computes the average at the perimeter of a 5x5 square.

The router construct is used for general communication. For example
the following function defines the transpose operation.

transpose( p )
  plural int p;
{
  p = router [ iyproc + nyproc * ixproc ].p ;
  return p ;
}

In the above code, iyproc, ixproc, and nyproc are system
defined, where iyproc returns the y coordinate of a processor, 
ixproc returns the x coordinate of a processor, and nyproc returns the
total number of processors in the y coordinate. Similarly defined are
nxproc and nproc.

The MPL language extends the semantic of the "if", "switch", "while",
and "for" control instructions depending on whether the control
expression is singular or plural. For example, the following is a
plural version of the if-then control statement.

plural int i, j ;
...
if ( i < 0 ) 
  j = -i ;
else
  j = i ;

Only those processing elements with a value of i < 0 will be active
in the then-body of the "if" statement. Similarly, only those
processing elements with a value of i >= 0 will be active
in the else-body of the "if" statement. Notice that any serial code
that exists in the then-branch will be executed if at least one
processing element is active. The same goes for the else-branch. Thus,
it is possible for two singular assignments that exist in both
branches of a conditional to be executed! On the other hand, if the control
expression of the if statement is singular, then all the active
processing elements participate in execution of the then-branch (or
else-branch). 

The MPL language includes a rich library of reduction and scan
operations. For example, the following code computes the singular sum j
of the plural variable i.

#include "mpl.h"

int j ;
plural int i ;

j = reduceAdd32(i) ;

Other interesting functions include: "sendwith___" and "scan___". Some
examples follow:

plural int dest, i, j ;
plural bool state ;

if (state) 
  j = sendwithAdd32(i,dest) ;

In the above code the "senders" and "receivers" are those active
processors (whose state == 1). If more than one sender sends data to a
particular receiver, the data is combined using Add (or Max, or Min,
etc.) Notice the difference between this semantics and the semantics
of sending and receiving on the CM-2/CM-5 C*.


The CM-5 architecture
---------------------
The CM-5 is a "synchronized MIMD" machine. It consists of a
combination of processing nodes and control processors, which are
SPARC processors, each with 32MB. The processing nodes are equipped
with a vecor processing unit. The Control Processors are equipped with
I/O and disks. The CM-5 has three networks; A data network (for
point-to-point communication), a control network (for broadcasts,
synchronization, and scans), and a diagnostic network (for diagnosis
and hardware tests). The data and control networks are connected to
the processing nodes via a network interface (NI) chip.

The system operates one or more user partitions, each consisting of a
control processor, a collection of processing nodes and a dedicated
portion of the data and control networks. Access to data and control
networks can be done directly form the user space (nonprivileged),
thus avoiding OS Kernel overhead. Access to the diagnostic network and
I/O on the other hand are privileged via system calls.

The data network of the CM-5 is based on the fat-tree
topology. Processing nodes, control processors, and I/O channels are
at the leaves of a tree, whose channel capacities increase as we
ascend from these leaves to the root. The hierarchical nature of the
fat tree is exploited to give each user partition a dedicated subtree,
with which traffic from other partitions doesn't interfere. To route a
message from one processor to another, the message is sent up the tree
to the least common ancestor of the two processors and then down the
destination. As the message goes up the tree it may have several
choices as to which parent connection to take. The decision is
resolved by pseudo-randomly selecting from amongst those parents that
are unobstructed by other messages. As the messages go down, they have
no choices. The randomization of the traffic up the fat tree balances
the load on the network and avoids undue congestion caused by
pathological permutations.

The control network is a complete binary tree with all system
components at the leaves. Each user partition is assigned a subtree of
the network. The control network provides the synchronization
mechanisms necessary for allowing data-parallel code to be executed
efficiently.


The CM-5 architecture
---------------------
The CM-5 is a "synchronized MIMD" machine. It consists of a
combination of processing nodes and control processors, which are
SPARC processors, each with 32MB. The processing nodes are equipped
with a vecor processing unit. The Control Processors are equipped with
I/O and disks. The CM-5 has three networks; A data network (for
point-to-point communication), a control network (for broadcasts,
synchronization, and scans), and a diagnostic network (for diagnosis
and hardware tests). The data and control networks are connected to
the processing nodes via a network interface (NI) chip.

The system operates one or more user partitions, each consisting of a
control processor, a collection of processing nodes and a dedicated
portion of the data and control networks. Access to data and control
networks can be done directly form the user space (nonprivileged),
thus avoiding OS Kernel overhead. Access to the diagnostic network and
I/O on the other hand are privileged via system calls.

The data network of the CM-5 is based on the fat-tree
topology. Processing nodes, control processors, and I/O channels are
at the leaves of a tree, whose channel capacities increase as we
ascend from these leaves to the root. The hierarchical nature of the
fat tree is exploited to give each user partition a dedicated subtree,
with which traffic from other partitions doesn't interfere. To route a
message from one processor to another, the message is sent up the tree
to the least common ancestor of the two processors and then down the
destination. As the message goes up the tree it may have several
choices as to which parent connection to take. The decision is
resolved by pseudo-randomly selecting from amongst those parents that
are unobstructed by other messages. As the messages go down, they have
no choices. The randomization of the traffic up the fat tree balances
the load on the network and avoids undue congestion caused by
pathological permutations.

The control network is a complete binary tree with all system
components at the leaves. Each user partition is assigned a subtree of
the network. The control network provides the synchronization
mechanisms necessary for allowing data-parallel code to be executed
efficiently.
This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.
Date of last update: November 1, 1994.