Lecture 19 11/15/1994 SIMD Processing --------------- The CM-2 architecture --------------------- The CM-2 can be thought of as a "glorified" vector unit. It consists of an array of PEs connected to a scalar host (front-end). One way to view the CM-2 is that it has a front-end attached to a memory distributed amongst a number of PEs. Data can be communicated between the front-end and the PE array in one of 3 ways: broadcasting, global combining, and scalar memory bus. In broadcasting the front-end can broadcast a scalar value to all PEs (casting a scalar variable/constant to a parallel variable/constant). In global combining, the front-end can obtain the scalar sum, max, min, OR, AND, of a parallel variable. Using the scalar memory bus, the front-end can read or write the memory of a particular PE. The CM-2 has one control unit for all the PEs. A "parallel" instruction is passed from the front-end to the controller, which broadcasts a sequence of nanoinstructions to all PEs to execute that instruction. Data can be moved around between the PEs using one of three method, the NEWS grids, the global router, and the scan operations. The CM-2 PE array consists of up to 2048 processing nodes. Each consists of a memory unit, a FP unit, and 32 bit-slice processors (1 PE = a 3-input/2-output 1-bit ALU with latches) organized into 2 groups of 16 PEs each. Each one of these 16-PE groups is connected to a NEWS/hypercube router interface. The hypercube router consists of 4096 router nodes interconnected as a 12-dimensional hypercube. A hardware routing algorithm that may introduce misrouting is used. When the router is "delivering" messages, the PE array is idle. The MasPar architecture ----------------------- The main conceptual difference between the MasPar and the CM-2 is the departure from 1-bit PEs to 4-bit PEs and the departure from a router that uses "packet switching" to one that uses "circuit switching" (using a multistage 1024x1024 crossbar interconnect). Another important difference between the CM-2 and the MasPar is the capability of the MasPar PEs to access local memory "independently". This allows the machine to effectively simulate a SPMD operation (by being at different states of the same automata). The basic language for programming the MasPar is MPL, which is a SIMD language (like C*). MPL introduces a new basic datatype (plural) and allows various data types to be derived from it. The following are some examples: plural int x ; plural int *x ; plural int * (plural x) ; As mentioned above, the major difference between the C* (of the CM-2) and MPL is MasPar's ability to use parallel indirection efficiently. For example the following is a segment of code that will result in independent string copying in parallel: p_strcpy(s, t) plural char * plural s, * plural t; { while ( *s++ = *t++ ) ; } In MPL communication is done through the use of the "proc" construct, the "xnet" construct, and the "router" construct. The following are some examples. int values[nproc] ; plural int val ; for (i=0; i < nproc; i++) { proc[i].val = values[i] ; } The above piece of code copies the ith element of the vector values[] (stored in the ACU memory) into the variable val of processor i. The argument of the proc contruct can be either one-dimensional, in which case the MasPar is viewed as a unidimensional SIMD architecture -- or two-dimensional, in which case the MasPar is viewed as a two-dimensional (array) SIMD architecture. There is no support for meshes of higher dimensions. The xnet construct is used for nearest neighbor communication using the 2-D mesh of the MasPar. For example: plural int i, j, k ; i = xnetN[1].j ; xnetS[1].k = i ; The above two statements will result in all active processing elements fetching the value of j from their northern neighbor and storing it locally in variable i and then sending that value to the variable k of their southern neighbor. This is equivalent to xnetS[1].k = xnetN[1].j The xnet construct can be used with any integer offset and in any one of the Xnet directions, namely: N, S, E, W, NE, NW, SE, and SW. For example: sum = xnetN[2].val + xnetE[2].val + xnetS[2].val + xnetW[2].val + xnetNW[2].val + xnetNE[2].val + xnetSE[2].val + xnetSW[2].val temp = xnetE[1].val ; sum = sum + xnetN[2].temp + xnetNW[2].temp + xnetS[2].temp + xnetSW[2].temp ; temp = xnetN[1].val ; sum = sum + xnetW[2].temp + xnetSW[2].temp + xnetE[2].temp + xnetSE[2].temp ; avg = sum / 16 ; The above code computes the average at the perimeter of a 5x5 square. The router construct is used for general communication. For example the following function defines the transpose operation. transpose( p ) plural int p; { p = router [ iyproc + nyproc * ixproc ].p ; return p ; } In the above code, iyproc, ixproc, and nyproc are system defined, where iyproc returns the y coordinate of a processor, ixproc returns the x coordinate of a processor, and nyproc returns the total number of processors in the y coordinate. Similarly defined are nxproc and nproc. The MPL language extends the semantic of the "if", "switch", "while", and "for" control instructions depending on whether the control expression is singular or plural. For example, the following is a plural version of the if-then control statement. plural int i, j ; ... if ( i < 0 ) j = -i ; else j = i ; Only those processing elements with a value of i < 0 will be active in the then-body of the "if" statement. Similarly, only those processing elements with a value of i >= 0 will be active in the else-body of the "if" statement. Notice that any serial code that exists in the then-branch will be executed if at least one processing element is active. The same goes for the else-branch. Thus, it is possible for two singular assignments that exist in both branches of a conditional to be executed! On the other hand, if the control expression of the if statement is singular, then all the active processing elements participate in execution of the then-branch (or else-branch). The MPL language includes a rich library of reduction and scan operations. For example, the following code computes the singular sum j of the plural variable i. #include "mpl.h" int j ; plural int i ; j = reduceAdd32(i) ; Other interesting functions include: "sendwith___" and "scan___". Some examples follow: plural int dest, i, j ; plural bool state ; if (state) j = sendwithAdd32(i,dest) ; In the above code the "senders" and "receivers" are those active processors (whose state == 1). If more than one sender sends data to a particular receiver, the data is combined using Add (or Max, or Min, etc.) Notice the difference between this semantics and the semantics of sending and receiving on the CM-2/CM-5 C*. The CM-5 architecture --------------------- The CM-5 is a "synchronized MIMD" machine. It consists of a combination of processing nodes and control processors, which are SPARC processors, each with 32MB. The processing nodes are equipped with a vecor processing unit. The Control Processors are equipped with I/O and disks. The CM-5 has three networks; A data network (for point-to-point communication), a control network (for broadcasts, synchronization, and scans), and a diagnostic network (for diagnosis and hardware tests). The data and control networks are connected to the processing nodes via a network interface (NI) chip. The system operates one or more user partitions, each consisting of a control processor, a collection of processing nodes and a dedicated portion of the data and control networks. Access to data and control networks can be done directly form the user space (nonprivileged), thus avoiding OS Kernel overhead. Access to the diagnostic network and I/O on the other hand are privileged via system calls. The data network of the CM-5 is based on the fat-tree topology. Processing nodes, control processors, and I/O channels are at the leaves of a tree, whose channel capacities increase as we ascend from these leaves to the root. The hierarchical nature of the fat tree is exploited to give each user partition a dedicated subtree, with which traffic from other partitions doesn't interfere. To route a message from one processor to another, the message is sent up the tree to the least common ancestor of the two processors and then down the destination. As the message goes up the tree it may have several choices as to which parent connection to take. The decision is resolved by pseudo-randomly selecting from amongst those parents that are unobstructed by other messages. As the messages go down, they have no choices. The randomization of the traffic up the fat tree balances the load on the network and avoids undue congestion caused by pathological permutations. The control network is a complete binary tree with all system components at the leaves. Each user partition is assigned a subtree of the network. The control network provides the synchronization mechanisms necessary for allowing data-parallel code to be executed efficiently. The CM-5 architecture --------------------- The CM-5 is a "synchronized MIMD" machine. It consists of a combination of processing nodes and control processors, which are SPARC processors, each with 32MB. The processing nodes are equipped with a vecor processing unit. The Control Processors are equipped with I/O and disks. The CM-5 has three networks; A data network (for point-to-point communication), a control network (for broadcasts, synchronization, and scans), and a diagnostic network (for diagnosis and hardware tests). The data and control networks are connected to the processing nodes via a network interface (NI) chip. The system operates one or more user partitions, each consisting of a control processor, a collection of processing nodes and a dedicated portion of the data and control networks. Access to data and control networks can be done directly form the user space (nonprivileged), thus avoiding OS Kernel overhead. Access to the diagnostic network and I/O on the other hand are privileged via system calls. The data network of the CM-5 is based on the fat-tree topology. Processing nodes, control processors, and I/O channels are at the leaves of a tree, whose channel capacities increase as we ascend from these leaves to the root. The hierarchical nature of the fat tree is exploited to give each user partition a dedicated subtree, with which traffic from other partitions doesn't interfere. To route a message from one processor to another, the message is sent up the tree to the least common ancestor of the two processors and then down the destination. As the message goes up the tree it may have several choices as to which parent connection to take. The decision is resolved by pseudo-randomly selecting from amongst those parents that are unobstructed by other messages. As the messages go down, they have no choices. The randomization of the traffic up the fat tree balances the load on the network and avoids undue congestion caused by pathological permutations. The control network is a complete binary tree with all system components at the leaves. Each user partition is assigned a subtree of the network. The control network provides the synchronization mechanisms necessary for allowing data-parallel code to be executed efficiently.
Date of last update: November 1, 1994.