CS-551: Homework Assignment -- SIMD matrix multiplication on the MasPar

BU/CLA CS-551

Parallel Computing: Models, Languages, and Architectures

SIMD matrix multiplication on the MasPar

We all know that to compute the product of two matrices, we do:

  C(i,j) = A(i,1)*B(1,j) + A(i,2)*B(2,j) + ... + A(i,N)*B(N,j)

To do this on the MasPar, storing one element per processor, is tricky because we have to get the appropriate elements to the appropriate processors. Doing so using general communication might prove to be overly expensive.

Another way of doing matrix multiplication is using Pipelining. Let (+) and (*) denote element-wise addition and multiplication of two arrays. To explain this technique, consider the following 3x3 matrix multiplication of A and B to yield C:

 /             \         /             \         /             \
| C11  C12  C13 |       | A11  A12  A13 |       | B11  B22  B33 |
| C21  C22  C23 |   =   | A22  A23  A21 |  (*)  | B21  B32  B13 |  (+)
| C31  C32  C33 |       | A33  A31  A32 |       | B31  B12  B23 |
 \             /         \             /         \             /


                         /             \         /             \
                        | A12  A13  A11 |       | B21  B32  B13 |
                        | A23  A21  A22 |  (*)  | B31  B12  B23 |  (+)
                        | A31  A32  A33 |       | B11  B22  B33 |
                         \             /         \             /


                         /             \         /             \
                        | A13  A11  A12 |       | B31  B12  B23 |
                        | A21  A22  A23 |  (*)  | B11  B22  B33 |
                        | A32  A33  A31 |       | B21  B32  B13 |
                         \             /         \             /

In the above computation, the elements of the 3 by 3 arrays A and B were initially arranged so that successive circular shifts (shifting up for A and left for B) resulted in the elements of A and B be aligned at each processing element (PE) so that the cumulative sum for the matrix multiplication can be computed in C, which remains static with respect to the PEs. Assuming that A and B are square matrices that conform to the physical dimensions of the MasPar, answer the following questions:

Write an MPL program to do matrix multiplication storing Aij, Bij, and Cij in the same processor. Your program should use the router function to perform general communication to to move the data around.
Write an MPL program to do matrix multiplication using the above pipelined method (i.e. using nearest neighbor communication only). You should assume that, initially, Aij, Bij, and Cij are stored on the same processor (thus initial "skewing" to match the above pattern is necessary).
Assuming that the cost of a general communication using the router function is 10 times as expensive as nearest neighbor communication using (say) the Xnet function. Compare the performance of the two algorithms. (Make assumptions if necessary regarding the timing of other operations).

This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.

Date of last update: November 22, 1994.