Lecture 02
			      09/15/1994
			      ----------

Theoretical models of parallel machines are useful in studying and
comparing the complexity (cost) of parallel algorithms without having
to worry about the details of the physical machines. In essence these
models allow the extension of the well-known notions of space and time
complexities of serial algorithms. 

PRAM complexity model
---------------------
The Parallel Random Access Memory model views a multiprocessor as
UMA machine consisting of a number of processors with a globally
addressable memory. The processors are synchronized in their read,
compute, and write memory cycles. In other words, any memory access
takes exactly the same (unit) amount of time; there is no
synchronization or memory access overhead. Since more than one
processor may read/write a particular memory address, a policy has to
be adopted to regulate concurrent reads and concurrent writes and
determine the outcome of such concurrent accesses. The following are
possible PRAM variants: 

EREW-PRAM: 
The Exclusive Read Exclusive Write PRAM model allows no concurrent
reads or writes to the same memory address.

CREW-PRAM:
The Concurrent Read Exclusive Write PRAM model allows different
processors to concurrently read the same memory address, but forbids
concurrent writes. 

CRCW-PRAM: 
The Concurrent Read Concurrent Write PRAM model allows different
processors to concurrently read/write the same memory
address. Conflicting writes are resolved by adopting a policy that
either restricts all written values to be identical, or else "choses"
one of the written values according to some known policy (e.g. random,
priority, minimum). 

Coming up with a solution for a problem on the EREW-PRAM model will
result in an algorithm with a higher time/space complexity than that
possible on a CREW-PRAM model or a CRCW-PRAM model. It can be shown,
however, that any CRCW-PRAM can be simulated by an EREW-PRAM with an
O(log n) slowdown, where n is the number of processors.

Example: Devise an O(log n) matrix multiplication algorithm for 
         n^3 processors. 

Example: Devise an O(log n) matrix multiplication algorithm for
         n^3/log n processors. 


A VLSI complexity model
-----------------------
When algorithms are "hardwired" into VLSI chips, the "size" of the
chip and its "capacity" to process information by moving it from one
point to the other become interesting complexity measures. We identify
two such measures: one bounds "memory" (size) and the other bounds
"I/O" (capacity). 

Today's VLSI technology is 2-dimensional, which means that VLSI chips
are manufactured by laying out circuitry in 2 dimensions. [The fact
that VLSI chips consist of many layers of 2-D circuitry doesn't change
the dimensionality, it simply increases the 2-D area of the chip by a
constant factor.] Since the size of a VLSI chip is proportional to the
amount of memory (space) such a chip can hold, we can estimate the
space complexity of an algorithm by the Area (A) of the VLSI chip
implementation of that algorithm. If T is the time (latency) needed to
terminate the algorithm, then A.T gives an upper bound on the total
number of bits processed through the chip (or I/O). It has been argued
that an appropriate quantity to bound both the space (VLSI area A) and
time (T) requirements of an algorithm is given by

                         A.T^2 >= O(f(s))

which means that there is a lower bound on the area and latency of any
VLSI algorithm that implements a solution to the problem of size
s. Using the VLSI complexity model, we concern ourselves with the
estimation of the smallest A.T^2 that realizes an algorithm.

Example: Devise an O(n^4) VLSI chip for matrix multiplication using
         2-D broadcasting.

Example: Devise an O(n^4) VLSI chip for matrix multiplication using
         nearest-neighbor (local) communication only.

Example: Devise an O(n^2) VLSI chip for multiplying two n-word
         integers using 1-word multiplication units. 


Architectural Development Tracks
--------------------------------
The evolution of parallel computers sprang along three tracks. These
tracks are distinguished by similarity in the underlying parallel
computational models. 


                                        / - Multiprocessor track
       .--> Multiple processor track  -<
       |                                \ - Multicomputer track
       /
      /                                 / - Vector track
   --< ---> Multiple data track       -<
      \                                 \ - SIMD track
       \
       |                                / - Multithreaded track
       `--> Multiple threads track    -<
                                        \ - Dataflow track


In the multiple processor track, the source of parallelism is assumed
to be the concurrent execution of different threads on different
processors, with communication occuring either through shared memory
(multiprocessor track) or via message passing (multicomputer
track). In the multiple data track, the source of parallelism is
assumed to be the opportunity to execute the same code on massive
amounts of data. This could be through the execution of the same
instruction on a sequence of data elements (vector track) or through
the execution of the same sequence of instructions on similar data
sets (SIMD track). In the multiple threads track, the source of
parallelism is assumed to be the interleaved execution of different
threads on the same processor so as to hide synchronization delays
between threads executing on different processors. Thread interleaving
could be coarse (multithreaded track) or fine (dataflow track).
This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.
Date of last update: September 29, 1994.