Lecture 02 09/15/1994 ---------- Theoretical models of parallel machines are useful in studying and comparing the complexity (cost) of parallel algorithms without having to worry about the details of the physical machines. In essence these models allow the extension of the well-known notions of space and time complexities of serial algorithms. PRAM complexity model --------------------- The Parallel Random Access Memory model views a multiprocessor as UMA machine consisting of a number of processors with a globally addressable memory. The processors are synchronized in their read, compute, and write memory cycles. In other words, any memory access takes exactly the same (unit) amount of time; there is no synchronization or memory access overhead. Since more than one processor may read/write a particular memory address, a policy has to be adopted to regulate concurrent reads and concurrent writes and determine the outcome of such concurrent accesses. The following are possible PRAM variants: EREW-PRAM: The Exclusive Read Exclusive Write PRAM model allows no concurrent reads or writes to the same memory address. CREW-PRAM: The Concurrent Read Exclusive Write PRAM model allows different processors to concurrently read the same memory address, but forbids concurrent writes. CRCW-PRAM: The Concurrent Read Concurrent Write PRAM model allows different processors to concurrently read/write the same memory address. Conflicting writes are resolved by adopting a policy that either restricts all written values to be identical, or else "choses" one of the written values according to some known policy (e.g. random, priority, minimum). Coming up with a solution for a problem on the EREW-PRAM model will result in an algorithm with a higher time/space complexity than that possible on a CREW-PRAM model or a CRCW-PRAM model. It can be shown, however, that any CRCW-PRAM can be simulated by an EREW-PRAM with an O(log n) slowdown, where n is the number of processors. Example: Devise an O(log n) matrix multiplication algorithm for n^3 processors. Example: Devise an O(log n) matrix multiplication algorithm for n^3/log n processors. A VLSI complexity model ----------------------- When algorithms are "hardwired" into VLSI chips, the "size" of the chip and its "capacity" to process information by moving it from one point to the other become interesting complexity measures. We identify two such measures: one bounds "memory" (size) and the other bounds "I/O" (capacity). Today's VLSI technology is 2-dimensional, which means that VLSI chips are manufactured by laying out circuitry in 2 dimensions. [The fact that VLSI chips consist of many layers of 2-D circuitry doesn't change the dimensionality, it simply increases the 2-D area of the chip by a constant factor.] Since the size of a VLSI chip is proportional to the amount of memory (space) such a chip can hold, we can estimate the space complexity of an algorithm by the Area (A) of the VLSI chip implementation of that algorithm. If T is the time (latency) needed to terminate the algorithm, then A.T gives an upper bound on the total number of bits processed through the chip (or I/O). It has been argued that an appropriate quantity to bound both the space (VLSI area A) and time (T) requirements of an algorithm is given by A.T^2 >= O(f(s)) which means that there is a lower bound on the area and latency of any VLSI algorithm that implements a solution to the problem of size s. Using the VLSI complexity model, we concern ourselves with the estimation of the smallest A.T^2 that realizes an algorithm. Example: Devise an O(n^4) VLSI chip for matrix multiplication using 2-D broadcasting. Example: Devise an O(n^4) VLSI chip for matrix multiplication using nearest-neighbor (local) communication only. Example: Devise an O(n^2) VLSI chip for multiplying two n-word integers using 1-word multiplication units. Architectural Development Tracks -------------------------------- The evolution of parallel computers sprang along three tracks. These tracks are distinguished by similarity in the underlying parallel computational models. / - Multiprocessor track .--> Multiple processor track -< | \ - Multicomputer track / / / - Vector track --< ---> Multiple data track -< \ \ - SIMD track \ | / - Multithreaded track `--> Multiple threads track -< \ - Dataflow track In the multiple processor track, the source of parallelism is assumed to be the concurrent execution of different threads on different processors, with communication occuring either through shared memory (multiprocessor track) or via message passing (multicomputer track). In the multiple data track, the source of parallelism is assumed to be the opportunity to execute the same code on massive amounts of data. This could be through the execution of the same instruction on a sequence of data elements (vector track) or through the execution of the same sequence of instructions on similar data sets (SIMD track). In the multiple threads track, the source of parallelism is assumed to be the interleaved execution of different threads on the same processor so as to hide synchronization delays between threads executing on different processors. Thread interleaving could be coarse (multithreaded track) or fine (dataflow track).
Date of last update: September 29, 1994.