Lecture 06 09/29/1994 Parallelism profile for a program --------------------------------- The number of processors used by a program at a particular point in time defines the Degree Of Parallelism (DOP). The plot of DOP against time is called the parallelism profile for the program. Changes in the DOP depend on many factors (algorithm, available resources, compiler optimizations, etc.) The average value of the DOP (say A) defines the average parallelism for that program over a period of time. The maximum value of the DOP (say m) defines the peak achievable parallelism for that program. Let ti denote the total amount of time that the program's DOP = i. In the presence of an infinite number of processors (i.e. n >> m), we define the average DOP to be the ratio Sum for i=1,m ( i * ti ) A = ------------------------ Sum for i=1,m ( ti ) Harmonic Mean Speedup --------------------- Assume that a program may execute in different modes. We say that a program is executed in mode i if i processors are used. Define Ri to be the rate (e.g. MIPS) at which the execution proceeds with i processors. Suppose that a given program is in mode i with a probability fi, where 1<=i<= n, and n is the number of modes available. We define the Harmonic Mean Speedup to be: Sequential execution time (total work) S(n) = ----------------------------------------- Expected time of execution over all modes W.L.O.G. We assume that Ri = i (i.e. R1 = 1). Also, we assume that communication overhead is negligible. Under these conditions, we get: 1 S(n) = --------------------------- Sum for i=1,n (fi/Ri) Amdahl Law ---------- If a program has two modes of execution: serial, or "maximally" parallel, then the above Harmonic Mean Speedup reduces to: n S(n) = ---------------------- 1 + (n - 1) alpha where n is the available number of processors and alpha is the probability that the program will be in a serial mode (i.e. no parallelism could be exploited). Amdahl law suggests that the maximum achievable parallelism is upper-bounded by 1/alpha. So (for example), if a program executes in serial mode 20% of the time, then even if an infinite number of processors is available, one should not expect a speedup of more than 5. The constant alpha is called the sequential bottleneck of a program. Notice that Amdahl Law stands even though communication overhead is excluded. Including it will only exacerbate the sequential bottleneck! System efficiency, Redundancy, and Utilization ---------------------------------------------- If T(s,n) is the length of time it takes to execute a program of size s on n processors (excluding communication overhead) and if h(s,n) is the communication overhead incured, then the Mean Speedup for that program T(s,1) S(s,n) = -------------- T(s,n) + h(s,n) ignoring communication overhead we get: T(s,1) Sequential execution time S(s,n) = ------ = ------------------------- T(s,n) Parallel execution time Ideally we would want S(n) = n, therefore we define the efficiency of a parallel solution to be S(s,n) T(s,1) E(s,n) = ------ = -------- n n.T(s,n) Furthermore, if O(s,n) is the total number of operations executed on n processors for a problem of size s, then the redundancy of the execution (the extent of the workload increase for going from serial to parallel execution) is: O(s,n) R(s,n) = ------ O(s,1) and the utilization O(s,n) S(s,n) U(s,n) = ------ * ------ O(s,1) n Assuming O(s,1) = T(s,1) (i.e. an operation takes a unit time), we get O(s,n) U(n) = -------- n.T(s,n) Quality of Parallel Execution ----------------------------- The quality of a parallel execution is a measure that combines speedup, efficiency, and redundancy: S(s,n) E(s,n) Q(s,n) = ------------- R(s,n) Since both the efficiency and the inverse of redundancy are fractions, we conclude that Q(s,n) <= S(s,n) Are MIPS and MFLOPS good enough? -------------------------------- No. To factor out issues such as the clock speed and the instruction set architecture, one should use benchmarks. Benchmarks could be "real" programs (such as Spice, LaTex) or synthetic, which means that they are synthesized to reflect a particular kind of applications. Current industry standard synthetic benchmarks include: KLIPS, KDhrystones/s, and KWhestones/s, etc.. The notion of scalability ------------------------- Let s be the size of a problem and n be the size of a parallel machine to be used to solve that problem. The amount of "useful" work to be done is w(s) (a function of the problem size) and the communication overhead is h(s,n) (a function of the problem size and machine size and topology). Assuming 0% redundancy, the efficiency of a parallel algorithm can be written as: w(s) E = ------------- w(s) + h(s,n) With a fixed problem size, the efficiency decreases as n increases. Generally speaking h(s,n) grows slower than w(s) as s increases. Thus, with a fixed machine size, the efficiency increases as s increases. Generally speaking, as the size of a machine increases, one has to increase the problem size in order to maintain a constant efficiency. The "amount of growth" needed in the problem size to balance the increase in machine size (for a constant efficiency) is called the isoefficiency function. The lower the order of the isoefficiency function the "more scalable" the architecture (for that problem). Example: It can be shown that for FFT on an n-node hypercube, h(s,n) = O(n.log n + s.log n), whereas for FFT on a 2-D n-node mesh, h(s,n) = O(n.log n + s.sqrt(n)). The workload (the amount of useful work) for FFT is O(s.log s). To compute the isoefficiency of each architecture we equate w(s) to h(s,n). For the hypercube we get O(n.log n + s.log n) = O(s.log s) ==> s = O(n) ==> h(s,n) = O(n.log n). For the mesh we get O(n.log n + s.sqrt(n)) = O(s.log s) ==> s = O(k**sqrt(n)) ==> h(s,n) = O(sqrt(n).k**sqrt(n)). Obvioulsy the hypercube is a "more" scalable topology for FFT than the mesh. Fixed load/time/space speedup laws ---------------------------------- Amdahl Law provides an upper bound on the achievable speedup given a workload of a fixed size; it's a fixed-load speedup law. In other words, speedup comes from the reduction in CPU time (we wish to solve the same problem faster on a larger machine). For many applications, it may be more realistic to have a fixed-time assumption, whereby speedup comes ffom the increase in problem size or amount of work (given a particular time frame, we wish to solve larger problems on larger machines). In addition to the fixed-time assumption, we may have a fixed-space assumption. In this case, given a particular time frame and a upper bound on memory, we wish to solve larger problems on larger machines. Example: The solution of a discrete Laplace equation in 3 dimensions involves the computation of an average for the 6 neighboring points. Assume that the problem size is N*N*N and that p processors are available. Furthermore, assume that each processor performs 100 MFLOPS and that it has 32MB of memory. Assuming a fixed problem size of 10x10x10, how much "speedup" is achievable, if the communication latency for exchanging data between two neighbors is 1 usec. Repeat for the case where communication latency is 2 usec. What is the minimum size of the problem so as to achieve best efficiency for communication latencies of 1 and 2 usec.
Date of last update: September 29, 1994.