Lecture 06
                              09/29/1994

Parallelism profile for a program
---------------------------------
The number of processors used by a program at a particular point in
time defines the Degree Of Parallelism (DOP). The plot of DOP against
time is called the parallelism profile for the program. Changes in the
DOP depend on many factors (algorithm, available resources, compiler
optimizations, etc.) The average value of the DOP (say A) defines the
average parallelism for that program over a period of time. The
maximum value of the DOP (say m) defines the peak achievable
parallelism for that program. Let ti denote the total amount of time
that the program's DOP = i. In the presence of an infinite number of
processors (i.e. n >> m), we define the average DOP to be the
ratio 
				   
		       Sum for i=1,m ( i * ti )
                 A =   ------------------------
			 Sum for i=1,m ( ti )



Harmonic Mean Speedup
---------------------
Assume that a program may execute in different modes. We say that a
program is executed in mode i if i processors are used. Define Ri to
be the rate (e.g. MIPS) at which the execution proceeds with i
processors. Suppose that a given program is in mode i with a 
probability fi, where 1<=i<= n, and n is the number of modes
available. We define the Harmonic Mean Speedup to be: 

                    Sequential execution time (total work)
           S(n) =  -----------------------------------------
                   Expected time of execution over all modes

W.L.O.G. We assume that Ri = i (i.e. R1 = 1). Also, we assume that
communication overhead is negligible. Under these conditions, we get: 

				  1
               S(n) = ---------------------------
			Sum for i=1,n (fi/Ri)

Amdahl Law
----------
If a program has two modes of execution: serial, or "maximally"
parallel, then the above Harmonic Mean Speedup reduces to: 

                                  n
                  S(n) = ----------------------
                           1 + (n - 1) alpha

where n is the available number of processors and alpha is the
probability that the program will be in a serial mode (i.e. no
parallelism could be exploited). Amdahl law suggests that the maximum
achievable parallelism is upper-bounded by 1/alpha. So (for example),
if a program executes in serial mode 20% of the time, then even if an
infinite number of processors is available, one should not expect a
speedup of more than 5. The constant alpha is called the sequential
bottleneck of a program. Notice that Amdahl Law stands even though
communication overhead is excluded. Including it will only exacerbate
the sequential bottleneck!


System efficiency, Redundancy, and Utilization
----------------------------------------------
If T(s,n) is the length of time it takes to execute a program of size
s on n processors (excluding communication overhead) and if h(s,n) is
the communication overhead incured, then the Mean Speedup for that program  

                 T(s,1)
    S(s,n) = --------------
             T(s,n) + h(s,n)

ignoring communication overhead we get:

             T(s,1)   Sequential execution time
    S(s,n) = ------ = -------------------------
             T(s,n)    Parallel execution time

Ideally we would want S(n) = n, therefore we define the efficiency of
a parallel solution to be 

             S(s,n)    T(s,1)
    E(s,n) = ------ = --------
                n     n.T(s,n)

Furthermore, if O(s,n) is the total number of operations executed on n
processors for a problem of size s, then the redundancy of the
execution (the extent of the workload increase for going from serial
to parallel execution) is:  

             O(s,n)
    R(s,n) = ------
             O(s,1)

and the utilization

             O(s,n)   S(s,n)
    U(s,n) = ------ * ------
             O(s,1)      n     

Assuming O(s,1) = T(s,1) (i.e. an operation takes a unit time), we get

            O(s,n)
    U(n) = --------
           n.T(s,n)


Quality of Parallel Execution
-----------------------------
The quality of a parallel execution is a measure that combines
speedup, efficiency, and redundancy:

             S(s,n) E(s,n)
    Q(s,n) = -------------
                 R(s,n)

Since both the efficiency and the inverse of redundancy are fractions,
we conclude that Q(s,n) <= S(s,n)


Are MIPS and MFLOPS good enough?
--------------------------------
No. To factor out issues such as the clock speed and the instruction set
architecture, one should use benchmarks. Benchmarks could be "real"
programs (such as Spice, LaTex) or synthetic, which means that they
are synthesized to reflect a particular kind of applications. Current
industry standard synthetic benchmarks include: KLIPS, KDhrystones/s,
and KWhestones/s, etc..


The notion of scalability
-------------------------
Let s be the size of a problem and n be the size of a parallel
machine to be used to solve that problem. The amount of "useful" work
to be done is w(s) (a function of the problem size) and the
communication overhead is h(s,n) (a function of the problem size and
machine size and topology). Assuming 0% redundancy, the efficiency of
a parallel algorithm can be written as:

                w(s)
        E = -------------
            w(s) + h(s,n)

With a fixed problem size, the efficiency decreases as n increases.
Generally speaking h(s,n) grows slower than w(s) as s increases. Thus,
with a fixed machine size, the efficiency increases as s increases.
Generally speaking, as the size of a machine increases, one has to
increase the problem size in order to maintain a constant efficiency.
The "amount of growth" needed in the problem size to balance the
increase in machine size (for a constant efficiency) is called the
isoefficiency function. The lower the order of the isoefficiency
function the "more scalable" the architecture (for that problem). 

Example: It can be shown that for FFT on an n-node hypercube, 
         h(s,n) = O(n.log n + s.log n), whereas for FFT on a 2-D
         n-node mesh, h(s,n) = O(n.log n + s.sqrt(n)). The workload
         (the amount of useful work) for FFT is O(s.log s). To compute
         the isoefficiency of each architecture we equate w(s) to
         h(s,n). For the hypercube we get O(n.log n + s.log n) =
         O(s.log s) ==> s = O(n) ==> h(s,n) = O(n.log n). For the mesh
         we get O(n.log n + s.sqrt(n)) = O(s.log s) ==> s = O(k**sqrt(n))
         ==> h(s,n) = O(sqrt(n).k**sqrt(n)). Obvioulsy the hypercube
         is a "more" scalable topology for FFT than the mesh.


Fixed load/time/space speedup laws
----------------------------------
Amdahl Law provides an upper bound on the achievable speedup given a
workload of a fixed size; it's a fixed-load speedup law. In other
words, speedup comes from the reduction in CPU time (we wish to solve
the same problem faster on a larger machine). For many applications,
it may be more realistic to have a fixed-time assumption, whereby
speedup comes ffom the increase in problem size or amount of work
(given a particular time frame, we wish to solve larger problems on
larger machines). In addition to the fixed-time assumption, we may
have a fixed-space assumption. In this case, given a particular time
frame and a upper bound on memory, we wish to solve larger problems on
larger machines. 

Example: The solution of a discrete Laplace equation in 3 dimensions
         involves the computation of an average for the 6 neighboring
         points. Assume that the problem size is N*N*N and that p
         processors are available. Furthermore, assume that each
         processor performs 100 MFLOPS and that it has 32MB of
         memory. Assuming a fixed problem size of 10x10x10, how much
         "speedup" is achievable, if the communication latency for
         exchanging data between two neighbors is 1 usec. Repeat for
         the case where communication latency is 2 usec. What is the
         minimum size of the problem so as to achieve best efficiency
         for communication latencies of 1 and 2 usec.
This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.
Date of last update: September 29, 1994.