Lecture 03
			      09/20/1994
			      ----------

Software Parallelism
--------------------
Many factors may limit the amount of parallelism that one may
"extract" from a given piece of code. These factors are usually
attributed to data dependence, control dependence, or resource
dependence. 

Data dependence imposes an ordering relationship between
statements. There are 5 types of data dependence:

(1) Flow dependence: Statement S2 is reacheable from statement S1 and
    statement S2 consumes the results of statement S1.
(2) Antidependence: Statement S2 is reacheable from statement S1 and
    statement S1 uses results that statement S2 may change.
(3) Output dependence: Statements S1 and S2 may both result in
    changes to the same data.
(4) I/O Dependence: Statements S1 and S2 access the same I/O file,
    even if they access different parts.
(5) Undetermined Dependence: Independence between statements S1 and S2
    cannot be established (e.g. S1 and S2 manipulate the same
    subscripted variables and the subscript is itself a subscripted
    variable). 

Bernstein's Conditions: A process is a software entity that consumes a
data set (input set, I) and produces another data set (output set,
O). Two processes pi and pj can execute in parallel if Bernstein's
conditions hold:

    Ii  and Oj are mutually exclusive sets
    Ij  and Oi are mutually exclusive sets
    Oi  and Oj are mutually exclusive sets

The order of execution of two processes doesn't affect the outcome of
their computation if they satisfy the above condtions. 

The Bernstein condition can be generalized for a set of software
processes.  A set of processes can be executed in parallel if they
satisfy Bernstein's conditions on a pairwise basis.

Control dependence refers to situations in which statements (or
program segments) in different parts of a program control structure
(IF and LOOP constructs) must abide by a particular ordering for a
correct execution. For example, statements in different loop
iterations may or may not be control dependent. Control dependence
could be eliminated totally if the sequencing of instructions is
determined solely based on data dependencies. Such an alternative
yields what is known as dataflow computing or demand-driven computing,
as opposed to the traditional control-flow computing.  [Dataflow and
demand-driven computers are still in their experimental phase].

In control-flow computing, the ordering of instructions is determined
a-priori by the programmer (or the compiler) through the use of
control structures (e.g. Sequences and Loops). In demand-driven
computing, the execution of an operation is triggered by the need for
its results; it's a top-down approach based that corresponds to "lazy
evaluation". In dataflow computing, the execution of an operation is
triggered by the availability of its operands (inputs); it's a
bottom-up approach based on "eager evaluation". The degree of explicit
flow control decreases from control-flow to demand-driven to dataflow
computing. 


Hardware Parallelism
--------------------
Parallel execution is possible only if hardware resources could be
utilized concurrently by several operations. The possible patterns of
concurrency amongst hardware functional units is a function of many
variables that are constrained by cost and performance tradeoffs.
Hardware parallelism exists in many levels of a computer design and
may be limited by intricate interactions.  For example, while
functional units may exist and may allow more parallelism, it may be
impossible to capitalize on them due to a limitations imposed by
microprogramming (horizontal versus vertical microprogramming). 

Example: Mismatch between software parallelism and hardware
         parallelism.


Program Partitioning
--------------------
Software parallelism could be exploited at various granularity levels.
In fine grain parallelism, the basic unit of computation (program
segment or granule) may span tens to hundreds of instructions of
straightline code or unfolded iterations. This level of parallelism is
often exploited by a parallelizing (vectorizing) compiler. Medium
grain parallelism is bounded by function or procedure call/return and
may span thousands of instructions. This level of parallelism often
involves both the programmer and the compiler. Coarse grain
parallelism results from the execution of essentially independent
programs and may span tens of thousands of instructions. This level of
parallelism is exploited by the run time system (OS).

Efficient parallel execution requires a delicate balance between
program granularity and communication latency (synchronization
overhead) between the different granules. In particular, if
communication latency is minimal then fine grain partitioning will
yield the best performance.  This is the case when data parallelism is
used. If communication latency is large (as in a loosely coupled
system), then coarse grain partitioning is more appropriate. The
problem of selecting the "right" grain size is NP-complete (it
involves exploring an exponential search space of possible schedules).

Example: Hiding latency by "grain packing".
This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.
Date of last update: September 29, 1994.