Lecture 16
                              11/03/1994

		    


			  Vector Processing
			  -----------------

A vector processor consists of a scalar processor and a vector unit,
which could be thought of as an independent functional unit capable of
efficient vector operations. Scalar instructions are executed on the
scalar processor, whereas vector instructions are executed on the
vector unit. When designing the instruction set for a vector unit, one
has to choose an ISA. Most of today's vector units have an instruction
set that "generalizes" the Load/Store ISA of scalar processors. In
particular, the architecture relies on the existence of General
Purpose Vector Registers that can be loaded or stored to memory, with
all arithmetic operations performed amongst Vector Registers. 

The ISA of a scalar processor is augmented with vector instructions of
the following types:

o Vector-vector instructions: 
	f1: Vi --> Vj           (e.g. MOVE Va, Vb)
        f2: Vj x Vk --> Vi      (e.g. ADD  Va, Vb, Vc)

o Vector-scalar instructions:
        f3: s  x Vi --> Vj      (e.g. ADD  R1, Va, Vb)

o Vector-memory instructions: 
        f4: M --> V             (e.g. Vector Load)
        f5: V --> M             (e.g. Vector Store)

o Vector reduction instructions:
        f6: V --> s             (e.g. ADD V, s)
        f7: Vi x Vj --> s       (e.g. DOT Va, Vb, s)

o Gather and Scatter instructions:
        f8: M x Va --> Vb       (e.g. gather)
        f9: Va x Vb --> M       (e.g. scatter)

o Masking instructions:
        fa: Va x Vm --> Vb      (e.g. MMOVE V1, V2, V3)


Gather and scatter are used to process sparse matrices/vectors. The
gather operation, uses a base address and a set of indeces to access
from memory "few" of the elements of a large vector into one of the
vector registers. The scatter operation does the opposite. The masking
operations allows conditional execution of an instruction based on a
"masking" register. 

The major hurdle for designing a vector unit is to ensure that the
flow of data from memory to the vector unit will not pose a
bottleneck. In particular, for a vector unit to be effective, the
memory must be able to deliver one datum per clock cycle. This is
usually achieved using pipelining using the C-access memory
organization (concurrent access) or the S-access memory organization
(simultaneous access), or a combination thereof.


The Relative Vector/Scalar Performance and Amdahl Law
-----------------------------------------------------
Let r be the vector/scalar speed ratio and f be the vectorization
ratio. For example, if the time it takes to add a vector of 64
integers using the scalar unit is 10 times the time it takes to do it
using the vector unit, then r = 10. Moreover, if the total number of
operations in a program is 100 and only 10 of these are scalar (after
vectorization), then f=90 (i.e. 90% of the work is done by the vector
unit). It follows that the achievable speedup is:

       Time without the vector unit
       ----------------------------
         Time with the vector unit

For our example, assuming that it takes one unit of time to execute
one scalar operation, this ratio will be:

          100x1
      -------------  = 100/19 (approx 5). 
      10x1 + 90x0.1

In general, the speedup is:

          r
      ----------
      (1-f)r + f

So even if the performance of the vector unit is extremely high (r =
oo) we get a speedup less than 1/(1-f), which suggests that the ratio
f is crucial to performace since it poses a limit on the attainable
speedup. This ratio depends on the efficiency of the compilation,
etc... This also suggests that a scalar unit with a mediocre
performance (even if coupled with the fastest vector unit), will yoeld
mediocre speedup.


Strip-mining
------------
If a vector to be processed has a length greater than that of the
vector registers, then strip-mining is used, whereby the original
vector is divided into equal size segments (equal to the size of the
vector registers) and these segments are processed in sequence. The
process of strip-mining is usually performed by the compiler but in
some architectures (like the Fujitsu VP series) it could be done by
the hardware.


Compound Vector Processing
--------------------------
A sequence of vector operation may be bundled into a "compound" vector
function (CVF), which could be executed as one operation (without
having to store intermediate results in register vectors, etc..) using
a technique called chaining, which is an extension of bypassing (used
in scalar pipelines). The purpose of "discovering" CVFs is to explore
opportunities for concurrent processing of linked vector operations.
Notice that the number of available vector registers and functional
units imposes limitations on how many CVFs can be executed
simulataneously (e.g. Cray 1 CVP of SAXPY code leads to a speadup of
5/3. The X-MP results in a speadup of 5).


Pipeline Nets
-------------
The idea of CVP can be generalized to non-linear pipelining patterns
(Pipeline Nets). 



			   SIMD Processing
			   ---------------

The CM-2 architecture
---------------------
The CM-2 can be thought of as a "glorified" vector unit. It consists
of an array of PEs connected to a scalar host (front-end). One way to
view the CM-2 is that it has a front-end attached to a 
memory distributed amongst a number of PEs. Data can be
communicated between the front-end and the PE array in one of 3 ways:
broadcasting, global combining, and scalar memory bus. In broadcasting the
front-end can broadcast a scalar value to all PEs (casting a scalar
variable/constant to a parallel variable/constant). In global
combining, the front-end can obtain the scalar sum, max, min, OR, AND,
of a parallel variable. Using the scalar memory bus, the front-end can
read or write the memory of a particular PE. 

The CM-2 has one control unit for all the PEs. A "parallel"
instruction is passed from the front-end to the controller, which
broadcasts a sequence of nanoinstructions to all PEs to execute that
instruction. Data can be moved around between the PEs using one of
three method, the NEWS grids, the global router, and the scan
operations.

The CM-2 PE array consists of up to 2048 processing nodes. Each
consists of a memory unit, a FP unit, and 32 bit-slice processors (1
PE = a 3-input/2-output 1-bit ALU with latches) organized into 2
groups of 16 PEs each. Each one of these 16-PE groups is connected to
a NEWS/hypercube router interface. 

The hypercube router consists of 4096 router nodes interconnected as a
12-dimensional hypercube. A hardware routing algorithm that may
introduce misrouting is used. When the router is "delivering"
messages, the PE array is idle.


The MasPar architecture
-----------------------
The main conceptual difference between the MasPar and the CM-2 is the
departure from 1-bit PEs to 4-bit PEs and the departure from a router
that uses "packet switching" to one that uses "circuit switching"
(using a multistage 1024x1024 crossbar interconnect). Another
important difference between the CM-2 and the MasPar is the capability
of the MasPar PEs to access local memory "independently". This
allows the machine to effectively simulate a SPMD operation (by being
at different states of the same automata). 


The CM-5 architecture
---------------------
This document has been prepared by Professor Azer Bestavros <best@cs.bu.edu> as the WWW Home Page for CS-551, which is part of the NSF-funded undergraduate curriculum on parallel computing at BU.
Date of last update: November 1, 1994.