Lecture 16 11/03/1994 Vector Processing ----------------- A vector processor consists of a scalar processor and a vector unit, which could be thought of as an independent functional unit capable of efficient vector operations. Scalar instructions are executed on the scalar processor, whereas vector instructions are executed on the vector unit. When designing the instruction set for a vector unit, one has to choose an ISA. Most of today's vector units have an instruction set that "generalizes" the Load/Store ISA of scalar processors. In particular, the architecture relies on the existence of General Purpose Vector Registers that can be loaded or stored to memory, with all arithmetic operations performed amongst Vector Registers. The ISA of a scalar processor is augmented with vector instructions of the following types: o Vector-vector instructions: f1: Vi --> Vj (e.g. MOVE Va, Vb) f2: Vj x Vk --> Vi (e.g. ADD Va, Vb, Vc) o Vector-scalar instructions: f3: s x Vi --> Vj (e.g. ADD R1, Va, Vb) o Vector-memory instructions: f4: M --> V (e.g. Vector Load) f5: V --> M (e.g. Vector Store) o Vector reduction instructions: f6: V --> s (e.g. ADD V, s) f7: Vi x Vj --> s (e.g. DOT Va, Vb, s) o Gather and Scatter instructions: f8: M x Va --> Vb (e.g. gather) f9: Va x Vb --> M (e.g. scatter) o Masking instructions: fa: Va x Vm --> Vb (e.g. MMOVE V1, V2, V3) Gather and scatter are used to process sparse matrices/vectors. The gather operation, uses a base address and a set of indeces to access from memory "few" of the elements of a large vector into one of the vector registers. The scatter operation does the opposite. The masking operations allows conditional execution of an instruction based on a "masking" register. The major hurdle for designing a vector unit is to ensure that the flow of data from memory to the vector unit will not pose a bottleneck. In particular, for a vector unit to be effective, the memory must be able to deliver one datum per clock cycle. This is usually achieved using pipelining using the C-access memory organization (concurrent access) or the S-access memory organization (simultaneous access), or a combination thereof. The Relative Vector/Scalar Performance and Amdahl Law ----------------------------------------------------- Let r be the vector/scalar speed ratio and f be the vectorization ratio. For example, if the time it takes to add a vector of 64 integers using the scalar unit is 10 times the time it takes to do it using the vector unit, then r = 10. Moreover, if the total number of operations in a program is 100 and only 10 of these are scalar (after vectorization), then f=90 (i.e. 90% of the work is done by the vector unit). It follows that the achievable speedup is: Time without the vector unit ---------------------------- Time with the vector unit For our example, assuming that it takes one unit of time to execute one scalar operation, this ratio will be: 100x1 ------------- = 100/19 (approx 5). 10x1 + 90x0.1 In general, the speedup is: r ---------- (1-f)r + f So even if the performance of the vector unit is extremely high (r = oo) we get a speedup less than 1/(1-f), which suggests that the ratio f is crucial to performace since it poses a limit on the attainable speedup. This ratio depends on the efficiency of the compilation, etc... This also suggests that a scalar unit with a mediocre performance (even if coupled with the fastest vector unit), will yoeld mediocre speedup. Strip-mining ------------ If a vector to be processed has a length greater than that of the vector registers, then strip-mining is used, whereby the original vector is divided into equal size segments (equal to the size of the vector registers) and these segments are processed in sequence. The process of strip-mining is usually performed by the compiler but in some architectures (like the Fujitsu VP series) it could be done by the hardware. Compound Vector Processing -------------------------- A sequence of vector operation may be bundled into a "compound" vector function (CVF), which could be executed as one operation (without having to store intermediate results in register vectors, etc..) using a technique called chaining, which is an extension of bypassing (used in scalar pipelines). The purpose of "discovering" CVFs is to explore opportunities for concurrent processing of linked vector operations. Notice that the number of available vector registers and functional units imposes limitations on how many CVFs can be executed simulataneously (e.g. Cray 1 CVP of SAXPY code leads to a speadup of 5/3. The X-MP results in a speadup of 5). Pipeline Nets ------------- The idea of CVP can be generalized to non-linear pipelining patterns (Pipeline Nets). SIMD Processing --------------- The CM-2 architecture --------------------- The CM-2 can be thought of as a "glorified" vector unit. It consists of an array of PEs connected to a scalar host (front-end). One way to view the CM-2 is that it has a front-end attached to a memory distributed amongst a number of PEs. Data can be communicated between the front-end and the PE array in one of 3 ways: broadcasting, global combining, and scalar memory bus. In broadcasting the front-end can broadcast a scalar value to all PEs (casting a scalar variable/constant to a parallel variable/constant). In global combining, the front-end can obtain the scalar sum, max, min, OR, AND, of a parallel variable. Using the scalar memory bus, the front-end can read or write the memory of a particular PE. The CM-2 has one control unit for all the PEs. A "parallel" instruction is passed from the front-end to the controller, which broadcasts a sequence of nanoinstructions to all PEs to execute that instruction. Data can be moved around between the PEs using one of three method, the NEWS grids, the global router, and the scan operations. The CM-2 PE array consists of up to 2048 processing nodes. Each consists of a memory unit, a FP unit, and 32 bit-slice processors (1 PE = a 3-input/2-output 1-bit ALU with latches) organized into 2 groups of 16 PEs each. Each one of these 16-PE groups is connected to a NEWS/hypercube router interface. The hypercube router consists of 4096 router nodes interconnected as a 12-dimensional hypercube. A hardware routing algorithm that may introduce misrouting is used. When the router is "delivering" messages, the PE array is idle. The MasPar architecture ----------------------- The main conceptual difference between the MasPar and the CM-2 is the departure from 1-bit PEs to 4-bit PEs and the departure from a router that uses "packet switching" to one that uses "circuit switching" (using a multistage 1024x1024 crossbar interconnect). Another important difference between the CM-2 and the MasPar is the capability of the MasPar PEs to access local memory "independently". This allows the machine to effectively simulate a SPMD operation (by being at different states of the same automata). The CM-5 architecture ---------------------
Date of last update: November 1, 1994.