Lecture 21 11/22/1994 Multithreading -------------- Perhaps the most promising "scalable" approach to deal with large latencies (whether due to simply physical limitations or process synchronization) is to "hide it" (rather than reduce it). Using mutithreading, each processor in the system is responsible for running a large number of "threads". When a thread issues a remote access, then instead of "idling" (or busy waiting), the processor switches contexts (i.e. executes a different thread). In other words, the cost of the remote access is hidden because "useful work" on a different thread is done. Obviously, for this technique to work, the cost of context switching must be much cheaper than the cost of a remote access. A major question in designing a multithreaded machine to hide latency is "when to switch context" ? There are various possibilities. For example, a processor may switch context on a cache miss, or it may switch context on every load (independent of whether or not it hits or misses), or it may switch context after each instruction (i.e. it interleaves the instructions from different threads on a cycle-by-cycle basis -- this is particularly good for pipelined execution because it removes pipeline hazards!) Let (L) denote the communication latency on a remote memory access. The value of L reflects the network delays as well as any cache miss penalty, etc.. Let (N) denote the number of threads that can be interleaved in each processor. Let (C) denote the context switching overhead. Finally, let (R) denote the average number of cycles (instructions) between remote accesses. Clearly the objective is to maximize the fraction of time that a processor is busy doing useful work (i.e. utilization). For a single-threaded processor, the utilization will be: R 1 U1 = ----- = ------- R + L 1 + L/R For large L and small R, it can be easily seen that the utilization goes down quite fast. For a multiple-threaded processor, the maximum utilization will occur when there are enough threads to completely hide latency (i.e. there is always a ready thread to execute after each remote access). For this to be true we need: (N-1)(R+C) > L Thus, L N >= ------- + 1 R + C In this case, the utilization becomes: R 1 Umax = ----- = ------- R + C 1 + C/R Obviously, if the context switching cost is reduced to 0 (as in the Tera computer for example), then a 100% utilization becomes possible. If N is below the level for a maximum utilization, then the utilization becomes N.R Ulin = --------- R + C + L An interesting observation is that multithreading may have an adverse effect on the latency! In particular, multithreading results in a much higher number of "data in transit" (or on the wire), which may result in hot spots or congestions, and may result in a higher latency. It is important to strike the right balance (a quite open area of research!) Example Architectures overviewed: -------------------------------- o The Winsconsin Multicube (e.g. of snoopy cache design) o The Stanford Dash Multiprocessor (e.g. of directory-based cache design) o The Tera Computer (e.g. of a multithreaded machine design) o The KSR Computer (e.g. of an ALLCACHE machine design)
Date of last update: November 16, 1994.