Instruction Level Parallelism

.
A. Instruction-level parallelism
A common design goal for general-purpose processors is to maximize throughput, which may be defined broadly as the amount of work performed in a given time. Average processor throughput is a function of two variables: the average number of clock cycles required to execute an instruction, and the frequency of clock cycles. o increase throughput, then,
a designer could increase the clock rate of the architecture, or increase the average instruction-level parallelism!"#$% of the architecture. &odern processor design has focused on executing more instructions in a given number of clock cycles, that is, increasing "#$. A number of techniques may be used. 'ne technique, pipelining, is particularly popular because it is relatively simple, and can be used in con(unction with superscalar and )#"* techniques. All modern +$, architectures are pipelined..
B. Pipelining
All instructions are executed in multiple stages. -or example, a simple processor may have five stages: first the instruction must be fetched from cache, then it must be decoded, the instruction must be executed, and any memory referenced by the instruction must be loaded or stored. -inally the result of the instruction is stored in registers. he output from one stage serves as the input to the next stage, forming a pipeline of instruction implementation. hese stages are frequently independent of each other, so, if separate hardware is used to perform each stage, multiple instructions may be .in flight/ at once, with each instruction at a different stage in the pipeline. "gnoring potential problems, the theoretical increase in speed is proportional to the length of the pipeline: longer pipelines means more simultaneous in-flight instructions and therefore fewer average cycles per instruction. he ma(or potential problem with pipelining is the potential for hazards. A hazard occurs when an instruction in the pipeline cannot be executed. 0ennessey and $atterson identify three types of hazards: structural hazards, where there simply isn1t sufficient hardware to execute all parallelizable instructions at once2 data hazards, where an instruction depends on the result of a previous instruction2 and control hazards, which arise from instructions which change the program counter !ie, branch instructions%. )arious techniques exist for managing hazards. he simplest of these is simply to stall the pipeline until the instruction causing the hazard has completed.
VLIW :
)#"* architecture All this additional hardware is complex, and contributes to the transistor count of the processor. All other things being equal, more transistors equals more power consumption, more heat, and less on-die space for cache. hus it seems beneficial to expose more of the architecture1s parallelism to the programmer. his way, not only is the architecture simplified, but programmers have more control over the hardware, and can take better advantage of it. )#"* is an architecture designed to help software designers extract more parallelism from their software than would be possible using a traditional 3"4+ design. "t is an alternative to better-known superscalar architectures. )#"* is a lot simpler than superscalar designs, but has not so far been commercially successful. -igure shows a typical )#"* architecture. 5ote the simplified instruction decode and dispatch.
A. ILP in VLIW
)#"* and superscalar approach the "#$ problem differently. he key difference between the two is where instruction scheduling is performed: in a superscalar architecture, scheduling is performed in hardware !and is called dynamic scheduling, because the schedule of a given piece of code may differ depending on the code path followed%, whereas in a )#"* scheduling is performed in software ! static scheduling, because the schedule is .built in to the binary/ by the compiler or assembly language programmer%.
B. Superscalar
,sually, the execution phase of the pipeline takes the longest. 'n modern hardware, the execution of the instruction may be performed by one of a number of functional units. -or example, integer instructions may be executed by the A#,, whereas floating-point operations are performed by the -$,. 'n a traditional, scalar pipelined architecture, either one or the other of these units will always be idle, depending on the instruction being executed. 'n a superscalar architecture, instructions may be executed in parallel on multiple functional units. he pipeline is essentially split after instruction issue.
C. Interloc ing
Another architecture feature present in some 3"4+ and )#"* architectures but never in superscalar6s is lack of interlocks. "n a pipelined processor, it is important to ensure that a stall somewhere in the pipeline won1t result in the machine performing incorrectly. his could happen if later stages of the pipeline do not detect the stall, and thus proceed as if the stalled stage had completed. o prevent this, most architectures incorporate interloc s on the pipeline
stages. 3emoving interlocks from the architecture is beneficial, because they complicate the design and can take time to set up, lowering the overall clock rate. 0owever, doing so means that the compiler !or assembly-language programmer% must know details about the timing of pipeline stages for each instruction in the processor, and insert 5'$s into the code to ensure correctness. his makes code incredibly hardware-specific. 7oth the architectures studied in detail below are fully interlocked, though 4un1s ill-fated &A8+ architecture was not, and relied on fast, universal 8" compilation to solve the hardware problems.

Instruction Level Parallelism

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Instruction Level Parallelism

Uploaded by

Copyright:

Available Formats

.

You might also like