COA unit 3

download COA unit 3

of 89

Transcript of COA unit 3

  • 7/31/2019 COA unit 3

    1/89

    Click to edit Master subtitle style

    5/27/12

    UNIT IIIPipelining

    Department of

    Computer Scienceand Engineering

  • 7/31/2019 COA unit 3

    2/89

    5/27/12

    Syllabus

    q Basic conceptsq Data hazardsq Instruction hazardsq Influence on instruction sets

    q Data path and control considerationsq Performance considerationsq Exceptionhandling.

  • 7/31/2019 COA unit 3

    3/89

    5/27/12

    Basic concepts

    Speed of execution of programs can beincreased through many factors.

    Simplest solution is:

    faster circuit technology to build the

    processor Increase main memory.

    Another solution is:

    arrange the hardware in such a manner sothat more than one operation can beperformed at the same time.

    Thereby the number of operationsperformed per second is increased eventhough the elapsed time needed to perform

  • 7/31/2019 COA unit 3

    4/89

    5/27/12

    Basic Concepts (Contd.)

    Pipelining can be adopted in a computer.

    A processor executes a program byfetching and executing instructions, oneafter the other.

    Execution of a program consists of a

    sequence of fetch and execute steps, asshown in Figure 1.

  • 7/31/2019 COA unit 3

    5/89

    5/27/12

    Sequential Execution

    Now consider a computer that has two separatehardware units, one for fetching instructions

    and another for executing them, as shown in

    Figure 2

  • 7/31/2019 COA unit 3

    6/89

    5/27/12

    Hardware Organization

    The instruction fetched by the fetch unit is

    deposited in an intermediate storage buffer,

    B1.

  • 7/31/2019 COA unit 3

    7/89

    5/27/12

    Basic Concepts

    This buffer is needed to enable theexecution unit to execute the instructionwhile the fetch unit is fetching the nextinstruction.

    The computer is controlled by a clockwhose period is such that the fetch andexecute steps of any instruction can each

    be completed in one clock cycle. (Figure 3)

  • 7/31/2019 COA unit 3

    8/89

    5/27/12

    Pipelined Execution

  • 7/31/2019 COA unit 3

    9/89

    5/27/12

    Basic Concepts

    The processing of an instruction need notbe divided into only two steps.

    For example, a pipelined processor mayprocess each instruction in four steps, as

    follows:

    F Fetch : read the instruction from thememory.

    D Decode : decode the instruction andfetch the

    source operand(s).

    E Execute : perform the operations ecified b the instruction.

  • 7/31/2019 COA unit 3

    10/89

    5/27/12

    Instruction Execution

    Figure4

  • 7/31/2019 COA unit 3

    11/89

    5/27/12

    Hardware Organization

    Figure5

  • 7/31/2019 COA unit 3

    12/89

    5/27/12

    Role of Cache Memory

    Each stage in a pipeline is expected tocomplete its operation in one clock cycle.

    Hence, the clock period should besufficiently long to complete the task beingperformed in any stage.

    If different units require different amountsof time, the clock period must allow thelongest task to be completed.

  • 7/31/2019 COA unit 3

    13/89

    5/27/12

    Role of Cache Memory(Contd.)A unit that completes its task early is idle

    for the remainder of the clock period.

    Instruction fetch requires main memoryaccess.

    The access time of main memory takesmore time.

    To avoid such problem we can go for cachememory.

  • 7/31/2019 COA unit 3

    14/89

    5/27/12

    Pipeline performance

    The following Figure 6 shows an example inwhich the operation specified in instructionI2 requires three cycles to complete andothers require only one cycle.

  • 7/31/2019 COA unit 3

    15/89

    5/27/12

    Effect of an execution taking more than oneclock cycle

  • 7/31/2019 COA unit 3

    16/89

    5/27/12

    In the given example the pipelineoperation is said to have been stalledfortwo clock cycles.

    Any condition that causes the pipeline to

    stall is called a hazard.The different types of hazard which would

    occur areu Data hazard

    u Control hazard (instruction hazard)u Structural hazard

  • 7/31/2019 COA unit 3

    17/89

    5/27/12

    DATA HAZARD

    A data hazard is a situation in which the pipeline is stalled because the data to be operatedon are delayed for some reason.

    Example for Data Hazard

    Assume that A = 5, and consider the following two operations:

    The data used in the second instruction depends on the result of the firstinstruction.

    17

  • 7/31/2019 COA unit 3

    18/89

    5/27/12

    Handling Data Hazards

    Operand forwarding Handling data hazards in software

    Side effects

  • 7/31/2019 COA unit 3

    19/89

    5/27/12

    Operand forwarding

    Mul R2,R3,R4

    Add R5,R4,R6

    The result of the multiply operation is stored in register R4

    R4 is one of the source operands of Add instruction.

    As decode unit decodes Add instruction in cycle 3, realizes R4 is source operand.

    Hence decode instruction of I2 cannot completed until the W step of I1 instructionhas been completed.

    So the pipeline is stalled for 2 clock cycles.

    The data are available at the output of ALU once the execute stages completes step E1

    In operand forwarding we rearrange result of the instruction I1 is forwarded directlyto step E2.

  • 7/31/2019 COA unit 3

    20/89

    5/27/12

    SRC1,SRC2, and RSLT are the registers, these registers constitute the interstage buffersneeded in pipeline operation.

    SRC1 and SRC2 are the part of buffer B2

    RSLT is part of buffer B3

    The two multiplexers connected at the inputs to ALU allow the data on the destinationbus to be selected instead of the contents of either SRC1 or SRC2 register.

  • 7/31/2019 COA unit 3

    21/89

    5/27/12

    The instructions are executed in the datapath and theoperations performed in each clock cycle as follows.

    After decoding instruction and detecting datadependency, a decision is made to use data forwarding.

    The operand not involved in dependency is R2 is readand loaded in register SRC1 in clock cycle 3

    In next clock cycle the product produced by instruction I1is available in Register RSLT.

    Because of forwarding connection it can be used in step

  • 7/31/2019 COA unit 3

    22/89

    5/27/12

    Handling Data Hazards in software

    An alternative approach is to leave the task of detecting data dependenciesand dealing with them to the software.

    In this case, the compiler can introduce the two-cycle delay needed betweeninstructions I1and I2 by inserting NOP (No-operation) instructions, asfollows:

    The dependencies is left entirely to software, the compiler must insert NOPinstructions to obtain correct result.

    It illustrates the close link between compiler and the hardware.

    The compiler can attempt to reorder instructions to perform useful task inthe NOP slot, thus achieve better performance.

  • 7/31/2019 COA unit 3

    23/89

    5/27/12

    Disadvantages

    In other hand the NOP instructions leads larger code size.

    If it is often the given processor architecture has several hardwareimplementations.

    NOP instructions are inserted to satisfy the requirements of oneimplementation and would lead to reduced performance in differentimplementation.

  • 7/31/2019 COA unit 3

    24/89

    5/27/12

    Side effects

    The data dependencies in the previous examples are easily detected becausethe register involved in destination of instruction I1 and source of instruction

    I2.

    Sometimes an instruction changes the contents of a register in destination.(e.g. Autoincrement or autodecrement).

    All the precautions needed to be handle data dependencies involving thedestination location must also be applied to the registers affected by anautoincrement or autodecrement operation.

    When a location other than one explicitly named in an instruction as a

    destination operand is affected, the instruction is said to have side effect.

    Stack instruction such as push, pop, produce similar side effects because theyimplicitly use auto increment and auto decrement addressing modes.

    Another possible side effect involves the condition code flags which are usedby the instructions such as conditional branches and add with carry.

  • 7/31/2019 COA unit 3

    25/89

    5/27/12

    The registers R1 and R2 hold double precision integer number and we wishto add another double precision number in registers R3 and R4. This may beaccomplished as follows

    Add R1,R3

    AddWithCarry R2,R4

    The implicit dependency exists between these two instructions throughcarry flag.

    This flag is set by first instruction and used in second instruction, whichperforms the operation

    Instructions with side effects give rise to multiple data dependencies,which lead to a substantial increase in the complexity of the hardware orsoftware to resolve them.

    The instruction designed for execution on pipelined hardware shouldhave few side effects.

  • 7/31/2019 COA unit 3

    26/89

    5/27/12

    INSTRUCTION HAZARDS

    Pipeline may be stalled delay in availability of an instruction.

    Unconditional branchesConditional branches and branch prediction.

  • 7/31/2019 COA unit 3

    27/89

    5/27/12

    Unconditional branches A branch instruction may also cause the pipeline to stall.

    Instructions I1 to I3 are stored in successive memory addresses.

    Consider I2 is a branch instruction and the target to be Ik.

    In clock cycle 3 the fetch operation for instruction I3 is in progress at thesame time branch address is being decoded and target address is computed.

    In clock cycle 4 processor discard I3, which has been incorrectly fetched andfetch instruction Ik.

    In the mean time hardware unit responsible for Execute step must be told todo nothing during that clock period.

    Thus pipeline is stalled for one clock cycle.

    The time lost as a result of a branch instruction is often referred as branchpenalty.

  • 7/31/2019 COA unit 3

    28/89

    5/27/12

  • 7/31/2019 COA unit 3

    29/89

    5/27/12

  • 7/31/2019 COA unit 3

    30/89

    5/27/12

    In this case, the branch penalty is only one clock

    cycle.

    Reducing branch penalty requires the branch address to be computearlier in the pipeline.

    The instruction fetch unit has dedicated hardware to identify abranch instruction and compute branch target address quickly aspossible after an instruction is fetched.

    With this additional hardware both of these tasks can be performed

    in step D2.

  • 7/31/2019 COA unit 3

    31/89

    5/27/12

    Instruction queue and prefetchingCache miss or a branch instructions stalls the pipeline for oneor more clock cycles.

    To reduce these many processors employ sophisticated fetchunits that can fetch instructions before they are needed andput them in a queue.

    A separate unit called dispatch unittakes instructions from thefront of the queue and sends them to the execution unit.

    The dispatch unit also perform decoding function.

  • 7/31/2019 COA unit 3

    32/89

    5/27/12

    To be effective the fetch unit must have sufficient decoding and processingcapability to recognize and execute branch instructions.

    It attempts to keep the instruction queue filled at all times to reduce delays.

    When the pipeline stalls because of data hazard. The fetch unit continues to

    fetch instruction and add them to the queue.

    So if there is delay in fetching instructions because of branch or cache miss,the dispatch unit continues to issue instructions from the instruction queue.

  • 7/31/2019 COA unit 3

    33/89

    5/27/12

    Assume that initially the queue contains one instruction.

    Every fetch operation adds one instruction to the queue and every dispatch operation reducesqueue length by one.

    The queue length remains the same for first 4 clock cycles.

    The instruction I1 introduces 2-cycle stall, space is available in the queue the fetch unitcontinues to fetch instructions and add the queue length rises to 3 in clock cycle 6

  • 7/31/2019 COA unit 3

    34/89

    5/27/12

    Assume that initially the queue contains one instruction.

    Every fetch operation adds one instruction to the queue and every dispatchoperation reduces queue length by one.

    The queue length remains the same for first 4 clock cycles.

    The instruction I1 introduces 2-cycle stall, space is available in the queue thefetch unit continues to fetch instructions and add the queue length rises to 3in clock cycle 6

    Instruction I5 is a branch instruction (target is Ik is fetched in cycle 7 so I6 isdiscarded).

    The branch instruction would normally cause a stall in clock cycle 7 as a

    result of discarding instruction I6 but I4 is dispatched from the queue todecoding stage.

    The queue length drops to 1 in cycle 8.

  • 7/31/2019 COA unit 3

    35/89

    5/27/12

    The instructions I1,I2,3,I4 and Ik complete execution in successive clockcycles.

    The branch instruction does not increase overall execution time.

    This is because the instruction fetch unit has executed the branchinstruction concurrently with the execution of other instructions.

    This technique referred as branch folding.

    The branch folding occurs only if at the time branch instruction isencountered, at least one instruction available in the queue other thanthe branch instruction.

  • 7/31/2019 COA unit 3

    36/89

    5/27/12

    Branch folding occurs only if at the time a branch instructionis encountered atleast one instruction is available in thequeue other than branch instruction.

    It is desirable to arrange for the queue to be full most of thetime to ensure the adequate supply of instructions forprocessing.

    The width of connection between the fetch unit and theinstruction cache allow reading more than one instruction ineach clock cycle.

    Instruction queue is also beneficial in dealing with cachemisses.

  • 7/31/2019 COA unit 3

    37/89

    5/27/12

    Conditional branch and branch prediction

    Branch instructions occur frequently, they represent about 20 percent ofthe dynamic instruction count of most programs.

    Delayed branch

    Branch prediction

    Dynamic branch prediction

  • 7/31/2019 COA unit 3

    38/89

    5/27/12

    Delayed branch

    Branch delay slot

    A technique delayed branchingcan minimize the penalty incurred

    as a result of conditional branch instructions.

    The instructions in the delay slots are always fetched, so we need toarrange them to be fully executed whether or not the branch istaken.

    The objective is to place useful instructions in these slots. If no

    useful instructions can be placed in delay slots, these slots must befilled with NOP instructions.

  • 7/31/2019 COA unit 3

    39/89

    5/27/12

    The instruction in reordered as shown in the figure.

    The shift instruction is fetched while branch instruction is executed.

    After evaluating the branch condition the processor fetches the instruction at LOOP or atNEXT depending upon whether the branch condition is true or false.

    In either case completes the execution of the shift instruction.

    The pipeline is not interrupted and there are no idle

    cycles.

  • 7/31/2019 COA unit 3

    40/89

    5/27/12

    Branch Prediction

    Assume that the branch will not take place and to continue to fetchinstructions in sequential address order.

    Instruction execution is done on a speculative basis.

    Instructions are executed before the processor is certain that they are inthe correct execution sequence is called as speculative execution.

    A care must be taken that no processor registers or memory locationsupdated until it is confirmed that these instructions should indeed beexecuted.

    If the branch decision indicates otherwise, the instructions and all theirassociated data in the execution unit must be purged and correctinstructions fetched and executed.

  • 7/31/2019 COA unit 3

    41/89

    5/27/12

  • 7/31/2019 COA unit 3

    42/89

  • 7/31/2019 COA unit 3

    43/89

    5/27/12

    Dynamic branch prediction

    The objective of branch prediction algorithms is to reduce the probability ofmaking a wrong decision, to avoid fetching instructions that eventually have

    to be discarded.

    The execution history used in predicting the out come of a given branchinstruction is the result of the most recent execution of the instruction.

    The processor assumes that the next time the instruction is executed theresult is likely to be same.

    The two states are:LT: Branch is likely to be taken

    LNT: Branch is likely not to be taken

    A th t th l ith t t d i th t t LNT Wh th b h i t ti

  • 7/31/2019 COA unit 3

    44/89

    5/27/12

    Assume that the algorithm started in the state LNT. When the branch instructionis executed and if the branch is taken, the machine moves to state LT. other wise,it remains same in LNT.

    The next time the same instruction is encountered, the branch is predicted astaken if the corresponding state machine is in state LT. Otherwise it is predictedas taken.

    Once a loop is entered, the branch instruction that controls looping will alwaysyield the same result until the last pass through the loop is reached.

    In the last pass, the branch prediction will turn out to be incorrect, and thebranch history state machine will be changed to the opposite state.

    Unfortunately this means that the next time this same loop is entered, and

    assume that there will be more than one pass through the loop, the machine willlead to the wrong prediction.

    Better performance can be achieved by keeping more information aboutexecution history.

  • 7/31/2019 COA unit 3

    45/89

    5/27/12

    An algorithm that uses 4 states, thus requiring two bits of historyinformation for each branch instruction is shown in the figure.

    The four states are:

    ST: Strongly likely to be takenLT: Likely to be taken

    LNT: Likely not to be taken

    SNT: Strongly likely not to be taken

    Assume that the same state of algorithm is initially set to LNT. After the branch

  • 7/31/2019 COA unit 3

    46/89

    5/27/12

    Assume that the same state of algorithm is initially set to LNT. After the branchinstruction has been executed, and if the branch is actually taken, the state ischanged to ST; otherwise, it is changed into SNT.

    When a branch instruction is encountered, the instruction fetch unit predicts

    that the branch will be taken if the state is either LT or ST, and it begins to fetchinstructions at the branch target address. Otherwise it continues to fetchinstruction in sequential order.

    When in state SNT, the instruction fetch unit predicts that the branch will not betaken. If the branch is actually taken, that is if the prediction is incorrect, thestate changes to LNT.

    This means that the next time the same branch instruction is encountered, theinstruction fetch unit will still predict that the branch will not be taken. Only ifthe prediction is incorrect twice in a row will change the state to ST.

    Assume that the branch instruction is at the end of the loop and that theprocessor sets initial state of the algorithm to LNT.

    During the first pass, the prediction will be wrong (not taken), and hence thestate will be changed to ST.

  • 7/31/2019 COA unit 3

    47/89

    5/27/12

    In all subsequent passes the prediction will be correct, except the last pass. Atthat time the state will change to LT.

    When the loop is entered a second time, the prediction will be correct (branchtaken).

    We now one final modification to correct the mispredicted branch at the timethe loop is first entered.

    In the absence of additional information about the nature of the branchinstruction, we assumed that the processor sets initial state to LNT.

    The information needed to set the initial state correctly can be provided by anyof the static prediction schemes discussed earlier.

    Either by comparing addresses or by checking a prediction bit in the instruction,the processor sets the initial state of the algorithm to LNT or LT.

  • 7/31/2019 COA unit 3

    48/89

    5/27/12

    In this case at the end of the loop, the compiler would indicate that the branchshould be predicted as taken, causing initial state to be set to LT.

    With this modification, branch prediction will be correct all the time, except forthe final pass through the loop.

    The state information used in dynamic branch prediction algorithms may keptby the processor in a variety of ways.

    It may reordered in a look-up table, it is possible for two branch instructionsshare the same table entry.

    This may lead to a branch being mispredicted, but it does not cause an error in

    execution. It only introduces a small delay in execution time.

    Influence on Instruction Sets

  • 7/31/2019 COA unit 3

    49/89

    5/27/12

    Influence on Instruction Sets

    v Addressing modes

    v Conditional codes

    Addressing modes It is for accessing a variety of data structures simply and efficiently.

    Useful addressing modes are index, indirect, autoincrement, andautodecrement.

    Many processors provide various combination of these modes toincrease the flexibility of their instruction sets.

    Addressing modes to be implemented in a pipelined processor, we mustconsider the effect of each addressing mode on instruction flow in thepipeline.

    Two important considerations are side effects of modes such as autoincrement and auto decrement and the

    extent to which complex addressing modes cause the pipeline to stall.

    The given mode is likely to be used by compilers.

    For example the instruction Load (X(Rl)) R2 may be executed assume

  • 7/31/2019 COA unit 3

    50/89

    5/27/12

    For example, the instruction Load (X(Rl)),R2 may be executed, assumethat the index offset, X is given in the instruction word.

    After computing the address in the cycle 3, the processor needs to access

    memory twiceFirst to read the location X+[R1] in clock cycle 4

    Then read location [X+[R1]] in cycle 5.

    If R2 is a source operand in the next instruction that instruction would

    be stalled for 3 clock cycles, which can be reduced to two cycles withoperand forwarding.

    To implement the same Load operation using only simple addressingmodes requires several instructions. For example, on a computer that

    allows three operand addresses, we can use

  • 7/31/2019 COA unit 3

    51/89

    5/27/12

  • 7/31/2019 COA unit 3

    52/89

    5/27/12

    The add instruction performs the operation R2 X+[R1].

    Two load instructions fetch the addresses and operand from memory. Thissequence of instructions takes exactly the same number of clock cycles as theoriginal, single load instruction , as shown in the figure.

    In a pipelined processor, complex addressing modes that involves several accessesto the memory do not lead to faster execution.

    Adv:

    It reduces number of instructions needed to perform a given task and there byreduce the program space needed in the main memory.

    Disadv:

    Long execution times cause the pipeline to stall, thus reducing effectiveness. They require more complex hardware to decode and execute them and thet are

    not convenient for compiles to work with.

  • 7/31/2019 COA unit 3

    53/89

    5/27/12

    The instruction sets of modern processors are designed totake maximum advantage of pipelined hardware.

    The address modes used in modern processors often have thefollowing features:Access to an operand does not require more than one access to the memory.

    Only load and store instructions access memory operands.

    The addressing modes used do not have side effects.

    Three basic addressing modes that have these features areregister, register indirect, and index.

    First two require no address computation. In the index mode

    the address can be computed in one cycle, whether the indexvalue is given in the instruction or in the register.

    None of these modes has any side effects, with one possibleexception.

    C diti d

  • 7/31/2019 COA unit 3

    54/89

    5/27/12

    Condition codes The condition code flags are stored in processor status register. They are

    either set or cleared by many instructions, so that they can be tested by asubsequent conditional branch instruction to change the flow of program

    execution.

    An optimizing compiler for a pipelined processor attempts to reorderinstructions to avoid stalling when branches or data dependencies or occurs.

    The compiler must ensure that reordering does not cause a change inoutcome of a computation.

    Consider the sequence of instructions in the fig 8.17a, the branch decisiontakes place in step E2 rather than D2 because it must await the result of theCompare instruction.

    The execution time of the branch instruction can be reduced byinterchanging the Add and Compare instructions, as shown in 8.17b.

    This will delay the branch instruction by one cycle relative to the compare

  • 7/31/2019 COA unit 3

    55/89

    5/27/12

    This will delay the branch instruction by one cycle relative to the compareinstruction.

    As a result, at the time the branch instruction being decoded and the result of the

    compare instruction will be available and correct branch decision will be made.

    Interchanging the Add and Compare instructions can be done only if the Addinstruction does not affect the condition codes.

    These lead to two important conclusions: To provide flexibility in reordering instructions, the condition-code flags should be

    affected by as few instructions as possible.

    The compiler should be able to specify in which instructions of a program thecondition codes are affected and in which they are not.

  • 7/31/2019 COA unit 3

    56/89

    5/27/12

    Datapath and control considerations

  • 7/31/2019 COA unit 3

    57/89

    5/27/12

    Datapath and control considerations

    Fig.7.8

    57

  • 7/31/2019 COA unit 3

    58/89

    5/27/12

    There are separate instruction and data caches that use separate address

  • 7/31/2019 COA unit 3

    59/89

    5/27/12

    and data connections to the processor. This requires two versions of theMAR register, IMAR for accessing the instruction cache and DMAR foraccessing the data cache.

    The PC is connected directly to the IMAR, so that the contents of the PCcan be transferred to IMAR at the same time that an independent ALUoperation is taking place.

    The data address in DMAR can be obtained directly from the register fileor from the ALU to support the register indirect and indexed addressingmodes.

    Separate MDR registers are provided for read and write operations. Datacan be transferred directly between these registers and the register file

    during load and store operations without the need to pass through theALU.

  • 7/31/2019 COA unit 3

    60/89

  • 7/31/2019 COA unit 3

    61/89

    5/27/12

    The following operations can be performed independently in

  • 7/31/2019 COA unit 3

    62/89

    5/27/12

    The following operations can be performed independently inthe processor of Figure 8.l8;

    Reading an instruction from the instruction cacheIncrementing the PC

    Decoding an instruction

    Reading from or writing into the data cache

    Reading the contents of up to two registers from theregister file

    Writing into one register in the register file

    Performing an ALU operation

  • 7/31/2019 COA unit 3

    63/89

    5/27/12

    These operations do not use shared resources, theycan be performed simultaneously in any

    combination.The following actions are happen during clock

    cycle 4:

    Write the result of instruction I1 into the register fileRead the operands of instruction I2 from the register file.

    Decode instruction I3

    Fetch instruction I4 and increment the PC.

    P f id ti

  • 7/31/2019 COA unit 3

    64/89

    5/27/12

    Performance consideration

    The execution time T is given by

    T = (N*S)/R

    N Dynamic instruction countS Average number of clock cycles it takes to

    fetch and execute one instruction.

    R Clock rate

    The instruction throughput Ps is given by Ps =R/S

    A four stage pipeline may increase instruction throughput by a factor

  • 7/31/2019 COA unit 3

    65/89

    5/27/12

    A four-stage pipeline may increase instruction throughput by a factorof four.

    In general, an n-stage pipeline has the potential to increase throughputn times.

    Thus, it would appear that the higher value of n, larger theperformance gain.

    This lead to two questions:

    How much of this potential increase in instruction throughput can berealized in practice?

    What is good value for n?

    Anytime a pipeline is stalled, the instruction throughput is reduced.

    Hence, the performance of pipeline is highly influenced by factors suchas branch and cache miss penalties.

    Effect of instruction hazards

  • 7/31/2019 COA unit 3

    66/89

    5/27/12

    Effect of instruction hazards Consider the processor that uses the four-stage pipeline.

    The clock rate, hence the time allocated to each step in the pipeline is determinedby the longest step.

    Let the delay through the ALU be the critical parameter.

    This is the time needed to add two integers. Thus, if ALU delay is 2 ns, a clock of

    500MHz can be used.

    The on-chip instruction and data caches for this processor should also bedesigned to have an access time of 2 ns.

    Under ideal conditions, this pipelined processor will have an instructionthroughput, Pp is given by Pp = R = 500 MIPS ( million instructions per second).

    The cache miss penalty Mp in that system is computed to be 17 clock

  • 7/31/2019 COA unit 3

    67/89

    5/27/12

    The cache miss penalty, Mp, in that system is computed to be 17 clockcycles.

    Let TI be the time between two successive instruction completions.

    For sequential execution TI = S.

    However, in the absence of hazards, a pipelined processor completes

    the execution of one instruction in each clock cycle, thus TI = 1clockcycle.

    A cache miss stalls the pipeline by an amount equal to the cache miss

    penalty for the instruction in which the miss occurs.

    A cache miss can occur for either an instruction or data.

    Consider a computer that has shared cache for instructions and data, and let d

  • 7/31/2019 COA unit 3

    68/89

    5/27/12

    Consider a computer that has shared cache for instructions and data, and let dbe the percentage of instructions that refer to data operands in the memory.

    The average increase in the value of TI as a result of cache miss is given by

    &miss = ( (1-hi) + d (1-hd) ) * Mp

    where hi and hd are the hit ratios for instructions and data respectively.

    Eg:

    Assume that 30 percent of the instructions access data and memory, with a95 percent instruction hit rate and 90 percent data hit rate, is given by

    &miss = ((0.05+0.3*0.1)*17 = 1.36 cycles

    Taking this delay into account, the processors throughput would be

    Pp = R/TI = R/(1 + &miss ) = 0.42R

    Note that R is expressed in MHz, the throughput is obtained directly inMIPS. For R=500 MHz, Pp = 210 MIPS.

    Let us compare this value to the throughput obtainable without pipelining

  • 7/31/2019 COA unit 3

    69/89

    5/27/12

    Let us compare this value to the throughput obtainable without pipelining.A processor that uses sequential execution requires 4 cycles per instruction.

    Its throughput would be Ps = R/(4 + &miss ) = 0.19R

    For R = 500 MHz, Ps = 95 MIPs. Clearly pipelining leads to significantlyhigher throughput.

    But the performance gain of 0.42/0.19 = 2.2 is only slightly better than one-half the ideal case.

    Reducing cache miss penalty is particularly worthwhile in a pipelinedprocessor. This can be achieved by introducing a secondary cache between

    the primary, on-chip cache and the memory.

    Assume that the time needed to transfer an 8-word block from thesecondary cache is 10 ns.

    Hence a miss in the primary cache for which required block is found in

  • 7/31/2019 COA unit 3

    70/89

    5/27/12

    Hence, a miss in the primary cache for which required block is found inthe secondary cache introduces penalty, Ms of 5 clock cycles.

    In this case of a miss in the secondary cache, the full 17-cycle penalty isstill incurred.

    Hence assuming a hit rate Hs of 94 percent in the secondary cache, theaverage increase in TI is &miss = ( (1-hi) + d (1-hd) ) * (hs*Ms + (1-hs)* Mp) = 0.46 cycle

    The instruction throughput in this case in 0.68R, or 340 MIPs. Anequivalent non-pipelined processor would have a throughput of 0.22R,110 MIPS.

    Thus the pipelining provides a performance gain of 0.68/0.22 = 3.1.

  • 7/31/2019 COA unit 3

    71/89

    5/27/12

    An optimizing compiler attempts to increase the distancebetween two instructions that create dependency by placingother instructions between them whenever possible.

    Also, in a processor that uses an instruction queue , the cachemiss penalty during instruction fetches may have reduced effectas the processor is able to dispatch instructions from the queue.

    Number of pipeline stages

  • 7/31/2019 COA unit 3

    72/89

    5/27/12

    Number of pipeline stages

    The fact that an n-stage pipeline may increase instruction throughputby a factor of n suggests that we should use a large number of stages.

    However, as the number of pipeline stages increase, so does theprobability of the pipeline being stalled, because more instructions arebeing executed concurrently.

    The dependencies between instructions may still cause pipeline to stallalso branch penalties.

    For these reasons, the gain from increasing the value of n begins to

    diminish, and the associated cost is not justified.

    Another important factor is the inherent delay in the basic operationsperformed by the processor.

  • 7/31/2019 COA unit 3

    73/89

    5/27/12

    The most important among these is the ALU delay.

    In many processors, the cycle time of the processor clock is chosensuch that one ALU operation can be completed in one clock cycle.

    Other operations are divided into steps that take about the sametime as an add operation.

    It is also possible to use a pipelined ALU.

    Eg: Compaq Alpha 21064 processor consist of a two stage pipeline,in which each stage completes its operation in 5 ns.

  • 7/31/2019 COA unit 3

    74/89

    5/27/12

    Many pipelined processors use four to six stages. Othersdivide instruction execution into smaller steps and use morepipeline stages and faster clock.

    UltraSPARC II uses 9-stage pipeline

    Pentium pro uses 12-stage pipeline

    Pentium 4 has 20-stage pipeline and uses the clock speed in therange 1.3 to 1.5 GHz.

    For fast operations there are two pipeline stages in one clock cycle.

    E ception Handling

  • 7/31/2019 COA unit 3

    75/89

    5/27/12

    Exception Handling

    Exceptional situations are harder to handle in a pipelined CPUbecause the overlapping of instructions makes it more difficult toknow whether an instruction can safely change the state of theCPU.

    In a pipelined CPU, an instruction is executed piece by piece and isnot completed for for several clock cycles.

    Unfortunately other instructions in the pipeline can raiseexceptions that may force the CPU to abort the instructions in thepipeline before they complete.

    Types of Exceptions

  • 7/31/2019 COA unit 3

    76/89

    5/27/12

    Types of Exceptions

    I/O device request

    Invoking an operating system service from a user program

    Tracing instruction execution

    Breakpoint

    Integer arithmetic overflow

    FP arithmetic anomaly

    Page fault

    Misaligned memory accesses

    Memory-protection violation

    Undefined or unimplemented instruction

    Hardware malfunctions Privilege violation

    Hardware and power failure

  • 7/31/2019 COA unit 3

    77/89

    5/27/12

    The requirement can be characterized into five semi independent axes:

  • 7/31/2019 COA unit 3

    78/89

    5/27/12

    The requirement can be characterized into five semi independent axes:

    Synchronous vs. asynchronous:

    If the event occurs at the same place every time theprogram is executed with the same data and memoryallocation is synchronous.

    With the exception of hardware malfunctions, asynchronousevents are caused by devices external to the CPU andmemory.

    Asynchronous events usually can be handled after thecompletion of the current instruction, which makes easier tohandle.

    User requested vs. coerced If the task directly asks for it, it is a user-requested event.

    User requested exceptions are predictable and easier tohandle.

    Usually can be handled after the completion of the current

    instruction, which makes easier to handle.

    User Maskable vs. unmaskable

  • 7/31/2019 COA unit 3

    79/89

    5/27/12

    If an event can be masked or disabled by a user task, it is user maskable.The mask simply controls whether the hardware responds to the exceptionor not.

    Within vs. between instructions

    Exceptions occurring within instructions are synchronous

    It is harder to deal with exceptions that occur within instructions

    Resume vs. terminate

    If the programs execution always stops after the interrupt, it isterminatingevent.

    If the program execution continues after the interrupt, it is a resuming

    event. It is easier to implement exceptions that terminate program execution

  • 7/31/2019 COA unit 3

    80/89

    5/27/12

    I/O device request:

  • 7/31/2019 COA unit 3

    81/89

    5/27/12

    Errors that are related to an I/O request are usually indicated in the statusdata provided with the I/O interrupt. These errors are:

    Tracing instruction: The trace instruction is used to control the tracing of the execution, and is

    primarily used for debugging.

    trace all

    trace methods

    trace off

    trace results

    Page fault:

    Page fault is an interrupt (or exception) to the software raised by thehardware, when a program accesses a page that is mapped in address space,

    but not loaded in physical memory.

    Hardware malfunction

  • 7/31/2019 COA unit 3

    82/89

    5/27/12

    Hardware malfunction This behavior can occur if a hardware component malfunctions, or if there are

    damaged or incompatible drivers Installed.

    Check the Memory Remove any extra memory modules that are in the computer, leaving only the least

    amount that is required for the computer to start and run Windows. Restart thecomputer to see whether the error messages persist.

    Check the Adapters

    Remove any adapters that are not required to start the computer and run Windows.

    In many cases, you can start your computer with only the drive subsystem controllerand the video adapter.

    Check the Computer BIOS/Configuration

    Verify that you have installed the latest revisions for your computer's BIOS orfirmware configuration software. Go into the BIOS and set load Fail-safe defaults orBIOS defaults, disable any antivirus protection in the BIOS, and then set Plug and

    Play OS to No.Check For Updated Drivers

    Arithmetic overflow

  • 7/31/2019 COA unit 3

    83/89

    5/27/12

    Arithmetic overflow

    If you use fixed precision datatypes (smallint, integer, bigint,

    decimal and numeric), it is possible that the result of calculationdoesn't fit the datatype. Try casting the values in complexexpressions as double precision and see whether the error goesaway. If it works and you don't care about being too precise, youcan leave it at that. Otherwise you need to check every operationand calculate the result.

    Here's an example: if you multiply 9.12 with 8.11 (bothnumeric(18,2)) you would get 73.9632. If Firebird would storethat into numeric(18,2) datatype, we would lose 0.0032. Doesn'tlook much, but when you have complex calculations, you caneasily loose thousands (dollars or euros). Therefore, the result isstored in numeric(18,4).

    k i i

  • 7/31/2019 COA unit 3

    84/89

    5/27/12

    Break point Exception

    This is a DMA problem. It sounds like your program is executingcode in some DLL and that DLL has found something reallywrong, like a destroyed heap. It is fairly common to program adebugger break instruction in the code so that a debugger gets achance to stop. Fixing this is probably going to be tough. .

    You would not normally see this exception, as it is used by adebugger under the direction of a programmer to help solve orvalidate an issue in a program. The debugger can define abreakpoint event, and when such an event occurs, the processorissues this exception, and the debugger will regain control.

    Memory protection violation

  • 7/31/2019 COA unit 3

    85/89

    5/27/12

    y p

    Segmentation

    Segmentation refers to dividing a computer's memory intosegments.

    PagingIn paging, the memory address space is divided into equal,

    small pieces, called pages. Using a virtual memory mechanism,

    each page can be made to reside in any location of the physicalmemory, or be flagged as being protected. Virtual memorymakes it possible to have a linear virtual memory address spaceand to use it to access blocks fragmented over physical memoryaddress space.

    A page table is used for mapping virtual memory to physicalmemory. The page table is usually invisible to the process.Page tables make it easier to allocate new memory, as eachnew page can be allocated from anywhere in physicalmemory.

    Stopping & Restarting Execution

  • 7/31/2019 COA unit 3

    86/89

    5/27/12

    Stopping & Restarting Execution

    The most difficult exceptions have two properties:

    They occur within instructionsThey must be restartable

    When an exception occurs, the pipeline control can do thefollowing:

    Force a trap instruction into the pipeline on the next IF.

    Until the trap is taken, turn off all writes for the faultinginstruction and all the instructions that follow.

    After the exception-handling routine in the operating systemreceives control, it saves the PC of the faulting instruction.

    When using delayed branching, it is impossible to recreate thestate using a single PC since the instructions may not besequentially related

    Exceptions in MIPS

  • 7/31/2019 COA unit 3

    87/89

    5/27/12

    Exceptions in MIPS

    -----------------------------------------------------------------------------------------------

    LD IF ID EX MEM WB

    ------------------------------------------------------------------------------------------------

    DADD IF ID EX MEM WB

    -------------------------------------------------------------------------------------------------

    This air of instructions can cause a data a e fault and an arithmetic exce tion at a same

    This case can be handled by dealing with only the data page fault andrestarting the execution when the second exception occurs it can be

  • 7/31/2019 COA unit 3

    88/89

    5/27/12

    restarting the execution when the second exception occurs, it can behandled independently.

    Consider the instructions, LD followed by DADD. The LD can get datapage fault, seen when instruction is in MEM, and the DADD can get aninstruction page fault, seen when the DADD instruction is in IF.

    Since we are implementing precise exceptions, the pipeline is required tohandle the exception caused by the LD instruction first.

    The instruction in the position of the LD instruction i and the instruction inthe position of the DADD instruction i+1.

    The MIPS Approach:

    Hardware posts all exceptions caused by a given instruction in a statusvector associated with the instruction

    The exception status vector is carried along as the instruction goes downthe pipeline

    Once an exception indication is set in the exception status vector, anycontrol signal that may cause a data value to be written is turned off

    Upon entering the WB stage the exception status vector is checked and theexceptions if any will be handled according to the time they occurred

  • 7/31/2019 COA unit 3

    89/89

    exceptions, if any, will be handled according to the time they occurred

    Allowing an instruction to continue execution till the WB stage is not aproblem since all write operations for that instruction will be disallowed

    Notes:

    The MIPS machine design does not allow exception to occur at the WBstage .

    All write operations in the MIPS pipeline are in late stages

    Machines that allow writing in early pipeline stages are difficult to handlesince exceptions can occur after the machine state has been alreadychanged