Superpipelining and VLIW
Summary
TLDRThis script delves into the concept of Superscalar Execution, also known as Super Pipelining, in processor architecture. It explains how processors can execute multiple instructions in parallel when they are independent, requiring multiple logical units for simultaneous instruction fetch, decode, and execution. The script also highlights challenges like data dependency, branching, and memory latency that affect pipelining efficiency. It contrasts Super Pipelining with VLIW, discussing the trade-offs between dynamic runtime decision-making and static compiler-based instruction bundling.
Takeaways
- 🚀 Superscalar execution allows a processor to execute multiple instructions in parallel if they are independent of each other.
- 🛠️ Superscalar execution requires multiple hardware units for each stage (fetch, decode, execute) to process instructions simultaneously.
- ⛔ Instructions that depend on the results of previous instructions cannot be executed in parallel and must wait for the necessary data to be available.
- 📊 Superscalar execution is particularly beneficial in operations like linear algebra, where similar operations on independent data sets can be parallelized.
- 🔗 Data dependency is a major issue in pipelining and superscalar execution, as it can stall the pipeline if an instruction depends on the result of a previous one.
- 🔄 Branching can cause inefficiencies in pipelining, as instructions after a branch may need to be discarded if the branch is taken, leading to wasted work.
- ⏳ Memory latency is a significant challenge, as fetching data from memory can take hundreds of cycles, stalling the pipeline while the processor waits for the data.
- 🎯 Out-of-order execution allows the processor to issue instructions based on a window of code, enabling it to execute independent instructions together, even if they are not sequential.
- 💻 VLIW (Very Long Instruction Word) architecture offloads the decision of parallel instruction execution to the compiler, simplifying the processor's hardware.
- 📅 VLIW can analyze a larger window of code during compilation, but lacks the dynamic state awareness of super pipelining, making it less responsive to real-time conditions.
Q & A
What is Superscalar Execution also known as?
-Superscalar Execution is also known as Super Pipelining.
How does Superscalar Execution allow for parallel execution of instructions?
-Superscalar Execution allows for parallel execution by enabling the processor to execute multiple instructions simultaneously if it determines that they are independent of each other.
What is required for a processor to execute instructions in parallel using Superscalar Execution?
-For parallel execution, the processor requires multiple logical units, meaning it needs separate hardware for each stage of the instruction pipeline to fetch, decode, and execute multiple instructions at a time.
What is a 'no op' cycle in the context of Superscalar Execution?
-A 'no op' cycle, short for 'no operation', is a cycle where no operation is performed because there is a dependency on another instruction that has not completed yet.
In what kind of operations would Superscalar Execution architecture be particularly useful?
-Superscalar Execution is particularly useful in operations like linear algebra, where there are many independent operations on different data sets, such as scaling a vector or computing a dot product.
How can the dot product of two vectors be computed using Superscalar Execution?
-The dot product can be computed by performing independent multiplications of corresponding elements from each vector and then summing these products. Superscalar Execution can be used to execute these multiplications in parallel.
What are some of the issues typically faced with pipelining and super pipelining?
-Issues faced with pipelining and super pipelining include data dependency, branching, and memory latency. Data dependency can cause delays when instructions need to wait for data from previous operations. Branching can lead to wasted work if instructions following a branch are discarded. Memory latency can stall the pipeline if data retrieval from memory takes much longer than instruction execution.
What is the difference between 'inorder execution' and 'out of order issue' in the context of Superscalar Execution?
-In 'inorder execution', instructions are executed in the exact order they appear in the code. In contrast, 'out of order issue' allows the processor to issue instructions that are not in sequential order, based on their independence and the availability of resources, to maximize parallel execution.
What is VLIW and how does it differ from Superscalar Execution?
-VLIW stands for Very Long Instruction Words. It is an approach where the compiler determines which independent instructions can be executed together in one instruction word, simplifying the hardware but requiring more complex compilation. Superscalar Execution, on the other hand, makes these decisions dynamically at runtime, which can be more complex but also more responsive to the current state of execution.
How does the dynamic state of a processor affect its ability to execute instructions in Superscalar Execution?
-The dynamic state, including data availability and branch history, allows the processor to make real-time decisions about which instructions to issue in parallel. This dynamic decision-making is not available to the compiler in VLIW architectures, which must make these decisions offline during compilation.
Outlines
🔁 Superscalar Execution and Its Challenges
This paragraph introduces the concept of Superscalar Execution, also known as Super Pipelining, which is a technique that allows a processor to execute multiple instructions in parallel if they are independent of each other. The processor requires multiple logical units to handle the simultaneous fetch, decode, and execution of instructions. However, this approach faces challenges such as data dependencies, where an instruction must wait for another to complete before it can execute, resulting in 'no op' cycles. The paragraph also highlights the utility of superscalar execution in operations like linear algebra, where independent data sets can be processed in parallel. Issues with pipelining, such as data dependency and branching, are also discussed, where branching can cause a pipeline to be flushed of instructions that will not be executed.
🔄 Addressing Pipelining Issues: Dynamic vs. Static
The second paragraph delves into the problems associated with pipelining, such as wasted work due to branch instructions and memory latency, which can significantly delay instruction execution as the processor waits for data from memory. The paragraph contrasts dynamic super pipelining, where the processor decides in real-time which instructions to execute together, with Very Long Instruction Words (VLIW), where the compiler offloads the decision-making to determine which instructions can be executed in parallel. The dynamic nature of super pipelining requires complex circuitry and real-time decision-making, whereas VLIW simplifies hardware at the cost of flexibility to handle dynamic states and branch history.
🛠️ The Trade-offs of VLIW and Super Pipelining
The final paragraph discusses the trade-offs between VLIW and super pipelining architectures. VLIW allows for simpler hardware as the compiler statically determines which independent instructions can be bundled and executed together. This offline process can consider a larger set of instructions and their potential combinations. However, it lacks the ability to adapt to the dynamic runtime state of the processor, such as handling data not being present or making decisions based on branch history. Super pipelining, on the other hand, can leverage real-time information but is limited by the need for rapid decision-making and complex hardware requirements.
Mindmap
Keywords
💡Superscalar Execution
💡Super Pipelining
💡Instruction Fetch
💡Decode
💡Execute
💡Logical Units
💡Data Dependency
💡No Operation (No-Op)
💡Linear Algebra Operations
💡Memory Latency
💡In-Order Execution
💡Out-Of-Order Issue
💡VLIW (Very Long Instruction Words)
💡Branch Instruction
💡Operand Fetch
Highlights
Superscalar Execution, also known as Super Pipelining, allows for the parallel execution of independent instructions.
Processors can execute multiple instructions simultaneously if they determine that there are no dependencies between them.
Superscalar Execution requires multiple logical units for each stage of instruction processing.
Instruction dependencies can cause 'no op' cycles, where the processor has to wait before executing the next instruction.
Linear algebra operations, such as scaling vectors or computing dot products, are ideal candidates for Superscalar Execution.
Parallelizing operations on independent data sets can significantly benefit from Superscalar Execution.
Data dependencies are a common issue in pipelining and Super Pipelining, where instructions cannot be executed until their operands are available.
Branching in pipelines can lead to wasted work, as instructions following a branch may need to be discarded.
Memory latency is a significant challenge for pipelining, as the processor's speed far exceeds memory access times.
Superscalar Execution attempts to issue multiple instructions together, but is limited by the processor's ability to handle dependencies and memory latency.
In-order execution in Super Pipelining involves picking up the next instruction and deciding whether it can be executed simultaneously with the current one.
Modern processors maintain a window of instructions to determine which can be executed in parallel.
Out of order issue is a technique where instructions are issued in a different order than they appear in the code, to maximize parallel execution.
VLIW (Very Long Instruction Words) is an alternative approach where the compiler determines which instructions can be executed together, simplifying hardware complexity.
VLIW has the advantage of simpler hardware due to the compiler's role in determining instruction execution, but lacks the dynamic decision-making capability of Super Pipelining.
The dynamic state of the processor, including branch history and memory access times, is not available to the compiler in VLIW architectures.
Transcripts
So, the next feature that pushes this a little further is Superscalar Execution, and it is
also called Super Pipelining.
What happens in this is that, the processor, if it determines that there are instructions
like in the previous example - we saw ‘add R1 10’ and ‘add R2 5’ - so, if it is
able to figure out that these instructions are nothing to do with each other, then it
can actually execute them both in parallel.
So, you have instruction fetch, decode, and execute, and simultaneously, (you could be
doing) you could do the instruction fetch, decode and execute for the second instruction.
Right, we could do both of these in parallel.
But what does this require, now?
This requires multiple logical units.
So, you need multiple hardware for each of the stages, right.
Because you want to be fetching 2 instructions at a time, you want to be decoding 2 instructions
at a time, you want to be executing 2 instruction at a time.
And of course, if you have something like, let us say that the second instruction, instead
of ‘add R2 5’, it was (let us say) ‘add R1 R3’ (right), then could you execute that
in parallel along with the first instruction?
(So) you could do the instruction fetch in parallel - there is nothing wrong with fetching
both these instructions in parallel; you could decode both these instructions in parallel,
right; but while this first instruction is getting executed, can you execute the second
instruction?
No - essentially what happens in this case is that for the second instruction, you have
to execute a ‘no op’ cycle, ‘no op’ is ‘no operation’, right.
You have to wait because it has a dependency on some other instruction.
Right?
And then you could execute in the next cycle, right.
Once 10 has been added to R1, then you could add 3 to R1 in this cycle.
Okay?
Can you think of some examples where this kind of architecture functionality would be
very useful?
So, one common, I mean a lot of common operations is when you are doing linear algebra operations
(right), when you are working on matrices and vectors, right.
So, suppose you are trying to scale a vector or compute a dot product, you have lots of
operations which are very similar in nature and which are completely working on independent
data sets (right).
So, for instance, if you are computing the dot product of 2 vectors A and B, (right)
so what do you do - you pick up the first element of each of them, you multiply them
together, you pick up the second elements of each of the vectors multiply them together
the third elements and so on.
So, each of these is independent, right.
Yes, you want to add them up finally, to one common value, but (you know) there are easy
ways, you can design easy algorithms to take care of that (right).
For instance, you could keep track of the result of let us say first half of the vector
in a separate variable and you could keep the dot product of the second half of the
vectors in a separate variable and in the end you could add them up together, right.
So, you can parallelize this to a large extent.
So, when you are doing the operations on the first half of the vector and if you are doing
the operations on the second half of the vectors, they are very much independent (right), they
do not have anything to do with each other.
So, you could possibly rearrange your code so that the instructions for the first half
and the second half get executed together, and that would make very good use of superscalar
execution.
So, what are the issues that we typically face with pipelining and with super pipelining.
(So) there are several issues - the first one of them, we have already seen, is data
dependency.
This can take various forms, but one of them is, that, for instance over here (right) we
were adding 10 to R1 and then we were adding 3 to R1, but you could not add 3 to the register
till that register was not free.
It was already participating in some operation, something was being written into that register,
right.
So, you could not (you know) perform that execute cycle at that point in time.
(Right, so) there are different kinds of data dependencies, (so) we will talk about them
later.
The second issue is branching.
So, what are we doing with these pipelines (right)?
(So) we are filling up these pipelines.
Let us say that these are the cycles and this is how the pipeline, how the instructions
are getting executed (right).
So, let’s say that we have issued 2 instructions together - (so) this is super pipelining - and
then in the next cycle, we issued 2 more instructions, and in the next cycle, we issued 2 more instructions;
and at some point of time what we realize is that, this particular instruction, let
us say, was a branch instruction (right) - when you decode that instruction, you realize it
is a branch instruction.
So, what happens now?
(So) what happens is that the instructions that are executing before it, they are fine
(right), they have to be executed anyways, but what about all the instructions which
are being executed after it?
So, here we are assuming that (you know) the processor is just picking up the instructions
in sequential order and putting them into the pipeline.
So, what happens to all these instructions which come after the branch instruction?
(So) you have to basically get rid of them (right) because you are going to jump to some
other piece of code.
So, these instructions have to be thrown away.
So, that is wasted work, right - you picked up all these instructions, you put them in
the pipeline, but eventually (you know) that was wasted, you did not complete these instructions,
you could not finish them.
Another issue is memory latency.
Well, this goes beyond pipelining, this is an issue in general.
The problem is that the processor operates at frequencies of (you know) something like
3 GigahHertz, 4 GigaHertz.
If you do some memory operation (right), if you want to fetch some data from the memory,
it takes a substantial amount of time for that data to come.
It can take hundreds of cycles, even though your instruction can execute in about 4-5
cycles, in 4-5 nanoseconds, but just because it is trying to get something from the memory,
just to get that data, it can take hundreds of cycles.
(Right) so, there is a huge gap between the performance of the processor and the time
it takes to get data from the memory.
Okay?
So, what will happen to pipelining in this case?
Let us say that I have an instruction, say, ‘add R1’, but instead of adding a value
like 10 to it - a constant which (you know) I do not need to go to the memory for - suppose
I am trying to add to register R1, data that resides in memory location 1000.
So, what would happen over here; what would happen in the pipeline?
Well, you would have the ‘instruction fetch’, you would have the ‘decode’, and then
you would have the ‘operand fetch’.
(So) what does operand fetch do - it fetches the data from the memory.
How long is this going to take to execute?
Well this is going to eat up about maybe 100 cycles, right.
So, what happens to the next instruction which was put into the pipeline after this - instruction
fetch, decode, let us say the next instruction was ‘add R1 comma 3’.
So, what do I do now?
I cannot execute that instruction; that instruction is waiting also for R1, right.
So, it is also stuck - only when this data comes, and finally, this instruction gets
executed, after that can I execute this instruction.
So, when we talk about superscalar execution (right), what are we trying to do over here?
We are trying to issue multiple instructions together - I am trying to pick up 2 instructions
and execute them together.
Now, if I am doing inorder execution (right); what is inorder execution - inorder execution
is that I am simply picking up the next instruction which is there in the code and trying to execute
it.
I pick up the current instruction, I issue it; I pick up the next instruction and I have
to check - can I issue it simultaneously or not?
Maybe I can, maybe I cannot - depends on the dependencies, depends on various things, right.
But I may be able to issue it, I may not be able to issue it, right.
And (you know) the chances of being able to issue the next instruction together with the
current instruction may be quite small.
So, how do I deal with that?
(So) typically what is done in modern day processors is that there is a window that
is maintained (right).
So, if this is your code, it generally maintains a window of a few instructions, and it examines
all those instructions and figures out that which are the instructions that can be picked
up to be executed in parallel.
So, if this instruction is picked up, maybe the next instruction is dependent on it.
I cannot pick it up, but maybe the instruction after that is completely independent, right.
So, then it will pick up that instruction and issue it together with the current instruction.
Right?
(So) there are lots of intricacies involved which we are not going to get into.
For instance, you can issue instructions in different order, but you have to be very careful
that they complete in the order in which the code appears.
So, we are not going to get a whole lot into those intricacies - and this is actually called
‘out of order’ issue because it has issued instructions which are not in sequential order
as they appear in the code.
(So) another way of handling the same situation is VLIW - Very Long Instruction Words.
So, what happens here is that just as in the case of super pipelining (right) - you were
trying to issue multiple instructions together, but who was determining which instructions
can be executed together All of that was being done by the processor at runtime, right.
When it was seeing the code, at that time it was maintaining a window and trying to
decide which instructions can it execute together.
So, that makes the hardware complex, right; that makes the logic complex.
So, instead another approach is that offload this to the compiler, right.
Why do not I do it in the compiler!
So, when I am compiling the code, at that time I can look at all the instructions and
figure out which instructions can be executed together and just club them up together; and
that is the idea behind VLIW architectures.
So, here you have instruction words which actually comprise of multiple instructions
- instruction 1, instruction 2, instruction 3 and so on (right).
And the compiler figures out that these instructions have no dependency amongst them, and therefore,
I can issue them together.
Right?
So, there are advantages and disadvantages of both these approaches.
(So) if we just compare VLIW versus super pipelining (right), being done dynamically,
what are the advantages or disadvantages?
(So) one is - in super pipelining - this involves a more complex circuitry; whereas, in VLIW,
these decisions are not being made dynamically, it is being made by the compiler.
Right?
So, the circuit can be much simpler, the hardware can be much simpler.
Again, when you are doing super pipelining, when the processor is trying to determine
which instructions to issue dynamically, by itself (right), it is doing this in real time
(right).
This is in real time, whereas, this is offline; and because it is doing it in real time, there
is a limit to what it can do because it has to eventually, (I mean) it is looking at the
code and it has to issue the instruction in the next couple of cycles, right.
So, there is a very small window in which it has to make the decisions that which instruction
am I going to pick up to issue simultaneously, (right) because it is all in real time.
But that is not a restriction (in) when you are doing it in the compiler (right), because
in the compiler (I mean) it is being done offline, you can take your own sweet time.
Yeah, the compilation is going to be slow, but that is fine.
(But) you can take your own sweet time and you can (you know) look at a larger window,
you can try to look at more permutations combinations and figure out which instructions can be issued
together.
But of course, one major drawback of VLIW, as opposed to super pipelining, is that it
does not have a view of the dynamic state, (right) that what is currently going on - because
when you issue an instruction and, let us say, that the data is not present, and you
have to go to the memory to fetch that data, which takes a substantial amount of time (right)
- based on that you can make decisions of (you know) which instructions you can issue
and which you cannot.
(So) also the branch history plays a role.
(So) again we are not going to get into that, but (you know) previously whether a branch
has been taken repeatedly or not, like for instance, in a loop, when you are executing
a loop - a tight loop (right) - you take the branch again and again.
So, (that you) that information is maintained in a branch history table, based on which
(you know) you can - when you are executing the instruction - you can say that ‘okay,
(you know), I have taken this branch condition, I have taken this loop ten times before’.
So, I am probably going to take it again.
So, let me fetch the instructions from there and start working on them.
(So) this kind of dynamic state is not available to the compiler, right.
(So, whereas, (that is) all this information, because it is being done in real time in super
pipelining, it is available to the processor.
So, it can (you know) use that information to make decisions.
Ver Más Videos Relacionados
L-4.2: Pipelining Introduction and structure | Computer Organisation
Lec-10: Unconditional Branching in 8085 | Microprocessor
CH01_VID06_Buses
13.2.2 ALU Instructions
Introduction to Computer Organization and Architecture (COA)
RISC vs CISC | RISC | Reduced Instruction Set Computer | CISC | Complex Instruction Set Computer
5.0 / 5 (0 votes)