Superpipelining and VLIW

Introduction to Parallel Programming in OpenMP
11 Aug 201713:21

Summary

TLDRThis script delves into the concept of Superscalar Execution, also known as Super Pipelining, in processor architecture. It explains how processors can execute multiple instructions in parallel when they are independent, requiring multiple logical units for simultaneous instruction fetch, decode, and execution. The script also highlights challenges like data dependency, branching, and memory latency that affect pipelining efficiency. It contrasts Super Pipelining with VLIW, discussing the trade-offs between dynamic runtime decision-making and static compiler-based instruction bundling.

Takeaways

  • 🚀 Superscalar execution allows a processor to execute multiple instructions in parallel if they are independent of each other.
  • 🛠️ Superscalar execution requires multiple hardware units for each stage (fetch, decode, execute) to process instructions simultaneously.
  • ⛔ Instructions that depend on the results of previous instructions cannot be executed in parallel and must wait for the necessary data to be available.
  • 📊 Superscalar execution is particularly beneficial in operations like linear algebra, where similar operations on independent data sets can be parallelized.
  • 🔗 Data dependency is a major issue in pipelining and superscalar execution, as it can stall the pipeline if an instruction depends on the result of a previous one.
  • 🔄 Branching can cause inefficiencies in pipelining, as instructions after a branch may need to be discarded if the branch is taken, leading to wasted work.
  • ⏳ Memory latency is a significant challenge, as fetching data from memory can take hundreds of cycles, stalling the pipeline while the processor waits for the data.
  • 🎯 Out-of-order execution allows the processor to issue instructions based on a window of code, enabling it to execute independent instructions together, even if they are not sequential.
  • 💻 VLIW (Very Long Instruction Word) architecture offloads the decision of parallel instruction execution to the compiler, simplifying the processor's hardware.
  • 📅 VLIW can analyze a larger window of code during compilation, but lacks the dynamic state awareness of super pipelining, making it less responsive to real-time conditions.

Q & A

  • What is Superscalar Execution also known as?

    -Superscalar Execution is also known as Super Pipelining.

  • How does Superscalar Execution allow for parallel execution of instructions?

    -Superscalar Execution allows for parallel execution by enabling the processor to execute multiple instructions simultaneously if it determines that they are independent of each other.

  • What is required for a processor to execute instructions in parallel using Superscalar Execution?

    -For parallel execution, the processor requires multiple logical units, meaning it needs separate hardware for each stage of the instruction pipeline to fetch, decode, and execute multiple instructions at a time.

  • What is a 'no op' cycle in the context of Superscalar Execution?

    -A 'no op' cycle, short for 'no operation', is a cycle where no operation is performed because there is a dependency on another instruction that has not completed yet.

  • In what kind of operations would Superscalar Execution architecture be particularly useful?

    -Superscalar Execution is particularly useful in operations like linear algebra, where there are many independent operations on different data sets, such as scaling a vector or computing a dot product.

  • How can the dot product of two vectors be computed using Superscalar Execution?

    -The dot product can be computed by performing independent multiplications of corresponding elements from each vector and then summing these products. Superscalar Execution can be used to execute these multiplications in parallel.

  • What are some of the issues typically faced with pipelining and super pipelining?

    -Issues faced with pipelining and super pipelining include data dependency, branching, and memory latency. Data dependency can cause delays when instructions need to wait for data from previous operations. Branching can lead to wasted work if instructions following a branch are discarded. Memory latency can stall the pipeline if data retrieval from memory takes much longer than instruction execution.

  • What is the difference between 'inorder execution' and 'out of order issue' in the context of Superscalar Execution?

    -In 'inorder execution', instructions are executed in the exact order they appear in the code. In contrast, 'out of order issue' allows the processor to issue instructions that are not in sequential order, based on their independence and the availability of resources, to maximize parallel execution.

  • What is VLIW and how does it differ from Superscalar Execution?

    -VLIW stands for Very Long Instruction Words. It is an approach where the compiler determines which independent instructions can be executed together in one instruction word, simplifying the hardware but requiring more complex compilation. Superscalar Execution, on the other hand, makes these decisions dynamically at runtime, which can be more complex but also more responsive to the current state of execution.

  • How does the dynamic state of a processor affect its ability to execute instructions in Superscalar Execution?

    -The dynamic state, including data availability and branch history, allows the processor to make real-time decisions about which instructions to issue in parallel. This dynamic decision-making is not available to the compiler in VLIW architectures, which must make these decisions offline during compilation.

Outlines

00:00

🔁 Superscalar Execution and Its Challenges

This paragraph introduces the concept of Superscalar Execution, also known as Super Pipelining, which is a technique that allows a processor to execute multiple instructions in parallel if they are independent of each other. The processor requires multiple logical units to handle the simultaneous fetch, decode, and execution of instructions. However, this approach faces challenges such as data dependencies, where an instruction must wait for another to complete before it can execute, resulting in 'no op' cycles. The paragraph also highlights the utility of superscalar execution in operations like linear algebra, where independent data sets can be processed in parallel. Issues with pipelining, such as data dependency and branching, are also discussed, where branching can cause a pipeline to be flushed of instructions that will not be executed.

05:00

🔄 Addressing Pipelining Issues: Dynamic vs. Static

The second paragraph delves into the problems associated with pipelining, such as wasted work due to branch instructions and memory latency, which can significantly delay instruction execution as the processor waits for data from memory. The paragraph contrasts dynamic super pipelining, where the processor decides in real-time which instructions to execute together, with Very Long Instruction Words (VLIW), where the compiler offloads the decision-making to determine which instructions can be executed in parallel. The dynamic nature of super pipelining requires complex circuitry and real-time decision-making, whereas VLIW simplifies hardware at the cost of flexibility to handle dynamic states and branch history.

10:01

🛠️ The Trade-offs of VLIW and Super Pipelining

The final paragraph discusses the trade-offs between VLIW and super pipelining architectures. VLIW allows for simpler hardware as the compiler statically determines which independent instructions can be bundled and executed together. This offline process can consider a larger set of instructions and their potential combinations. However, it lacks the ability to adapt to the dynamic runtime state of the processor, such as handling data not being present or making decisions based on branch history. Super pipelining, on the other hand, can leverage real-time information but is limited by the need for rapid decision-making and complex hardware requirements.

Mindmap

Keywords

💡Superscalar Execution

Superscalar Execution refers to a processor's ability to execute multiple instructions simultaneously, which is a significant advancement in processor design. In the video's context, it's explained as the processor's capability to identify and execute independent instructions in parallel, such as 'add R1 10' and 'add R2 5'. This concept is central to the video's theme of enhancing processor efficiency and performance.

💡Super Pipelining

Super Pipelining is another term for Superscalar Execution and represents the process of executing multiple instructions at different stages of the pipeline concurrently. The script discusses how this feature allows for the simultaneous instruction fetch, decode, and execution, which is crucial for improving processor throughput and is a key point in the video's exploration of processor architecture.

💡Instruction Fetch

Instruction Fetch is the process by which the processor retrieves instructions from memory. In the script, it is mentioned as the first step in the pipeline where the processor gets ready to execute an instruction. It is an essential part of the pipeline process and is highlighted in the context of Superscalar Execution to show how multiple instructions can be fetched at once.

💡Decode

Decode is the stage in the pipeline where the processor translates the fetched instruction into a set of operations it can execute. The script describes how, in Superscalar Execution, the processor can decode multiple instructions at the same time, which is vital for understanding how parallel execution is facilitated within the processor's architecture.

💡Execute

Execute is the phase where the processor carries out the operation specified by the instruction. The script uses 'execute' to illustrate how, with Superscalar Execution, multiple instructions can be executed in parallel, provided they do not interfere with each other, showcasing the efficiency gains in processor operations.

💡Logical Units

Logical Units are the separate hardware components within a processor that handle different stages of instruction processing. The script explains that for Superscalar Execution to occur, multiple logical units are required to handle the simultaneous fetching, decoding, and executing of instructions, which is a fundamental requirement for parallel processing capabilities.

💡Data Dependency

Data Dependency refers to a situation where the execution of one instruction depends on the result of another. The script uses the example of adding values to the same register to explain how data dependencies can prevent instructions from being executed in parallel, which is a critical issue addressed in the video concerning the limitations of Superscalar Execution.

💡No Operation (No-Op)

No Operation, or No-Op, is a placeholder instruction that does nothing and is used when an instruction cannot be executed due to dependencies. The script mentions No-Op in the context of an instruction that has to wait because it depends on the completion of another, illustrating a scenario where parallel execution is not possible.

💡Linear Algebra Operations

Linear Algebra Operations involve mathematical operations on vectors and matrices, such as scaling a vector or computing a dot product. The script provides these as examples where Superscalar Execution can be highly beneficial due to the independent nature of many operations involved, demonstrating a practical application of the concept discussed.

💡Memory Latency

Memory Latency is the delay between when a processor requests data from memory and when the data is actually available. The script discusses how memory latency can significantly impact pipelining and Superscalar Execution, as it creates a bottleneck where the processor must wait for data, highlighting the challenges in achieving continuous parallel execution.

💡In-Order Execution

In-Order Execution is the process where instructions are executed in the exact order they appear in the code. The script contrasts this with Out-Of-Order execution, explaining that in Superscalar Execution, the processor may issue instructions out of order to maximize parallelism, but they must complete in the original order, which is a key concept in understanding the complexities of modern processor operation.

💡Out-Of-Order Issue

Out-Of-Order Issue is a technique where the processor issues instructions for execution in an order different from their appearance in the code. The script explains this as a method to overcome dependencies and improve parallel execution, but it also emphasizes the complexity it adds to ensuring that the final results are consistent with the original code sequence.

💡VLIW (Very Long Instruction Words)

VLIW is an architecture where multiple instructions are grouped into a single instruction word, which can be executed in parallel if they have no dependencies. The script contrasts VLIW with Superscalar Execution, noting that VLIW offloads the decision of which instructions to execute in parallel to the compiler, simplifying the hardware but potentially limiting flexibility based on runtime conditions.

💡Branch Instruction

A Branch Instruction is an instruction that tells the processor to jump to a different part of the code. The script discusses how branch instructions can disrupt pipelining, as they may require the processor to discard instructions that were lined up to be executed after the branch, illustrating a common challenge in maintaining efficient instruction flow.

💡Operand Fetch

Operand Fetch is the stage in the pipeline where the processor retrieves the data needed for an instruction to be executed. The script uses this term to describe the process of getting data from memory, which can introduce significant delays due to memory latency, impacting the efficiency of Superscalar Execution.

Highlights

Superscalar Execution, also known as Super Pipelining, allows for the parallel execution of independent instructions.

Processors can execute multiple instructions simultaneously if they determine that there are no dependencies between them.

Superscalar Execution requires multiple logical units for each stage of instruction processing.

Instruction dependencies can cause 'no op' cycles, where the processor has to wait before executing the next instruction.

Linear algebra operations, such as scaling vectors or computing dot products, are ideal candidates for Superscalar Execution.

Parallelizing operations on independent data sets can significantly benefit from Superscalar Execution.

Data dependencies are a common issue in pipelining and Super Pipelining, where instructions cannot be executed until their operands are available.

Branching in pipelines can lead to wasted work, as instructions following a branch may need to be discarded.

Memory latency is a significant challenge for pipelining, as the processor's speed far exceeds memory access times.

Superscalar Execution attempts to issue multiple instructions together, but is limited by the processor's ability to handle dependencies and memory latency.

In-order execution in Super Pipelining involves picking up the next instruction and deciding whether it can be executed simultaneously with the current one.

Modern processors maintain a window of instructions to determine which can be executed in parallel.

Out of order issue is a technique where instructions are issued in a different order than they appear in the code, to maximize parallel execution.

VLIW (Very Long Instruction Words) is an alternative approach where the compiler determines which instructions can be executed together, simplifying hardware complexity.

VLIW has the advantage of simpler hardware due to the compiler's role in determining instruction execution, but lacks the dynamic decision-making capability of Super Pipelining.

The dynamic state of the processor, including branch history and memory access times, is not available to the compiler in VLIW architectures.

Transcripts

play00:00

So, the next feature that pushes this a little further is Superscalar Execution, and it is

play00:10

also called Super Pipelining.

play00:17

What happens in this is that, the processor, if it determines that there are instructions

play00:23

like in the previous example - we saw ‘add R1 10’ and ‘add R2 5’ - so, if it is

play00:31

able to figure out that these instructions are nothing to do with each other, then it

play00:34

can actually execute them both in parallel.

play00:36

So, you have instruction fetch, decode, and execute, and simultaneously, (you could be

play00:43

doing) you could do the instruction fetch, decode and execute for the second instruction.

play00:50

Right, we could do both of these in parallel.

play00:54

But what does this require, now?

play00:56

This requires multiple logical units.

play01:02

So, you need multiple hardware for each of the stages, right.

play01:08

Because you want to be fetching 2 instructions at a time, you want to be decoding 2 instructions

play01:12

at a time, you want to be executing 2 instruction at a time.

play01:15

And of course, if you have something like, let us say that the second instruction, instead

play01:20

of ‘add R2 5’, it was (let us say) ‘add R1 R3’ (right), then could you execute that

play01:28

in parallel along with the first instruction?

play01:30

(So) you could do the instruction fetch in parallel - there is nothing wrong with fetching

play01:37

both these instructions in parallel; you could decode both these instructions in parallel,

play01:44

right; but while this first instruction is getting executed, can you execute the second

play01:50

instruction?

play01:51

No - essentially what happens in this case is that for the second instruction, you have

play01:55

to execute a ‘no op’ cycle, ‘no op’ is ‘no operation’, right.

play01:59

You have to wait because it has a dependency on some other instruction.

play02:04

Right?

play02:05

And then you could execute in the next cycle, right.

play02:10

Once 10 has been added to R1, then you could add 3 to R1 in this cycle.

play02:17

Okay?

play02:19

Can you think of some examples where this kind of architecture functionality would be

play02:22

very useful?

play02:25

So, one common, I mean a lot of common operations is when you are doing linear algebra operations

play02:32

(right), when you are working on matrices and vectors, right.

play02:34

So, suppose you are trying to scale a vector or compute a dot product, you have lots of

play02:38

operations which are very similar in nature and which are completely working on independent

play02:42

data sets (right).

play02:43

So, for instance, if you are computing the dot product of 2 vectors A and B, (right)

play02:52

so what do you do - you pick up the first element of each of them, you multiply them

play02:57

together, you pick up the second elements of each of the vectors multiply them together

play03:01

the third elements and so on.

play03:03

So, each of these is independent, right.

play03:05

Yes, you want to add them up finally, to one common value, but (you know) there are easy

play03:10

ways, you can design easy algorithms to take care of that (right).

play03:12

For instance, you could keep track of the result of let us say first half of the vector

play03:18

in a separate variable and you could keep the dot product of the second half of the

play03:24

vectors in a separate variable and in the end you could add them up together, right.

play03:27

So, you can parallelize this to a large extent.

play03:29

So, when you are doing the operations on the first half of the vector and if you are doing

play03:34

the operations on the second half of the vectors, they are very much independent (right), they

play03:38

do not have anything to do with each other.

play03:40

So, you could possibly rearrange your code so that the instructions for the first half

play03:45

and the second half get executed together, and that would make very good use of superscalar

play03:51

execution.

play03:52

So, what are the issues that we typically face with pipelining and with super pipelining.

play03:59

(So) there are several issues - the first one of them, we have already seen, is data

play04:06

dependency.

play04:12

This can take various forms, but one of them is, that, for instance over here (right) we

play04:17

were adding 10 to R1 and then we were adding 3 to R1, but you could not add 3 to the register

play04:22

till that register was not free.

play04:24

It was already participating in some operation, something was being written into that register,

play04:29

right.

play04:30

So, you could not (you know) perform that execute cycle at that point in time.

play04:34

(Right, so) there are different kinds of data dependencies, (so) we will talk about them

play04:38

later.

play04:39

The second issue is branching.

play04:41

So, what are we doing with these pipelines (right)?

play04:44

(So) we are filling up these pipelines.

play04:45

Let us say that these are the cycles and this is how the pipeline, how the instructions

play04:52

are getting executed (right).

play04:54

So, let’s say that we have issued 2 instructions together - (so) this is super pipelining - and

play05:00

then in the next cycle, we issued 2 more instructions, and in the next cycle, we issued 2 more instructions;

play05:09

and at some point of time what we realize is that, this particular instruction, let

play05:15

us say, was a branch instruction (right) - when you decode that instruction, you realize it

play05:20

is a branch instruction.

play05:21

So, what happens now?

play05:23

(So) what happens is that the instructions that are executing before it, they are fine

play05:28

(right), they have to be executed anyways, but what about all the instructions which

play05:31

are being executed after it?

play05:33

So, here we are assuming that (you know) the processor is just picking up the instructions

play05:37

in sequential order and putting them into the pipeline.

play05:40

So, what happens to all these instructions which come after the branch instruction?

play05:44

(So) you have to basically get rid of them (right) because you are going to jump to some

play05:48

other piece of code.

play05:49

So, these instructions have to be thrown away.

play05:52

So, that is wasted work, right - you picked up all these instructions, you put them in

play05:56

the pipeline, but eventually (you know) that was wasted, you did not complete these instructions,

play06:00

you could not finish them.

play06:03

Another issue is memory latency.

play06:05

Well, this goes beyond pipelining, this is an issue in general.

play06:09

The problem is that the processor operates at frequencies of (you know) something like

play06:14

3 GigahHertz, 4 GigaHertz.

play06:15

If you do some memory operation (right), if you want to fetch some data from the memory,

play06:20

it takes a substantial amount of time for that data to come.

play06:22

It can take hundreds of cycles, even though your instruction can execute in about 4-5

play06:28

cycles, in 4-5 nanoseconds, but just because it is trying to get something from the memory,

play06:33

just to get that data, it can take hundreds of cycles.

play06:35

(Right) so, there is a huge gap between the performance of the processor and the time

play06:39

it takes to get data from the memory.

play06:42

Okay?

play06:43

So, what will happen to pipelining in this case?

play06:45

Let us say that I have an instruction, say, ‘add R1’, but instead of adding a value

play06:50

like 10 to it - a constant which (you know) I do not need to go to the memory for - suppose

play06:55

I am trying to add to register R1, data that resides in memory location 1000.

play07:01

So, what would happen over here; what would happen in the pipeline?

play07:04

Well, you would have the ‘instruction fetch’, you would have the ‘decode’, and then

play07:09

you would have the ‘operand fetch’.

play07:11

(So) what does operand fetch do - it fetches the data from the memory.

play07:15

How long is this going to take to execute?

play07:16

Well this is going to eat up about maybe 100 cycles, right.

play07:20

So, what happens to the next instruction which was put into the pipeline after this - instruction

play07:24

fetch, decode, let us say the next instruction was ‘add R1 comma 3’.

play07:28

So, what do I do now?

play07:30

I cannot execute that instruction; that instruction is waiting also for R1, right.

play07:33

So, it is also stuck - only when this data comes, and finally, this instruction gets

play07:39

executed, after that can I execute this instruction.

play07:43

So, when we talk about superscalar execution (right), what are we trying to do over here?

play07:47

We are trying to issue multiple instructions together - I am trying to pick up 2 instructions

play07:51

and execute them together.

play07:53

Now, if I am doing inorder execution (right); what is inorder execution - inorder execution

play07:57

is that I am simply picking up the next instruction which is there in the code and trying to execute

play08:04

it.

play08:05

I pick up the current instruction, I issue it; I pick up the next instruction and I have

play08:08

to check - can I issue it simultaneously or not?

play08:10

Maybe I can, maybe I cannot - depends on the dependencies, depends on various things, right.

play08:14

But I may be able to issue it, I may not be able to issue it, right.

play08:17

And (you know) the chances of being able to issue the next instruction together with the

play08:20

current instruction may be quite small.

play08:22

So, how do I deal with that?

play08:24

(So) typically what is done in modern day processors is that there is a window that

play08:29

is maintained (right).

play08:30

So, if this is your code, it generally maintains a window of a few instructions, and it examines

play08:36

all those instructions and figures out that which are the instructions that can be picked

play08:40

up to be executed in parallel.

play08:42

So, if this instruction is picked up, maybe the next instruction is dependent on it.

play08:47

I cannot pick it up, but maybe the instruction after that is completely independent, right.

play08:51

So, then it will pick up that instruction and issue it together with the current instruction.

play08:55

Right?

play08:56

(So) there are lots of intricacies involved which we are not going to get into.

play08:59

For instance, you can issue instructions in different order, but you have to be very careful

play09:03

that they complete in the order in which the code appears.

play09:06

So, we are not going to get a whole lot into those intricacies - and this is actually called

play09:13

‘out of order’ issue because it has issued instructions which are not in sequential order

play09:19

as they appear in the code.

play09:21

(So) another way of handling the same situation is VLIW - Very Long Instruction Words.

play09:30

So, what happens here is that just as in the case of super pipelining (right) - you were

play09:38

trying to issue multiple instructions together, but who was determining which instructions

play09:44

can be executed together All of that was being done by the processor at runtime, right.

play09:49

When it was seeing the code, at that time it was maintaining a window and trying to

play09:52

decide which instructions can it execute together.

play09:54

So, that makes the hardware complex, right; that makes the logic complex.

play10:01

So, instead another approach is that offload this to the compiler, right.

play10:07

Why do not I do it in the compiler!

play10:08

So, when I am compiling the code, at that time I can look at all the instructions and

play10:12

figure out which instructions can be executed together and just club them up together; and

play10:16

that is the idea behind VLIW architectures.

play10:19

So, here you have instruction words which actually comprise of multiple instructions

play10:25

- instruction 1, instruction 2, instruction 3 and so on (right).

play10:32

And the compiler figures out that these instructions have no dependency amongst them, and therefore,

play10:37

I can issue them together.

play10:39

Right?

play10:40

So, there are advantages and disadvantages of both these approaches.

play10:44

(So) if we just compare VLIW versus super pipelining (right), being done dynamically,

play10:51

what are the advantages or disadvantages?

play10:52

(So) one is - in super pipelining - this involves a more complex circuitry; whereas, in VLIW,

play11:01

these decisions are not being made dynamically, it is being made by the compiler.

play11:04

Right?

play11:05

So, the circuit can be much simpler, the hardware can be much simpler.

play11:10

Again, when you are doing super pipelining, when the processor is trying to determine

play11:16

which instructions to issue dynamically, by itself (right), it is doing this in real time

play11:20

(right).

play11:21

This is in real time, whereas, this is offline; and because it is doing it in real time, there

play11:27

is a limit to what it can do because it has to eventually, (I mean) it is looking at the

play11:32

code and it has to issue the instruction in the next couple of cycles, right.

play11:34

So, there is a very small window in which it has to make the decisions that which instruction

play11:38

am I going to pick up to issue simultaneously, (right) because it is all in real time.

play11:43

But that is not a restriction (in) when you are doing it in the compiler (right), because

play11:47

in the compiler (I mean) it is being done offline, you can take your own sweet time.

play11:50

Yeah, the compilation is going to be slow, but that is fine.

play11:54

(But) you can take your own sweet time and you can (you know) look at a larger window,

play11:58

you can try to look at more permutations combinations and figure out which instructions can be issued

play12:03

together.

play12:04

But of course, one major drawback of VLIW, as opposed to super pipelining, is that it

play12:13

does not have a view of the dynamic state, (right) that what is currently going on - because

play12:18

when you issue an instruction and, let us say, that the data is not present, and you

play12:23

have to go to the memory to fetch that data, which takes a substantial amount of time (right)

play12:28

- based on that you can make decisions of (you know) which instructions you can issue

play12:32

and which you cannot.

play12:33

(So) also the branch history plays a role.

play12:36

(So) again we are not going to get into that, but (you know) previously whether a branch

play12:39

has been taken repeatedly or not, like for instance, in a loop, when you are executing

play12:43

a loop - a tight loop (right) - you take the branch again and again.

play12:46

So, (that you) that information is maintained in a branch history table, based on which

play12:50

(you know) you can - when you are executing the instruction - you can say that ‘okay,

play12:54

(you know), I have taken this branch condition, I have taken this loop ten times before’.

play12:58

So, I am probably going to take it again.

play13:00

So, let me fetch the instructions from there and start working on them.

play13:03

(So) this kind of dynamic state is not available to the compiler, right.

play13:07

(So, whereas, (that is) all this information, because it is being done in real time in super

play13:14

pipelining, it is available to the processor.

play13:15

So, it can (you know) use that information to make decisions.

Rate This

5.0 / 5 (0 votes)

الوسوم ذات الصلة
Superscalar ExecutionParallel ProcessingCPU ArchitectureInstruction PipeliningData DependencyBranching IssuesMemory LatencyCompiler OptimizationVLIW ArchitectureReal-time DecisionsCompiler Complexity
هل تحتاج إلى تلخيص باللغة الإنجليزية؟