Advanced CPU Designs: Crash Course Computer Science #9

CrashCourse
26 Apr 201712:22

Summary

TLDRThis CrashCourse Computer Science episode explores the evolution of computer processors from mechanical devices to Gigahertz-speed CPUs. It delves into techniques like instruction pipelining, cache usage, and multi-core processors that enhance performance. The video also touches on challenges like data bottlenecks and the importance of efficient programming to harness the immense processing power available today.

Takeaways

  • 🚀 Computers have evolved from mechanical devices to processors running at Gigahertz speeds, executing billions of instructions per second.
  • 🔍 Early processors increased speed by improving transistor switching times, but this approach has limitations, leading to the development of various performance-boosting techniques.
  • 🛠️ Modern processors include specialized circuits for complex tasks like graphics operations, video decoding, and file encryption, which are known as MMX, 3DNow!, or SSE.
  • 📈 The instruction set of processors has grown over time, with modern processors having thousands of instructions for enhanced capabilities and backward compatibility.
  • 🔄 High clock speeds create a data bottleneck with RAM, which is addressed by using caches to store frequently accessed data closer to the CPU, reducing access time.
  • 💡 Caching improves efficiency through cache hits and cache misses, utilizing a 'dirty bit' to manage data synchronization between cache and RAM.
  • 🔄 Instruction pipelining allows multiple instructions to be processed simultaneously in different stages of the CPU, increasing throughput.
  • 🔄 Out-of-order execution in high-end processors dynamically reorders instructions to minimize pipeline stalls and improve efficiency.
  • 🤖 Speculative execution and branch prediction are techniques used to deal with conditional jumps, guessing the flow of execution to reduce delays.
  • 🔢 Superscalar processors can execute multiple instructions per clock cycle by utilizing idle areas of the CPU or adding duplicate circuitry for popular instructions.
  • 💻 Multi-core processors allow for multiple independent processing units within a single CPU chip, sharing resources and improving performance on shared computations.
  • 🌐 Supercomputers, like the Sunway TaihuLight, utilize millions of cores to perform an immense number of calculations, showcasing the pinnacle of computational power.

Q & A

  • How have computers evolved from their early days to the present?

    -Computers have evolved from mechanical devices capable of one calculation per second to CPUs running at Gigahertz speeds, executing billions of instructions every second.

  • What was one of the early methods to make processors faster?

    -One of the early methods to make processors faster was by improving the switching time of the transistors inside the chip, which make up all the logic gates, ALUs, and other components.

  • Why are additional circuits added to modern computer processors?

    -Additional circuits are added to modern computer processors to perform more sophisticated operations and to execute instructions that would take many clock cycles with standard operations, such as graphics operations, video decoding, and file encryption.

  • What is the significance of MMX, 3DNow!, and SSE in processors?

    -MMX, 3DNow!, and SSE are extensions to the instruction set that allow processors to execute additional instructions for specific tasks like gaming and encryption, enhancing performance for these operations.

  • Why did the Intel 4004, the first integrated CPU, only have 46 instructions?

    -The Intel 4004 had 46 instructions because that was enough to build a fully functional computer at the time. As technology advanced, more instructions were needed to perform a wider variety of tasks.

  • What is the role of a cache in a CPU?

    -A cache is a small piece of RAM located on the CPU that stores data to speed up access times. It helps to alleviate the bottleneck caused by the slower speed of RAM compared to the CPU.

  • What is a cache hit and a cache miss?

    -A cache hit occurs when the data requested from RAM is already stored in the cache, allowing for faster access. A cache miss happens when the data is not in the cache, requiring a slower access from the main RAM.

  • What is instruction pipelining and how does it improve CPU performance?

    -Instruction pipelining is a technique where different stages of instruction processing (fetch, decode, execute) are overlapped, allowing for continuous operation and higher throughput, effectively executing one instruction per clock cycle.

  • What is the purpose of speculative execution in CPUs?

    -Speculative execution is a technique used by advanced CPUs to guess the outcome of a conditional jump instruction and start filling the pipeline with instructions based on that guess, reducing delays when the jump is resolved.

  • How do superscalar processors differ from regular pipelined processors?

    -Superscalar processors can execute more than one instruction per clock cycle by fetching and decoding multiple instructions at once and executing instructions that require different parts of the CPU simultaneously.

  • What is the advantage of multi-core processors over single-core processors?

    -Multi-core processors have multiple independent processing units within a single CPU chip, allowing for parallel processing of multiple instruction streams and improved performance for multi-threaded applications.

  • Why are supercomputers necessary for certain types of calculations?

    -Supercomputers are necessary for performing extremely large and complex calculations, such as simulating the formation of the universe, which require a massive amount of processing power beyond what is available in standard desktop or server CPUs.

Outlines

00:00

🚀 Evolution of Computer Processors and Performance Enhancement Techniques

Carrie Anne introduces the progress in computer processors, from early mechanical devices to modern CPUs operating at Gigahertz speeds. The video discusses the limitations of increasing transistor speed and the development of various techniques to improve performance, such as specialized circuits for graphics, video decoding, and encryption. It also covers the concept of instruction set extensions like MMX, 3DNow!, and SSE, and the continuous growth of these sets for backward compatibility. The issue of data transfer speed between CPU and RAM is highlighted, along with the introduction of caches to mitigate this bottleneck, explaining how caches work, the concept of cache hits and misses, and the use of dirty bits for synchronization.

05:04

🔄 Advanced CPU Techniques: Pipelining, Caching, and Multi-Core Processing

This paragraph delves into instruction pipelining as a method to increase CPU performance, comparing it to washing laundry to illustrate the concept of parallelizing tasks. The explanation includes the benefits of pipelining, such as increased throughput, and the challenges it presents, like handling data dependencies and jump instructions. The paragraph also touches on advanced techniques like out-of-order execution and speculative execution with branch prediction to minimize pipeline stalls. The discussion then shifts to superscalar processors capable of executing multiple instructions per clock cycle and the concept of multi-core processors, which allow for parallel processing of multiple instruction streams, concluding with the mention of multi-CPU systems in high-performance computers.

10:06

🌐 Scaling Up: From Multi-Core Processors to Supercomputers

The final paragraph discusses the escalation from multi-core processors to the construction of supercomputers for massive computational tasks. It explains the necessity of supercomputers for complex calculations like simulating the universe's formation and introduces the Sunway TaihuLight, a supercomputer with over ten million cores capable of processing 93 quadrillion floating-point operations per second. The paragraph emphasizes the sophistication and speed of modern processors and sets the stage for the next episode, which will focus on programming and utilizing this computational power.

Mindmap

Keywords

💡Gigahertz

Gigahertz refers to a unit of frequency equal to one billion cycles per second. In the context of the video, it is used to describe the speed at which modern CPUs operate, executing billions of instructions every second. The script mentions that the device viewers are using is likely running at Gigahertz speeds, emphasizing the significant advancement in computing power over time.

💡Transistors

Transistors are semiconductor devices that act as switches or amplifiers and are fundamental components of modern electronic devices. The video script discusses how improving the switching time of transistors within a chip was one method of increasing processor speed in the early days of electronic computing, highlighting their importance in the evolution of computer processors.

💡ALU (Arithmetic Logic Unit)

An Arithmetic Logic Unit is a part of the CPU that performs arithmetic and logical operations. The script explains that modern processors have the division operation performed by the ALU in hardware, which is a more efficient approach than the earlier method of using many subtractions to achieve division, as demonstrated in the video's example.

💡Instruction Set

An instruction set is the collection of basic instructions that a processor can understand and execute. The video script notes the growth of instruction sets over time, with modern processors having thousands of instructions, as opposed to the 46 instructions of the Intel 4004, to perform a wide variety of complex operations.

💡Cache

A cache is a small, fast memory storage used to temporarily store frequently accessed data to reduce access time from main memory (RAM). The script describes the use of a cache in CPUs to speed up data retrieval, explaining how a cache hit is more efficient than a cache miss, which requires data to be fetched from RAM.

💡Pipelining

Pipelining is a method of improving performance by allowing multiple instructions to be processed at different stages of the execution pipeline simultaneously. The video script uses the analogy of washing sheets to explain how pipelining can increase throughput, allowing an instruction to be executed every clock cycle rather than every few.

💡Superscalar Processors

Superscalar processors are CPUs capable of executing more than one instruction per clock cycle. The script mentions superscalar processors as a further advancement in CPU design, allowing for even greater efficiency by utilizing different parts of the CPU for parallel execution of instructions.

💡Multi-core Processors

Multi-core processors contain multiple independent processing units (cores) within a single CPU. The video script discusses how multi-core processors can handle several streams of instructions at once, providing an example of dual-core or quad-core processors and their ability to share resources like cache.

💡Supercomputer

A supercomputer is a computer with a high level of performance compared to a general-purpose computer. The script refers to supercomputers as machines designed for massive computational tasks, such as simulating the universe's formation, and mentions the Sunway TaihuLight as an example of a supercomputer with over ten million cores.

💡FLOPS (Floating Point Operations Per Second)

FLOPS is a measure of a computer's performance, indicating the number of floating-point operations it can execute in one second. The video script uses FLOPS to quantify the immense computational power of the Sunway TaihuLight supercomputer, which can process 93 quadrillion FLOPS.

💡Branch Prediction

Branch prediction is a technique used in pipelined processors to guess the direction a program will take at a conditional branch instruction. The script describes how advanced CPUs use branch prediction to minimize pipeline stalls, often guessing with over 90% accuracy, which helps maintain high execution efficiency.

Highlights

Computers have evolved from mechanical devices to CPUs running at Gigahertz speeds, executing billions of instructions per second.

Processors were traditionally made faster by improving the switching time of transistors.

Processor designers have developed techniques to boost performance beyond transistor efficiency.

Modern CPUs include hardware support for complex operations like division to reduce clock cycles.

ALUs have become more complex to perform additional operations like graphics and encryption.

Instruction sets have grown larger over time, retaining old opcodes for backward compatibility.

High-speed CPUs face bottlenecks with RAM due to data transmission delays.

Caching is used to mitigate RAM bottlenecks by storing data closer to the CPU.

Caches work by transmitting blocks of data, reducing the need for repeated RAM access.

Cache hits and misses are key concepts in CPU performance optimization.

The dirty bit in caches helps manage synchronization between cache and RAM.

Instruction pipelining allows for overlapping stages of instruction processing, increasing throughput.

Pipeline hazards, such as data dependencies, can be mitigated with advanced processor techniques.

Out-of-order execution and speculative execution are methods used to minimize pipeline stalls.

Branch prediction improves the efficiency of handling conditional jump instructions.

Superscalar processors can execute multiple instructions per clock cycle, further increasing performance.

Multi-core processors allow for multiple independent streams of instructions to run simultaneously.

Supercomputers, like the Sunway TaihuLight, utilize millions of cores for massive computational tasks.

Programming harnesses the power of sophisticated processors to perform useful computations.

Transcripts

play00:02

Hi, I’m Carrie Anne and welcome to CrashCourse Computer Science!

play00:06

As we’ve discussed throughout the series, computers have come a long way from mechanical

play00:09

devices capable of maybe one calculation per second, to CPUs running at kilohertz and megahertz speeds.

play00:15

The device you’re watching this video on right now is almost certainly running at Gigahertz

play00:19

speeds - that’s billions of instructions executed every second.

play00:22

Which, trust me, is a lot of computation!

play00:24

In the early days of electronic computing, processors were typically made faster by improving

play00:28

the switching time of the transistors inside the chip - the ones that make up all the logic

play00:33

gates, ALUs and other stuff we’ve talked about over the past few episodes.

play00:36

But just making transistors faster and more efficient only went so far, so processor designers

play00:41

have developed various techniques to boost performance allowing not only simple instructions

play00:45

to run fast, but also performing much more sophisticated operations.

play00:49

INTRO

play00:58

Last episode, we created a small program for our CPU that allowed us to divide two numbers.

play01:03

We did this by doing many subtractions in a row... so, for example, 16 divided by 4

play01:08

could be broken down into the smaller problem of 16 minus 4, minus 4, minus 4, minus 4.

play01:13

When we hit zero, or a negative number, we knew that we we’re done.

play01:17

But this approach gobbles up a lot of clock cycles, and isn’t particularly efficient.

play01:20

So most computer processors today have divide as one of the instructions that the ALU can

play01:25

perform in hardware.

play01:26

Of course, this extra circuitry makes the ALU bigger and more complicated to design,

play01:30

but also more capable - a complexity-for-speed tradeoff that has been made many times in

play01:35

computing history.

play01:36

For instance, modern computer processors now have special circuits for things like graphics

play01:40

operations, decoding compressed video, and encrypting files - all of which are operations

play01:45

that would take many many many clock cycles to perform with standard operations.

play01:48

You may have even heard of processors with MMX, 3DNow!, or SSE.

play01:53

These are processors with additional, fancy circuits that allow them to execute additional,

play01:57

fancy instructions - for things like gaming and encryption.

play02:00

These extensions to the instruction set have grown, and grown over time, and once people

play02:04

have written programs to take advantage of them, it’s hard to remove them.

play02:07

So instruction sets tend to keep getting larger and larger keeping all the old opcodes around

play02:12

for backwards compatibility.

play02:13

The Intel 4004, the first truly integrated CPU, had 46 instructions - which was enough

play02:19

to build a fully functional computer.

play02:21

But a modern computer processor has thousands of different instructions, which utilize all

play02:26

sorts of clever and complex internal circuitry.

play02:28

Now, high clock speeds and fancy instruction sets lead to another problem - getting data

play02:33

in and out of the CPU quickly enough.

play02:35

It’s like having a powerful steam locomotive, but no way to shovel in coal fast enough.

play02:40

In this case, the bottleneck is RAM.

play02:42

RAM is typically a memory module that lies outside the CPU.

play02:45

This means that data has to be transmitted to and from RAM along sets of data wires,

play02:49

called a bus.

play02:50

This bus might only be a few centimeters long, and remember those electrical signals are

play02:54

traveling near the speed of light, but when you are operating at gigahertz speeds – that’s

play02:58

billionths of a second – even this small delay starts to become problematic.

play03:02

It also takes time for RAM itself to lookup the address, retrieve the data, and configure

play03:07

itself for output.

play03:08

So a “load from RAM” instruction might take dozens of clock cycles to complete, and during

play03:12

this time the processor is just sitting there idly waiting for the data.

play03:16

One solution is to put a little piece of RAM right on the CPU -- called a cache.

play03:20

There isn’t a lot of space on a processor’s chip, so most caches are just kilobytes or

play03:23

maybe megabytes in size, where RAM is usually gigabytes.

play03:27

Having a cache speeds things up in a clever way.

play03:29

When the CPU requests a memory location from RAM, the RAM can transmit not just one single

play03:34

value, but a whole block of data.

play03:36

This takes only a little bit more time than transmitting a single value, but it allows

play03:38

this data block to be saved into the cache.

play03:41

This tends to be really useful because computer data is often arranged and processed sequentially.

play03:45

For example, let say the processor is totalling up daily sales for a restaurant.

play03:50

It starts by fetching the first transaction from RAM at memory location 100.

play03:54

The RAM, instead of sending back just that one value, sends a block of data, from memory

play03:58

location 100 through 200, which are then all copied into the cache.

play04:02

Now, when the processor requests the next transaction to add to its running total, the

play04:06

value at address 101, the cache will say “Oh, I’ve already got that value right here,

play04:10

so I can give it to you right away!”

play04:12

And there’s no need to go all the way to RAM.

play04:14

Because the cache is so close to the processor, it can typically provide the data in a single

play04:18

clock cycle -- no waiting required.

play04:21

This speeds things up tremendously over having to go back and forth to RAM every single time.

play04:24

When data requested in RAM is already stored in the cache like this it’s called a cache

play04:28

hit,

play04:29

and if the data requested isn’t in the cache, so you have to go to RAM, it’s a called

play04:32

a cache miss.

play04:34

The cache can also be used like a scratch space, storing intermediate values when performing

play04:38

a longer, or more complicated calculation.

play04:41

Continuing our restaurant example, let’s say the processor has finished totalling up

play04:44

all of the sales for the day, and wants to store the result in memory address 150.

play04:48

Like before, instead of going back all the way to RAM to save that value, it can be stored

play04:53

in cached copy, which is faster to save to, and also faster to access later if more calculations

play04:59

are needed.

play04:59

But this introduces an interesting problem -- the cache’s copy of the data is now different

play05:04

to the real version stored in RAM.

play05:05

This mismatch has to be recorded, so that at some point everything can get synced up.

play05:09

For this purpose, the cache has a special flag for each block of memory it stores, called

play05:13

the dirty bit -- which might just be the best term computer scientists have ever invented.

play05:18

Most often this synchronization happens when the cache is full, but a new block of memory

play05:23

is being requested by the processor.

play05:24

Before the cache erases the old block to free up space, it checks its dirty bit, and if

play05:29

it’s dirty, the old block of data is written back to RAM before loading in the new block.

play05:33

Another trick to boost cpu performance is called instruction pipelining.

play05:37

Imagine you have to wash an entire hotel’s worth of sheets, but you’ve only got one

play05:40

washing machine and one dryer.

play05:42

One option is to do it all sequentially: put a batch of sheets in the washer and wait 30

play05:45

minutes for it to finish.

play05:46

Then take the wet sheets out and put them in the dryer and wait another 30 minutes for

play05:50

that to finish.

play05:51

This allows you to do one batch of sheets every hour.

play05:53

Side note: if you have a dryer that can dry a load of laundry in 30 minutes, please tell

play05:57

me the brand and model in the comments, because I’m living with 90 minute dry times, minimum.

play06:01

But, even with this magic clothes dryer, you can speed things up even more if you parallelize

play06:06

your operation.

play06:07

As before, you start off putting one batch of sheets in the washer.

play06:10

You wait 30 minutes for it to finish.

play06:12

Then you take the wet sheets out and put them in the dryer.

play06:14

But this time, instead of just waiting 30 minutes for the dryer to finish, you simultaneously

play06:19

start another load in the washing machine.

play06:21

Now you’ve got both machines going at once.

play06:23

Wait 30 minutes, and one batch is now done, one batch is half done, and another is ready

play06:27

to go in.

play06:28

This effectively doubles your throughput.

play06:30

Processor designs can apply the same idea.

play06:32

In episode 7, our example processor performed the fetch-decode-execute cycle sequentially

play06:37

and in a continuous loop: Fetch-decode-execute, fetch-decode-execute, fetch-decode-execute,

play06:41

and so on.

play06:43

This meant our design required three clock cycles to execute one instruction.

play06:46

But each of these stages uses a different part of the CPU, meaning there is an opportunity

play06:50

to parallelize!

play06:51

While one instruction is getting executed, the next instruction could be getting decoded,

play06:55

and the instruction beyond that fetched from memory.

play06:57

All of these separate processes can overlap so that all parts of the CPU are active at

play07:01

any given time.

play07:02

In this pipelined design, an instruction is executed every single clock cycle which triples

play07:07

the throughput.

play07:08

But just like with caching this can lead to some tricky problems.

play07:11

A big hazard is a dependency in the instructions.

play07:14

For example, you might fetch something that the currently executing instruction is just

play07:17

about to modify, which means you’ll end up with the old value in the pipeline.

play07:21

To compensate for this, pipelined processors have to look ahead for data dependencies,

play07:25

and if necessary, stall their pipelines to avoid problems.

play07:28

High end processors, like those found in laptops and smartphones, go one step further and can

play07:32

dynamically reorder instructions with dependencies in order to minimize stalls and keep the pipeline

play07:37

moving, which is called out-of-order execution.

play07:40

As you might imagine, the circuits that figure this all out are incredibly complicated.

play07:44

Nonetheless, pipelining is tremendously effective and almost all processors implement it today.

play07:49

Another big hazard are conditional jump instructions -- we talked about one example, a JUMP NEGATIVE,

play07:54

last episode.

play07:55

These instructions can change the execution flow of a program depending on a value.

play07:59

A simple pipelined processor will perform a long stall when it sees a jump instruction,

play08:03

waiting for the value to be finalized.

play08:05

Only once the jump outcome is known, does the processor start refilling its pipeline.

play08:09

But, this can produce long delays, so high-end processors have some tricks to deal with this

play08:13

problem too.

play08:14

Imagine an upcoming jump instruction as a fork in a road - a branch.

play08:18

Advanced CPUs guess which way they are going to go, and start filling their pipeline with

play08:21

instructions based off that guess – a technique called speculative execution.

play08:27

When the jump instruction is finally resolved, if the CPU guessed correctly, then the pipeline

play08:31

is already full of the correct instructions and it can motor along without delay.

play08:35

However, if the CPU guessed wrong, it has to discard all its speculative results and

play08:38

perform a pipeline flush - sort of like when you miss a turn and have to do a u-turn to

play08:42

get back on route, and stop your GPS’s insistent shouting.

play08:46

To minimize the effects of these flushes, CPU manufacturers have developed sophisticated

play08:50

ways to guess which way branches will go, called branch prediction.

play08:54

Instead of being a 50/50 guess, today’s processors can often guess with over 90% accuracy!

play08:59

In an ideal case, pipelining lets you complete one instruction every single clock cycle,

play09:03

but then superscalar processors came along which can execute more than one instruction

play09:08

per clock cycle.

play09:09

During the execute phase even in a pipelined design, whole areas of the processor might

play09:13

be totally idle.

play09:14

For example, while executing an instruction that fetches a value from memory, the ALU

play09:18

is just going to be sitting there, not doing a thing.

play09:21

So why not fetch-and-decode several instructions at once, and whenever possible, execute instructions

play09:26

that require different parts of the CPU all at the same time!?

play09:29

But we can take this one step further and add duplicate circuitry

play09:32

for popular instructions.

play09:34

For example, many processors will have four, eight or more identical ALUs, so they can

play09:39

execute many mathematical instructions all in parallel!

play09:42

Ok, the techniques we’ve discussed so far primarily optimize the execution throughput

play09:46

of a single stream of instructions, but another way to increase performance is to run several

play09:50

streams of instructions at once with multi-core processors.

play09:54

You might have heard of dual core or quad core processors.

play09:57

This means there are multiple independent processing units inside of a single CPU chip.

play10:01

In many ways, this is very much like having multiple separate CPUs, but because they’re

play10:05

tightly integrated, they can share some resources, like cache, allowing the cores to work together

play10:10

on shared computations.

play10:11

But, when more cores just isn’t enough, you can build computers with multiple independent

play10:16

CPUs!

play10:17

High end computers, like the servers streaming this video from YouTube’s datacenter, often

play10:21

need the extra horsepower to keep it silky smooth for the hundreds of people watching

play10:25

simultaneously.

play10:26

Two- and four-processor configuration are the most common right now, but every now and

play10:30

again even that much processing power isn’t enough.

play10:32

So we humans get extra ambitious and build ourselves a supercomputer!

play10:36

If you’re looking to do some really monster calculations – like simulating the formation

play10:40

of the universe - you’ll need some pretty serious compute power.

play10:43

A few extra processors in a desktop computer just isn’t going to cut it.

play10:47

You’re going to need a lot of processors.

play10:49

No.. no... even more than that.

play10:51

A lot more!

play10:52

When this video was made, the world’s fastest computer was located in The National Supercomputing

play10:56

Center in Wuxi, China.

play10:58

The Sunway TaihuLight contains a brain-melting 40,960 CPUs, each with 256 cores!

play11:06

Thats over ten million cores in total... and each one of those cores runs at 1.45 gigahertz.

play11:11

In total, this machine can process 93 Quadrillion -- that’s 93 million-billions -- floating

play11:17

point math operations per second, knows as FLOPS.

play11:20

And trust me, that’s a lot of FLOPS!!

play11:21

No word on whether it can run Crysis at max settings, but I suspect it might.

play11:25

So long story short, not only have computer processors gotten a lot faster over the years,

play11:30

but also a lot more sophisticated, employing all sorts of clever tricks to squeeze out

play11:33

more and more computation per clock cycle.

play11:36

Our job is to wield that incredible processing power to do cool and useful things.

play11:40

That’s the essence of programming, which we’ll start discussing next episode.

play11:44

See you next week.

Rate This

5.0 / 5 (0 votes)

関連タグ
Computer ScienceProcessor SpeedInstruction SetCachingPipeliningPerformanceCPU DesignMulti-CoreSupercomputersProgramming
英語で要約が必要ですか?