Advanced CPU Designs: Crash Course Computer Science #9

CrashCourse
26 Apr 201712:22

Summary

TLDR这段视频脚本介绍了计算机处理器从早期的机械装置发展到现代的千兆赫兹速度的历程。早期处理器通过提高晶体管的开关速度来提升性能,但随着技术的发展,设计者们开发了多种技术来提升性能,包括在硬件中执行除法等复杂操作。现代处理器拥有特殊电路来执行图形操作、解码压缩视频和加密文件等任务。处理器的指令集随着时间的推移不断增长,以保持向后兼容性。然而,高时钟速度和复杂的指令集带来了数据传输的瓶颈问题,尤其是与RAM的交互。为了解决这个问题,CPU内部集成了缓存(cache),可以快速存取数据。此外,指令流水线技术允许CPU在单个时钟周期内执行一条指令,提高了吞吐量。高端处理器甚至采用乱序执行和推测执行来最小化流水线中断。多核处理器允许同时执行多个指令流,而超级计算机则通过集成数以百万计的处理器核心来实现巨大的计算能力。视频强调了编程的本质是利用这些强大的处理能力来创造有用和有趣的应用。

Takeaways

  • 🚀 计算机处理器从每秒一次计算的机械装置发展到以千赫兹和兆赫兹速度运行的CPU,现在设备通常以吉赫兹速度运行,即每秒执行数十亿条指令。
  • 🔍 早期电子计算中,通过提高芯片内晶体管的开关时间来提高处理器速度,但仅提高晶体管速度和效率只能达到一定的性能提升。
  • 🔧 处理器设计者开发了多种技术来提升性能,不仅简单指令运行快速,还能执行更复杂的操作。
  • 📉 现代计算机处理器中的算术逻辑单元(ALU)通常在硬件中执行除法等操作,这增加了ALU的复杂性,但提高了速度。
  • 🎮 现代处理器具有特殊电路,如MMX、3DNow!或SSE,这些扩展指令集允许执行额外的高级指令,如游戏和加密。
  • 📚 高速时钟和复杂指令集导致另一个问题:快速地从CPU进出数据,RAM成为瓶颈。
  • ⚡ 通过在CPU上放置一个小的RAM(称为缓存),可以加速数据访问,缓存存储了最近访问的数据,减少了对主内存的访问次数。
  • 🔄 缓存命中时,数据已在缓存中,可以直接访问;缓存未命中时,需要从RAM中获取数据。
  • 🛠️ 指令流水线是提高CPU性能的另一种技巧,它允许在一个时钟周期内执行多个指令的某个阶段,从而提高吞吐量。
  • 🔍 高端处理器使用高级技术,如推测执行和分支预测,来最小化流水线暂停和提高效率。
  • 🔗 超标量处理器可以同时执行多条指令,通过增加ALU的数量,可以并行执行多个数学指令。
  • 💻 多核处理器可以同时运行多个指令流,它们在单个CPU芯片中有多个独立的处理单元,类似于拥有多个CPU。
  • 🌟 超级计算机使用数以百万计的处理器核心来进行大规模计算,如模拟宇宙的形成。

Q & A

  • 计算机处理器的速度是如何从每秒一次计算提升到千兆赫兹的?

    -计算机处理器的速度提升主要是通过改进芯片内部晶体管的开关时间来实现的,晶体管是构成逻辑门、算术逻辑单元(ALU)等的基础。随着技术的进步,处理器设计者开发了多种技术来提升性能,使得简单指令和复杂操作都能快速执行。

  • 为什么现代计算机处理器会包含除法指令?

    -现代计算机处理器包含除法指令是因为,如果使用连续减法的方式来实现除法,会消耗大量的时钟周期,效率不高。因此,大多数处理器都将除法作为ALU可以硬件执行的指令之一。

  • 什么是MMX、3DNow!或SSE,它们对处理器有什么影响?

    -MMX、3DNow!和SSE是处理器的扩展指令集,它们包含了额外的电路,允许处理器执行额外的高级指令,这些指令对于游戏和加密等操作非常有帮助。

  • 为什么说高时钟速度和复杂的指令集会导致数据传输瓶颈问题?

    -高时钟速度和复杂的指令集意味着CPU需要更快地获取数据,但RAM通常位于CPU外部,数据需要通过数据总线传输,这个过程即使很短也可能成为瓶颈。此外,RAM本身查找地址、检索数据和配置输出也需要时间,这可能导致处理器在等待数据时闲置。

  • 缓存(cache)是如何提高CPU性能的?

    -缓存是位于CPU上的一小块RAM,它通过存储来自RAM的数据块来加速数据访问。当CPU请求RAM中的数据时,RAM可以传输整个数据块而不是单个值,这个数据块随后被保存在缓存中。由于计算机数据通常是顺序排列和处理的,这种方式非常有效。

  • 什么是缓存命中和缓存未命中?

    -当从RAM请求的数据已经在缓存中时,这种情况称为缓存命中,因为缓存可以立即提供所需数据。如果请求的数据不在缓存中,就需要访问RAM,这种情况称为缓存未命中。

  • 脏位(dirty bit)在缓存中起什么作用?

    -脏位是缓存为它存储的每个内存块设置的一个特殊标志。当缓存中的数据显示与RAM中的实际版本不同时,脏位会被设置。这样,当缓存满时,并且处理器请求一个新的内存块,缓存会检查脏位,如果脏位被设置,它会先将脏数据写回RAM,然后再加载新的数据块。

  • 指令流水线(pipelining)如何提高CPU性能?

    -指令流水线通过将指令的执行分解为多个阶段(如取指、解码、执行)并让这些阶段并行进行来提高CPU性能。这样,当一个指令在执行阶段时,下一个指令可以同时被解码,再下一个指令则从内存中取出,从而每个时钟周期可以执行一个指令,提高了吞吐量。

  • 什么是超标量处理器(superscalar processors),它们如何提高性能?

    -超标量处理器可以在每个时钟周期执行多于一个的指令。它们通过同时获取并解码多个指令,并在可能的情况下同时执行需要CPU不同部分的指令来提高性能。此外,许多处理器还会为常用指令添加重复的电路,例如,一些处理器可能有四个、八个或更多相同的ALU,以便它们可以并行执行许多数学指令。

  • 多核处理器是如何提高计算性能的?

    -多核处理器通过在单个CPU芯片内集成多个独立的处理单元来提高计算性能。这在很多方面类似于拥有多个独立的CPU,但由于它们紧密集成,它们可以共享一些资源,如缓存,允许核心在共享计算上协同工作。

  • 超级计算机是如何实现巨大计算能力的?

    -超级计算机通过集成大量的处理器来实现巨大的计算能力。例如,世界上最快的计算机之一,位于中国无锡的国家超级计算中心的Sunway TaihuLight,拥有40,960个CPU,每个CPU有256个核心,总计超过一千万核心,每个核心以1.45吉赫兹的频率运行,能够每秒处理93千万亿次浮点数学运算。

  • 为什么说编程是利用处理器强大计算能力的本质?

    -编程是通过编写和设计软件来指挥和利用处理器执行特定任务的过程。随着处理器性能的显著提升和复杂性的增加,程序员可以使用各种巧妙的技术来优化和提高程序的执行效率,从而实现更丰富和有用的功能。

Outlines

00:00

🚀 计算机处理器的发展与性能提升

Carrie Anne在视频中介绍了计算机处理器从早期的机械设备发展到现代的CPU,速度从每秒一次计算提升到了千兆赫兹。处理器性能的提升不仅仅依赖于晶体管的开关速度,还包括多种技术的发展,如ALU硬件执行除法、特殊电路处理图形运算、视频解码和文件加密等。此外,处理器还通过扩展指令集来增加新的功能,例如MMX、3DNow!和SSE。处理器性能的另一个瓶颈是RAM的数据传输速度,为此,CPU中加入了缓存(cache)技术,通过预先加载数据块来减少对RAM的访问次数。缓存分为cache hit(已存储在缓存中的数据)和cache miss(未存储在缓存中的数据)。

05:04

🔍 CPU的高级性能优化技术

为了进一步提升CPU性能,视频中提到了指令流水线(pipelining)技术,通过将指令的获取、解码和执行过程并行化,使得每个时钟周期都能执行一个指令,大幅提升了吞吐量。然而,这种技术也带来了一些挑战,比如指令之间的依赖性可能导致流水线中断。为了解决这个问题,现代处理器采用了乱序执行(out-of-order execution)和分支预测(branch prediction)技术。此外,还有超标量处理器(superscalar processors)可以同时执行多个指令。视频中还讨论了多核处理器(multi-core processors)的概念,它们可以在单个CPU芯片中集成多个独立的处理单元,共同工作以提高性能。

10:06

🏢 多处理器和超级计算机的应用

视频最后讨论了当多核处理器仍不足以满足计算需求时,可以构建拥有多个独立CPU的计算机。例如,YouTube数据中心的服务器通常需要两到四个处理器配置来满足高负载需求。对于需要极大计算能力的超级计算任务,如模拟宇宙的形成,就需要构建超级计算机。视频中提到,截至视频制作时,世界上最快的计算机位于中国无锡的国家超级计算中心,拥有超过千万个核心,能够执行高达93千万亿次的浮点运算每秒。这些先进的处理器技术不仅让计算机变得更快,也更复杂,而编程的任务就是利用这些强大的处理能力来完成有用和有趣的工作。

Mindmap

Keywords

💡CPU

CPU,即中央处理单元,是计算机的核心部件,负责执行程序中的指令。在视频中,CPU的发展历程从每秒一次计算的机械装置发展到以千赫兹和兆赫兹速度运行的处理器,现在设备的速度以吉赫兹计,即每秒执行数十亿条指令。CPU的发展是视频讨论的主题之一,说明了计算机性能提升的核心。

💡晶体管

晶体管是构成CPU内部逻辑门、算术逻辑单元(ALU)等组件的基本元素。视频提到,早期提高处理器速度的方法主要是通过改善晶体管的开关时间。晶体管的速度和效率提升对CPU性能有直接影响,是计算机科学中的基础概念。

💡ALU

ALU,即算术逻辑单元,是CPU内负责执行算术和逻辑运算的部分。视频中提到,现代计算机处理器中的ALU能够在硬件层面执行除法等操作,这是通过增加额外电路实现的,虽然这使得ALU设计更复杂,但也更强大。

💡指令集

指令集是指CPU可以执行的所有指令的集合。视频提到,随着时间的推移,指令集不断扩大,以支持更多的操作,如图形处理、视频解码和文件加密等。指令集的扩展对编程和计算机性能有重要影响,因为程序员可以利用这些新指令来编写更高效的程序。

💡缓存

缓存是位于CPU内部的小型RAM,用于存储CPU频繁访问的数据。视频解释了缓存的工作原理,即当CPU请求RAM中的数据时,RAM可以传输整个数据块到缓存中,这样CPU就不必每次都访问RAM,从而加快了数据处理速度。缓存击中(cache hit)和缓存未击中(cache miss)是衡量缓存效率的两个重要概念。

💡指令流水线

指令流水线是一种CPU优化技术,它允许CPU在执行当前指令的同时,开始解码下一条指令,并从内存中获取再下一条指令。视频通过洗衣服的比喻解释了流水线如何提高效率。流水线设计使得CPU可以在每个时钟周期内执行一条指令,显著提高了处理器的吞吐量。

💡超标量处理器

超标量处理器是一种能够在每个时钟周期内执行多条指令的CPU。视频指出,即使在流水线设计中,CPU的某些部分在执行阶段可能会闲置,超标量处理器通过同时执行多条指令来利用这些资源,从而进一步提高性能。

💡多核处理器

多核处理器是指单个CPU芯片内含有多个独立处理单元的处理器。视频提到,多核处理器类似于拥有多个独立的CPU,但由于它们紧密集成,可以共享一些资源,如缓存。多核处理器能够同时处理多个指令流,对于提高计算性能至关重要。

💡超级计算机

超级计算机是拥有大量处理器和核心的高性能计算机。视频举例说明了中国无锡的国家超级计算中心的Sunway TaihuLight超级计算机,它拥有超过一千万的核心,能够执行高达93千万亿次的浮点运算每秒。超级计算机适用于执行极其复杂和计算密集的任务。

💡分支预测

分支预测是高级CPU中用于处理条件跳转指令的技术。视频解释了,当CPU遇到跳转指令时,它会猜测跳转的方向,并基于这个猜测开始填充流水线。如果猜测正确,CPU可以无延迟地继续执行。如果猜测错误,则需要丢弃所有推测结果并刷新流水线。分支预测技术使得现代处理器能够以超过90%的准确率预测分支,从而减少流水线刷新带来的延迟。

💡乱序执行

乱序执行是高端处理器中的一种技术,它允许处理器动态地重新排序有依赖关系的指令,以最小化流水线的停滞并保持流水线的连续流动。视频指出,这种技术需要复杂的电路来实现,但它非常有效,几乎所有的处理器都实现了流水线技术。

Highlights

计算机从每秒可能只进行一次计算的机械设备,发展到以千赫兹和兆赫兹速度运行的CPU。

处理器通过提高芯片内晶体管的开关时间来提高速度,但这种方法提升有限。

现代计算机处理器拥有特殊的电路,用于图形操作、解码压缩视频和加密文件等复杂操作。

处理器的指令集随着时间增长,以保持向后兼容性,同时增加了如MMX、3DNow!或SSE等扩展。

高时钟频率和复杂指令集导致数据进出CPU的速度成为瓶颈,尤其是与RAM的交互。

在CPU上集成一个小的RAM块,即缓存,可以加速数据访问。

当CPU请求的数据已经在缓存中时,称为缓存命中;如果不在,则称为缓存未命中。

缓存使用脏位(dirty bit)来记录数据块是否与RAM中的数据不一致。

指令流水线技术可以提高CPU性能,允许在一个时钟周期内执行一个指令。

高端处理器采用乱序执行技术,动态重新排序有依赖的指令以最小化流水线暂停。

条件跳转指令可能导致流水线长时间暂停,但高端处理器采用推测执行技术来预测跳转结果。

分支预测技术使得现代处理器在分支预测上的准确率超过90%。

超标量处理器可以在一个时钟周期内执行多个指令。

多核处理器可以同时运行多个指令流,类似于拥有多个独立的CPU。

构建多CPU计算机或超级计算机可以提供额外的处理能力,适用于需要巨大计算能力的场合。

世界上最快的计算机位于中国无锡的国家超级计算中心,拥有超过一千万个核心。

计算机处理器不仅速度变得更快,而且变得更加复杂,采用了各种巧妙的技巧来提高每个时钟周期的计算量。

编程的本质是利用这种惊人的处理能力来完成酷炫和有用的事情。

Transcripts

play00:02

Hi, I’m Carrie Anne and welcome to CrashCourse Computer Science!

play00:06

As we’ve discussed throughout the series, computers have come a long way from mechanical

play00:09

devices capable of maybe one calculation per second, to CPUs running at kilohertz and megahertz speeds.

play00:15

The device you’re watching this video on right now is almost certainly running at Gigahertz

play00:19

speeds - that’s billions of instructions executed every second.

play00:22

Which, trust me, is a lot of computation!

play00:24

In the early days of electronic computing, processors were typically made faster by improving

play00:28

the switching time of the transistors inside the chip - the ones that make up all the logic

play00:33

gates, ALUs and other stuff we’ve talked about over the past few episodes.

play00:36

But just making transistors faster and more efficient only went so far, so processor designers

play00:41

have developed various techniques to boost performance allowing not only simple instructions

play00:45

to run fast, but also performing much more sophisticated operations.

play00:49

INTRO

play00:58

Last episode, we created a small program for our CPU that allowed us to divide two numbers.

play01:03

We did this by doing many subtractions in a row... so, for example, 16 divided by 4

play01:08

could be broken down into the smaller problem of 16 minus 4, minus 4, minus 4, minus 4.

play01:13

When we hit zero, or a negative number, we knew that we we’re done.

play01:17

But this approach gobbles up a lot of clock cycles, and isn’t particularly efficient.

play01:20

So most computer processors today have divide as one of the instructions that the ALU can

play01:25

perform in hardware.

play01:26

Of course, this extra circuitry makes the ALU bigger and more complicated to design,

play01:30

but also more capable - a complexity-for-speed tradeoff that has been made many times in

play01:35

computing history.

play01:36

For instance, modern computer processors now have special circuits for things like graphics

play01:40

operations, decoding compressed video, and encrypting files - all of which are operations

play01:45

that would take many many many clock cycles to perform with standard operations.

play01:48

You may have even heard of processors with MMX, 3DNow!, or SSE.

play01:53

These are processors with additional, fancy circuits that allow them to execute additional,

play01:57

fancy instructions - for things like gaming and encryption.

play02:00

These extensions to the instruction set have grown, and grown over time, and once people

play02:04

have written programs to take advantage of them, it’s hard to remove them.

play02:07

So instruction sets tend to keep getting larger and larger keeping all the old opcodes around

play02:12

for backwards compatibility.

play02:13

The Intel 4004, the first truly integrated CPU, had 46 instructions - which was enough

play02:19

to build a fully functional computer.

play02:21

But a modern computer processor has thousands of different instructions, which utilize all

play02:26

sorts of clever and complex internal circuitry.

play02:28

Now, high clock speeds and fancy instruction sets lead to another problem - getting data

play02:33

in and out of the CPU quickly enough.

play02:35

It’s like having a powerful steam locomotive, but no way to shovel in coal fast enough.

play02:40

In this case, the bottleneck is RAM.

play02:42

RAM is typically a memory module that lies outside the CPU.

play02:45

This means that data has to be transmitted to and from RAM along sets of data wires,

play02:49

called a bus.

play02:50

This bus might only be a few centimeters long, and remember those electrical signals are

play02:54

traveling near the speed of light, but when you are operating at gigahertz speeds – that’s

play02:58

billionths of a second – even this small delay starts to become problematic.

play03:02

It also takes time for RAM itself to lookup the address, retrieve the data, and configure

play03:07

itself for output.

play03:08

So a “load from RAM” instruction might take dozens of clock cycles to complete, and during

play03:12

this time the processor is just sitting there idly waiting for the data.

play03:16

One solution is to put a little piece of RAM right on the CPU -- called a cache.

play03:20

There isn’t a lot of space on a processor’s chip, so most caches are just kilobytes or

play03:23

maybe megabytes in size, where RAM is usually gigabytes.

play03:27

Having a cache speeds things up in a clever way.

play03:29

When the CPU requests a memory location from RAM, the RAM can transmit not just one single

play03:34

value, but a whole block of data.

play03:36

This takes only a little bit more time than transmitting a single value, but it allows

play03:38

this data block to be saved into the cache.

play03:41

This tends to be really useful because computer data is often arranged and processed sequentially.

play03:45

For example, let say the processor is totalling up daily sales for a restaurant.

play03:50

It starts by fetching the first transaction from RAM at memory location 100.

play03:54

The RAM, instead of sending back just that one value, sends a block of data, from memory

play03:58

location 100 through 200, which are then all copied into the cache.

play04:02

Now, when the processor requests the next transaction to add to its running total, the

play04:06

value at address 101, the cache will say “Oh, I’ve already got that value right here,

play04:10

so I can give it to you right away!”

play04:12

And there’s no need to go all the way to RAM.

play04:14

Because the cache is so close to the processor, it can typically provide the data in a single

play04:18

clock cycle -- no waiting required.

play04:21

This speeds things up tremendously over having to go back and forth to RAM every single time.

play04:24

When data requested in RAM is already stored in the cache like this it’s called a cache

play04:28

hit,

play04:29

and if the data requested isn’t in the cache, so you have to go to RAM, it’s a called

play04:32

a cache miss.

play04:34

The cache can also be used like a scratch space, storing intermediate values when performing

play04:38

a longer, or more complicated calculation.

play04:41

Continuing our restaurant example, let’s say the processor has finished totalling up

play04:44

all of the sales for the day, and wants to store the result in memory address 150.

play04:48

Like before, instead of going back all the way to RAM to save that value, it can be stored

play04:53

in cached copy, which is faster to save to, and also faster to access later if more calculations

play04:59

are needed.

play04:59

But this introduces an interesting problem -- the cache’s copy of the data is now different

play05:04

to the real version stored in RAM.

play05:05

This mismatch has to be recorded, so that at some point everything can get synced up.

play05:09

For this purpose, the cache has a special flag for each block of memory it stores, called

play05:13

the dirty bit -- which might just be the best term computer scientists have ever invented.

play05:18

Most often this synchronization happens when the cache is full, but a new block of memory

play05:23

is being requested by the processor.

play05:24

Before the cache erases the old block to free up space, it checks its dirty bit, and if

play05:29

it’s dirty, the old block of data is written back to RAM before loading in the new block.

play05:33

Another trick to boost cpu performance is called instruction pipelining.

play05:37

Imagine you have to wash an entire hotel’s worth of sheets, but you’ve only got one

play05:40

washing machine and one dryer.

play05:42

One option is to do it all sequentially: put a batch of sheets in the washer and wait 30

play05:45

minutes for it to finish.

play05:46

Then take the wet sheets out and put them in the dryer and wait another 30 minutes for

play05:50

that to finish.

play05:51

This allows you to do one batch of sheets every hour.

play05:53

Side note: if you have a dryer that can dry a load of laundry in 30 minutes, please tell

play05:57

me the brand and model in the comments, because I’m living with 90 minute dry times, minimum.

play06:01

But, even with this magic clothes dryer, you can speed things up even more if you parallelize

play06:06

your operation.

play06:07

As before, you start off putting one batch of sheets in the washer.

play06:10

You wait 30 minutes for it to finish.

play06:12

Then you take the wet sheets out and put them in the dryer.

play06:14

But this time, instead of just waiting 30 minutes for the dryer to finish, you simultaneously

play06:19

start another load in the washing machine.

play06:21

Now you’ve got both machines going at once.

play06:23

Wait 30 minutes, and one batch is now done, one batch is half done, and another is ready

play06:27

to go in.

play06:28

This effectively doubles your throughput.

play06:30

Processor designs can apply the same idea.

play06:32

In episode 7, our example processor performed the fetch-decode-execute cycle sequentially

play06:37

and in a continuous loop: Fetch-decode-execute, fetch-decode-execute, fetch-decode-execute,

play06:41

and so on.

play06:43

This meant our design required three clock cycles to execute one instruction.

play06:46

But each of these stages uses a different part of the CPU, meaning there is an opportunity

play06:50

to parallelize!

play06:51

While one instruction is getting executed, the next instruction could be getting decoded,

play06:55

and the instruction beyond that fetched from memory.

play06:57

All of these separate processes can overlap so that all parts of the CPU are active at

play07:01

any given time.

play07:02

In this pipelined design, an instruction is executed every single clock cycle which triples

play07:07

the throughput.

play07:08

But just like with caching this can lead to some tricky problems.

play07:11

A big hazard is a dependency in the instructions.

play07:14

For example, you might fetch something that the currently executing instruction is just

play07:17

about to modify, which means you’ll end up with the old value in the pipeline.

play07:21

To compensate for this, pipelined processors have to look ahead for data dependencies,

play07:25

and if necessary, stall their pipelines to avoid problems.

play07:28

High end processors, like those found in laptops and smartphones, go one step further and can

play07:32

dynamically reorder instructions with dependencies in order to minimize stalls and keep the pipeline

play07:37

moving, which is called out-of-order execution.

play07:40

As you might imagine, the circuits that figure this all out are incredibly complicated.

play07:44

Nonetheless, pipelining is tremendously effective and almost all processors implement it today.

play07:49

Another big hazard are conditional jump instructions -- we talked about one example, a JUMP NEGATIVE,

play07:54

last episode.

play07:55

These instructions can change the execution flow of a program depending on a value.

play07:59

A simple pipelined processor will perform a long stall when it sees a jump instruction,

play08:03

waiting for the value to be finalized.

play08:05

Only once the jump outcome is known, does the processor start refilling its pipeline.

play08:09

But, this can produce long delays, so high-end processors have some tricks to deal with this

play08:13

problem too.

play08:14

Imagine an upcoming jump instruction as a fork in a road - a branch.

play08:18

Advanced CPUs guess which way they are going to go, and start filling their pipeline with

play08:21

instructions based off that guess – a technique called speculative execution.

play08:27

When the jump instruction is finally resolved, if the CPU guessed correctly, then the pipeline

play08:31

is already full of the correct instructions and it can motor along without delay.

play08:35

However, if the CPU guessed wrong, it has to discard all its speculative results and

play08:38

perform a pipeline flush - sort of like when you miss a turn and have to do a u-turn to

play08:42

get back on route, and stop your GPS’s insistent shouting.

play08:46

To minimize the effects of these flushes, CPU manufacturers have developed sophisticated

play08:50

ways to guess which way branches will go, called branch prediction.

play08:54

Instead of being a 50/50 guess, today’s processors can often guess with over 90% accuracy!

play08:59

In an ideal case, pipelining lets you complete one instruction every single clock cycle,

play09:03

but then superscalar processors came along which can execute more than one instruction

play09:08

per clock cycle.

play09:09

During the execute phase even in a pipelined design, whole areas of the processor might

play09:13

be totally idle.

play09:14

For example, while executing an instruction that fetches a value from memory, the ALU

play09:18

is just going to be sitting there, not doing a thing.

play09:21

So why not fetch-and-decode several instructions at once, and whenever possible, execute instructions

play09:26

that require different parts of the CPU all at the same time!?

play09:29

But we can take this one step further and add duplicate circuitry

play09:32

for popular instructions.

play09:34

For example, many processors will have four, eight or more identical ALUs, so they can

play09:39

execute many mathematical instructions all in parallel!

play09:42

Ok, the techniques we’ve discussed so far primarily optimize the execution throughput

play09:46

of a single stream of instructions, but another way to increase performance is to run several

play09:50

streams of instructions at once with multi-core processors.

play09:54

You might have heard of dual core or quad core processors.

play09:57

This means there are multiple independent processing units inside of a single CPU chip.

play10:01

In many ways, this is very much like having multiple separate CPUs, but because they’re

play10:05

tightly integrated, they can share some resources, like cache, allowing the cores to work together

play10:10

on shared computations.

play10:11

But, when more cores just isn’t enough, you can build computers with multiple independent

play10:16

CPUs!

play10:17

High end computers, like the servers streaming this video from YouTube’s datacenter, often

play10:21

need the extra horsepower to keep it silky smooth for the hundreds of people watching

play10:25

simultaneously.

play10:26

Two- and four-processor configuration are the most common right now, but every now and

play10:30

again even that much processing power isn’t enough.

play10:32

So we humans get extra ambitious and build ourselves a supercomputer!

play10:36

If you’re looking to do some really monster calculations – like simulating the formation

play10:40

of the universe - you’ll need some pretty serious compute power.

play10:43

A few extra processors in a desktop computer just isn’t going to cut it.

play10:47

You’re going to need a lot of processors.

play10:49

No.. no... even more than that.

play10:51

A lot more!

play10:52

When this video was made, the world’s fastest computer was located in The National Supercomputing

play10:56

Center in Wuxi, China.

play10:58

The Sunway TaihuLight contains a brain-melting 40,960 CPUs, each with 256 cores!

play11:06

Thats over ten million cores in total... and each one of those cores runs at 1.45 gigahertz.

play11:11

In total, this machine can process 93 Quadrillion -- that’s 93 million-billions -- floating

play11:17

point math operations per second, knows as FLOPS.

play11:20

And trust me, that’s a lot of FLOPS!!

play11:21

No word on whether it can run Crysis at max settings, but I suspect it might.

play11:25

So long story short, not only have computer processors gotten a lot faster over the years,

play11:30

but also a lot more sophisticated, employing all sorts of clever tricks to squeeze out

play11:33

more and more computation per clock cycle.

play11:36

Our job is to wield that incredible processing power to do cool and useful things.

play11:40

That’s the essence of programming, which we’ll start discussing next episode.

play11:44

See you next week.

Rate This

5.0 / 5 (0 votes)

相关标签
处理器发展计算速度指令集缓存技术数据传输内存瓶颈指令流水线超标量处理器多核处理超级计算机性能优化