Advanced CPU Designs: Crash Course Computer Science #9
Summary
TLDR这段视频脚本介绍了计算机处理器从早期的机械装置发展到现代的千兆赫兹速度的历程。早期处理器通过提高晶体管的开关速度来提升性能,但随着技术的发展,设计者们开发了多种技术来提升性能,包括在硬件中执行除法等复杂操作。现代处理器拥有特殊电路来执行图形操作、解码压缩视频和加密文件等任务。处理器的指令集随着时间的推移不断增长,以保持向后兼容性。然而,高时钟速度和复杂的指令集带来了数据传输的瓶颈问题,尤其是与RAM的交互。为了解决这个问题,CPU内部集成了缓存(cache),可以快速存取数据。此外,指令流水线技术允许CPU在单个时钟周期内执行一条指令,提高了吞吐量。高端处理器甚至采用乱序执行和推测执行来最小化流水线中断。多核处理器允许同时执行多个指令流,而超级计算机则通过集成数以百万计的处理器核心来实现巨大的计算能力。视频强调了编程的本质是利用这些强大的处理能力来创造有用和有趣的应用。
Takeaways
- 🚀 计算机处理器从每秒一次计算的机械装置发展到以千赫兹和兆赫兹速度运行的CPU,现在设备通常以吉赫兹速度运行,即每秒执行数十亿条指令。
- 🔍 早期电子计算中,通过提高芯片内晶体管的开关时间来提高处理器速度,但仅提高晶体管速度和效率只能达到一定的性能提升。
- 🔧 处理器设计者开发了多种技术来提升性能,不仅简单指令运行快速,还能执行更复杂的操作。
- 📉 现代计算机处理器中的算术逻辑单元(ALU)通常在硬件中执行除法等操作,这增加了ALU的复杂性,但提高了速度。
- 🎮 现代处理器具有特殊电路,如MMX、3DNow!或SSE,这些扩展指令集允许执行额外的高级指令,如游戏和加密。
- 📚 高速时钟和复杂指令集导致另一个问题:快速地从CPU进出数据,RAM成为瓶颈。
- ⚡ 通过在CPU上放置一个小的RAM(称为缓存),可以加速数据访问,缓存存储了最近访问的数据,减少了对主内存的访问次数。
- 🔄 缓存命中时,数据已在缓存中,可以直接访问;缓存未命中时,需要从RAM中获取数据。
- 🛠️ 指令流水线是提高CPU性能的另一种技巧,它允许在一个时钟周期内执行多个指令的某个阶段,从而提高吞吐量。
- 🔍 高端处理器使用高级技术,如推测执行和分支预测,来最小化流水线暂停和提高效率。
- 🔗 超标量处理器可以同时执行多条指令,通过增加ALU的数量,可以并行执行多个数学指令。
- 💻 多核处理器可以同时运行多个指令流,它们在单个CPU芯片中有多个独立的处理单元,类似于拥有多个CPU。
- 🌟 超级计算机使用数以百万计的处理器核心来进行大规模计算,如模拟宇宙的形成。
Q & A
计算机处理器的速度是如何从每秒一次计算提升到千兆赫兹的?
-计算机处理器的速度提升主要是通过改进芯片内部晶体管的开关时间来实现的,晶体管是构成逻辑门、算术逻辑单元(ALU)等的基础。随着技术的进步,处理器设计者开发了多种技术来提升性能,使得简单指令和复杂操作都能快速执行。
为什么现代计算机处理器会包含除法指令?
-现代计算机处理器包含除法指令是因为,如果使用连续减法的方式来实现除法,会消耗大量的时钟周期,效率不高。因此,大多数处理器都将除法作为ALU可以硬件执行的指令之一。
什么是MMX、3DNow!或SSE,它们对处理器有什么影响?
-MMX、3DNow!和SSE是处理器的扩展指令集,它们包含了额外的电路,允许处理器执行额外的高级指令,这些指令对于游戏和加密等操作非常有帮助。
为什么说高时钟速度和复杂的指令集会导致数据传输瓶颈问题?
-高时钟速度和复杂的指令集意味着CPU需要更快地获取数据,但RAM通常位于CPU外部,数据需要通过数据总线传输,这个过程即使很短也可能成为瓶颈。此外,RAM本身查找地址、检索数据和配置输出也需要时间,这可能导致处理器在等待数据时闲置。
缓存(cache)是如何提高CPU性能的?
-缓存是位于CPU上的一小块RAM,它通过存储来自RAM的数据块来加速数据访问。当CPU请求RAM中的数据时,RAM可以传输整个数据块而不是单个值,这个数据块随后被保存在缓存中。由于计算机数据通常是顺序排列和处理的,这种方式非常有效。
什么是缓存命中和缓存未命中?
-当从RAM请求的数据已经在缓存中时,这种情况称为缓存命中,因为缓存可以立即提供所需数据。如果请求的数据不在缓存中,就需要访问RAM,这种情况称为缓存未命中。
脏位(dirty bit)在缓存中起什么作用?
-脏位是缓存为它存储的每个内存块设置的一个特殊标志。当缓存中的数据显示与RAM中的实际版本不同时,脏位会被设置。这样,当缓存满时,并且处理器请求一个新的内存块,缓存会检查脏位,如果脏位被设置,它会先将脏数据写回RAM,然后再加载新的数据块。
指令流水线(pipelining)如何提高CPU性能?
-指令流水线通过将指令的执行分解为多个阶段(如取指、解码、执行)并让这些阶段并行进行来提高CPU性能。这样,当一个指令在执行阶段时,下一个指令可以同时被解码,再下一个指令则从内存中取出,从而每个时钟周期可以执行一个指令,提高了吞吐量。
什么是超标量处理器(superscalar processors),它们如何提高性能?
-超标量处理器可以在每个时钟周期执行多于一个的指令。它们通过同时获取并解码多个指令,并在可能的情况下同时执行需要CPU不同部分的指令来提高性能。此外,许多处理器还会为常用指令添加重复的电路,例如,一些处理器可能有四个、八个或更多相同的ALU,以便它们可以并行执行许多数学指令。
多核处理器是如何提高计算性能的?
-多核处理器通过在单个CPU芯片内集成多个独立的处理单元来提高计算性能。这在很多方面类似于拥有多个独立的CPU,但由于它们紧密集成,它们可以共享一些资源,如缓存,允许核心在共享计算上协同工作。
超级计算机是如何实现巨大计算能力的?
-超级计算机通过集成大量的处理器来实现巨大的计算能力。例如,世界上最快的计算机之一,位于中国无锡的国家超级计算中心的Sunway TaihuLight,拥有40,960个CPU,每个CPU有256个核心,总计超过一千万核心,每个核心以1.45吉赫兹的频率运行,能够每秒处理93千万亿次浮点数学运算。
为什么说编程是利用处理器强大计算能力的本质?
-编程是通过编写和设计软件来指挥和利用处理器执行特定任务的过程。随着处理器性能的显著提升和复杂性的增加,程序员可以使用各种巧妙的技术来优化和提高程序的执行效率,从而实现更丰富和有用的功能。
Outlines
🚀 计算机处理器的发展与性能提升
Carrie Anne在视频中介绍了计算机处理器从早期的机械设备发展到现代的CPU,速度从每秒一次计算提升到了千兆赫兹。处理器性能的提升不仅仅依赖于晶体管的开关速度,还包括多种技术的发展,如ALU硬件执行除法、特殊电路处理图形运算、视频解码和文件加密等。此外,处理器还通过扩展指令集来增加新的功能,例如MMX、3DNow!和SSE。处理器性能的另一个瓶颈是RAM的数据传输速度,为此,CPU中加入了缓存(cache)技术,通过预先加载数据块来减少对RAM的访问次数。缓存分为cache hit(已存储在缓存中的数据)和cache miss(未存储在缓存中的数据)。
🔍 CPU的高级性能优化技术
为了进一步提升CPU性能,视频中提到了指令流水线(pipelining)技术,通过将指令的获取、解码和执行过程并行化,使得每个时钟周期都能执行一个指令,大幅提升了吞吐量。然而,这种技术也带来了一些挑战,比如指令之间的依赖性可能导致流水线中断。为了解决这个问题,现代处理器采用了乱序执行(out-of-order execution)和分支预测(branch prediction)技术。此外,还有超标量处理器(superscalar processors)可以同时执行多个指令。视频中还讨论了多核处理器(multi-core processors)的概念,它们可以在单个CPU芯片中集成多个独立的处理单元,共同工作以提高性能。
🏢 多处理器和超级计算机的应用
视频最后讨论了当多核处理器仍不足以满足计算需求时,可以构建拥有多个独立CPU的计算机。例如,YouTube数据中心的服务器通常需要两到四个处理器配置来满足高负载需求。对于需要极大计算能力的超级计算任务,如模拟宇宙的形成,就需要构建超级计算机。视频中提到,截至视频制作时,世界上最快的计算机位于中国无锡的国家超级计算中心,拥有超过千万个核心,能够执行高达93千万亿次的浮点运算每秒。这些先进的处理器技术不仅让计算机变得更快,也更复杂,而编程的任务就是利用这些强大的处理能力来完成有用和有趣的工作。
Mindmap
Keywords
💡CPU
💡晶体管
💡ALU
💡指令集
💡缓存
💡指令流水线
💡超标量处理器
💡多核处理器
💡超级计算机
💡分支预测
💡乱序执行
Highlights
计算机从每秒可能只进行一次计算的机械设备,发展到以千赫兹和兆赫兹速度运行的CPU。
处理器通过提高芯片内晶体管的开关时间来提高速度,但这种方法提升有限。
现代计算机处理器拥有特殊的电路,用于图形操作、解码压缩视频和加密文件等复杂操作。
处理器的指令集随着时间增长,以保持向后兼容性,同时增加了如MMX、3DNow!或SSE等扩展。
高时钟频率和复杂指令集导致数据进出CPU的速度成为瓶颈,尤其是与RAM的交互。
在CPU上集成一个小的RAM块,即缓存,可以加速数据访问。
当CPU请求的数据已经在缓存中时,称为缓存命中;如果不在,则称为缓存未命中。
缓存使用脏位(dirty bit)来记录数据块是否与RAM中的数据不一致。
指令流水线技术可以提高CPU性能,允许在一个时钟周期内执行一个指令。
高端处理器采用乱序执行技术,动态重新排序有依赖的指令以最小化流水线暂停。
条件跳转指令可能导致流水线长时间暂停,但高端处理器采用推测执行技术来预测跳转结果。
分支预测技术使得现代处理器在分支预测上的准确率超过90%。
超标量处理器可以在一个时钟周期内执行多个指令。
多核处理器可以同时运行多个指令流,类似于拥有多个独立的CPU。
构建多CPU计算机或超级计算机可以提供额外的处理能力,适用于需要巨大计算能力的场合。
世界上最快的计算机位于中国无锡的国家超级计算中心,拥有超过一千万个核心。
计算机处理器不仅速度变得更快,而且变得更加复杂,采用了各种巧妙的技巧来提高每个时钟周期的计算量。
编程的本质是利用这种惊人的处理能力来完成酷炫和有用的事情。
Transcripts
Hi, I’m Carrie Anne and welcome to CrashCourse Computer Science!
As we’ve discussed throughout the series, computers have come a long way from mechanical
devices capable of maybe one calculation per second, to CPUs running at kilohertz and megahertz speeds.
The device you’re watching this video on right now is almost certainly running at Gigahertz
speeds - that’s billions of instructions executed every second.
Which, trust me, is a lot of computation!
In the early days of electronic computing, processors were typically made faster by improving
the switching time of the transistors inside the chip - the ones that make up all the logic
gates, ALUs and other stuff we’ve talked about over the past few episodes.
But just making transistors faster and more efficient only went so far, so processor designers
have developed various techniques to boost performance allowing not only simple instructions
to run fast, but also performing much more sophisticated operations.
INTRO
Last episode, we created a small program for our CPU that allowed us to divide two numbers.
We did this by doing many subtractions in a row... so, for example, 16 divided by 4
could be broken down into the smaller problem of 16 minus 4, minus 4, minus 4, minus 4.
When we hit zero, or a negative number, we knew that we we’re done.
But this approach gobbles up a lot of clock cycles, and isn’t particularly efficient.
So most computer processors today have divide as one of the instructions that the ALU can
perform in hardware.
Of course, this extra circuitry makes the ALU bigger and more complicated to design,
but also more capable - a complexity-for-speed tradeoff that has been made many times in
computing history.
For instance, modern computer processors now have special circuits for things like graphics
operations, decoding compressed video, and encrypting files - all of which are operations
that would take many many many clock cycles to perform with standard operations.
You may have even heard of processors with MMX, 3DNow!, or SSE.
These are processors with additional, fancy circuits that allow them to execute additional,
fancy instructions - for things like gaming and encryption.
These extensions to the instruction set have grown, and grown over time, and once people
have written programs to take advantage of them, it’s hard to remove them.
So instruction sets tend to keep getting larger and larger keeping all the old opcodes around
for backwards compatibility.
The Intel 4004, the first truly integrated CPU, had 46 instructions - which was enough
to build a fully functional computer.
But a modern computer processor has thousands of different instructions, which utilize all
sorts of clever and complex internal circuitry.
Now, high clock speeds and fancy instruction sets lead to another problem - getting data
in and out of the CPU quickly enough.
It’s like having a powerful steam locomotive, but no way to shovel in coal fast enough.
In this case, the bottleneck is RAM.
RAM is typically a memory module that lies outside the CPU.
This means that data has to be transmitted to and from RAM along sets of data wires,
called a bus.
This bus might only be a few centimeters long, and remember those electrical signals are
traveling near the speed of light, but when you are operating at gigahertz speeds – that’s
billionths of a second – even this small delay starts to become problematic.
It also takes time for RAM itself to lookup the address, retrieve the data, and configure
itself for output.
So a “load from RAM” instruction might take dozens of clock cycles to complete, and during
this time the processor is just sitting there idly waiting for the data.
One solution is to put a little piece of RAM right on the CPU -- called a cache.
There isn’t a lot of space on a processor’s chip, so most caches are just kilobytes or
maybe megabytes in size, where RAM is usually gigabytes.
Having a cache speeds things up in a clever way.
When the CPU requests a memory location from RAM, the RAM can transmit not just one single
value, but a whole block of data.
This takes only a little bit more time than transmitting a single value, but it allows
this data block to be saved into the cache.
This tends to be really useful because computer data is often arranged and processed sequentially.
For example, let say the processor is totalling up daily sales for a restaurant.
It starts by fetching the first transaction from RAM at memory location 100.
The RAM, instead of sending back just that one value, sends a block of data, from memory
location 100 through 200, which are then all copied into the cache.
Now, when the processor requests the next transaction to add to its running total, the
value at address 101, the cache will say “Oh, I’ve already got that value right here,
so I can give it to you right away!”
And there’s no need to go all the way to RAM.
Because the cache is so close to the processor, it can typically provide the data in a single
clock cycle -- no waiting required.
This speeds things up tremendously over having to go back and forth to RAM every single time.
When data requested in RAM is already stored in the cache like this it’s called a cache
hit,
and if the data requested isn’t in the cache, so you have to go to RAM, it’s a called
a cache miss.
The cache can also be used like a scratch space, storing intermediate values when performing
a longer, or more complicated calculation.
Continuing our restaurant example, let’s say the processor has finished totalling up
all of the sales for the day, and wants to store the result in memory address 150.
Like before, instead of going back all the way to RAM to save that value, it can be stored
in cached copy, which is faster to save to, and also faster to access later if more calculations
are needed.
But this introduces an interesting problem -- the cache’s copy of the data is now different
to the real version stored in RAM.
This mismatch has to be recorded, so that at some point everything can get synced up.
For this purpose, the cache has a special flag for each block of memory it stores, called
the dirty bit -- which might just be the best term computer scientists have ever invented.
Most often this synchronization happens when the cache is full, but a new block of memory
is being requested by the processor.
Before the cache erases the old block to free up space, it checks its dirty bit, and if
it’s dirty, the old block of data is written back to RAM before loading in the new block.
Another trick to boost cpu performance is called instruction pipelining.
Imagine you have to wash an entire hotel’s worth of sheets, but you’ve only got one
washing machine and one dryer.
One option is to do it all sequentially: put a batch of sheets in the washer and wait 30
minutes for it to finish.
Then take the wet sheets out and put them in the dryer and wait another 30 minutes for
that to finish.
This allows you to do one batch of sheets every hour.
Side note: if you have a dryer that can dry a load of laundry in 30 minutes, please tell
me the brand and model in the comments, because I’m living with 90 minute dry times, minimum.
But, even with this magic clothes dryer, you can speed things up even more if you parallelize
your operation.
As before, you start off putting one batch of sheets in the washer.
You wait 30 minutes for it to finish.
Then you take the wet sheets out and put them in the dryer.
But this time, instead of just waiting 30 minutes for the dryer to finish, you simultaneously
start another load in the washing machine.
Now you’ve got both machines going at once.
Wait 30 minutes, and one batch is now done, one batch is half done, and another is ready
to go in.
This effectively doubles your throughput.
Processor designs can apply the same idea.
In episode 7, our example processor performed the fetch-decode-execute cycle sequentially
and in a continuous loop: Fetch-decode-execute, fetch-decode-execute, fetch-decode-execute,
and so on.
This meant our design required three clock cycles to execute one instruction.
But each of these stages uses a different part of the CPU, meaning there is an opportunity
to parallelize!
While one instruction is getting executed, the next instruction could be getting decoded,
and the instruction beyond that fetched from memory.
All of these separate processes can overlap so that all parts of the CPU are active at
any given time.
In this pipelined design, an instruction is executed every single clock cycle which triples
the throughput.
But just like with caching this can lead to some tricky problems.
A big hazard is a dependency in the instructions.
For example, you might fetch something that the currently executing instruction is just
about to modify, which means you’ll end up with the old value in the pipeline.
To compensate for this, pipelined processors have to look ahead for data dependencies,
and if necessary, stall their pipelines to avoid problems.
High end processors, like those found in laptops and smartphones, go one step further and can
dynamically reorder instructions with dependencies in order to minimize stalls and keep the pipeline
moving, which is called out-of-order execution.
As you might imagine, the circuits that figure this all out are incredibly complicated.
Nonetheless, pipelining is tremendously effective and almost all processors implement it today.
Another big hazard are conditional jump instructions -- we talked about one example, a JUMP NEGATIVE,
last episode.
These instructions can change the execution flow of a program depending on a value.
A simple pipelined processor will perform a long stall when it sees a jump instruction,
waiting for the value to be finalized.
Only once the jump outcome is known, does the processor start refilling its pipeline.
But, this can produce long delays, so high-end processors have some tricks to deal with this
problem too.
Imagine an upcoming jump instruction as a fork in a road - a branch.
Advanced CPUs guess which way they are going to go, and start filling their pipeline with
instructions based off that guess – a technique called speculative execution.
When the jump instruction is finally resolved, if the CPU guessed correctly, then the pipeline
is already full of the correct instructions and it can motor along without delay.
However, if the CPU guessed wrong, it has to discard all its speculative results and
perform a pipeline flush - sort of like when you miss a turn and have to do a u-turn to
get back on route, and stop your GPS’s insistent shouting.
To minimize the effects of these flushes, CPU manufacturers have developed sophisticated
ways to guess which way branches will go, called branch prediction.
Instead of being a 50/50 guess, today’s processors can often guess with over 90% accuracy!
In an ideal case, pipelining lets you complete one instruction every single clock cycle,
but then superscalar processors came along which can execute more than one instruction
per clock cycle.
During the execute phase even in a pipelined design, whole areas of the processor might
be totally idle.
For example, while executing an instruction that fetches a value from memory, the ALU
is just going to be sitting there, not doing a thing.
So why not fetch-and-decode several instructions at once, and whenever possible, execute instructions
that require different parts of the CPU all at the same time!?
But we can take this one step further and add duplicate circuitry
for popular instructions.
For example, many processors will have four, eight or more identical ALUs, so they can
execute many mathematical instructions all in parallel!
Ok, the techniques we’ve discussed so far primarily optimize the execution throughput
of a single stream of instructions, but another way to increase performance is to run several
streams of instructions at once with multi-core processors.
You might have heard of dual core or quad core processors.
This means there are multiple independent processing units inside of a single CPU chip.
In many ways, this is very much like having multiple separate CPUs, but because they’re
tightly integrated, they can share some resources, like cache, allowing the cores to work together
on shared computations.
But, when more cores just isn’t enough, you can build computers with multiple independent
CPUs!
High end computers, like the servers streaming this video from YouTube’s datacenter, often
need the extra horsepower to keep it silky smooth for the hundreds of people watching
simultaneously.
Two- and four-processor configuration are the most common right now, but every now and
again even that much processing power isn’t enough.
So we humans get extra ambitious and build ourselves a supercomputer!
If you’re looking to do some really monster calculations – like simulating the formation
of the universe - you’ll need some pretty serious compute power.
A few extra processors in a desktop computer just isn’t going to cut it.
You’re going to need a lot of processors.
No.. no... even more than that.
A lot more!
When this video was made, the world’s fastest computer was located in The National Supercomputing
Center in Wuxi, China.
The Sunway TaihuLight contains a brain-melting 40,960 CPUs, each with 256 cores!
Thats over ten million cores in total... and each one of those cores runs at 1.45 gigahertz.
In total, this machine can process 93 Quadrillion -- that’s 93 million-billions -- floating
point math operations per second, knows as FLOPS.
And trust me, that’s a lot of FLOPS!!
No word on whether it can run Crysis at max settings, but I suspect it might.
So long story short, not only have computer processors gotten a lot faster over the years,
but also a lot more sophisticated, employing all sorts of clever tricks to squeeze out
more and more computation per clock cycle.
Our job is to wield that incredible processing power to do cool and useful things.
That’s the essence of programming, which we’ll start discussing next episode.
See you next week.
浏览更多相关视频
The Central Processing Unit (CPU): Crash Course Computer Science #7
Operating Systems: Crash Course Computer Science #18
Instructions & Programs: Crash Course Computer Science #8
Screens & 2D Graphics: Crash Course Computer Science #23
Artificial Intelligence Explained Simply in 1 Minute! ✨
Natural Language Processing: Crash Course Computer Science #36
5.0 / 5 (0 votes)