Why Computers Can't Count Sometimes

Tom Scott
12 Nov 201808:44

Summary

TLDR视频解释了大规模系统计数不准确的原因,即竞争条件、缓存和最终一致性。当请求大量涌入时,系统面临着读写冲突,导致计数错误。此外,缓存机制使不同服务器的数据略有不同。所以计数数字会上下波动,但最终会趋于一致。

Takeaways

  • 😀 计算看起来简单的事物,比如视频观看次数,实际上可能非常复杂
  • 😊 现代大规模系统需要处理大量输入和输出,这使得准确计数变得困难
  • 😞 数据冲突会导致计数错误,这称为竞争条件
  • 🤔 队列和串行处理可以避免冲突,但不适合大规模系统
  • 😏 最终一致性通过批量更新和缓存提高了规模,但计数不实时
  • 😉 YouTube和Twitter使用事件最终一致性机制来扩展
  • 🤨 缓存有助于减轻数据库负载,但会导致不同的计数
  • 😕 当负载激增时,可以通过添加更多缓存服务器来扩展
  • 😣 由于缓存和最终一致性,数值会上下波动
  • 🙂 随着时间的推移,冲突会得到解决,数值会稳定

Q & A

  • 为什么Twitter和YouTube上的数字有时会上下波动而不是稳定增长?

    -这是因为存在称为“竞态条件”的问题,以及为了处理大规模数据而采用的“最终一致性”策略和缓存机制,导致数据更新并不是实时的,从而造成数字的上下波动。

  • 什么是单线程编程,为什么它对于现代复杂系统来说不够用?

    -单线程编程意味着计算机按照顺序执行一系列指令,一次只做一件事。对于简单任务来说足够了,但对于需要同时处理多个请求的现代复杂系统来说,单线程编程无法满足需求。

  • 什么是竞态条件,它如何影响数据的准确性?

    -竞态条件是指当代码尝试同时做两件或更多事情时,结果会根据这些操作发生的顺序而改变,而这个顺序是无法控制的。这可能导致数据不准确,比如在计数视图时出现遗漏。

  • 什么是最终一致性,它是如何在YouTube和Twitter等平台上使用的?

    -最终一致性是一种数据一致性模型,允许系统在短时间内存在不一致,但保证最终数据将变得一致。YouTube和Twitter通过在世界各地部署多个服务器,将数据更新集中在较少的时间点上,从而提高效率。

  • 为什么数据库在处理大量请求时会遇到问题?

    -单个数据库处理能力有限,接收请求、理解请求、做出更改并发送响应都需要时间,因此每秒能处理的请求数量有限。同时处理多个请求时,还可能出现数据覆盖等问题。

  • 缓存是如何帮助减轻中央数据库的负担的?

    -缓存通过存储数据副本来减少对中央数据库的直接请求。这意味着大量的读取请求可以由缓存而不是数据库本身快速处理,从而大大提高系统的效率。

  • 为什么即使计算机在本质上是计算器,计数仍然会出现问题?

    -尽管计算机擅长执行数学计算,但当涉及到大规模、复杂的系统时,数据的同步、更新和缓存等问题会导致计数不准确或延迟。

  • 为什么有时你在YouTube或Twitter上看到的数据会随着设备的不同而改变?

    -这可能是因为数据缓存和最终一致性的处理方式不同,不同的设备可能会连接到不同的服务器或缓存,导致显示的数据有所不同。

  • 为什么对于某些操作,如购买演唱会门票,系统会采用排队机制而不是最终一致性?

    -对于需要绝对数据一致性的操作,如确保不会重复销售相同座位的门票,系统会采用排队机制来逐一处理请求,确保数据的准确性。

  • YouTube如何确保即使在数据更新中存在延迟,也能保证广告收入和金钱的准确计算?

    -尽管YouTube使用最终一致性来处理视图和其他统计数据,但对于需要准确计算的数据,如广告收入,它采取特殊措施确保数据的准确性,可能包括实时处理或优先级更高的数据更新策略。

Outlines

00:00

😊 第一个段落提出了一个问题:为什么推特的点赞数量会上下波动?

作者通过一个推特截图提出了一个问题:为什么推特上的点赞数量会无规律地波动,而不是持续稳步增长?这似乎反映了计算机在计数这一基本操作上也会出现问题。原因在于大型复杂系统在扩展时会遇到一些困难。

05:03

😊 第二个段落解释了这个问题的原因

这个问题的原因在于“竞争条件”和“最终一致性”。当有大量请求同时访问数据库时,如果处理顺序不同,结果也会不同。此时需要将请求排入队列一个个处理以保证一致性,但这无法扩展。大型系统通常采用“最终一致性”策略,允许各数据库有一定延迟后再更新总数据库。此外,系统还设置缓存来减少请求直接访问总数据库的次数。这就是为什么点赞数量会起伏不定,但最终会正确统计出来。

Mindmap

Keywords

💡缓存

缓存是一种存储技术,用于临时保存频繁访问的数据,以便快速访问。在视频中,缓存被用来解释为什么像Twitter这样的社交媒体平台上的数字(如点赞数)会不断变化。通过使用缓存,系统可以减少对中央数据库的直接访问次数,从而提高性能。例如,如果有成千上万的人请求同一条推文的信息,将这些请求直接发往缓存而不是每次都查询数据库,可以显著提高效率。

💡最终一致性

最终一致性是一种数据库复制的策略,它允许数据在不同的服务器之间稍有延迟,但保证最终所有的副本将达到一致的状态。视频中提到,像YouTube和Twitter这样的大型网站使用最终一致性来处理大量的数据更新请求,因为它们拥有遍布全球的多个服务器。这种方法虽然不会立即反映每一个动作(如观看次数或转发次数),但它保证了系统的可扩展性和效率,同时确保长期内数据的准确性。

💡竞态条件

竞态条件是指当多个进程或线程在没有适当同步的情况下访问共享数据时发生的一种条件,其最终结果取决于事件的顺序,这些顺序是不可预测的。视频中以YouTube的观看次数统计为例,说明了如果同时有多个请求试图更新同一个数据(如视频观看次数),而这些请求在没有适当管理的情况下同时进行读和写操作,就可能导致数据不一致,即丢失一些观看次数。这突出了并发处理中的一个常见问题。

💡扩展性

扩展性是指系统、网络或过程增加工作负载时,其性能如何增长的能力。视频中提到,随着用户数和数据量的增加,单一数据库或服务器处理请求的能力是有限的,这就需要设计能够承载更大负荷的系统。扩展性问题解释了为什么大型系统像YouTube和Twitter需要利用如缓存和最终一致性等技术来高效处理海量数据。

💡数据库

数据库是存储和管理数据的系统,它允许用户和其他软件应用程序存取和处理数据。视频强调,许多网站实质上是数据库的前端展示,如YouTube是视频和评论的数据库,Twitter是消息的数据库。这说明了数据库在现代信息系统中的核心作用,特别是在处理大量的用户请求和数据更新时。

💡单线程

单线程意味着计算机程序或进程一次只能执行一个任务或命令。视频中提到,单线程模型在处理简单任务或小规模系统时效果很好,因为它按顺序执行指令,减少了复杂性。然而,对于需要同时处理成千上万个请求的大型系统,单线程模型就不够用了,因此需要采用多线程或其他并发处理技术。

💡多线程

多线程是指计算机程序同时执行多个任务或线程的能力。视频中讨论了多线程处理请求时可能出现的竞态条件问题。通过多线程,系统可以在同一时间内并行处理多个操作,提高效率。但是,这也引入了新的复杂性,如需要确保数据的一致性和避免竞态条件。

💡并发处理

并发处理是计算机科学中的一个概念,指的是系统同时处理多个任务或进程的能力。视频中,这与多线程和竞态条件的讨论紧密相关。并发处理允许大型系统如YouTube和Twitter高效地服务于大量用户,但同时也带来了保持数据一致性和避免数据冲突的挑战。

💡请求队列

请求队列是一种管理进入系统请求的方法,通过将请求排成一行来逐一处理,以确保按顺序完成每个任务。视频中提到,将所有请求放入队列并逐个处理可以解决一些竞态条件问题,但这种方法并不适合需要实时响应的大规模应用,因为这会导致处理速度慢,用户体验差。

💡中央数据库

中央数据库是存储核心数据的主要数据库系统。在视频中,中央数据库负责处理来自世界各地的更新请求。大型平台如YouTube和Twitter会有多个服务器和数据库分布在全球,它们通过最终一致性策略保持数据同步。但中央数据库的设计和管理对于确保系统的稳定性和扩展性至关重要。

Highlights

Observation of fluctuating social media metrics beneath a tweet, questioning Twitter's counting ability.

Introduction to the concept of race conditions, caching, and eventual consistency to explain counting difficulties.

Explanation of the limitations of single-threaded code in handling multiple, simultaneous requests.

Description of modern websites as interfaces to databases, with challenges in scaling and handling requests.

Discussion on how simultaneous requests can lead to missed counts due to race conditions in databases.

The concept of eventual consistency as a solution for sites dealing with Big Data, allowing for accurate, albeit delayed, updates.

Explanation of caching as a strategy to reduce direct hits on the central database by serving repeated requests more efficiently.

Highlighting the impact of scaling on counting accuracy and the complications introduced by the need for real-time updates.

The use of queues to ensure accuracy in systems where immediate consistency is critical, such as ticket sales.

Introduction to the complexities of managing a central database across multiple servers for large-scale operations like YouTube.

Explanation of how social media platforms manage viewcounts and statistics through periodic updates to central systems.

Discussion on the trade-offs between immediate and eventual consistency in different application contexts.

The role of logs in managing data updates across servers in large systems to ensure eventual consistency.

Understanding why metrics like subscriber counts and view numbers on platforms like YouTube may not always immediately reflect changes.

Acknowledgment of the difficulty in accurately counting and updating data in real-time on large digital platforms.

Transcripts

play00:00

This is a brilliant tweet.

play00:03

But I don't want you to pay attention to the tweet.

play00:05

It's good, sure, but I want you to watch the numbers that are underneath it.

play00:07

That's a screen recording,

play00:09

and the numbers are going up and down, all over the place.

play00:13

They should steadily rise, but they don’t.

play00:15

There aren't that many people tapping ‘like’ by mistake and then taking it back.

play00:20

So why can't Twitter just count?

play00:26

You'll see examples like this all over the place.

play00:28

On YouTube, subscriber and view counts sometimes rise and drop seemingly at random,

play00:32

or they change depending on which device you're checking on.

play00:36

Computers should be good at counting, right?

play00:39

They're basically just overgrown calculators.

play00:41

This video that you're watching,

play00:43

whether it's on a tiny little phone screen or on a massive desktop display,

play00:46

it is all just the result of huge amounts of math that turns

play00:50

a compressed stream of binary numbers into amounts of electricity

play00:54

that get sent to either a grid of coloured pixels or a speaker,

play00:58

all in perfect time.

play01:00

Just counting should be easy.

play01:02

But sometimes it seems to fall apart.

play01:05

And that's usually when there's a big, complicated system

play01:07

with lots of inputs and outputs,

play01:09

when something has to be done at scale.

play01:12

Scaling makes things difficult. And to explain why,

play01:15

we have to talk about race conditions, caching, and eventual consistency.

play01:20

All the code that I've talked about in The Basics so far has been single-threaded,

play01:24

because, well, we’re talking about the basics.

play01:27

Single-threaded means that it looks like a set of instructions

play01:29

that the computer steps through one after the other.

play01:32

It starts at the top, it works its way through, ignoring everything else,

play01:36

and at the end it has Done A Thing.

play01:39

Which is fine, as long as that's the only thread,

play01:41

the only thing that the computer's doing,

play01:43

and that it's the only computer doing it.

play01:45

Fine for old machines like this,

play01:47

but for complicated, modern systems, that’s never going to be the case.

play01:51

Most web sites are, at their heart, just a fancy front end to a database.

play01:55

YouTube is a database of videos and comments.

play01:57

Twitter is a database of small messages.

play01:58

Your phone company's billing site is a database of customers and bank accounts.

play02:04

But the trouble is that a single computer holding a single database can only deal with

play02:08

so much input at once.

play02:10

Receiving a request, understanding it, making the change, and sending the response back:

play02:15

all of those take time,

play02:17

so there are only so many requests that can fit in each second.

play02:21

And if you try and handle multiple requests at once,

play02:24

there are subtle problems that can show up.

play02:27

Let's say that YouTube wants to count one view of a video.

play02:30

It just has the job of adding one to the view count.

play02:33

Which seems really simple, but it's actually three separate smaller jobs.

play02:37

You have to read the view count,

play02:39

you have to add one to it,

play02:40

and then you have to write that view count back into the database.

play02:43

If two requests come along very close to each other,

play02:45

and they’re assigned to separate threads,

play02:47

it is entirely possible that the second thread

play02:52

could read the view count

play02:53

while the first thread is still doing its calculation.

play02:57

And yeah, that's a really simple calculation, it's just adding one,

play03:02

but it still takes a few ticks of a processor.

play03:04

So both of those write processes would put the same number back into the database,

play03:09

and we've missed a view.

play03:11

On popular videos, there'll be collisions like that all the time.

play03:15

Worst case, you've got ten or a hundred of those requests all coming in at once,

play03:18

and one gets stuck for a while for some reason.

play03:21

It'll still add just one to the original number that it read,

play03:24

and then, much later,

play03:25

it'll finally write its result back into the database.

play03:27

And we've lost any number of views.

play03:30

In early databases, having updates that collided like that could corrupt the entire system,

play03:34

but these days things will generally at least keep working,

play03:37

even if they're not quite accurate.

play03:40

And given that YouTube has to work out not just views,

play03:42

but ad revenue and money,

play03:44

it has got to be accurate.

play03:46

Anyway, that’s a basic race condition:

play03:49

when the code’s trying to do two or more things at once,

play03:52

and the result changes depending on the order they occur in,

play03:56

an order that you cannot control.

play03:59

One solution is to put all the requests in a queue,

play04:01

and refuse to answer any requests until the previous one is completed.

play04:06

That's how that single-threaded, single-computer programming works.

play04:08

It's how these old machines work.

play04:10

Until the code finishes its task and says "okay, I'm ready for more now",

play04:14

it just doesn't accept anything else.

play04:17

Fine for simple stuff, does not scale up.

play04:19

A million-strong queue to watch a YouTube video doesn't sound like a great user experience.

play04:24

But that still happens somewhere, for things like buying tickets to a show,

play04:28

where it'd be an extremely bad idea to accidentally sell the same seat to two people.

play04:34

Those databases have to be 100% consistent, so for big shows,

play04:37

ticket sites will sometimes start a queue,

play04:39

and limit the number of people accessing the booking site at once.

play04:43

If you absolutely must count everything accurately, in real time, that’s the best approach.

play04:49

But for sites dealing with Big Data, like YouTube and Twitter,

play04:52

there is a different solution called eventual consistency.

play04:56

They have lots of servers all over the world,

play04:58

and rather than reporting every view or every retweet right away,

play05:03

each individual server will keep its own count,

play05:05

bundle up all the viewcounts and statistics that it’s dealing with,

play05:08

and just it will just update the central system when there's time to do so.

play05:12

Updates doesn't have to be hours apart,

play05:14

they can just be minutes or even just seconds,

play05:16

but having a few bundled updates that can be queued and dealt with individually

play05:21

is a lot easier on the central system

play05:23

than having millions of requests all being shouted at once.

play05:26

Actually, for something on YouTube’s scale,

play05:28

that central database won't just be one computer:

play05:31

it'll be several, and they'll all be keeping each other up to date,

play05:34

but that is a mess we really don't want to get into right now.

play05:39

Eventual consistency isn't right for everything.

play05:42

On YouTube, if you're updating something like the privacy settings of a video,

play05:46

it's important that it's updated immediately everywhere.

play05:50

But compared to views, likes and comments, that's a really rare thing to happen,

play05:54

so it's OK to stop everything, put everything else on hold,

play05:57

spend some time sorting out that important change, and come back later.

play06:02

But views and comments, they can wait for a little while.

play06:04

Just tell the servers around the world to write them down somewhere, keep a log,

play06:07

then every few seconds, or minutes, or maybe even hours for some places,

play06:12

those systems can run through their logs,

play06:13

do the calculations and update the central system once everyone has time.

play06:19

All that explains why viewcounts and subscriber counts lag sometimes on YouTube,

play06:23

why it can take a while to get the numbers sorted out in the end,

play06:26

but it doesn't explain the up-and-down numbers you saw at the start in that tweet.

play06:32

That's down to another thing: caching.

play06:34

It's not just writing into the database that's bundled up. Reading is too.

play06:38

If you have thousands of people requesting the same thing,

play06:41

it really doesn't make sense to have them all hit the central system

play06:44

and have it do the calculations every single time.

play06:46

So if Twitter are getting 10,000 requests a second for information on that one tweet,

play06:52

which is actually a pretty reasonable amount for them,

play06:54

it'd be ridiculous for the central database to look up all the details and do the numbers every time.

play06:59

So the requests are actually going to a cache,

play07:02

one of thousands, or maybe tens of thousands of caches

play07:04

sitting between the end users and the central system.

play07:07

Each cache looks up the details in the central system once,

play07:10

and then it keeps the details in its memory.

play07:13

For Twitter, each cache might only keep them for a few seconds,

play07:16

so it feels live but isn't actually.

play07:18

But it means only a tiny fraction of that huge amount of traffic

play07:22

actually has to bother the central database:

play07:24

the rest comes straight out of memory on a system that is built

play07:27

just for serving those requests,

play07:29

which is orders of magnitude faster.

play07:31

And if there's a sudden spike in traffic,

play07:33

Twitter can just spin up some more cache servers,

play07:35

put them into the pool that's answering everyone's requests,

play07:37

and it all just keeps working without any worry for the database.

play07:42

But each of those caches will pull that information at a slightly different time,

play07:46

all out of sync with each other.

play07:47

When your request comes in, it's routed to any of those available caches,

play07:51

and crucially it is not going to be the same one every time.

play07:54

They've all got slightly different answers,

play07:56

and each time you're asking a different one.

play08:00

Eventual consistency means that everything will be sorted out at some point.

play08:05

We won't lose any data, but it might take a while before it's all in place.

play08:10

Sooner or later the flood of retweets will stop, or your viewcount will settle down,

play08:14

and once the dust has settled everything can finally get counted up.

play08:17

But until then: give YouTube and Twitter a little leeway.

play08:22

Counting things accurately is really difficult.

play08:26

Thank you very much to the Centre for Computing History here in Cambridge,

play08:29

who've let me film with all this wonderful old equipment,

play08:31

and to all my proofreading team who made sure my script's right.

Rate This

5.0 / 5 (0 votes)

¿Necesitas un resumen en inglés?