Why Computers Can't Count Sometimes
Summary
TLDR视频解释了大规模系统计数不准确的原因,即竞争条件、缓存和最终一致性。当请求大量涌入时,系统面临着读写冲突,导致计数错误。此外,缓存机制使不同服务器的数据略有不同。所以计数数字会上下波动,但最终会趋于一致。
Takeaways
- 😀 计算看起来简单的事物,比如视频观看次数,实际上可能非常复杂
- 😊 现代大规模系统需要处理大量输入和输出,这使得准确计数变得困难
- 😞 数据冲突会导致计数错误,这称为竞争条件
- 🤔 队列和串行处理可以避免冲突,但不适合大规模系统
- 😏 最终一致性通过批量更新和缓存提高了规模,但计数不实时
- 😉 YouTube和Twitter使用事件最终一致性机制来扩展
- 🤨 缓存有助于减轻数据库负载,但会导致不同的计数
- 😕 当负载激增时,可以通过添加更多缓存服务器来扩展
- 😣 由于缓存和最终一致性,数值会上下波动
- 🙂 随着时间的推移,冲突会得到解决,数值会稳定
Q & A
为什么Twitter和YouTube上的数字有时会上下波动而不是稳定增长?
-这是因为存在称为“竞态条件”的问题,以及为了处理大规模数据而采用的“最终一致性”策略和缓存机制,导致数据更新并不是实时的,从而造成数字的上下波动。
什么是单线程编程,为什么它对于现代复杂系统来说不够用?
-单线程编程意味着计算机按照顺序执行一系列指令,一次只做一件事。对于简单任务来说足够了,但对于需要同时处理多个请求的现代复杂系统来说,单线程编程无法满足需求。
什么是竞态条件,它如何影响数据的准确性?
-竞态条件是指当代码尝试同时做两件或更多事情时,结果会根据这些操作发生的顺序而改变,而这个顺序是无法控制的。这可能导致数据不准确,比如在计数视图时出现遗漏。
什么是最终一致性,它是如何在YouTube和Twitter等平台上使用的?
-最终一致性是一种数据一致性模型,允许系统在短时间内存在不一致,但保证最终数据将变得一致。YouTube和Twitter通过在世界各地部署多个服务器,将数据更新集中在较少的时间点上,从而提高效率。
为什么数据库在处理大量请求时会遇到问题?
-单个数据库处理能力有限,接收请求、理解请求、做出更改并发送响应都需要时间,因此每秒能处理的请求数量有限。同时处理多个请求时,还可能出现数据覆盖等问题。
缓存是如何帮助减轻中央数据库的负担的?
-缓存通过存储数据副本来减少对中央数据库的直接请求。这意味着大量的读取请求可以由缓存而不是数据库本身快速处理,从而大大提高系统的效率。
为什么即使计算机在本质上是计算器,计数仍然会出现问题?
-尽管计算机擅长执行数学计算,但当涉及到大规模、复杂的系统时,数据的同步、更新和缓存等问题会导致计数不准确或延迟。
为什么有时你在YouTube或Twitter上看到的数据会随着设备的不同而改变?
-这可能是因为数据缓存和最终一致性的处理方式不同,不同的设备可能会连接到不同的服务器或缓存,导致显示的数据有所不同。
为什么对于某些操作,如购买演唱会门票,系统会采用排队机制而不是最终一致性?
-对于需要绝对数据一致性的操作,如确保不会重复销售相同座位的门票,系统会采用排队机制来逐一处理请求,确保数据的准确性。
YouTube如何确保即使在数据更新中存在延迟,也能保证广告收入和金钱的准确计算?
-尽管YouTube使用最终一致性来处理视图和其他统计数据,但对于需要准确计算的数据,如广告收入,它采取特殊措施确保数据的准确性,可能包括实时处理或优先级更高的数据更新策略。
Outlines
😊 第一个段落提出了一个问题:为什么推特的点赞数量会上下波动?
作者通过一个推特截图提出了一个问题:为什么推特上的点赞数量会无规律地波动,而不是持续稳步增长?这似乎反映了计算机在计数这一基本操作上也会出现问题。原因在于大型复杂系统在扩展时会遇到一些困难。
😊 第二个段落解释了这个问题的原因
这个问题的原因在于“竞争条件”和“最终一致性”。当有大量请求同时访问数据库时,如果处理顺序不同,结果也会不同。此时需要将请求排入队列一个个处理以保证一致性,但这无法扩展。大型系统通常采用“最终一致性”策略,允许各数据库有一定延迟后再更新总数据库。此外,系统还设置缓存来减少请求直接访问总数据库的次数。这就是为什么点赞数量会起伏不定,但最终会正确统计出来。
Mindmap
Keywords
💡缓存
💡最终一致性
💡竞态条件
💡扩展性
💡数据库
💡单线程
💡多线程
💡并发处理
💡请求队列
💡中央数据库
Highlights
Observation of fluctuating social media metrics beneath a tweet, questioning Twitter's counting ability.
Introduction to the concept of race conditions, caching, and eventual consistency to explain counting difficulties.
Explanation of the limitations of single-threaded code in handling multiple, simultaneous requests.
Description of modern websites as interfaces to databases, with challenges in scaling and handling requests.
Discussion on how simultaneous requests can lead to missed counts due to race conditions in databases.
The concept of eventual consistency as a solution for sites dealing with Big Data, allowing for accurate, albeit delayed, updates.
Explanation of caching as a strategy to reduce direct hits on the central database by serving repeated requests more efficiently.
Highlighting the impact of scaling on counting accuracy and the complications introduced by the need for real-time updates.
The use of queues to ensure accuracy in systems where immediate consistency is critical, such as ticket sales.
Introduction to the complexities of managing a central database across multiple servers for large-scale operations like YouTube.
Explanation of how social media platforms manage viewcounts and statistics through periodic updates to central systems.
Discussion on the trade-offs between immediate and eventual consistency in different application contexts.
The role of logs in managing data updates across servers in large systems to ensure eventual consistency.
Understanding why metrics like subscriber counts and view numbers on platforms like YouTube may not always immediately reflect changes.
Acknowledgment of the difficulty in accurately counting and updating data in real-time on large digital platforms.
Transcripts
This is a brilliant tweet.
But I don't want you to pay attention to the tweet.
It's good, sure, but I want you to watch the numbers that are underneath it.
That's a screen recording,
and the numbers are going up and down, all over the place.
They should steadily rise, but they don’t.
There aren't that many people tapping ‘like’ by mistake and then taking it back.
So why can't Twitter just count?
You'll see examples like this all over the place.
On YouTube, subscriber and view counts sometimes rise and drop seemingly at random,
or they change depending on which device you're checking on.
Computers should be good at counting, right?
They're basically just overgrown calculators.
This video that you're watching,
whether it's on a tiny little phone screen or on a massive desktop display,
it is all just the result of huge amounts of math that turns
a compressed stream of binary numbers into amounts of electricity
that get sent to either a grid of coloured pixels or a speaker,
all in perfect time.
Just counting should be easy.
But sometimes it seems to fall apart.
And that's usually when there's a big, complicated system
with lots of inputs and outputs,
when something has to be done at scale.
Scaling makes things difficult. And to explain why,
we have to talk about race conditions, caching, and eventual consistency.
All the code that I've talked about in The Basics so far has been single-threaded,
because, well, we’re talking about the basics.
Single-threaded means that it looks like a set of instructions
that the computer steps through one after the other.
It starts at the top, it works its way through, ignoring everything else,
and at the end it has Done A Thing.
Which is fine, as long as that's the only thread,
the only thing that the computer's doing,
and that it's the only computer doing it.
Fine for old machines like this,
but for complicated, modern systems, that’s never going to be the case.
Most web sites are, at their heart, just a fancy front end to a database.
YouTube is a database of videos and comments.
Twitter is a database of small messages.
Your phone company's billing site is a database of customers and bank accounts.
But the trouble is that a single computer holding a single database can only deal with
so much input at once.
Receiving a request, understanding it, making the change, and sending the response back:
all of those take time,
so there are only so many requests that can fit in each second.
And if you try and handle multiple requests at once,
there are subtle problems that can show up.
Let's say that YouTube wants to count one view of a video.
It just has the job of adding one to the view count.
Which seems really simple, but it's actually three separate smaller jobs.
You have to read the view count,
you have to add one to it,
and then you have to write that view count back into the database.
If two requests come along very close to each other,
and they’re assigned to separate threads,
it is entirely possible that the second thread
could read the view count
while the first thread is still doing its calculation.
And yeah, that's a really simple calculation, it's just adding one,
but it still takes a few ticks of a processor.
So both of those write processes would put the same number back into the database,
and we've missed a view.
On popular videos, there'll be collisions like that all the time.
Worst case, you've got ten or a hundred of those requests all coming in at once,
and one gets stuck for a while for some reason.
It'll still add just one to the original number that it read,
and then, much later,
it'll finally write its result back into the database.
And we've lost any number of views.
In early databases, having updates that collided like that could corrupt the entire system,
but these days things will generally at least keep working,
even if they're not quite accurate.
And given that YouTube has to work out not just views,
but ad revenue and money,
it has got to be accurate.
Anyway, that’s a basic race condition:
when the code’s trying to do two or more things at once,
and the result changes depending on the order they occur in,
an order that you cannot control.
One solution is to put all the requests in a queue,
and refuse to answer any requests until the previous one is completed.
That's how that single-threaded, single-computer programming works.
It's how these old machines work.
Until the code finishes its task and says "okay, I'm ready for more now",
it just doesn't accept anything else.
Fine for simple stuff, does not scale up.
A million-strong queue to watch a YouTube video doesn't sound like a great user experience.
But that still happens somewhere, for things like buying tickets to a show,
where it'd be an extremely bad idea to accidentally sell the same seat to two people.
Those databases have to be 100% consistent, so for big shows,
ticket sites will sometimes start a queue,
and limit the number of people accessing the booking site at once.
If you absolutely must count everything accurately, in real time, that’s the best approach.
But for sites dealing with Big Data, like YouTube and Twitter,
there is a different solution called eventual consistency.
They have lots of servers all over the world,
and rather than reporting every view or every retweet right away,
each individual server will keep its own count,
bundle up all the viewcounts and statistics that it’s dealing with,
and just it will just update the central system when there's time to do so.
Updates doesn't have to be hours apart,
they can just be minutes or even just seconds,
but having a few bundled updates that can be queued and dealt with individually
is a lot easier on the central system
than having millions of requests all being shouted at once.
Actually, for something on YouTube’s scale,
that central database won't just be one computer:
it'll be several, and they'll all be keeping each other up to date,
but that is a mess we really don't want to get into right now.
Eventual consistency isn't right for everything.
On YouTube, if you're updating something like the privacy settings of a video,
it's important that it's updated immediately everywhere.
But compared to views, likes and comments, that's a really rare thing to happen,
so it's OK to stop everything, put everything else on hold,
spend some time sorting out that important change, and come back later.
But views and comments, they can wait for a little while.
Just tell the servers around the world to write them down somewhere, keep a log,
then every few seconds, or minutes, or maybe even hours for some places,
those systems can run through their logs,
do the calculations and update the central system once everyone has time.
All that explains why viewcounts and subscriber counts lag sometimes on YouTube,
why it can take a while to get the numbers sorted out in the end,
but it doesn't explain the up-and-down numbers you saw at the start in that tweet.
That's down to another thing: caching.
It's not just writing into the database that's bundled up. Reading is too.
If you have thousands of people requesting the same thing,
it really doesn't make sense to have them all hit the central system
and have it do the calculations every single time.
So if Twitter are getting 10,000 requests a second for information on that one tweet,
which is actually a pretty reasonable amount for them,
it'd be ridiculous for the central database to look up all the details and do the numbers every time.
So the requests are actually going to a cache,
one of thousands, or maybe tens of thousands of caches
sitting between the end users and the central system.
Each cache looks up the details in the central system once,
and then it keeps the details in its memory.
For Twitter, each cache might only keep them for a few seconds,
so it feels live but isn't actually.
But it means only a tiny fraction of that huge amount of traffic
actually has to bother the central database:
the rest comes straight out of memory on a system that is built
just for serving those requests,
which is orders of magnitude faster.
And if there's a sudden spike in traffic,
Twitter can just spin up some more cache servers,
put them into the pool that's answering everyone's requests,
and it all just keeps working without any worry for the database.
But each of those caches will pull that information at a slightly different time,
all out of sync with each other.
When your request comes in, it's routed to any of those available caches,
and crucially it is not going to be the same one every time.
They've all got slightly different answers,
and each time you're asking a different one.
Eventual consistency means that everything will be sorted out at some point.
We won't lose any data, but it might take a while before it's all in place.
Sooner or later the flood of retweets will stop, or your viewcount will settle down,
and once the dust has settled everything can finally get counted up.
But until then: give YouTube and Twitter a little leeway.
Counting things accurately is really difficult.
Thank you very much to the Centre for Computing History here in Cambridge,
who've let me film with all this wonderful old equipment,
and to all my proofreading team who made sure my script's right.
5.0 / 5 (0 votes)