How to avoid cascading failures in a distributed system 💣💥🔥
Summary
TLDRThis system design video tackles the 'Thundering Herd' problem, where a surge of requests overwhelms servers. It explains rate-limiting as a solution, using server queues to manage load and prevent cascading failures. The video also addresses challenges like viral traffic, job scheduling, and popular posts, suggesting strategies like pre-scaling, auto-scaling, batch processing, and approximate statistics. It concludes with best practices, including caching, gradual deployments, and cautious data coupling, to enhance performance and mitigate the impact of sudden traffic spikes.
Takeaways
- 🚦 The main problem addressed is the 'thundering herd' issue, which occurs when a large number of requests overwhelm the server, similar to a stampede of bison.
- 🔄 Rate limiting is introduced as a server-side solution to prevent server overload by controlling the rate at which users can send requests.
- 💡 The concept of load balancing is explained, where servers are assigned specific ranges of requests to handle, ensuring even distribution of load.
- 🔄 The script discusses the cascading failure problem, where the failure of one server can lead to additional load on others, potentially causing a system-wide crash.
- 🚫 A workaround mentioned is to stop serving requests for certain user IDs to prevent further overload, although not an ideal solution.
- 📈 The importance of having a smart load balancer or the ability to quickly bring in new servers during peak loads is highlighted.
- 🛑 The script suggests using request queues with each server having a defined capacity to handle requests, which helps in managing overloads.
- ⏱️ It emphasizes the need for clients to handle failure responses appropriately, possibly retrying after some time, to manage server load.
- 📈 Auto-scaling is presented as a solution for unpredictable traffic increases, such as during viral events or sales periods like Black Friday.
- 📅 Job scheduling is identified as a server-side problem, where tasks like sending new year wishes to all users should be batched to avoid overload.
- 📊 The script introduces the concept of approximate statistics, where displaying approximate numbers for metadata like views or likes can reduce database load.
- 💾 Caching is recommended as a best practice to handle common requests efficiently and reduce database queries.
- 📈 Gradual deployments are suggested to minimize server-side issues during updates, by deploying in increments and monitoring the impact.
- 🔗 The script ends with a cautionary note on coupling, where keeping sensitive data in cache can improve performance but also poses risks if not managed carefully.
Q & A
What is the main problem addressed in the system design video?
-The main problem addressed is the 'Thundering Herd' problem, which refers to a large number of requests overwhelming the server, potentially causing a cascading failure of the system.
What is rate limiting and why is it used?
-Rate limiting is a technique used on the server side to control the amount of incoming traffic to prevent the server from being overwhelmed by too many requests at once, thus avoiding system crashes.
How does the load balancing scenario with four servers work in the script?
-In the load balancing scenario, each server is assigned a range of requests to handle (1 to 400 in increments of 100). If one server crashes, the load balancer redistributes the load among the remaining servers, increasing their request ranges accordingly.
What is the cascading failure problem mentioned in the script?
-The cascading failure problem occurs when one server's crash leads to an increased load on other servers, which may also become overwhelmed and crash, causing a chain reaction that can take down the entire system.
How can a server queue help in managing the load?
-A server queue can help manage the load by allowing each server to have a limit on the number of requests it can handle. If the queue reaches its capacity, additional requests are either ignored or the server returns a failure response, preventing overload.
What is the difference between temporary and permanent errors in the context of rate limiting?
-Temporary errors indicate that the request failure is due to a temporary issue, such as server load, and the client may try again later. Permanent errors suggest there is a logical error in the request that needs to be corrected by the client.
How can pre-scaling help with events like Black Friday?
-Pre-scaling involves increasing the server capacity in anticipation of high traffic during specific events like Black Friday. This proactive approach helps to handle the increased load without overloading the existing servers.
What is auto-scaling and how does it differ from pre-scaling?
-Auto-scaling is a feature provided by cloud services that automatically adjusts the number of servers based on the current load. Unlike pre-scaling, which is based on predictions, auto-scaling reacts to real-time demand.
Why is job scheduling a server-side problem that needs to be addressed?
-Job scheduling is a problem because tasks like sending email notifications to a large number of users at once can create a sudden spike in load. It needs to be managed to avoid overwhelming the server.
What is batch processing and how does it help in job scheduling?
-Batch processing involves breaking down large tasks into smaller chunks and executing them over time. In job scheduling, this can help distribute the load evenly, preventing server overload.
How can approximate statistics be used to improve server performance?
-Approximate statistics involve displaying estimated or rounded numbers for metadata like views or likes on a post, rather than exact numbers. This can reduce the load on the server by avoiding unnecessary database queries for exact counts.
What are some best practices mentioned in the script to avoid the Thundering Herd problem?
-The best practices include caching common requests, gradual deployments to minimize disruptions, and careful consideration of data coupling and caching sensitive data to improve performance without compromising security or accuracy.
Outlines
🚫 Rate Limiting to Prevent Server Overload
This paragraph introduces the concept of rate limiting as a solution to the 'thundering herd' problem, where a massive influx of requests can overwhelm server capacity. It uses the analogy of a bison stampede to describe the impact of too many requests hitting the server at once. The scenario of four servers handling a load-balanced request range is presented, with the failure of one server causing a cascading effect that could lead to a total system crash. The paragraph emphasizes the importance of avoiding such scenarios by implementing rate limiting to manage server load effectively.
🔄 Handling Server Overload with Queues and Scaling
The second paragraph delves into how to manage server load through the use of request queues and scaling. It explains that by assigning a compute capacity to each server and expanding the queue up to its limit, servers can avoid overloading. The paragraph also touches on the importance of client-side awareness when requests fail due to server limits, suggesting that temporary error messages can guide users to retry after some time. Additionally, it discusses the challenges of scaling in response to unpredictable viral traffic or planned events like Black Friday sales, highlighting pre-scaling and auto-scaling as potential strategies.
📅 Job Scheduling and Batch Processing
This paragraph addresses the issue of job scheduling, particularly the problem of running cron jobs that could potentially flood the server with tasks at a specific time, such as sending New Year greetings to all users simultaneously. The solution proposed is to break down the job into smaller, manageable chunks, using batch processing to distribute the load over time. This approach ensures that the server does not get overwhelmed and that the service remains reliable and responsive.
🔄 Batch Processing and Approximations for Popular Content
The fourth paragraph discusses the challenges of handling popular content, such as a viral post or a popular YouTube video, and how batch processing can mitigate the load. It also introduces the concept of adding 'jitter' to the notification process to spread out the load. Furthermore, the paragraph explores the idea of using approximate statistics for metadata, such as view counts, to reduce the load on the database and improve performance, even if it means displaying slightly inaccurate numbers to users who are not overly concerned with exact figures.
🛡️ Best Practices for Avoiding the Thundering Herd
The final paragraph wraps up the discussion by outlining best practices to avoid the thundering herd problem. It highlights caching as a means to reduce database queries and improve system performance. The paragraph also touches on the importance of gradual deployments to minimize the impact of new service updates. Lastly, it presents the controversial practice of coupling, where sensitive data might be cached to reduce load, but cautions that this approach must be used judiciously to avoid security risks.
Mindmap
Keywords
💡Rate-limiting
💡Cascading failure
💡Load balancing
💡Queue
💡Compute capacity
💡Auto-scaling
💡Job scheduling
💡Batch processing
💡Jitter
💡Approximate statistics
💡Caching
💡Gradual deployments
💡Coupling
Highlights
Introduction to the problem of the 'thundering herd' in server-side systems due to overwhelming request volumes.
Illustration of the server load balancing scenario with a range of requests per server and the impact of a server crash.
Explanation of cascading failure problem where the failure of one server leads to the overload and potential crash of others.
Discussion on the importance of server capacity and the danger of overloading with additional request ranges.
Proposing rate limiting as a solution to avoid server crashes by managing the number of requests handled.
Description of how request queues can be used to limit the load on servers and prevent overloading.
The concept of compute capacity per server and how it relates to the number of requests that can be processed.
The dilemma of serving some users versus no users in the event of server failures and the principle behind rate limiting.
Differentiating between temporary and permanent error messages to manage client expectations during request failures.
Strategies for handling sudden traffic spikes, such as pre-scaling and auto-scaling, especially during events like Black Friday.
The challenges of going viral and the role of rate limiting in managing unexpected surges in traffic.
Approaches to job scheduling to prevent server overload during tasks like sending new year wishes to all users.
The technique of batch processing to distribute workload evenly and prevent thundering herd problems.
Smart solutions like adding jitter to notifications to manage the load during popular posts or events.
The use of approximate statistics to减轻数据库负载 and improve performance without sacrificing user experience.
Best practices like caching to handle common requests efficiently and reduce database load.
The benefits of gradual deployments to minimize server-side issues during service updates.
Controversial practice of coupling systems for performance improvement with the associated risks.
Conclusion summarizing the importance of these strategies in mitigating the thundering herd problem and improving system design.
Transcripts
hi everyone we are back with a new
system design video on rate-limiting
specifically the problem that we are
trying to solve is that of the turn
during hood so if you can imagine a huge
group of bison charging towards you
crushing everything in their path that's
what it feels like many on the server
side and there's a ton of requests
coming in from users they're just going
to crush our servers and crush your
system completely so to avoid this
problem what we do on the server side is
something called rate limiting and let's
just try to understand the scenario
first let's say you have four servers
and let's say you have a request range
from 1 to 400 so every server is load
balanced to serve a range of 100
requests now let us assume that you have
s1 crashing because for some reason
maybe some internal issue s1 crashed
resulting in s4 s3 and s2 taking
additional load so s1 had the range from
1 to 100 the load balance is going to be
smart about this and assign them loads
let's say 1 to 1 1 to 23 this is going
to get an additional request range from
34 to 67 and this is going to get an
additional range from 68 to 100 ok so s1
caching did not affect the rest of the
servers because or rather did not affect
the users because the rest of the
servers are now able to serve their
requests however there's a implicit
assumption over here and the assumption
is that each of these servers can handle
the new load ok let us assume that s4
did not have that much compute power it
was barely surviving with request range
of 100 and by adding to that range that
it needs to serve now s4 is completely
exhausted and the requests are taking
too much time the too many timeouts s4
crashes so s1 was already dead s4 is now
dead and as you can expect somebody
needs to actually answer so somebody
needs to take these requests from this
range so I'm going to give these
additional ranges now 4 s 4 which is 3 0
1 2 3 50 and also the additional range
over here which is let's say 1 2 17 and
now we have this serve also serving some
range 351 - 418 - 23 you're probably
getting an idea now these ranges are
pretty big initially s3 was serving half
of what it is serving right now as a
request range and there's a good chance
that s 3 will also crash
so if s3 crashes it was serving 200% of
the load that it could maybe just had
50% additional and it crashes which
means s2 has to serve around 400 percent
of its original load approximately and
there's a very good chance that s 2 will
also crash resulting in the whole system
crashing all of your users being upset
and this is something that we really
want to avoid so this problem is called
the cascading failure problem and that's
the first problem that we try to
mitigate as you can see this cascading
failure is a race against time when s1
had crashed there's that Delta that
small pine gap that you have for
bringing in the new server before s4
takes that much load and crashes so it
is a race against time one of the things
that you could do is have a really smart
load balance or have some seamless sort
of new server bring in but we should
assume the worst and there are some
possible let's say workarounds and one
real solution to this problem one
workaround of course is to just stop
something requests for all users having
requests IDs 1 to 100 yeah that's that's
not really a solution but if you see
that the other services can't take in
more load it's better to be available to
some users than to be available to none
of the users and now we are out of
workarounds so the real solution
what we should do is take a cue and put
our requests in this queue what's going
to happen here is that every server can
have a request queue and they can decide
on answering or not answering a request
so what I'm going to do is give each of
these servers a particular capacity
right compute capacity so s1 has 100
compute capacity which for me one unit
of compute capacity means it can handle
one request per second so 100 requests
per second
300 requests per second 400 degrees per
second 200 requests per second okay
now let's say s1 clashes looking at the
node s4 what we need to do is we need to
see 300 is the maximum number of
requests it can take so this queue is
going to keep expanding till it hits 300
if the 300 first request comes in what
we are going to do is we are going to
ignore that request we are going to just
say no so when we return a failed
response to the client at least this
server is not being overloaded and also
the client is now aware that ok this
request fail
maybe after five minutes I should try
again so the user experience is going to
be bad of course this user who made that
request and fail is not going to be
happy but again going by the principle
that serving some users is better than
serving no users we are going to start
dumping requests one small thing to
remember here is that the client
shouldn't be stupid if the request fails
and if the client is bombarding the
server suddenly that on America's not my
request this is going to be bad so there
are some types of errors that you can
send the client one is temporary and one
is permanent right these are just types
of errors that you can send if you say
permanent it means that there's some
serious mistake in the request you sent
and there's a logical error temporary
means that you should try in some time
maybe there's some internal server issue
going on maybe the database is too slow
or maybe there's too much load so try
after some time and the client can
play messages accordingly but the
general idea of course is to limit the
number of requests you can take on the
server side so that you can handle the
load till the scaling bit comes in till
you can thing in the new service all
right so this is the first problem the
second problem that you can face is if
you go viral or if there's some sort of
an event let's say Black Friday you know
sales go up on Black Friday so might be
an issue well when you have an event one
of the things you can do is because you
have prior knowledge you can scale
beforehand you know if you have four
servers and you assume that on Black
Friday you're going to have 50% more
users get 6 hours so that's the first
solution which is pre scale however if
you are not very sure about the number
of servers you'll need during the during
the event one thing that you could do is
auto scale and please don't quote this
video if you spend too much money auto
scaling but yeah I mean auto scaling is
something that is provided as a solution
by cloud services you know if you if you
host your service on the cloud you can
probably ask them taught to scale your
service and you know what is scaling is
not a very bad idea usually because the
increased number of traffic is probably
meaning that you're going to make more
money out of that traffic so yeah that's
one solution how about if you go viral
but if you go viral you can just fall
back to the old solution of rate
limiting so if you do rate limiting you
will be stopping the maximum number of
uses that you can actually serve and
yeah
and of course auto scaling and free
scaling is a good idea but going viral
is something that you can't predict so
pre scaling is not really a solution
these two other solutions the third
problem is a real server-side problem
and that is job scheduling
so often is not rewrite cron jobs which
run on some point in time I mean we we
decide when it's going to run but
imagine a cron job which is supposed to
send email notifications to all users
wishing them a happy new year on the 1st
of January what could happen if you do
this in a naive way is that you send all
of the emails together when the clock
hits first of January which means that
if you have a million users you're going
to send one million email notifications
and that's of course like a huge herd of
bison coming towards you so the way you
avoid this is to break the job into
smaller pieces let's say 1 million users
so you have 1 million news IDs the first
thousand users are broken I mean the
users are broken into chunks the first
thousand users are going to get the
email in the first minute the second
thousand users are going to get in the
second minute
and so on and so forth 1 million by
1,000 is going to take 1000 minutes and
with this what's happening is that you
have divided the work that you hide on
the server into smaller chunks which it
can consume you know one minute 1000 is
something which is not a tremendous load
so it's going to survive and your users
don't really care I mean if they don't
get the email notification
auto-generated email notification at the
first minute of new year they don't
really care so yeah it's something that
you can do of course if they do care
then you have to bring it down you have
to bring this range down from 1,000
minutes to whatever you like but you see
that you can decide all right so batch
processing is something that you should
definitely do the fourth problem is as
interesting as the other problems
actually it's when someone popular post
something or if a post goes like if a
post becomes really popular if a lot of
people liking it sharing it subscribing
to it like you guys should but popular
post let's say a user like PewDiePie
post something on YouTube then you need
to send it to all of their followers if
you do it a knive way the same issue of
you know job scheduling will come in
there's too many users and a very small
Delta so what you could do is batch
processing over there you know send
users in chunks of 1,000 but something
that YouTube does really smartly is
adding jitter in which case if you have
a lot of followers what's going to
happen is the notifications are going to
go to them in the in a batch processing
way but if they start hitting the page
let's say the video page then there is
some content the video content which is
core to YouTube but there is a lot of
data which actually doesn't matter if
you think of it that's the number of
views the number of likes number of
comments and so on and so forth now if
you have a very popular user like
PewDiePie actually posting a video the
number of views are going to be changing
dramatically so what you could do is to
faithfully display that or you could do
it in the smart way so let's say in the
first hour we get 1,000 views then in
the second hour if there's a lot of
users who are asking the number of views
in this video I'm going to be smart and
I'm just going to say 1,000 in 21.5 is
the total number of views now so that's
1,500 yeah but maybe the total number of
views in reality was 1700 so there's a
mismatch between reality and what is
being displayed but we don't care
because this is metadata this is
something which is not code to the video
so we are going to display some number
which may or may not be true it's an
approximation and do the users really
care not really they want a general idea
of what's going on and of course this
seems like a really big difference but
YouTube can be smart about this they can
figure out how the views change over
time and if it's this is the first hour
and this is the second hour instead of
finding out the total number of views in
this video they can just run through
this graph and figure out where it
should lie okay
YouTube must be much much smarter than
this but I am just giving the general
area of approximation instead of showing
people the truth approximate and save a
lot of load on your service
potentially this could save a lot of
database queries that you are making to
get the metadata of a post right okay so
that's the fourth smart solution and
that is now the fifth smart solution
which is approximate statistics
apart from these solutions of course
there's some good practices in the
server side to avoid a thundering hood
the first one is the most common one
which is caching so if you're getting a
lot of common requests then the response
is going to be the same and you can just
cash those requests and I mean basically
cache the responses for those requests
so those are key value pairs and this is
going to save a lot of queries that
you'll be making on the database in turn
that will be improving the performance
of your system and also you can handle
more load then another thing that you
can do is of course gradual deployments
most of the issues that people get in
the server side is when they're
deploying the service so there's a lot
of stories about you know site
reliability engineers who are fighting
deployments and the developers want to
deploy more because they want to get
more features out and the reliability
engineers want to stop the province as
much as they can because that makes the
system more stable so it's it's an
interesting tug-of-war and what you want
to do essentially is deploy so gradual
deployments in this what happens is you
don't deploy let's say if you have a
total of 100 so as you don't deploy them
together you deploy the first thing you
have a look at what's going on and then
you deploy the next 10 and so on and so
forth this won't be possible in certain
scenarios when there's a breaking change
so to speak but we are getting into too
much detail and gradual deployment is a
good idea
deploy 10 10 10 10 together unless there
is absolutely no choice that you haven't
you have to deploy in parallel and the
final point that I'm going to make of
course is going to be controversial it's
with
star and that's called coupling so to
improve performance sometimes what you
need to do is you need to store data
which is very similar to caching but let
us assume that you have a service which
for every request that gets asks an
authentication service to authenticate
the user first to authenticate this
request and then serves the request
let's say that this network call is too
much for you
maybe it's an external service what you
could do is you could cache the users
username and token or password or
whatever you like and now what you can
do is you can see that if the username
password worked once in the past one
hour then maybe the password hasn't
changed and we are going to assume that
this user is authenticated to make that
like you know to call this service and
we are going to go ahead instead of
talking to authentication service and
verifying whether it's true or not it's
good in a way that it's not you know
querying the authentication service all
the time so reducing the load
authentication service improving
performance improving user experience
except that if this is a financial
system and the password has changed and
if there's a person who's hacked into
their account and then using this
password then you're big trouble aren't
you
so if this is a double-edged sword in
fact it seems like a really bad idea in
most cases so you should only couple
systems which is actually keeping data
in the cache sensitive data or or
important data in the cache that should
be avoided however if you have some data
like you know the profile picture or
something you can keep that for one hour
or two hours that's why there's a star
over here you want to take this
case-by-case and understand that keeping
some data for an external service in
your own service whether that's a good
idea or not and that will improve
performance and in turn that will help
you handle more requests so the problem
of the Thundering Herd will be slightly
mitigated but really this is
like performance improvement and
probably we should put another star over
here just to be sure that's it for this
discussion on the thundering herd
we often have discussions on system
design so if you want notifications for
that you can subscribe and of course if
you have any doubts or comments on this
discussion then you can leave them in
the comments below I'll see you next
time
stuff like the views the number of views
but the number of legs number of
comments these things are not critical
to a video
Weitere ähnliche Videos ansehen
5.0 / 5 (0 votes)