1: TinyURL + PasteBin | Systems Design Interview Questions With Ex-Google SWE
Summary
TLDRThis video delves into system design for URL shortening services like TinyURL and paste bins, tackling challenges in generating unique short URLs, handling large-scale data, and incorporating analytics for tracking clicks. The presenter explores various technical solutions, including hashing for URL generation, single-leader replication, partitioning for scalability, and caching to optimize read speeds. The discussion also covers strategies for managing hot links, stream processing for accurate click analytics, and considerations for handling expired links and large pastes, suggesting the use of object stores and CDNs for efficiency.
Takeaways
- 😀 The video discusses designing a system for URL shortening services like TinyURL and a paste bin service, focusing on generating unique short URLs and handling large-scale data.
- 🔄 The presenter accidentally recorded 20 minutes without sound, highlighting the importance of checking technical setup before long recordings.
- 🔗 The core functionality involves creating short links from long URLs and storing pastes with short access links, emphasizing the need for a unique and distributed approach to avoid collisions.
- 📈 The system design considers analytics, specifically tracking the number of clicks per link, which introduces challenges in ensuring data accuracy and performance at scale.
- 🚀 The design aims to handle an extremely high scale, with a hypothetical trillion URLs and varying sizes of data, from kilobytes to gigabytes, requiring partitioning and distributed storage.
- ⚖️ The system optimizes for more reads than writes, a common pattern in URL shortening services, which influences the choice of database replication and caching strategies.
- 🔑 Generating short URLs involves using a hashing function with elements like long URL, user ID, and timestamp to ensure an even distribution and handle collisions through probing.
- 🚫 The video rules out multi-leader or leaderless replication and write-back caching due to potential conflicts and inconsistencies in link generation.
- 📚 The choice of database is influenced by the need for single-leader replication, partitioning, and the use of B-tree indexes for efficient reading, leaning towards a traditional SQL database.
- 🔥 To handle hot links with high traffic, the system employs caching strategies, with considerations for cache invalidation and the use of write-around caching to avoid conflicts.
- ♻️ The system uses stream processing with tools like Kafka for handling analytics data, ensuring durability and fault tolerance, and avoiding race conditions in click counting.
Q & A
What is the primary purpose of the systems design interview questions discussed in the video?
-The primary purpose is to delve into the design of systems like TinyURL and ppin, focusing on generating short links and handling analytics such as click counts, while considering performance and scalability.
Why is generating a unique short URL considered challenging?
-Generating a unique short URL is challenging because it must be unique for each link, which can slow down the service as the number of URLs increases, leading to potential collisions and the need for efficient handling mechanisms.
What is the significance of using a hashing function in the context of generating short URLs?
-A hashing function is used to distribute the short URLs evenly across the system, reducing the likelihood of collisions and ensuring a more uniform distribution of link generations.
Why might using a monotonically increasing sequence number for short links be a bad idea?
-Using a monotonically increasing sequence number could lead to performance bottlenecks because it would require locking on that number for every single request, thus reducing concurrency and slowing down the link generation process.
What are some performance considerations when designing a system to handle a trillion URLs with varying click rates?
-Performance considerations include ensuring accurate click count analytics, data storage for potentially petabyte-scale data, and optimizing for a higher number of reads than writes due to usage patterns.
How can partitioning help in improving the performance of URL generation and click analytics?
-Partitioning can improve performance by distributing the load across multiple systems, reducing the chance of hotspots and allowing for more efficient data management and retrieval.
What is the role of caching in the context of a URL shortener service?
-Caching can significantly speed up read operations by storing frequently accessed data, such as popular short URLs and their redirections, in a faster-access storage system, reducing the need to query the database repeatedly.
Why might multi-leader or leaderless replication not be suitable for a URL shortener service?
-Multi-leader or leaderless replication could lead to conflicts where multiple users generate the same short URL at the same time, resulting in incorrect link associations and a poor user experience.
What is the proposed solution for handling click analytics to ensure accuracy without performance degradation?
-The proposed solution involves using stream processing, such as Kafka, to handle click events and then process them in mini-batches using a system like Spark Streaming, ensuring accurate and efficient analytics.
How can a write-around cache help in managing the writes to the database for URL click analytics?
-A write-around cache allows writes to be first made to the database and then propagated to the cache, ensuring data consistency and preventing the cache from serving stale data while also reducing the load on the database due to write operations.
What are some considerations for handling large pastes in a paste bin service similar to TinyURL?
-Handling large pastes requires considering storage solutions like object stores (e.g., Amazon S3) instead of traditional databases, and using CDNs for efficient delivery of large, static files to users.
Outlines
😅 Systems Design Interview: Tiny URL and PPIN
The speaker begins by expressing frustration over a recording mishap but dives into discussing systems design interview questions, focusing on URL shortening services like Tiny URL and PPIN. They explain the concept of generating unique short links and the additional complexity of integrating analytics for tracking clicks. The importance of ensuring the accuracy of click counts is highlighted, along with the performance considerations at scale, such as supporting trillions of URLs and petabytes of data. The speaker emphasizes the need for a robust partitioning strategy and the predominance of read operations over writes.
🔄 Link Generation and Handling Hash Collisions
This paragraph delves into the technicalities of generating unique short URLs using a hashing function combined with the long URL, user ID, and creation timestamp. The discussion covers the number of characters needed in the hash result, given the possible combinations with alphanumeric characters. The speaker then addresses potential hash collisions, explaining the use of probing to find an available slot in the hash table, as opposed to chaining or database-linked lists, which are not feasible in this context.
🚫 Challenges with Replication and Caching for URL Shortening
The speaker explores the challenges of using replication and caching to enhance write operations in a URL shortening service. They explain why multi-leader or leaderless replication is not suitable due to the potential for conflicts when multiple users generate the same short link simultaneously. The paragraph also discusses the limitations of write-back caching due to the inability to ensure immediate validity of the link, thus ruling out certain databases and caching strategies.
🔑 Database Schema and Handling Concurrent Writes
The paragraph discusses the database schema for a URL shortening service, including fields for the short URL, actual URL, user ID, creation time, expiration time, and click count. It addresses the issue of concurrent writes to the same short URL by users and how databases can handle this using predicate locks or by materializing conflicts. The speaker also touches on the efficiency of using indexes and the potential use of stored procedures to reduce network calls.
🛠 Choosing the Right Database Engine and Indexing Strategy
The speaker compares different database engines and indexing strategies for the URL shortening service. They discuss the trade-offs between LSM trees and B-trees, ultimately favoring B-trees for their read efficiency. The choice of a single-leader replication database is justified, and the speaker expresses a preference for MySQL due to the simplicity of the data model, despite the potential for NoSQL solutions.
🔍 Optimizing Read Speeds and Handling Hot Links
This paragraph focuses on optimizing read speeds, which is a priority due to the high number of reads compared to writes. The speaker discusses the use of replication and partitioning to distribute load and improve read performance. They also introduce the concept of hot links, which receive a high volume of traffic, and propose caching as a solution to handle these efficiently, suggesting a write-around cache strategy with an LRU eviction policy.
📈 Stream Processing for Accurate Analytics
The speaker proposes using stream processing to accurately track URL clicks and avoid race conditions in the analytics process. They suggest using a message broker like Kafka for durability and fault tolerance, and discuss different methods of writing data to the cache. The paragraph also explores options for processing the stream of events, such as using Spark Streaming with mini-batching to update the click count in the database without the need for locking.
🔒 Ensuring Exactly-Once Semantics in Stream Processing
This paragraph addresses the challenge of ensuring exactly-once processing semantics in stream processing systems. The speaker discusses the potential for events to be processed more than once due to network issues or system failures. They propose solutions such as using an idempotency key or partitioning the Kafka queues and Spark Streaming consumers by short URL to avoid race conditions and ensure accurate click counts.
🗑️ Handling Expired Links and Paste Bin Considerations
The speaker discusses the handling of expired links through a simple batch job that clears out links past their expiration time. They also differentiate the handling of Paste Bin, which involves large files that cannot be stored as fields in a database. The paragraph suggests using an object store like Amazon S3 for large pastes and leveraging a CDN for quick delivery, advocating for a write-through cache strategy in this case.
🌐 Comprehensive System Design for URL Shortening and Paste Bin
The final paragraph brings together all the components discussed in the previous paragraphs to present a comprehensive system design. The speaker outlines the processes for writers and readers, the use of load balancers, URL assigning services, databases, caches, Kafka for event streaming, and Spark Streaming for analytics. They also highlight the role of ZooKeeper as a coordination service for managing metadata and ensuring consistency across the system.
Mindmap
Keywords
💡TinyURL
💡Pastebin
💡URL Shortening
💡Analytics
💡Hashing Function
💡Partitioning
💡Replication
💡Caching
💡Consistent Hashing
💡Stream Processing
💡Write-Through Cache
💡Write-Around Cache
💡Item Potency
💡Zookeeper
Highlights
Introduction to systems design interview questions focusing on URL shortening services like TinyURL and ppin.
The challenge of generating unique short URLs and the potential impact on service performance.
Incorporating analytics to track the number of clicks per link and ensuring accuracy.
Performance considerations at scale, with a hypothetical trillion URLs scenario.
The importance of read-write ratio in system design, optimizing for more reads than writes.
Link generation strategies using hashing functions to distribute links evenly.
Handling hash collisions in URL generation with probing methods.
The pitfalls of using multi-leader or leaderless replication for URL assignment.
The role of caching in speeding up writes and its limitations in the context of URL shortening services.
Partitioning as a method to increase write throughput and its advantages.
The use of consistent hashing to minimize key redistribution during cluster size changes.
Database schema design for storing URLs, user IDs, creation times, and click counts.
Techniques to handle concurrent URL creation to avoid conflicts in a distributed system.
The choice between LSM tree and B-tree indexes for the database based on read and write priorities.
Optimizing read speeds through replication, partitioning, and handling hot links with caching.
Approaches to updating click analytics without causing race conditions in a high-traffic scenario.
Utilizing stream processing for handling click events and the role of systems like Kafka.
Ensuring exactly-once processing in stream consumers to prevent incorrect click counts.
The implementation of expiration logic for links and the use of batch jobs.
Distinguishing the design for a paste bin service, including handling large files with object stores and CDNs.
The overall system architecture integrating writers, readers, databases, caches, and stream processing.
Transcripts
hello everyone and welcome back to the
channel today after much long waited
time we are finally going to be getting
back into our systems design interview
questions starting off of course with
tiny URL and ppin now I've just had the
absolute pleasure of going 20 minutes
into this video and then realizing that
my microphone wasn't turned on and now
I'm on absolute tilt so if any of you
think that uh you know you're not
looking forward to watching this
probably hourong video just know that
I'm looking forward to recording it and
editing it a whole lot less anyways
let's go ahead and get started before I
freak out so today we're going to be
talking about tiny URL and ppin and so
the gist of these two is pretty simple
basically we're going to be generating
short links so in the case of tiny URL
we might have something like
google.com and someone wants to make a
short link for it called tinyurl.com
SL
ABC same goes for for past
bin If instead of entering my text here
I was to write Jordan is
sexy which we all know to be the case of
course our paste would be pointed to by
that short link so let's actually go
through some formal problem
requirements so basically what we're
going to talk about today is a couple of
important things the first thing is
generating this unique short URL now the
she the unique short URL more or less
has to be unique obviously like I've
just said but this is easier said than
done and it's of course going to make
our service slower so the other thing is
that in addition to that URL I'm
actually going to make this problem a
little bit more complicated and
hopefully a little bit more in-depth by
adding some amount of analytics in this
case the number of clicks per link now
you might think that this is super easy
as a matter of fact it is not and of
course the reason it's not easy is
because we have to be able to ensure
that this number is actually accurate
while it doesn't have to be there the
second that click occurs eventually we
do need to know the right
answer so let's think about some
performance considerations as well uh
disclaimer I doubt that if I were to
build a real tiny URL site there would
be a trillion URLs that I have to
support nonetheless this systems design
problem wouldn't be very much fun if
there weren't so as a result I'm going
to name some obscene amount of scale and
then we're going to try and figure it
all out so let's imagine that for our
median URL we've got 10,000 clicks the
most popular URL is in the mid millions
of them so that's going to be quite a
bit and make things tougher like I
mentioned there's going to be a trillion
overall and then if we're talking about
past bin certain past can actually be in
the gigabytes and others are going to be
in kilobytes probably the vast majority
of those so let's think that in that
case if we're just going to do some back
of the envelope math over here we've got
1 trillion short URLs times 1 kilobytes
worth of pastes that's going to be equal
to a petabyte and that is certainly more
data than we can store on one individual
system we're going to have to be doing
some partitioning at some point down the
line another thing to note is just based
on the actual use patterns of people who
use tiny URL a lot of people are going
to be clicking the link a lot more than
they are generating links so as a result
of that we're going to have many more
reads than writes and that's probably
the type of case that we want to be
optimizing on so let's go ahead and
actually start by talking about link
generation because that is going to be
the most critical part of this problem
that we really need to understand so is
possible if we really wanted to to
basically just have some monotonically
increasing sequence number for all of
our short links but I think that's a
pretty bad idea because then you
basically need to lock on that number
every single time for every single
request that everyone's making so you
can actually generate that short number
instead what we should probably be doing
is generating those links out evenly
across our system and the way that we
can do this is using a hashing function
I think just to continue to make sure
that we get very evenly distributed
links we should probably put in things
like like in the tiny URL case with the
long URL a user ID and the create Tim
stamp so we've got one example of that
over here and as you can see it's going
to Output some sort of string for us to
use as our tiny URL key and so the
question we should now be asking
ourselves is how many characters do we
actually need in our hash result well
assuming that we're using 0 through 9
and A to Z that gives us 10 + 26 = 36
possible choices per character and so as
a result that means per slot we have
here we've got 36 * 36 * 36 * so on
which is 36 to the N combinations where
n is how many characters we're going to
generate and so as a result if n is
equal to 8 I looked this up on Google
because I can't do this in my head we
have around 2 trillion combinations and
considering I said we need about a
trillion links that should be good for
us so the question now is what do we
actually want to do on hash collisions
cuz when you only have two trillion
buckets and one trillion things you're
probably going to have a decent amount
of those well in the case of a hash
collision at least on a single normal
computer in our actual computer science
programs we basically have two options
one is that you can do something known
as chaining which is creating a linked
list out of each hash key bucket and the
other is probing which is saying oh if
this bucket is already taken why don't
we try and put our value over here now
in a database we can't really use a
linked list so probably the only
feasible thing is going to be probing so
for example if we had you know the hash
32 fc1 ca8 imagine this was like base 36
or something then the next hash to click
is going to be 32 fc1 ca9 and so on we
can basically keep trying until we find
an available one hopefully that makes
sense so now the first thing I wanted to
talk about for this video is writing
those URLs actually assigning them and
then as a result what type of
replication that we need to use within
our databases so of course even though
we care more so about reads than writes
we are still doing a ton of writs I
mentioned there are going to be a
trillion URLs and as a result of that we
want to maximize that throughput so the
first question that I wanted to ask is
can we do so with replication can we use
a multi-leader or leaderless replication
schema because these are two ways of
actually making sure you can speed up
wrs it gives you the ability to write to
many replicas and as a result you can
increase your throughput and the answer
that I think is going to be the case
here is no sadly you cannot so let's
come up with example here's me on the
right Jordan and here's a businessman on
the
left we're both submitting URLs for
Generation to the database at the same
time and his is leading to my
presentation.com mine is leading to one
of my favorite sites personally and then
at the same time you'll notice that
we've got the same short link so as a
result we have a conflict and so when it
comes to things like multi-leader or
leaderless replication it's kind of
arbitrary a lot of the time how you
actually end up resolving which right is
going to win so let's imagine that we
use last right wins lww and my time
stamp happens to be a little bit higher
than his mainly just out of random luck
because keep in mind we can't even trust
distributed timestamps So This Server
happens to have timestamp uh X and this
one's got time stamp X plus one so
Jordan's right wins and then all of a
sudden this guy is reading Jordan's Link
and he's saying WTF because he thought
this was a business
presentation so in theory you know if
you wanted to play advocate here you
could go and say that well okay maybe a
few seconds later we could have told the
businessman hey your right is no longer
valid but at least in my opinion this is
not going to be good here because most
people generate their link they copy
paste it they leave the site and then
they paste it elsewhere so most of the
time I feel that you probably want to be
giving them the right link the second
they click that button so of course this
is going to rule out all sorts of
databases that actually use leaderless
replication and that's going to include
any of the Dynamo style ones so Cass
Andra Sila Rak Etc we want to be using
single leiter
replication okay next we are going to be
talking about caching because caching is
actually another way that you can speed
up your rights assuming you're doing it
properly by using a right back cache and
so for example we've got these two guys
over here which are basically in memory
databases they're you know reddest
databases in particular and the gist is
that you can first write to them and
then eventually some point down the line
you can flush those out but what we
encounter here is basically the same
problem that we encountered above with
multileader or leaderless replication
which is that no one really knows the
second they make the right whether their
link is valid or not and as a result if
they send it around in the meantime it
could be someone else's link and that
would be a problem so again same issue
as before we can't really use right back
hasing to speed up our rights that's
unfortunate what about partitioning
that's another way you can speed up your
rights the more partitions that you have
the more load you can put on each one
and you can spread things out a little
bit
well in my opinion this is a perfectly
valid way of speeding up our rights and
we should totally be doing it so we can
actually just go ahead and take all of
our short URLs and partition on those so
we've got all the ones through a through
D over here e through h on the next one
and so on and so on and so you might
notice that I'm actually partitioning by
short URL range as opposed to Hash range
my argument for that is that the short
URL is itself already a hash so things
should be pretty evenly distributed we
shouldn't have to worry about load too
much also recall that I basically said
whenever you want a short URL X and you
can't get it you basically have to try x
+ one and so as a result x + one would
be on the same partition as X so it
means that you don't have to go to
another partition randomly to make that
right and it should in theory increase
latency so another thing that is
important to know with partitioning in
general is that we probably should be
using some sort of consistent hashing
reasoning being that with consistent
hashing hashing as opposed to hashing
where you basically do mod n where n is
the number of partitions with your short
key that basically it means that
whenever the cluster size changes fewer
uh fewer keys are going to have to be
redistributed so let's imagine this
Arrow right here represents everything
on this node over here and when I add a
new node to the cluster now the only
thing that's going to be moving is this
range over here as well as just uh as
opposed to just a variety of scattered
keys throughout our hash range and so
that is going to help us quite a bit
okay so let's actually talk about our
database schema and what things look
like on a single node so you can see
I've got an index right here or not an
index but an actual table which has our
10 URL and ideally we want this to be
unique like I mentioned we've got an
actual URL a user ID a create time an
expire time we'll talk about how to
expire things towards the end of this
video and also a number of clicks which
again we'll talk about towards the end
of this video so let's let's imagine
we've got two users user one and user
two the way that we've organized our
schema we actually have a little bit of
a problem which is that in theory they
could be both adding the same row with
the same key for their short URL at the
same time and our database wouldn't be
able to do anything about it why well
normally you would want to be locking on
this key but in our case this row didn't
actually exist yet before they added
them and so we don't actually have
anything to lock on so how would a
database internally actually go ahead
and do something like this well there's
two possible solutions one of which is
called predicate locks predicate locks
are pretty simple they're basically just
locks on rows that don't exist yet and
you specify a query on that row so in
this case you know as you can see we
would be taking everything from the URLs
table where the short URL key is the
thing that the conflict was on and so
what this sty looks like uh in turn is
if user one says create the link but
also give me the lock on short uh short
URL X User two can do the same thing but
the database is now going to come back
and tell him to kick rocks because that
is now taken then he's going to have to
go back and create link X+1 and then
finally the database can say okay so a
couple things to note here first off is
that predicate queries can potentially
be expensive why because they have to go
through the whole database and actually
find all the rows that potentially apply
so something that could make this a
little bit faster is actually using an
index on this short URL field why
because an index basically means that
this guy is going to be sorted
internally and as a result of sorting by
tiny URL now all of our queries are
going to be o of log on that field
because we can binary search there
instead of having to do a linear scan
which would be o of n additionally note
that right over
here we've got two sets of network calls
made by the guy who basically lost the
raise condition and now has to try to
get link X+ one what we could do instead
is use some sort of stored procedure
Advanced database function where it's
like if x is
taken grab x + one or something like
that and that way we could maybe do all
of that writing with just one network
call perhaps it would speed things up a
little
bit okay another way that we can also
handle this in addition to predicate
locks is by actually materializing
conflicts so as opposed to you know
locking on rows that don't exist yet
what if we actually just grab the lock
on rows that do exist well how can they
exist we basically just write every
single possible key to the database now
you may think to yourself oh this has
got to be a ton of Rights we have a
trillion possible short URLs or actually
two trillion because there are that many
combinations well keep in mind that uh
basically there's one bite per character
there's eight characters per short URL
and that actually only comes out to 16
terabytes which is really not that much
in the grand scheme of things you'll
probably still have to partition but
again not that much data and so now when
user one and user two go ahead and try
to both right to the same key user one
can go ahead and grab the lock user two
is going to lose the database is going
to say try again and then he can try
with a different
row okay now let's talk about potential
database engines because of course this
is something we want to be thinking
about as well the first thing to note is
that for a problem like this we actually
don't really ever need to do range
queries maybe with the exception of a
predicate lock so a hash index could be
fast but at the end of the day we are
storing about a petabyte of data and you
know your interviewer might not let you
get away with that one they might just
say that's going to be too expensive
choose an actual on disk index to use
and so if we are limited to on disk
indexes that gives us two choices first
off the LSM tree second off the B tree
now the trade-offs for these are in my
opinion somewhat simple the LSM tree is
going to be a little bit worse for reads
because when you read you typically have
to read from multiple SS tables and
possibly the LSM tree as well and when
you write you're basically just writing
over to the LSM tree which is in memory
so that should be relatively quick
you'll flush it later on the other hand
for a b tree you're basically writing
straight to disk which is a little bit
less efficient but at the end of the day
when you're reading you only have to
Traverse the tree one time there's no
concept of having to check multiple
different files and so a b tree is
literally just going right through
finding the piece of data that you want
it's all sorted and then you're good to
go so in this case because we are
prioritizing reads over wrs generally
here I think I would like to opt for the
B tree oops accidentally did that but
I'm open to hearing what anyone else has
to say there so let's actually go ahead
and choose a database because now we
have some good parameters with which to
filter our choices down we've already
said we want single leader replication
we've already planned on doing some
amount of partitioning we want a
database that uses a b tree index and
personally for me considering the
Simplicity of all of it I think that
would just make it my sequel uh for me I
could see why someone might say mongod
DB if they're particularly inclined to
no SQL but it's not like our data
structures that we're using here in our
data model are particularly complex I
don't really see the argument for a
document data model so I personally
would lean towards my
SQL okay so let's start talking about
read speeds because that's the thing
that we really want to be optimizing
here like I mentioned there are a ton
more reads than there are wres so so far
we've already discussed the fact that
we're going to be using replic
replication not only for fault tolerance
but also to speed up reads because
replication is going to allow us to read
from our follower replicas in addition
to our leader and also multiple
partitions which also means that for
every partition we're going to have less
load as a result of the fact that there
are now more places to actually read
from which is going to be really great
now it is worth noting that in this case
of single leader
replication because we are going to be
using asynchronous consistency or
eventual consistency would probably be
be the better word to use it is possible
that a client over here could read from
a
follower get some stale data where maybe
it thinks there's no actual URL to
redirect you to when in reality there
has been over here but it just hasn't
been replicated yet um I do think that
in this case you could maybe add some
application logic to go check the leader
replica but at the same time you should
probably be careful with that you never
know how many times they're just like
Bots that are constantly spamming all of
your followers to actually look for you
know redirects or anything like that you
might end up putting a lot of load on
your leader as a
result okay so maximizing read speeds
we've spoken about replication we've
spoken about partitioning but next let's
speak about hot links because I did
mention that certain links are going to
have a ton of traffic they're going to
have millions of clicks and as a result
we need to find a way to actually deal
with those it would be great if not
every single one of those requests uh
for a particular hotlink result had to
go to the database especially because
they're all returning the same thing and
so what would be a good thing to do here
well we should probably introduce some
amount of caching so again the caching
layer the thing that's actually nice
about it is first of all we can scale
this thing independently if we have you
know a ton of hot links maybe we need
more caches additionally we can also
partition the caches by the short URL in
the same way that we would our databases
so that more or less all of the requests
for a particular hot link can go to the
same cach or same set of caches and that
way we don't have to basically recompute
that value multiple different times on
many caches so as you can see all these
guys want ABC right here the best thing
to do is to all have those requests go
to the top cache and then maybe this guy
would be I don't know like H through Z
requests or something like
that okay so we've already spoken about
the fact that we want caching but how
are we actually going to make sure that
the data that we want gets in the cache
well when talking about caching or CDN
or anything like that in general there
are two concepts that we have to
consider we can either push the data to
the cache basically in advance when it's
created or we can pull it in there so in
my opinion I don't think pushing is
really going to work here because at the
end of the day we don't know which links
are going to be popular beforehand
there's no way on our servers to say
like ooh I can tell this link is going
to be super hot let's put it in the
cache beforehand so we can warm it up no
that's probably not going to happen so
at the end of the day we're going to
have to be pulling that data in there
somehow so there are basically three
methods of writing data to our cache
that we can consider the first one we've
already spoken about which is the right
back cache we've already said we can't
really do this because it's going to
lead to data inconsistencies and right
conflicts which is not going to be
feasible for us the second one which is
a possibility is going to be the right
through cache so in the right through
cache which you can see over here the
gist is basically that in addition to
writing to the database at a given time
you also write to your cache now if you
need to you can use two pH is to commit
to make sure that they always stay in
line or you can just do best efforts uh
but the gist is you can do that
personally I don't think it's necessary
for us because it's going to slow down
our right speeds a lot and the vast
majority of these links we really don't
even want in our cash in the first place
so it's not that useful to do a write
through in my opinion we should probably
just do a write around cache which is
basically where you just go and write to
the database as per usual and then as
people will eventually read from the
cache the database will send its results
for first to the cach and then back to
the user so as you can see that's what's
happening down over here and of course
you know we are obviously going to run
out of room as our cash gets filled up
with all of these short links and of
course we want to keep the most popular
results because that's how we get the
most us usage out of our cash the way
that I personally would do this is
pretty standard which is just least
recently used eviction whatever entry in
the cache has least you know been least
recently used whenever we're performing
a read from it get rid of that one
popular it with a new piece of data
hopefully simple enough Okay so we've
spoken about some reads we've spoken
about some wrs we seem to have a pretty
fast solution in terms of our
replication our partitioning our caching
but now let's actually go and talk about
a potential solution for our analytics
so if you recall when I showed us the
database schema we do have a column for
clicks and in theory what we could do is
just go ahead and update that right you
know we've got let's say 100 for the
short link a bc1 2 3 we could have the
guy on the left increment it by one when
he makes the click a guy on the right
increment it by one when he makes the
click however what many of you might
already be anticipating is that without
any sort of locking this is going to be
a race condition why because this guy
might read 100 first this guy might
read 100 first and then they're both
going to say set it to 100+ 1 so now
they're going to write 101 and this
guy's going to write 101 and then this
gets set to 101 instead of 102 too and
keep in mind that for very popular links
this is a real possibility if you've got
hundreds of thousands of people clicking
it every single minute you're going to
have a lot of these conflicts and you
would need to implement locking and when
you're implementing either locking or
Atomic operations for something that's
popular enough the database might not be
able to handle that so keep that in mind
it's probably too slow using this sort
of naive
implementation so what's a potentially
better way that we can do this well what
if we were to use stream processing so
basically my idea is that we dump the
data somewhere where we don't need to
grab a lock in order to dump it there
and then we can go ahead and aggregate
it later so in theory you know we could
dump it to a database but the question
is do we need to dump this to a database
a database might be slower than just
dumping it to something like a you know
inmemory message broker or a log base
message broker because at the end of the
day that's basically just either
something in memory or a write ahead log
that you're writing to so again I feel
like we should rule out the database
it's just going to be slower to write to
and additionally we've also got our
in-memory message broker which is good
however I also did say that I want to
ensure that the analytics results that
we get are actually correct now the
issue with an inmemory message broker is
that it's in memory which means it's
probably less fault tolerant barring it
using a right ahead log and as a result
even though it's super fast it's not
going to be as durable and so what I
would prefer to do is meet these things
in the middle use a l based message
broker where everything is kept on disk
we've got offsets for every single entry
and essentially what we're doing is
writing to a right ahe head log and as a
result of that it is going to be durable
now an example of such a solution would
be Kafka if you want to use AWS maybe
AWS Kinesis but uh let's try and keep
things open source here so I'm going to
say
Kafka okay so now let's actually talk
about the consumer of these events how
is it that once we have the events
placed in some sort of you know queue
that that we can go ahead and process
them what technology do we want to use
to do so well we've got a few options
one of which is that you know we could
have something dump to htfs or you know
S3 or anything like that and then
eventually run a batch job on it in my
opinion uh this would probably give us
analytics to infrequently but it really
depends on your users right if they're
content to have analytics once per day
then you could do that but I feel like
most people would like a little bit of a
shorter granularity than that maybe
every couple minutes maybe even every
few seconds
so personally I'm not a fan of the batch
processing solution additionally another
thing that we could do is use something
like aachi flank which is more of a
real-time solution so in this case we
would process every single event that we
receive from Kafka individually now
personally I don't also think this is
necessary because at the end of the day
if we're processing every single event
individually this also means that every
single event is going to lead to a right
to the database and there's not really
any reason to put that much load on it
now to be fair you could write custom
flank code that basically just goes and
says you know maybe every 10 that you'll
upload to the database but in my
personal opinion the best option here is
probably just going to be something like
spark streaming where Spark streaming
natively supports something like mini
batching which is configurable and we
could just say something like hey give
me mini batches of 100 clicks and so as
a result every single 100 clicks you
would go ahead and write to the sync in
that type of
interval so the very nice thing about
these stream consumer FR Frameworks like
spark streaming like Flink is that they
actually ensure correctness at least
within the stream processing system so
they basically say that between EV every
event that gets to Kafka and the actual
eventual state of our stream consumer
that all of those events are only going
to be processed once however spoiler
that is going to break down whenever you
have some sort of external system like a
database so let's go ahead and talk
about
that the question is are our events
actually going to be processed exactly
once if our thing here is just going to
be you know do plus 100 Maybe not maybe
things would be a little bit different
if we actually aggregated the total
count in our stream consumer and then
occasionally updated the database but I
think it makes sense to basically just
take every single mini batched event and
then upload it right to the database so
again like I said my question is is do
things actually get run exactly once and
is it possible to have certain race
conditions that could cause us to have
an incorrect click count in my opinion
yes so like I mentioned events are only
processed exactly once internally so
that means within here events are
processed exactly once but here's an
example of something that can happen
from spark streaming we say to the
database hey add 100 clicks and right
before the database can respond saying
hey you know I acknowledge this even
though it went through on the database
for some reason this internet connection
gets cut off spark streaming goes down
and then when it comes back
up it doesn't see that database
acknowledgement it doesn't realize that
the event was processed and now it's
going to have to process it again so
it's going to say plus 100 one more time
and that right there is going to be bad
because now we're going to have an
incorrect count of clicks so we've got a
couple of options that we can actually
do the first is two-phase commit right
we would have some coordinat node over
here which basically says hey are you
ready and are you ready to commit this
thing and assuming they both say
yes du then it would say okay go ahead
and commit it now that is notoriously
slow ideally we would like to avoid it
so maybe another better option would be
to use something like an item potency
key where in addition to storing the
total clicks count let's say there are
1500 clicks the database also says Hey
the last key I saw was uh key number a12
uh 45
Z and so you know let's say this one
this first right was a1245
Z whenever this next right comes along
because uh it is actually the same event
from spark streaming that would also
have key
a1245 Z and then the database would say
hey wait a second these two are equal
don't process this one I don't want to
listen to it and that is one way of
guaranteeing that our pushes to the
database are in fact item potent again
another way like I mentioned which in
retrospect might be easier is to just
keep track of total
count in spark streaming because then we
don't have to worry about you know
external numbers being published but
that's also going to fail if we
potentially have multiple spark
streaming consumers so again I
personally think an item potency key
would make a lot of sense the issue with
the item potency key is let's say we had
you know 10 spark streaming consumers
you know
SS SS s SS and they're all publishing
over
here now we need to store this item
potency key times 10 for every single
publisher which is inconvenient you know
if all of a sudden there are 100
Publishers now we need to store 100
extra keys that's not great how can we
do this uh or rather how can we get
around this well we could just make it
so that we only have one publisher per
row so what I'm saying is that we would
only have one instance of spark
streaming per short URL so let's say
short URL ABC is only being published
clicks by this guy right here and so the
way that we can do that is we can
actually partition our Kafka q's and of
course our spark streaming consumers by
the short URL and this way we ensure
that basically only one consumer is
going to be publishing clicks for a
specific row at a time which is really
good the first one I mentioned is that
one there's fewer item potency keys to
store per row and additionally now let's
say we had two spark streaming things
publishing to the database at the same
time they would have to grab locks on
this row ABC because otherwise we would
run into the same problem as before we
would have a race condition and so again
we want to be trying to avoid grabbing
locks whenever
possible okay hopefully that much at
least has made sense let's quickly talk
about expired links because this is
definitely worth a discussion so I
mentioned in my data model that you know
when you create the tiny URL link you
should have some field for being able to
expire one so let's actually just go
ahead and use a batch job to do that we
probably don't need to run anything like
spark here it's not that much data uh I
feel like a nightly batch job would do
it and additionally you really don't
even need a lock either because at the
end of the day uh you know you're only
grabbing it for the row that's being red
so yeah maybe you do need a lock for
that particular row of the one currently
being red just so that no one is you
know continuing to overwrite it while
you clear out the data but at the same
time you know the most expensive batch
jobs are those that have to grab locks
on everything because they're doing some
sort of aggregation and in this case
we're really not I think uh a simple
Crown job would probably get it done and
the pseudo code that we' be running is
right here which is basically you know
if the time of the batch Shob is greater
than the expired time of the row just
clear it out and uh yeah should be
simple
enough okay let's finally talk about
paste bin because so far we've been
talking as if this was all tiny URL
and the reason I group P past bin into
this kind of video is because they're
two very very similar problems but the
one exception with paast bin is that you
can have super large pastes and so like
I mentioned some of them could be in the
multiple gigabytes of size we are going
to support that and so as a result we
probably cannot be storing them as a
field in our database I did look it up I
think uh postgress supports Fields up to
4 gabt for long text so let's imagine we
could have one that's 10 gigabytes and
it's not going to fit in there so a few
options that we have one of which is uh
you know storing them in hdfs which we
probably shouldn't do because it's
generally more expensive uh especially
considering that we don't need the data
locality of being able to run batch jobs
where our pastes are stored because
there's no batch jobs to actually run on
them we're literally just storing them
and then returning them so in my opinion
something like an object store like
Amazon S3 would make a whole lot more
sense so like I mentioned likely
preferable cheaper no batch jobs blah
blah blah you just heard me say it the
other important thing to do is note that
because these files are so large we want
to be able to deliver them to the user
relatively quickly and so some sort of
caching would be of great use to us now
if you've watched my CDN video you would
know that those are great for things
like static content which our pastes are
you can't actually modify a paste after
the fact that you've made it and so
again a CDN is basically this
geographically distributed cache which
is going to enable us to deliver those
static files particularly quickly we
could again use some type of lru policy
to make sure that only the most popular
pastes are going to be in there and as a
result hopefully things should go well
so as you can see what we would want to
do perhaps at least in my opinion is try
and perform all of these rights in
series because if we want all of these
rights to go through 1 two and three to
the CDN to the S3 and the database then
we would have to use some sort of
two-phase commit which again can get
very expensive and annoying so
personally what I would do is from the
client I would just first write to CDN
when that basically goes through I would
then write to S3 assuming that still
goes through I would then write to the
database the reason I don't write to the
database first and then S3 and then the
CDN is that if the database right goes
through and then the right to S3 fails
it's going to seem like a paste exists
when the data for it actually doesn't so
it's more important that we get the data
uploaded to our Object Store and our CDN
first and then after that we can go and
talk about uh putting things in the
database so now you may note that I am
actually writing to the CDN before
pulling things in here
I think this is maybe one of the few
cases where uh right through actually
makes a lot of sense because at the end
of the day uh if we use right around
we're going to have to have a cash Miss
and that cash Miss is going to be
gigantically expensive for a 10 gbyte
file so it would be greatly beneficial
for us if that was already in the CDN
considering how few rights there are
relative to reads that being said again
open to debate on this
one okay take a deep breath guys we're
almost done
so of course we finally made it to the
last diagram where I've tried to put
every single thing together so let's go
ahead and do that here's our writer this
is going to be someone who's generating
a link here's our reader this is someone
who's going to be reading a link so the
writer is going to do the following
assuming that this is for tiny URL and
not past bin then you're only going to
be writing over to the servers but if it
is for past bin like I mentioned you'll
probably be hitting S3 over here and
you're also probably going to to be
hitting your CVN which will eventually
geographically distribute that content
as needed so anyways we've got a load
balancer before we basically
horizontally scale out all of our URL
assigning servers note that I do have
our URL assigning Service as something
completely different than our URL
reading service over there on the right
however I think in reality they would
probably be on the same box I mainly did
it this way for clarity of diagram
though I do think it is fair enough to
make them their own microservices
considering the fact that there are so
many more reads than there are writes so
once we hit our assigning service we're
now going to have to hit a database and
actually write our data so as you can
see I have those partitioned by the
range of the short URLs a through l m
through Z and of course we've also got
our single liter replication single
liter rep and this is going to be our
URL table which ideally should be stored
in my SQL now of course like I mentioned
to be speeding up our reads we also want
some sort of distributed partitioned
replicated cache which we can just use
redus instances for so as you can see
this is similarly partitioned by short
URL so that we can you know D duplicate
the amount of data that we're storing
and just use the same caches for the
same types of data now note that you
know for example if there's a key that
starts with a and it gets a ton of
traffic and there's a key that starts
with b and it gets a ton of traffic you
know one cach might not be enough for
that so in those cases we may have to
actually get some replicas of a
particular cash
instance okay so hopefully we've touched
on that a little bit now let's start to
touch upon reads so we've got our reader
over here and the reader in the at least
the tiny URL case is going to go to the
URL read it's going to turn return
something from the cash ideally or if
it's not in the cash we're going to have
a cash Miss that's going to go over to
the database the database is going to
load the appropriate cache and then as a
result of you know going and clicking
that link We we are going to now upload
our click data the fact that we
performed a click over to Kafka which is
over here as you can see again we've got
multiple different partitions to support
the fact that we want just one spark
streaming instance publishing per row
and then we've got our spark streaming
layer over here which ideally should be
sharded in the same way that our Kafka
qes are now eventually you know let's
say every 10 seconds or something
or more so probably 10 messages because
this is mini batching spark streaming is
going to say oh shoot I just saw you
know you now have 50 more clicks well if
there's 10 messages you can probably
only have 10 more clicks and then it's
going to go over to my SQL and do a
single database upload which is going to
be a lot less expensive than having to
grab a ton of locks now of course the
last component of this whole system is
well where are we keeping all of this
partitioning info how do we know what
load balancer is pointing where you know
what database URL is
what that's what we have zookeeper for
which is really connected to everything
zookeeper is great it is you know
completely consistent and I should say
strongly consistent and it is a
coordination service it is effectively a
nice key Value Store of all of the
metadata information that we're going to
need about the various pieces of our
system well guys give yourselves a pat
on the back I certainly need one I know
this was a pretty long video hopefully
the content was actually useful because
I now need to go cry a little bit
anyways have a great rest of the night I
will see you in the next one
Ver Más Videos Relacionados
Tiny URL - System Design Interview Question (URL shortener)
2: Instagram + Twitter + Facebook + Reddit | Systems Design Interview Questions With Ex-Google SWE
How to make your text clickable 💛 Unity TextMeshPro and the link tag
Shazam Audio Recognition Design Deep Dive with Google SWE! | Systems Design Interview Question 23
Amazon Elasticsearch Service로 우리 서비스에 날개 달기-박진우,솔루션즈 아키텍트,AWS::AWS Summit Online Korea 2021
The Consistent Hash Exchange: Making RabbitMQ a better broker - Jack Vanlightly
5.0 / 5 (0 votes)