Tiny URL - System Design Interview Question (URL shortener)
Summary
TLDRThe video covers the TinyURL system design interview question, exploring a robust approach to solving it. The key focus is on creating a short URL from a long URL and redirecting users back to the original URL. The speaker explains API design, schema, scalability concerns, and the use of tools like Zookeeper for distributed synchronization. It also touches on caching strategies, database considerations, and optional features like analytics, rate limiting, and security enhancements. The video emphasizes scalability and performance, aiming to address typical interviewer expectations in such design challenges.
Takeaways
- 😀 The Tiny URL system requires both functional (URL shortening and redirection) and non-functional requirements (low latency, high availability).
- 📊 A simple REST API with a POST endpoint for URL creation and a GET endpoint for redirection is proposed.
- 🔢 Key decisions involve determining the length of the short URL based on expected scale, like 1,000 URLs per second, leading to 31 billion URLs annually.
- 🧮 By using a character set of 62 (a-z, A-Z, 0-9), a seven-character URL can handle up to 3.5 trillion URLs, making it scalable enough for large applications.
- 🌐 A naive approach would involve a load balancer distributing requests to web servers, fetching long URLs from a database and caching them for scalability.
- ⚠️ To avoid the single point of failure and collision issues with a caching system, a distributed Zookeeper is introduced to manage URL ranges assigned to web servers.
- 🛠️ Web servers request unique ranges from Zookeeper, ensuring that even if one web server fails, it won’t result in URL collisions.
- 🗄️ Database choices include SQL (Postgres with sharding) or NoSQL (Cassandra) based on trade-offs, with distributed caching like Redis or Memcached to reduce load.
- 📈 Analytics, like counting URL hits and storing IP addresses, can optimize performance and provide location-based cache distribution.
- 🔐 Security concerns include adding random suffixes to URLs to prevent predictability, with a trade-off between URL length and security.
Q & A
What are the functional requirements of the Tiny URL system design?
-The functional requirements are to create a short URL when provided with a long URL and to redirect the user to the long URL when given the short URL.
What are the non-functional requirements for the system?
-The non-functional requirements include very low latency and high availability.
Why is the API design important in this system, and what endpoints are proposed?
-API design is crucial because it defines how the system will interact with users. The system proposes two endpoints: a POST endpoint to create a short URL and a GET endpoint to redirect users to the long URL.
What is the purpose of choosing a short URL length, and how is it determined?
-The short URL length is determined based on the scale of the application. The system needs to handle billions of URLs, so using a seven-character string is ideal, allowing for 3.5 trillion unique combinations.
How does the system handle scalability issues in the proposed design?
-Scalability is addressed by horizontally scaling both web servers and cache systems, using a range allocation technique through Zookeeper to prevent collisions between web servers.
What role does Zookeeper play in the system, and how does it help prevent collisions?
-Zookeeper provides distributed synchronization by assigning each web server a unique range of values for URL generation. This prevents collisions by ensuring no two web servers generate the same short URL.
What are the potential drawbacks of using Zookeeper, and how can they be mitigated?
-One drawback is that if a web server dies after obtaining a range, that range of URLs will be lost. However, given the 3.5 trillion possible combinations, losing a few million URLs is not critical. Additionally, running multiple Zookeeper instances can prevent it from being a single point of failure.
Why might a NoSQL database like Cassandra be better suited for this system?
-Cassandra is better suited due to its ability to handle large-scale data, and it supports horizontal scaling which fits the high read/write demands of this URL shortening system.
How does the system optimize performance using caching?
-The system uses a distributed cache like Redis or Memcached to store frequently accessed short URLs, reducing the load on the database and improving response times.
What additional features or enhancements could be considered for the system?
-Additional features include analytics (tracking popular URLs), IP address logging for optimization, rate limiting to prevent DDoS attacks, and security measures such as adding a random suffix to short URLs to avoid predictability.
Outlines
🔧 Introduction to TinyURL System Design
This paragraph introduces the TinyURL system design interview question. It highlights that this is one of many possible solutions but emphasizes that the approach is robust, addressing common concerns. The problem is broken down into two main functional requirements: creating a short URL from a long URL and redirecting a short URL to the long one. Non-functional requirements like low latency and high availability are also essential, along with scalability, which interviewers expect candidates to explore deeply. The speaker starts with the design of a simple REST API, involving a POST endpoint to create the URL and a GET endpoint to handle redirection.
📊 Handling Scalability with Character Sets and URL Length
Here, the speaker begins addressing scalability, particularly in terms of the number of URLs the system must handle over time. Assuming a scale of 1,000 URLs per second, they estimate the system would need to handle over 31 billion URLs annually. The speaker discusses the character sets that can be used for short URLs, including lowercase and uppercase letters (a-z, A-Z) and digits (0-9), providing 62 characters in total. They then calculate how many unique combinations can be created with different numbers of characters, ultimately concluding that a seven-character URL would suffice to handle trillions of URLs over many years.
⚙️ High-Level System Design and Scaling Solutions
The paragraph outlines a simple high-level system architecture where a load balancer distributes requests to web servers that interact with a database to retrieve long URLs and redirect users. However, this approach won't scale well for billions of URLs. A caching mechanism is introduced to optimize URL generation, but the issue of having a single point of failure with the cache is raised. Horizontally scaling both web servers and the cache helps address this, but new problems such as collisions arise when multiple servers generate the same short URL simultaneously.
🛠 Improving Scalability with Zookeeper and Range Technique
This paragraph introduces a more advanced solution to solve scalability and collision issues by replacing the cache with Zookeeper. Zookeeper assigns ranges of values to web servers for URL generation, ensuring unique short URLs are created by servers operating within different ranges. If a server fails, the range of URLs it was working on will be lost, but since the system generates billions of URLs, losing some values is not a major concern. To prevent Zookeeper from becoming a single point of failure, multiple instances can be deployed, though it handles relatively few requests compared to the previous caching solution.
📚 Database Choices and Distributed Caching
In this section, the focus shifts to database scaling. The speaker discusses using either a traditional SQL database with sharding, like PostgreSQL, or a NoSQL database like Cassandra, which is better suited for large-scale applications. A distributed cache like Memcached or Redis could also be implemented to reduce database load, particularly for frequently requested URLs. The web server can first check the cache before querying the database, optimizing the system's performance under heavy traffic.
📬 Request Flow and URL Creation Process
This paragraph walks through the end-to-end flow of a post request for creating a URL. The request hits a load balancer, is routed to a web server, and the server interacts with Zookeeper to get a range of numbers. A base-62 encoded short URL is generated from this range, stored in the database (and possibly the cache), and finally returned to the user. For GET requests, the web server first checks the cache for the short URL, and if not found, retrieves it from the database before redirecting the user to the long URL.
📈 Additional Considerations: Analytics, Rate Limiting, and Security
In this concluding section, the speaker touches on extra features that may be required, such as analytics for tracking URL usage and improving caching by identifying popular URLs. Other enhancements include adding rate limiting to prevent Denial of Service (DoS) attacks and securing the URL generation process by introducing random suffixes to avoid sequential prediction. The trade-off between URL length and security is also briefly mentioned, as interviewers might expect deeper exploration into balancing these aspects.
Mindmap
Keywords
💡Tiny URL
💡Functional Requirements
💡Non-functional Requirements
💡API Design
💡Scalability
💡Zookeeper
💡Base 62 Encoding
💡Distributed Cache
💡Collision
💡Rate Limiting
Highlights
Introduction of Tiny URL system design, emphasizing that there are multiple ways to approach the problem, but this solution is robust and covers key concerns for interviewers.
Functional requirements include creating a short URL from a long URL and redirecting from the short URL to the long URL.
Non-functional requirements highlight the need for low latency and high availability.
API design consists of a simple REST API with two endpoints: POST for creating a short URL and GET for redirecting the user to the long URL.
Schema design includes fields like long URL, short URL, and created timestamp, with potential for additional fields later on.
To determine the length of the short URL, the scale of the application must be considered (e.g., 1,000 URLs per second would require handling 31 billion URLs annually).
The short URL character set is made up of 62 characters (a-z, A-Z, 0-9), which provides 3.5 trillion unique combinations with 7-character URLs.
Simple system architecture includes a load balancer, web servers, and a database, with initial caching to reduce load, but issues like single points of failure must be addressed.
Introduction of horizontally scaling the web servers and cache to avoid single points of failure, though this introduces potential collision issues.
Solution to collision problem involves using a Zookeeper for distributed synchronization, allocating unique ranges of URL values to each web server.
Zookeeper manages URL ranges to avoid collisions, though web servers that crash may waste some ranges, which is acceptable due to the large available character space.
NoSQL databases like Cassandra are suggested for their ability to handle high-scale read/write demands, potentially coupled with caching mechanisms like Redis or Memcached.
A sample workflow includes a POST request generating a short URL, storing it in the database, and possibly caching it, while GET requests check the cache first before querying the database.
Discussion of analytics, including tracking the number of hits per short URL and IP addresses for regional optimization and reduced latency.
Additional considerations include rate-limiting to prevent DoS attacks and security measures like randomizing URLs to make them harder to predict.
Transcripts
okay so today we're going to be solving
the tiny URL system design interview
question uh and before we start I just
think it's important to note that this
is one way of solving the interview
question there's many other ways but I
think this is a really robust way of
solving the problem that addresses the
main concerns that many interviewers
would be looking to be solved so with
that said let's jump straight into it so
for the functional requirements it's
really simple when we're given a long
URL we have to create a short URL and
then when we're given a short URL we
have to redirect the user to the long
URL and for the non-functional
requirements we're going to require very
low latency and very high availability
so it's a really simple concept but
there are some nice areas where you can
go super deep on the scalability issues
which many interviewers will expect you
to to really know so starting off I
always like to look at the API design so
for this I think a simple rest API will
do so we'll have one endpoint which will
be the the post endpoint which will be
the create URL and here the parameters
will take a long URL and then once we've
created the short URL we can just simply
return with a status code of 2011
indicating that that's being created and
then we'll have a second endpoint which
will be the get endpoint uh where in the
URL we provided with the short URL and
then what we'll do is we will then
redirect the user we'll do a 3 hour1
permanent redirect for the user to the
uh previously provided long URL so super
simple rest API here then for the schema
again really simple we'll have a long
URL which will be a string a short URL
which will be a string and then a creat
that which would be a time stamp but we
could also add additional Fields here
which I'll touch on later on so the
first key question we have to ask the
interor is how long should the URL be
and to to answer that we kind of need to
know what what is the scale of the
application so let's say for example the
interviewer said okay it's required that
the application can create 1,000 URLs
per second so then we know that every
year so if we multiply 1,000 time 60
seconds time 60 Minutes time 24 hours
time 365 days that we're going to be
required to have roughly 3.5 billion or
31 billion URLs created each year and
then if we anticipate a kind of r right
request ratio of 10:1 that means we're
going to be doing roughly 300 billion
reads per year so again it kind of is
something to keep in mind when thinking
about the kind of uh scalability of this
application so then the next question
then is well what characters can we use
and so typically most interviewers and
it's fairly safe to assume that we'll
we'll be able to use a to z lowercase A
to Z uppercase and 0 to9 and that means
then we'll have a total of 62 characters
to use so then with that information we
can then look at how many unique short
your can be created so if we've got one
character if we put that to the power of
62 that means we'll have 62 unique
combinations which makes sense because
there's only you know 62 characters to
choose from uh and then similarly two to
the pirate so we've used uh two
characters then we've got two to power
62 which is
3,800 and so if we skip on down to six
to the power of 62 so for six characters
we've got 56 billion and then for 7even
we've got uh seven characters we've got
3.5 trillion and so if you look here if
we've got you know if we're going to be
expected to generate you know 31.5
billion URLs each year if we look over
the span of 10 years well that's maybe
300 billion and so you can see here well
56 billion won't kind of you know it's
kind of it won't be enough but you know
seven characters at 3.5 trillion will be
so that's why we're going to go with
seven characters okay so now that we
have all the information we need from
the interviewer we can kind of start
putting together a really simple
highlevel um system design so here we've
got a user makes a request to load
balancer that will then distribute it to
a web server and the web server will
then uh get the information it needs
from the database and then redirect uh
the user to the to the long URL so while
this technically Works obviously you
know it's not going to be able to handle
the scale that we've previously talked
about so the first naive solution could
be able uh could be to introduce a c
cach which means that every time a web
server is given a long URL it goes to
the campcash and it asks for a number
the campcash will then return a number a
base 10 number and the web server will
then convert that into a base 62 which
will then be used as the short Euro and
so while this technically Works we've
introduced a single point of failure
here with the campcash which is
something we never want to do and so
we've kind of got to come up with a uh a
more nuanced
solution so here what we've done is
we've hor horizontally scaled both the
web servers and the Countach so that
we've got multiple instances of each so
that we can handle if certain nodes go
down however now we've introduced a new
problem so let's say for instance one of
the web servers makes a request to the
campcash and it gives back a value of 10
and then simultaneously another web
server makes a request to the count cash
and it also uh gives a value of 10 well
then now we've got a collision and this
is something that we cannot have in our
system we want it to be kind of uh
highly available so what we're going to
do now is we're going to move it on to
the next level okay so looking at this
next iteration what we're going to
introduce now is we're going to get rid
of our camp cach and we're going to
introduce a zookeeper and so Z zookeeper
is really good uh to be used as a kind
of centralized service for maintaining
you know configuration information and
kind of providing distributed
synchronization and so we're going to
use a range technique here so what will
happen is each web server will reach out
to the Zookeeper and the Zookeeper will
give it a range of values from which it
can create URLs and so for instance the
first web server might come over and
it'll give it the range of zero to 1
million and the next range will then be
1 million to two uh 1 million one to 2
million so what this means is that each
server will be creating uh unique short
URLs as they're all operating in
different ranges and so when a a new web
server comes on it'll reach out to the
Zookeeper and it will then ask okay what
ranges can I have and it'll give it a
new range and then similarly when a web
server completes its range it will then
reach out to zookeeper which will then
provide it a new unused range and then
you may be asking well okay well what if
a web server reaches out to zookeeper
it's given a range but then it instantly
dies well what that happens with that is
that range will no longer be used and
that those 1 million values will never
be used so we've kind of wasted a
million values but again because we
chose seven characters um we've got you
know 3.5 trillion so losing a million
here and there actually isn't that much
of a problem at all but you could also
say that the Zookeeper is also a single
point of failure so what we could do is
we could have multiple instances of it
um and you have to remember that the
Zookeeper does not get that many uh read
requests because you know unlike the
count cach which every single time a web
server wanted a number it went to it the
web servers only go here to get its
range of values and then unless it
finishes the range it will not come back
to thato keer service so this is a
really rub solution that kind of scales
really well the next step we can do then
is to scale our databases so again here
we can look at the classic you know SQL
nosql asset based tradeoff so you know
you could make an argument we could use
uh you know an SQL database like
postgress and then use sharding to
handle the scale but I think you know
maybe a slightly better approach might
be to go with maybe column orientated
database like Cassandra which could you
know really handle the scale that that
we're dealing with here so typically I'd
go with a nosql kind of Cassandra
approach here and then to kind of reduce
the load even more on the database I'd
also use maybe a distributed cache like
mcash or reddis uh and here I'd store
maybe the most popular short URL so that
the web servers would first check the
the cache uh to kind of reduce the load
on the databases as well so if we walk
through an example so let's say we've
got a post request so we're going to
send a post request to the create URL
endpoint that'll come to the load
balancer which will then distribute it
to the web servers one of the web
servers will then if it doesn't have a
range it'll go to the Zookeeper and get
its range and then it will generate the
short your Ur L uh from that number
using the B 62
encoding The Next Step then will be to
save the long URL and short URL in the
data base uh and it could also store it
in the distributed cache depending on
what the you know caching logic is for
the application uh and then finally the
web server will just respond to the user
with the uh the newly created short
URL and then for the get request what
will happen is a get request will come
in with the short URL uh in the actual
URL itself uh the web server will first
check the cache uh and see if it's in
there if it's not in there well then it
will retrieve it from the the long URL
from the database and then it will
simply uh redirect the user to the long
URL so really simple but again it's this
kind of scalability uh issue that you
know uh interviewers are really expect
you to to go deeper and then finally
there's there may be some additional
talking points which you may or may not
be required to know so one of them may
be analytics so what we could do is we
could also maybe maintain counts for
each of the URLs to determine you know
which short URLs are most popular and
then cach them and this kind of again
can help us with the short the hotkey
problem like uh like that Twitter has
and then similar we also can store maybe
IP address information to get location
info which can help us determine where
to locate our caches to just again
reduce that latency and improve the
overall
performance we can also introduce rate
limiting and so this can help prevent
against dos attacks whereby users try
and flood our servers with request to
kind of uh try and make them fail and so
rate limiting could be an interesting
talking point there uh and then finally
there's you know security consideration
so you know if you look at our uh range
Theory we are sequentially in
incrementing and so therefore
potentially we might want to add a
random suffix to the short URL to
prevent hackers being able to predict
URLs but then it we'll be making a
trade-off between the length of the URL
uh and you know the predictability so
that trade-off between length and
security is maybe a good talking point
uh that you can have with the
interviewer so this is just one way of
solving the problem if you got any other
ways definitely leave a comment below be
very interested to see you know how
other people solve it uh and look if you
liked it please like And subscribe it
helps the channel out a lot and I'll see
you in the next one
Посмотреть больше похожих видео
1: TinyURL + PasteBin | Systems Design Interview Questions With Ex-Google SWE
API designing: How to Design Best APIs | Best Practices | #api #backenddevelopment
System Design Mock Interview: Design Instagram
2: Instagram + Twitter + Facebook + Reddit | Systems Design Interview Questions With Ex-Google SWE
System Design: How to design Twitter? Interview question at Facebook, Google, Microsoft
WHATSAPP System Design: Chat Messaging Systems for Interviews
5.0 / 5 (0 votes)