Shazam Audio Recognition Design Deep Dive with Google SWE! | Systems Design Interview Question 23
Summary
TLDRIn this video, the creator discusses the complexities of building a music recognition service like Shazam, an intriguing distributed systems problem. They delve into the functional requirements, capacity estimates, and database design, highlighting the use of spectrograms and constellation maps for audio fingerprinting. The video explores the algorithmic approach for song matching, including the use of combinatorial hashes and inverted indexes for efficient search. It also touches on the challenges of scaling the service and suggests solutions like sharding, caching popular songs, and parallel processing for handling large datasets.
Takeaways
- 🎤 The speaker humorously introduces their channel and personal habits, setting a casual tone for the video.
- 🔍 The video discusses the problem of designing a music recognition service like Shazam, which is an interesting distributed systems challenge.
- 📈 The speaker provides functional requirements for the service, including the ability to take an audio clip and return song suggestions.
- 📚 Capacity estimates are given, with 100 million songs and 3,000 indexable keys per song, leading to a 2.4 terabyte index size.
- 🔑 The concept of 'fingerprints' is introduced as a way to represent audio data in a simplified 2D graph, which is crucial for matching songs.
- 🛠️ The algorithm for music recognition involves creating a constellation map from audio data and using combinatorial hashes to match audio clips to songs.
- 💡 The use of pairs of frequencies with a time delta between them is highlighted as a key optimization for creating a robust song index.
- 🗃️ The video outlines a database design involving an index for quick lookup and a database for storing audio files and their metadata.
- 🔒 The importance of sharding the index for scalability and in-memory storage for performance is discussed.
- 🔄 The need for parallelization in processing user requests and updating the index with new songs is emphasized.
- 🛠️ The video concludes with a high-level overview of the system's architecture, including client interaction, load balancing, and batch processing for new songs.
Q & A
What is the main topic discussed in the video script?
-The main topic discussed in the video script is the design and functioning of a music recognition service similar to Shazam, including its algorithm and distributed systems challenges.
Why is the speaker wearing underwear to celebrate the recording of this video?
-The speaker humorously mentions wearing underwear to celebrate the recording as an unusual accomplishment, indicating their tendency to stay at their desk and not get up often.
What is the functional requirement of the music recognition service discussed in the script?
-The functional requirement is to build a service that allows users to input an audio clip through a device like a phone microphone, and then the service returns a suggestion of the song that the audio clip is from.
What is the capacity estimate for the index in the music recognition service?
-The capacity estimate for the index is 2.4 terabytes, based on the assumption of 100 million songs with 3,000 indexable keys each, where each key is 64 bits.
What is a spectrogram and how is it used in the context of the music recognition service?
-A spectrogram is a 3D graph representing audio with parameters for time, frequency, and amplitude. It is used to simplify the search space and match audio clips by reducing it to a 2D graph known as a constellation map.
What is a combinatorial hash in the context of the music recognition algorithm?
-A combinatorial hash in this context is a method used to reduce a 3D spectrogram to a simplified 2D graph by focusing on pairs of frequencies with a time offset between them, creating a tuple that can be used for efficient song matching.
Why is sharding necessary for the index in the music recognition service?
-Sharding is necessary because the index is too large to fit in a single machine's memory, and sharding allows for parallel processing and scalability as the index grows.
What is the role of an inverted index in the music recognition service?
-The inverted index in the music recognition service maps the combinatorial hashes (fingerprints) to song IDs, allowing for quick lookup of potential song matches from a user's audio clip.
How does the music recognition service handle the processing of potential song matches?
-The service uses parallelization to compute the similarity between the user's audio clip and potential song clips, possibly caching popular songs for faster access, and then aggregates the results to determine the best match.
What is the purpose of the batch job mentioned in the script for the music recognition service?
-The batch job is responsible for periodically updating the index and database with new songs, ensuring that the service can recognize the latest music.
Why does the speaker mention using a MongoDB database for storing song data?
-The speaker mentions MongoDB because it uses a B-tree for storage, which is beneficial for read performance, and as a document store, it provides good data locality, which is useful for fetching large documents containing song data and fingerprints.
Outlines
😀 Introduction and Overview
The speaker welcomes everyone back to the channel, mentioning their current hairstyle and a personal anecdote about wearing underwear. They introduce the topic of Shazam, a music recognition service, and explain its relevance despite its complexity. The video aims to explore the functional requirements and algorithms involved in such a system.
📊 Functional Requirements and Capacity Estimates
The speaker outlines the basic functional requirements of the music recognition service, which involves users recording audio clips to be analyzed by the system. They discuss capacity estimates, assuming 100 million songs with 3,000 indexable keys per song, resulting in a 2.4 terabyte index. The need for sharding to handle such a large index is highlighted.
📂 Database Design and Search Problem
The problem is framed as a search issue for databases. Two main components are required: an index for client-side audio clips and a database of audio files. The speaker begins to discuss an algorithm for music recognition, emphasizing that while interviewers may not expect detailed knowledge, understanding distributed systems is valuable.
🔍 Music Recognition Algorithm
The speaker delves into the algorithm for Shazam-like services, focusing on converting audio data into a format that matches existing data in the database. They introduce spectrograms and constellation maps, which simplify the search space by reducing 3D audio graphs to 2D scatter plots. This reduction aids in matching songs despite variations.
🧮 Combinatorial Hash and Song Indexing
The explanation continues with the concept of combinatorial hashing. By using pairs of frequency points and time deltas, the system can create unique identifiers (tuples) for audio clips. This method helps align user clips with database entries, facilitating efficient song identification. The hashing strategy is described in detail.
💾 Index Storage and Sharding
Discussion moves to index storage, emphasizing the need for memory-based solutions for fast access. Sharding is necessary to handle the large index, and consistent hashing helps distribute the data efficiently. The importance of parallelizing requests to achieve low latency is highlighted.
🚀 Matching and Processing
The speaker explains the second phase of the process: matching user clips with potential songs. This involves significant CPU work and parallelized computations to handle the large dataset. Strategies for optimizing this phase, such as using caches and sharding, are discussed to ensure fast and accurate song identification.
📅 Handling New Songs and Updates
The final part addresses the need for regular updates to the database and index as new songs are created. A batch job using a Hadoop cluster and Spark is proposed to handle these updates efficiently. The speaker emphasizes the importance of maintaining up-to-date data for accurate recognition.
🏁 Conclusion and Practical Considerations
The speaker wraps up by formalizing the process with a diagram. They describe the client-server interaction, from recording an audio clip to matching it with database entries. The use of Redis and MongoDB for efficient data handling is justified. The video concludes with practical considerations and personal remarks about the recording environment.
Mindmap
Keywords
💡Jufrow
💡Shazam
💡Distributed Systems
💡Fingerprinting
💡Spectrogram
💡Constellation Map
💡Combinatorial Hash
💡Inverted Index
💡Sharding
💡Latency
💡High Availability
💡Caching
Highlights
Introduction to the concept of building a music recognition service like Shazam.
The speaker's personal anecdote about wearing underwear to celebrate the importance of the recording.
Discussion of the distributed systems problem inherent in music recognition technology.
Functional requirements for the music recognition service, including audio clip input and song suggestion output.
Capacity estimates involving 100 million songs and 3,000 indexable keys per song.
The creation of a 2.4 terabyte index for song recognition and the need for sharding.
The API design for the music recognition service, focusing on song finding with audio recording as a parameter.
Database design considerations for efficient music search and retrieval.
Algorithmic approach to music recognition, including audio data conversion and matching.
Use of spectrograms to represent audio and Shazam's optimization technique.
Introduction of constellation maps for simplifying audio data for search purposes.
The concept of combinatorial hash and its role in reducing search space for music recognition.
Method of using pairs of frequencies with time delta to create a robust song index.
Sharding strategy for the large index to enable in-memory processing and fast matching.
Parallelization of requests to different shards of the index for efficient song matching.
The secondary phase of processing to determine the best song match from potential candidates.
Caching popular songs' fingerprints to expedite the matching process.
Handling new song uploads with batch jobs to update the index and database regularly.
Diagrammatic representation of the music recognition service architecture from client to database.
Use of MongoDB for storing songs due to its B-tree indexing and document store advantages.
Implementation of a Hadoop cluster with Spark jobs for updating the song index and database.
The speaker's humorous commentary on the complexity of the problem and its unexpected nature in interviews.
Transcripts
hello everybody and welcome back to the
channel as you can see my jufrow is in
full effect today and i refuse to get a
haircut so it's probably going to stay
like this for a hot second this is a
pretty important recording for me today
and i've actually put on some underwear
in order to celebrate which is a huge
accomplishment for me because i don't
like getting up from my desk very often
anyways today let's talk about shazam
this is one of those services or
problems that if an interviewer asks you
this they're probably smoking crack but
at the same time it's genuinely an
interesting distributed systems problem
and i feel like there's still things we
can talk about in addition to the core
algorithm that are actually exemplary of
stuff that you could be quizzed on and
so yeah i think it's still useful
overall but um you know if you're just
here to learn a little bit that's what i
did as well and it's pretty interesting
so let's get into
it okay so the functional requirements
of our service are pretty simple
basically all we're going to be doing is
building some sort of music recognition
service which means that we're going to
allow a user to take in an audio clip
into something like their phone
microphone and then you know they'll
send that to our servers and afterwards
we're going to return them a suggestion
of what we think the song that they just
heard is
as far as capacity estimates go this is
going to be a little bit backwards the
reasoning for that being that i'm kind
of going to mention a key term here that
we'll cover in a bit but it'll make
sense a little bit more later
basically i'm going to assume that there
are 100 million songs that we're you
know having under consideration
and then basically per song let's
imagine that there are 3 000 indexable
keys and i'll touch upon this later what
i really mean is something known as
fingerprints but
again we'll speak about that later and
if each of those indexable keys is 64
bits then we can basically say that
we're going to have to create an index
that is 2.4 terabytes so obviously
that's going to be pretty big we can't
fit it all on one machine or i guess we
could with a hard drive but if we plan
on doing any sort of processing in
memory we're going to have to be doing
sharding and in addition 2.4 terabytes
is a lot anyway you would probably want
to do some sharding regardless because
that obviously has
an ability to grow
the api for this service is pretty quick
it's just gonna be that we can go ahead
and find a song and the one parameter
that would take in is the recording of a
song so let's go ahead and get on into
the database design
well if you think about this problem as
kind of a search problem for databases
then there's really just going to be two
things that you would need the first one
is going to be some sort of index that
allows us to take in our client-side
clip of audio and spits out some
potential ideas of audio that you know
we could be playing and then the second
thing is probably going to be some
overarching database of all the audio
files or something pertaining to the
audio that we can go ahead and search
afterwards and return back to our user
okay so let's talk about an algorithm
for music recognition service like
shazam like i said earlier it's you know
completely absurd for any sort of
interviewer to ever expect you to know
something like this because i mean it's
quite literally a multi-100 million
dollar idea that they would be expecting
you to just come up with on the spot but
i think it would be somewhat reasonable
if they kind of gave you a bunch of
hints about what was going on and then
said well now that we have this
algorithm how would we actually kind of
support this in distributed environment
so anyways let's go ahead and get into
it we know that our entire goal is
basically to take a sample of audio data
and convert it to a format where we can
just go ahead and match that with you
know existing audio data that we have in
our databases and of course this needs
to be done
with the service that is both highly
available and also
has very low latency that's going to be
a requirement in any sort of systems
design interview
so anyways before we can really start
getting into the algorithm we have to
talk about just how audio files work in
general
audio can be represented by something
known as a spectrogram which is a 3d
graph and i'll put one of these up on
screen so you guys can follow along a
spectrogram basically has three
parameters
it has the time that you know you heard
a specific part of the audio it has the
frequency of the wavelength of the audio
and then additionally it has the
amplitude which kind of represents
something like the volume
so shazam does kind of a clever
optimization and this is something that
they've actually written in their paper
i'm just not making this up on the spot
but basically they take all of the peaks
at certain amplitudes within the graph
in this 3d graph and they reduce this 3d
graph down to a 2d graph and create
something they call a constellation map
where basically now we have a 2d scatter
plot of all these points that were at
really high amplitude and basically now
they're just you know little scatters on
um basically a plot of both the time and
frequency so this allows us to simplify
our search space pretty considerably and
it also allows us to account for
variations between things like the
actual audio clip that we might have in
our database and also the audio clip
that a user might play for us obviously
there's going to be a lot of variation
that can happen between the two of those
between things like you know device
compression external noise and just a
bunch of other factors and as a result
of that kind of reducing this
complicated 3d graph down to this more
simple 2d one allows us to have a better
chance to be able to match these songs
together
so now that we have these 2d graphs what
can we actually do in order to kind of
start dealing with creating a song index
and making it such that we can search
songs quickly
well this is where something known as a
combinatorial hash comes in and shazam
goes ahead and does something like the
following
one option would be that every single
point on that constellation graph we
could go through all of those
frequencies at a given time and we could
say well what songs have all of these
you know frequency points at a given
time but there's a couple issues with
that the first is that well frequencies
are basically representing like the note
of the song you might hear so if you're
just looking at one frequency and trying
to you know limit down your fields of
songs based on which songs have that
frequency most songs are probably going
to have it it's like saying you know
what songs don't have like the note c
when you're looking at all of music they
probably all do
so what shazam does instead to limit
down the space is they actually look at
pairs of notes and this has a bunch of
really useful properties
so basically if you take two frequencies
so two of those points on the graph
and you take the first one which comes
before the second one in time and you
say the first one is what we'll call the
anchor point
basically create some sort of tuple
where we have you know the anchor
frequency the second frequency and then
the last thing is going to be the amount
of time and offset between them
now this is really useful because you
know one issue with having to create an
index like this is we don't know where
the user's clip is going to be played in
terms of at what point in the song it
starts right you know my song could be
three minutes long but i might be
playing an audio clip from my phone
where it starts a minute in so we have
to be able to align all of these clips
so actually using pairs of these
frequencies with the time delta between
them is super useful because if we can
find basically
you know for our user audio if we split
it into a bunch of different you know
pairs of points where we have the anchor
point then the second point the time
between them let's say we have like i
estimated earlier 3 000 indexable keys
per song where the key is going to be
one of those tuples then all we have to
do is basically figure out the song in
the database where we have the most
alignments between the user clip and
that song and we can get you know
genuinely a lot of alignment because
like i said we kind of reduce this
complicated 3d graph with a lot of
potential for noise into a more simple
2d graph which you know gets rid of a
lot of the variation that can occur from
having to use client-side audio through
a microphone so basically now here's
what's going to happen for our let's say
30 second user clip of audio we're going
to create a bunch of all of these you
know anchor point secondary point you
know time delta tuples between them you
know it could be up to something like a
thousand and in theory we can really get
a lot of these hashes and the hash is
basically going to be comprised of that
tuple let's say you know we want a
32-bit hash we can basically use
something like 10 bits for the first
frequency 10 bits for the second
frequency and then 12 bits for the time
delta between them
and so now for our given user clip
we have all of these different hashes
that we can look up in the actual index
and we're going to do that so if we have
an inverted index where basically
the inverted index also is taking all of
these hashes or these fingerprints of
all of the existing songs that we know
and says you know for
this hash right here here all the song
ids that correspond to it or have that
that exact hash
then we can look up all of the hashes in
our user clip find all of the potential
songs that we have to you know look at
to compare them and then say okay now we
have all these potential songs where one
hash is matching
what is the best match we can get when
we look at all of the hashes in our user
clip so basically it's a two-step
process we use all of the hashes
individually in this kind of index
search where we get a bunch of possible
song ids and then we limit down all of
the potential song ids to just one by
comparing all of our hashes in our user
clip of audio and seeing how well they
line up with
some basically sequence of hashes within
all the candidate songs and so that's
basically the algorithm right there it's
not overly complicated but the reason
that i kind of like this problem is
because now once we have this idea of
okay first we're hitting this you know
big inverted index and then after that
we have to pull a bunch of things from a
database and do a bunch of processing in
order to kind of you know figure out
which song matches best there's a lot of
cpu work to do there and it's not easy
to speed up so how can we actually go
ahead and do that well for starters as i
mentioned earlier in our capacity
estimates section
our index is probably going to be in the
order of magnitude of terabytes in fact
i estimated two terabytes and so if we
really wanted to basically have super
fast you know matching
in an ideal world we'd be able to use
something like memory with a hash map
where the hashmap is the key which is
going to be that tuple that 32-bit thing
i mentioned earlier and then the value
is going to be a list of song ids that
are held in the database
but in commodity servers these days it's
rare that you can have more than 256
gigabytes of ram and as a result of that
we're going to have to be able to shard
our index in order to really get
everything in memory we could put things
on disk but then we're limiting
ourselves to pretty slow performance
because we're basically going to have to
perform a binary search for all of the
you know corresponding hashes we've
spoken about this in the past you're
really basically limited to logarithmic
time complexity when using a disk based
solution whereas with memory not only
are you getting faster accesses just
because it's in memory but you also have
a hash table and that allows you to get
constant time accesses so that's
probably the ideal way of storing this
index
that being said we have to come up with
a good way of actually sharding out our
index
one nice property of the actual hashes
that have been created
is that 32-bit key is going to start
with you know some encoding of the
anchor point and so since we're going to
have a bunch of hashes with the same
anchor point it's going to allow us to
send basically a bunch of requests for
you know one anchor point but then a
bunch of different secondary points all
to the same search index node which is
really useful because consistent hashing
will basically make sure that all of
those keys with the same anchor point
are probably going to be on the same
node
and then in addition to that we're
probably going to have to parallelize
the rest of our requests because they
may have a different anchor point in
that tuple and i'll make sure to kind of
draw this out so it makes sense for you
know how we're actually looking to get
all of those hashes
so either way one way or another we're
probably going to have to be making a
bunch of different calls to different
shards of that index and in an ideal
world because they're parallelized it'll
all be decently fast and returned back
to our server quickly enough
but even once we do that even if we're
able to achieve low latency just in that
index alone we still now have to do this
entire secondary phase of processing
where we check out all the possible
songs and calculate which one is the
actual best match
so
now again we could basically come up
with another problem where you know that
has to be fast but
basically the only way i can think about
how you would do it is a either you keep
all of those songs in memory
which would be really useful because in
theory
keeping all of the fingerprints for
every single song should be the same
exact size as the inverted index it's
just stored in different formats you
could shard it as well and still use
memory again or perhaps and this might
be a little bit more practical
is
for the popular songs you know at a
given time a lot of people are going to
be wanting to figure out what one
particular song is you can actually just
go ahead and cache the sequence of
fingerprints for that song and as a
result it should greatly speed up a lot
of those operations and requests to the
database but either way no matter what
happens because you're going to have
multiple terabytes of data at a store in
terms of songs
you're going to have to parallelize
these computations and in fact you
should because all of those songs are
separate entities and as a result you
can perform the computation of
similarity between the user audio clip
and a given song audio clip separately
so it should be totally fine if you know
you're making a bunch of parallel
different requests two different shards
of the database and then on some sort of
aggregation server you would go ahead
and basically say okay which one of
these songs matched the closest to our
user uploaded clip
then the final piece of the puzzle is
probably just going to be that every
once in a while there are going to be
new songs that are created and uploaded
and recorded and as a result of that
we're going to need some reoccurring
batch job that's going to take in those
new songs
get the fingerprints for them according
to the algorithm that i mentioned
earlier place them in our index and also
place them in our database itself okay
so as always let's formalize all of that
with a diagram so basically imagine we
have a client who wants to figure out
what it is that they're currently
listening to over the radio and the
first thing they're going to do after
recording that clip on their phone is
hit a load balancer and upload that clip
to the recognition service the
recognition service is then basically
going to go ahead and say okay
i know now that i basically have all of
these fingerprints for the user uploaded
clip let's look them all up at the same
time in parallel in our fingerprint
index which in an ideal world we can
keep in
you know a memory based database
solution such as redis once it basically
gets all those lists of potential song
ids the next thing it has to do is reach
out to the matching service with that
aggregated list and then the matching
service can basically say okay i have
all these song ids which i need to get
the fingerprints of
if they're in the fingerprint cache then
great that's going to speed up some of
the requests if not i have to fetch them
from the database
you'll see that i used a mongodb
database for the songs here and there
are a couple reasons for that the first
thing is that a it still uses a b tree
even though it's nosql which is nice
because b trees in theory at least
should be faster for reads than an lsm
tree additionally
another nice thing about the mongodb is
that because it is a document store you
should have really nice data locality
which should allow you to fetch
basically the huge document that is
going to be you know one song id and all
of the corresponding fingerprints for
that so those documents can get pretty
big but you know as a result of that
data locality hopefully the reads will
be a little bit faster and then the
final component of the puzzle is
basically just having some sort of
hadoop cluster which as new songs are
created and published we'll go ahead and
use a daily spark job to go ahead and
update both the index and the database
to account for all of the new songs
god damn as you can see i'm already
starting to sweat which is
believable but uh yeah turn my
air conditioner off for like five
minutes and this is what happens in this
new place so i guess every time i record
i'm just gonna be a sweaty
but
is what it is so anyways guys i hope you
enjoy this one um
it's an interesting problem like i said
i wouldn't expect that anyone would ever
ask you this but i don't think it's
unfair if someone gave you like
i don't know a 10 minute primer on
how those keys are created for a
specific song and additionally you know
kind of the process of how you would
figure out
once you have the fingerprints of the
user clip and also the fingerprints of
you know the potential candidate songs
um and then basically say how would you
distribute this that's decently fair but
i just can't see it happening i've just
heard of this problem as a systems
design problem and
you know i like to cover it that's what
i do hit them all
dirty looked
so uh yeah
Parcourir plus de vidéos associées
![](https://i.ytimg.com/vi/eQ3eNd5WbH8/hq720.jpg)
How indexes work in Distributed Databases, their trade-offs, and challenges
![](https://i.ytimg.com/vi/g1mLKeEPgWU/hq720.jpg)
Google SWE teaches systems design | EP26: Redis and Memcached Explained (While Drunk?)
![](https://i.ytimg.com/vi/cY7pE7vX6MU/hqdefault.jpg?sqp=-oaymwExCJADEOABSFryq4qpAyMIARUAAIhCGAHwAQH4Af4EgALgA4oCDAgAEAEYciBSKEMwDw==&rs=AOn4CLDKkrfeqyoi-9Tq2O5zz99AA9mvOw)
Building A Python-Based Search Engine
![](https://i.ytimg.com/vi/DTUSFVfjRQA/hq720.jpg)
Google SWE teaches systems design | EP27: Search Indexes
![](https://i.ytimg.com/vi/e1ReB9xNj5U/hq720.jpg)
Guida GENERAZIONE CANZONI con UDIO BETA // Testo in Italiano & Gratis
![](https://i.ytimg.com/vi/0iGR8GnIItQ/hq720.jpg)
DynamoDB: Under the hood, managing throughput, advanced design patterns | Jason Hunter | AWS Events
5.0 / 5 (0 votes)