Shazam Audio Recognition Design Deep Dive with Google SWE! | Systems Design Interview Question 23

Jordan has no life
18 Aug 202218:09

Summary

TLDRIn this video, the creator discusses the complexities of building a music recognition service like Shazam, an intriguing distributed systems problem. They delve into the functional requirements, capacity estimates, and database design, highlighting the use of spectrograms and constellation maps for audio fingerprinting. The video explores the algorithmic approach for song matching, including the use of combinatorial hashes and inverted indexes for efficient search. It also touches on the challenges of scaling the service and suggests solutions like sharding, caching popular songs, and parallel processing for handling large datasets.

Takeaways

  • ๐ŸŽค The speaker humorously introduces their channel and personal habits, setting a casual tone for the video.
  • ๐Ÿ” The video discusses the problem of designing a music recognition service like Shazam, which is an interesting distributed systems challenge.
  • ๐Ÿ“ˆ The speaker provides functional requirements for the service, including the ability to take an audio clip and return song suggestions.
  • ๐Ÿ“š Capacity estimates are given, with 100 million songs and 3,000 indexable keys per song, leading to a 2.4 terabyte index size.
  • ๐Ÿ”‘ The concept of 'fingerprints' is introduced as a way to represent audio data in a simplified 2D graph, which is crucial for matching songs.
  • ๐Ÿ› ๏ธ The algorithm for music recognition involves creating a constellation map from audio data and using combinatorial hashes to match audio clips to songs.
  • ๐Ÿ’ก The use of pairs of frequencies with a time delta between them is highlighted as a key optimization for creating a robust song index.
  • ๐Ÿ—ƒ๏ธ The video outlines a database design involving an index for quick lookup and a database for storing audio files and their metadata.
  • ๐Ÿ”’ The importance of sharding the index for scalability and in-memory storage for performance is discussed.
  • ๐Ÿ”„ The need for parallelization in processing user requests and updating the index with new songs is emphasized.
  • ๐Ÿ› ๏ธ The video concludes with a high-level overview of the system's architecture, including client interaction, load balancing, and batch processing for new songs.

Q & A

  • What is the main topic discussed in the video script?

    -The main topic discussed in the video script is the design and functioning of a music recognition service similar to Shazam, including its algorithm and distributed systems challenges.

  • Why is the speaker wearing underwear to celebrate the recording of this video?

    -The speaker humorously mentions wearing underwear to celebrate the recording as an unusual accomplishment, indicating their tendency to stay at their desk and not get up often.

  • What is the functional requirement of the music recognition service discussed in the script?

    -The functional requirement is to build a service that allows users to input an audio clip through a device like a phone microphone, and then the service returns a suggestion of the song that the audio clip is from.

  • What is the capacity estimate for the index in the music recognition service?

    -The capacity estimate for the index is 2.4 terabytes, based on the assumption of 100 million songs with 3,000 indexable keys each, where each key is 64 bits.

  • What is a spectrogram and how is it used in the context of the music recognition service?

    -A spectrogram is a 3D graph representing audio with parameters for time, frequency, and amplitude. It is used to simplify the search space and match audio clips by reducing it to a 2D graph known as a constellation map.

  • What is a combinatorial hash in the context of the music recognition algorithm?

    -A combinatorial hash in this context is a method used to reduce a 3D spectrogram to a simplified 2D graph by focusing on pairs of frequencies with a time offset between them, creating a tuple that can be used for efficient song matching.

  • Why is sharding necessary for the index in the music recognition service?

    -Sharding is necessary because the index is too large to fit in a single machine's memory, and sharding allows for parallel processing and scalability as the index grows.

  • What is the role of an inverted index in the music recognition service?

    -The inverted index in the music recognition service maps the combinatorial hashes (fingerprints) to song IDs, allowing for quick lookup of potential song matches from a user's audio clip.

  • How does the music recognition service handle the processing of potential song matches?

    -The service uses parallelization to compute the similarity between the user's audio clip and potential song clips, possibly caching popular songs for faster access, and then aggregates the results to determine the best match.

  • What is the purpose of the batch job mentioned in the script for the music recognition service?

    -The batch job is responsible for periodically updating the index and database with new songs, ensuring that the service can recognize the latest music.

  • Why does the speaker mention using a MongoDB database for storing song data?

    -The speaker mentions MongoDB because it uses a B-tree for storage, which is beneficial for read performance, and as a document store, it provides good data locality, which is useful for fetching large documents containing song data and fingerprints.

Outlines

00:00

๐Ÿ˜€ Introduction and Overview

The speaker welcomes everyone back to the channel, mentioning their current hairstyle and a personal anecdote about wearing underwear. They introduce the topic of Shazam, a music recognition service, and explain its relevance despite its complexity. The video aims to explore the functional requirements and algorithms involved in such a system.

05:00

๐Ÿ“Š Functional Requirements and Capacity Estimates

The speaker outlines the basic functional requirements of the music recognition service, which involves users recording audio clips to be analyzed by the system. They discuss capacity estimates, assuming 100 million songs with 3,000 indexable keys per song, resulting in a 2.4 terabyte index. The need for sharding to handle such a large index is highlighted.

10:02

๐Ÿ“‚ Database Design and Search Problem

The problem is framed as a search issue for databases. Two main components are required: an index for client-side audio clips and a database of audio files. The speaker begins to discuss an algorithm for music recognition, emphasizing that while interviewers may not expect detailed knowledge, understanding distributed systems is valuable.

15:03

๐Ÿ” Music Recognition Algorithm

The speaker delves into the algorithm for Shazam-like services, focusing on converting audio data into a format that matches existing data in the database. They introduce spectrograms and constellation maps, which simplify the search space by reducing 3D audio graphs to 2D scatter plots. This reduction aids in matching songs despite variations.

๐Ÿงฎ Combinatorial Hash and Song Indexing

The explanation continues with the concept of combinatorial hashing. By using pairs of frequency points and time deltas, the system can create unique identifiers (tuples) for audio clips. This method helps align user clips with database entries, facilitating efficient song identification. The hashing strategy is described in detail.

๐Ÿ’พ Index Storage and Sharding

Discussion moves to index storage, emphasizing the need for memory-based solutions for fast access. Sharding is necessary to handle the large index, and consistent hashing helps distribute the data efficiently. The importance of parallelizing requests to achieve low latency is highlighted.

๐Ÿš€ Matching and Processing

The speaker explains the second phase of the process: matching user clips with potential songs. This involves significant CPU work and parallelized computations to handle the large dataset. Strategies for optimizing this phase, such as using caches and sharding, are discussed to ensure fast and accurate song identification.

๐Ÿ“… Handling New Songs and Updates

The final part addresses the need for regular updates to the database and index as new songs are created. A batch job using a Hadoop cluster and Spark is proposed to handle these updates efficiently. The speaker emphasizes the importance of maintaining up-to-date data for accurate recognition.

๐Ÿ Conclusion and Practical Considerations

The speaker wraps up by formalizing the process with a diagram. They describe the client-server interaction, from recording an audio clip to matching it with database entries. The use of Redis and MongoDB for efficient data handling is justified. The video concludes with practical considerations and personal remarks about the recording environment.

Mindmap

Keywords

๐Ÿ’กJufrow

A playful term used in the video to describe the speaker's unkempt or wild hairstyle. The speaker mentions their 'jufrow' as part of their casual and humorous introduction, setting a relaxed and personal tone for the video.

๐Ÿ’กShazam

Shazam is a music recognition service that allows users to identify songs by capturing an audio sample. The speaker uses Shazam as an example to explore the complexities and challenges of designing distributed systems capable of handling large-scale audio processing and matching tasks.

๐Ÿ’กDistributed Systems

Distributed systems refer to a network of interconnected computers that work together to achieve a common goal, such as processing data or running applications. In the video, the speaker discusses the design and challenges of creating a distributed system for a service like Shazam, emphasizing the need for high availability and low latency.

๐Ÿ’กFingerprinting

In the context of audio recognition, fingerprinting involves creating a unique digital representation of an audio clip based on its features like frequency and time. The speaker explains how Shazam uses audio fingerprints to match user-recorded clips with songs in its database, making this a core concept for understanding how the service functions.

๐Ÿ’กSpectrogram

A spectrogram is a visual representation of the spectrum of frequencies in an audio signal as it varies with time. Itโ€™s used in the video to explain how audio can be broken down into time, frequency, and amplitude components, which are crucial for creating audio fingerprints in systems like Shazam.

๐Ÿ’กConstellation Map

A constellation map simplifies a spectrogram by plotting only the highest amplitude points on a 2D graph. This reduction helps in efficiently matching audio fingerprints by focusing on significant audio features, as explained by the speaker while detailing Shazamโ€™s recognition algorithm.

๐Ÿ’กCombinatorial Hash

A combinatorial hash is a unique identifier created by combining multiple features, such as pairs of frequency points and their time offsets in an audio clip. This hashing technique is used in Shazam to efficiently match and search for songs in its database, as discussed in the video.

๐Ÿ’กInverted Index

An inverted index is a data structure used to store a mapping from content, like audio fingerprints, to their locations in a database. In the context of Shazam, it allows quick lookup of potential song matches by indexing the fingerprints of songs, which the speaker highlights as crucial for fast and accurate recognition.

๐Ÿ’กSharding

Sharding involves partitioning a database into smaller, more manageable pieces called shards. The speaker discusses the necessity of sharding the large audio fingerprint index to distribute the data across multiple servers, ensuring the system can handle high volumes of data and requests efficiently.

๐Ÿ’กLatency

Latency refers to the delay before a transfer of data begins following an instruction for its transfer. In the video, the speaker emphasizes the importance of minimizing latency in the music recognition service to provide quick responses to user queries, which is essential for a smooth and efficient user experience.

๐Ÿ’กHigh Availability

High availability ensures that a system or service remains operational and accessible for a high percentage of time. The speaker discusses designing the Shazam-like system to be highly available, meaning it must be resilient and capable of handling failures without significant downtime.

๐Ÿ’กCaching

Caching involves storing frequently accessed data in a location that allows for faster retrieval. In the video, the speaker suggests using caching to speed up the retrieval of fingerprints for popular songs, reducing the load on the database and improving overall performance.

Highlights

Introduction to the concept of building a music recognition service like Shazam.

The speaker's personal anecdote about wearing underwear to celebrate the importance of the recording.

Discussion of the distributed systems problem inherent in music recognition technology.

Functional requirements for the music recognition service, including audio clip input and song suggestion output.

Capacity estimates involving 100 million songs and 3,000 indexable keys per song.

The creation of a 2.4 terabyte index for song recognition and the need for sharding.

The API design for the music recognition service, focusing on song finding with audio recording as a parameter.

Database design considerations for efficient music search and retrieval.

Algorithmic approach to music recognition, including audio data conversion and matching.

Use of spectrograms to represent audio and Shazam's optimization technique.

Introduction of constellation maps for simplifying audio data for search purposes.

The concept of combinatorial hash and its role in reducing search space for music recognition.

Method of using pairs of frequencies with time delta to create a robust song index.

Sharding strategy for the large index to enable in-memory processing and fast matching.

Parallelization of requests to different shards of the index for efficient song matching.

The secondary phase of processing to determine the best song match from potential candidates.

Caching popular songs' fingerprints to expedite the matching process.

Handling new song uploads with batch jobs to update the index and database regularly.

Diagrammatic representation of the music recognition service architecture from client to database.

Use of MongoDB for storing songs due to its B-tree indexing and document store advantages.

Implementation of a Hadoop cluster with Spark jobs for updating the song index and database.

The speaker's humorous commentary on the complexity of the problem and its unexpected nature in interviews.

Transcripts

play00:00

hello everybody and welcome back to the

play00:02

channel as you can see my jufrow is in

play00:05

full effect today and i refuse to get a

play00:08

haircut so it's probably going to stay

play00:09

like this for a hot second this is a

play00:11

pretty important recording for me today

play00:13

and i've actually put on some underwear

play00:15

in order to celebrate which is a huge

play00:17

accomplishment for me because i don't

play00:18

like getting up from my desk very often

play00:21

anyways today let's talk about shazam

play00:24

this is one of those services or

play00:26

problems that if an interviewer asks you

play00:28

this they're probably smoking crack but

play00:30

at the same time it's genuinely an

play00:32

interesting distributed systems problem

play00:34

and i feel like there's still things we

play00:36

can talk about in addition to the core

play00:38

algorithm that are actually exemplary of

play00:40

stuff that you could be quizzed on and

play00:42

so yeah i think it's still useful

play00:43

overall but um you know if you're just

play00:45

here to learn a little bit that's what i

play00:47

did as well and it's pretty interesting

play00:49

so let's get into

play00:56

it okay so the functional requirements

play01:00

of our service are pretty simple

play01:02

basically all we're going to be doing is

play01:03

building some sort of music recognition

play01:05

service which means that we're going to

play01:07

allow a user to take in an audio clip

play01:09

into something like their phone

play01:10

microphone and then you know they'll

play01:12

send that to our servers and afterwards

play01:14

we're going to return them a suggestion

play01:17

of what we think the song that they just

play01:19

heard is

play01:27

as far as capacity estimates go this is

play01:30

going to be a little bit backwards the

play01:32

reasoning for that being that i'm kind

play01:34

of going to mention a key term here that

play01:37

we'll cover in a bit but it'll make

play01:39

sense a little bit more later

play01:41

basically i'm going to assume that there

play01:42

are 100 million songs that we're you

play01:45

know having under consideration

play01:47

and then basically per song let's

play01:49

imagine that there are 3 000 indexable

play01:52

keys and i'll touch upon this later what

play01:54

i really mean is something known as

play01:56

fingerprints but

play01:58

again we'll speak about that later and

play01:59

if each of those indexable keys is 64

play02:02

bits then we can basically say that

play02:04

we're going to have to create an index

play02:06

that is 2.4 terabytes so obviously

play02:09

that's going to be pretty big we can't

play02:11

fit it all on one machine or i guess we

play02:13

could with a hard drive but if we plan

play02:15

on doing any sort of processing in

play02:17

memory we're going to have to be doing

play02:18

sharding and in addition 2.4 terabytes

play02:21

is a lot anyway you would probably want

play02:22

to do some sharding regardless because

play02:24

that obviously has

play02:26

an ability to grow

play02:35

the api for this service is pretty quick

play02:37

it's just gonna be that we can go ahead

play02:39

and find a song and the one parameter

play02:41

that would take in is the recording of a

play02:43

song so let's go ahead and get on into

play02:45

the database design

play02:54

well if you think about this problem as

play02:56

kind of a search problem for databases

play02:59

then there's really just going to be two

play03:00

things that you would need the first one

play03:02

is going to be some sort of index that

play03:04

allows us to take in our client-side

play03:06

clip of audio and spits out some

play03:09

potential ideas of audio that you know

play03:11

we could be playing and then the second

play03:13

thing is probably going to be some

play03:15

overarching database of all the audio

play03:17

files or something pertaining to the

play03:18

audio that we can go ahead and search

play03:20

afterwards and return back to our user

play03:30

okay so let's talk about an algorithm

play03:32

for music recognition service like

play03:34

shazam like i said earlier it's you know

play03:37

completely absurd for any sort of

play03:39

interviewer to ever expect you to know

play03:40

something like this because i mean it's

play03:42

quite literally a multi-100 million

play03:44

dollar idea that they would be expecting

play03:46

you to just come up with on the spot but

play03:48

i think it would be somewhat reasonable

play03:50

if they kind of gave you a bunch of

play03:51

hints about what was going on and then

play03:53

said well now that we have this

play03:55

algorithm how would we actually kind of

play03:56

support this in distributed environment

play03:58

so anyways let's go ahead and get into

play04:00

it we know that our entire goal is

play04:03

basically to take a sample of audio data

play04:05

and convert it to a format where we can

play04:07

just go ahead and match that with you

play04:09

know existing audio data that we have in

play04:11

our databases and of course this needs

play04:13

to be done

play04:14

with the service that is both highly

play04:16

available and also

play04:18

has very low latency that's going to be

play04:20

a requirement in any sort of systems

play04:22

design interview

play04:24

so anyways before we can really start

play04:25

getting into the algorithm we have to

play04:27

talk about just how audio files work in

play04:30

general

play04:31

audio can be represented by something

play04:33

known as a spectrogram which is a 3d

play04:36

graph and i'll put one of these up on

play04:37

screen so you guys can follow along a

play04:39

spectrogram basically has three

play04:41

parameters

play04:42

it has the time that you know you heard

play04:44

a specific part of the audio it has the

play04:47

frequency of the wavelength of the audio

play04:49

and then additionally it has the

play04:51

amplitude which kind of represents

play04:52

something like the volume

play04:54

so shazam does kind of a clever

play04:56

optimization and this is something that

play04:58

they've actually written in their paper

play05:00

i'm just not making this up on the spot

play05:02

but basically they take all of the peaks

play05:05

at certain amplitudes within the graph

play05:08

in this 3d graph and they reduce this 3d

play05:10

graph down to a 2d graph and create

play05:13

something they call a constellation map

play05:15

where basically now we have a 2d scatter

play05:18

plot of all these points that were at

play05:20

really high amplitude and basically now

play05:22

they're just you know little scatters on

play05:25

um basically a plot of both the time and

play05:27

frequency so this allows us to simplify

play05:30

our search space pretty considerably and

play05:32

it also allows us to account for

play05:34

variations between things like the

play05:35

actual audio clip that we might have in

play05:37

our database and also the audio clip

play05:40

that a user might play for us obviously

play05:42

there's going to be a lot of variation

play05:43

that can happen between the two of those

play05:45

between things like you know device

play05:46

compression external noise and just a

play05:49

bunch of other factors and as a result

play05:50

of that kind of reducing this

play05:53

complicated 3d graph down to this more

play05:55

simple 2d one allows us to have a better

play05:58

chance to be able to match these songs

play06:00

together

play06:01

so now that we have these 2d graphs what

play06:03

can we actually do in order to kind of

play06:05

start dealing with creating a song index

play06:07

and making it such that we can search

play06:09

songs quickly

play06:10

well this is where something known as a

play06:13

combinatorial hash comes in and shazam

play06:15

goes ahead and does something like the

play06:17

following

play06:19

one option would be that every single

play06:21

point on that constellation graph we

play06:23

could go through all of those

play06:24

frequencies at a given time and we could

play06:27

say well what songs have all of these

play06:29

you know frequency points at a given

play06:30

time but there's a couple issues with

play06:32

that the first is that well frequencies

play06:35

are basically representing like the note

play06:37

of the song you might hear so if you're

play06:39

just looking at one frequency and trying

play06:41

to you know limit down your fields of

play06:43

songs based on which songs have that

play06:45

frequency most songs are probably going

play06:47

to have it it's like saying you know

play06:48

what songs don't have like the note c

play06:51

when you're looking at all of music they

play06:52

probably all do

play06:54

so what shazam does instead to limit

play06:56

down the space is they actually look at

play06:58

pairs of notes and this has a bunch of

play07:00

really useful properties

play07:02

so basically if you take two frequencies

play07:05

so two of those points on the graph

play07:08

and you take the first one which comes

play07:10

before the second one in time and you

play07:12

say the first one is what we'll call the

play07:14

anchor point

play07:16

basically create some sort of tuple

play07:18

where we have you know the anchor

play07:20

frequency the second frequency and then

play07:23

the last thing is going to be the amount

play07:25

of time and offset between them

play07:27

now this is really useful because you

play07:30

know one issue with having to create an

play07:32

index like this is we don't know where

play07:34

the user's clip is going to be played in

play07:36

terms of at what point in the song it

play07:38

starts right you know my song could be

play07:40

three minutes long but i might be

play07:42

playing an audio clip from my phone

play07:44

where it starts a minute in so we have

play07:46

to be able to align all of these clips

play07:48

so actually using pairs of these

play07:50

frequencies with the time delta between

play07:52

them is super useful because if we can

play07:55

find basically

play07:56

you know for our user audio if we split

play07:59

it into a bunch of different you know

play08:01

pairs of points where we have the anchor

play08:02

point then the second point the time

play08:04

between them let's say we have like i

play08:06

estimated earlier 3 000 indexable keys

play08:09

per song where the key is going to be

play08:11

one of those tuples then all we have to

play08:13

do is basically figure out the song in

play08:16

the database where we have the most

play08:18

alignments between the user clip and

play08:20

that song and we can get you know

play08:23

genuinely a lot of alignment because

play08:25

like i said we kind of reduce this

play08:26

complicated 3d graph with a lot of

play08:29

potential for noise into a more simple

play08:31

2d graph which you know gets rid of a

play08:34

lot of the variation that can occur from

play08:36

having to use client-side audio through

play08:38

a microphone so basically now here's

play08:40

what's going to happen for our let's say

play08:42

30 second user clip of audio we're going

play08:45

to create a bunch of all of these you

play08:47

know anchor point secondary point you

play08:49

know time delta tuples between them you

play08:52

know it could be up to something like a

play08:53

thousand and in theory we can really get

play08:55

a lot of these hashes and the hash is

play08:58

basically going to be comprised of that

play09:00

tuple let's say you know we want a

play09:02

32-bit hash we can basically use

play09:04

something like 10 bits for the first

play09:06

frequency 10 bits for the second

play09:08

frequency and then 12 bits for the time

play09:10

delta between them

play09:12

and so now for our given user clip

play09:14

we have all of these different hashes

play09:16

that we can look up in the actual index

play09:19

and we're going to do that so if we have

play09:21

an inverted index where basically

play09:23

the inverted index also is taking all of

play09:26

these hashes or these fingerprints of

play09:29

all of the existing songs that we know

play09:31

and says you know for

play09:33

this hash right here here all the song

play09:35

ids that correspond to it or have that

play09:38

that exact hash

play09:40

then we can look up all of the hashes in

play09:42

our user clip find all of the potential

play09:44

songs that we have to you know look at

play09:47

to compare them and then say okay now we

play09:49

have all these potential songs where one

play09:51

hash is matching

play09:53

what is the best match we can get when

play09:55

we look at all of the hashes in our user

play09:57

clip so basically it's a two-step

play09:59

process we use all of the hashes

play10:01

individually in this kind of index

play10:04

search where we get a bunch of possible

play10:06

song ids and then we limit down all of

play10:08

the potential song ids to just one by

play10:11

comparing all of our hashes in our user

play10:13

clip of audio and seeing how well they

play10:16

line up with

play10:17

some basically sequence of hashes within

play10:20

all the candidate songs and so that's

play10:21

basically the algorithm right there it's

play10:23

not overly complicated but the reason

play10:26

that i kind of like this problem is

play10:28

because now once we have this idea of

play10:30

okay first we're hitting this you know

play10:32

big inverted index and then after that

play10:34

we have to pull a bunch of things from a

play10:35

database and do a bunch of processing in

play10:38

order to kind of you know figure out

play10:40

which song matches best there's a lot of

play10:42

cpu work to do there and it's not easy

play10:44

to speed up so how can we actually go

play10:46

ahead and do that well for starters as i

play10:49

mentioned earlier in our capacity

play10:50

estimates section

play10:52

our index is probably going to be in the

play10:54

order of magnitude of terabytes in fact

play10:56

i estimated two terabytes and so if we

play10:59

really wanted to basically have super

play11:01

fast you know matching

play11:03

in an ideal world we'd be able to use

play11:05

something like memory with a hash map

play11:06

where the hashmap is the key which is

play11:09

going to be that tuple that 32-bit thing

play11:11

i mentioned earlier and then the value

play11:13

is going to be a list of song ids that

play11:15

are held in the database

play11:17

but in commodity servers these days it's

play11:19

rare that you can have more than 256

play11:21

gigabytes of ram and as a result of that

play11:24

we're going to have to be able to shard

play11:25

our index in order to really get

play11:27

everything in memory we could put things

play11:29

on disk but then we're limiting

play11:30

ourselves to pretty slow performance

play11:32

because we're basically going to have to

play11:34

perform a binary search for all of the

play11:36

you know corresponding hashes we've

play11:38

spoken about this in the past you're

play11:40

really basically limited to logarithmic

play11:42

time complexity when using a disk based

play11:44

solution whereas with memory not only

play11:46

are you getting faster accesses just

play11:48

because it's in memory but you also have

play11:50

a hash table and that allows you to get

play11:52

constant time accesses so that's

play11:54

probably the ideal way of storing this

play11:56

index

play11:58

that being said we have to come up with

play11:59

a good way of actually sharding out our

play12:01

index

play12:03

one nice property of the actual hashes

play12:06

that have been created

play12:08

is that 32-bit key is going to start

play12:11

with you know some encoding of the

play12:13

anchor point and so since we're going to

play12:16

have a bunch of hashes with the same

play12:18

anchor point it's going to allow us to

play12:20

send basically a bunch of requests for

play12:23

you know one anchor point but then a

play12:25

bunch of different secondary points all

play12:27

to the same search index node which is

play12:29

really useful because consistent hashing

play12:31

will basically make sure that all of

play12:33

those keys with the same anchor point

play12:35

are probably going to be on the same

play12:36

node

play12:37

and then in addition to that we're

play12:39

probably going to have to parallelize

play12:41

the rest of our requests because they

play12:42

may have a different anchor point in

play12:44

that tuple and i'll make sure to kind of

play12:47

draw this out so it makes sense for you

play12:49

know how we're actually looking to get

play12:51

all of those hashes

play12:53

so either way one way or another we're

play12:55

probably going to have to be making a

play12:56

bunch of different calls to different

play12:59

shards of that index and in an ideal

play13:01

world because they're parallelized it'll

play13:03

all be decently fast and returned back

play13:05

to our server quickly enough

play13:07

but even once we do that even if we're

play13:09

able to achieve low latency just in that

play13:12

index alone we still now have to do this

play13:14

entire secondary phase of processing

play13:17

where we check out all the possible

play13:18

songs and calculate which one is the

play13:21

actual best match

play13:22

so

play13:23

now again we could basically come up

play13:25

with another problem where you know that

play13:27

has to be fast but

play13:29

basically the only way i can think about

play13:31

how you would do it is a either you keep

play13:34

all of those songs in memory

play13:36

which would be really useful because in

play13:38

theory

play13:39

keeping all of the fingerprints for

play13:41

every single song should be the same

play13:43

exact size as the inverted index it's

play13:45

just stored in different formats you

play13:47

could shard it as well and still use

play13:49

memory again or perhaps and this might

play13:51

be a little bit more practical

play13:53

is

play13:54

for the popular songs you know at a

play13:56

given time a lot of people are going to

play13:57

be wanting to figure out what one

play13:59

particular song is you can actually just

play14:01

go ahead and cache the sequence of

play14:03

fingerprints for that song and as a

play14:05

result it should greatly speed up a lot

play14:07

of those operations and requests to the

play14:09

database but either way no matter what

play14:12

happens because you're going to have

play14:13

multiple terabytes of data at a store in

play14:15

terms of songs

play14:16

you're going to have to parallelize

play14:18

these computations and in fact you

play14:20

should because all of those songs are

play14:22

separate entities and as a result you

play14:24

can perform the computation of

play14:25

similarity between the user audio clip

play14:27

and a given song audio clip separately

play14:30

so it should be totally fine if you know

play14:32

you're making a bunch of parallel

play14:33

different requests two different shards

play14:35

of the database and then on some sort of

play14:37

aggregation server you would go ahead

play14:40

and basically say okay which one of

play14:42

these songs matched the closest to our

play14:45

user uploaded clip

play14:46

then the final piece of the puzzle is

play14:48

probably just going to be that every

play14:50

once in a while there are going to be

play14:52

new songs that are created and uploaded

play14:54

and recorded and as a result of that

play14:56

we're going to need some reoccurring

play14:58

batch job that's going to take in those

play15:00

new songs

play15:01

get the fingerprints for them according

play15:03

to the algorithm that i mentioned

play15:04

earlier place them in our index and also

play15:07

place them in our database itself okay

play15:10

so as always let's formalize all of that

play15:12

with a diagram so basically imagine we

play15:15

have a client who wants to figure out

play15:17

what it is that they're currently

play15:18

listening to over the radio and the

play15:20

first thing they're going to do after

play15:21

recording that clip on their phone is

play15:23

hit a load balancer and upload that clip

play15:26

to the recognition service the

play15:28

recognition service is then basically

play15:30

going to go ahead and say okay

play15:32

i know now that i basically have all of

play15:35

these fingerprints for the user uploaded

play15:37

clip let's look them all up at the same

play15:40

time in parallel in our fingerprint

play15:42

index which in an ideal world we can

play15:44

keep in

play15:46

you know a memory based database

play15:47

solution such as redis once it basically

play15:50

gets all those lists of potential song

play15:53

ids the next thing it has to do is reach

play15:55

out to the matching service with that

play15:57

aggregated list and then the matching

play15:59

service can basically say okay i have

play16:01

all these song ids which i need to get

play16:03

the fingerprints of

play16:05

if they're in the fingerprint cache then

play16:06

great that's going to speed up some of

play16:08

the requests if not i have to fetch them

play16:10

from the database

play16:12

you'll see that i used a mongodb

play16:14

database for the songs here and there

play16:16

are a couple reasons for that the first

play16:18

thing is that a it still uses a b tree

play16:21

even though it's nosql which is nice

play16:24

because b trees in theory at least

play16:25

should be faster for reads than an lsm

play16:28

tree additionally

play16:30

another nice thing about the mongodb is

play16:32

that because it is a document store you

play16:34

should have really nice data locality

play16:36

which should allow you to fetch

play16:38

basically the huge document that is

play16:40

going to be you know one song id and all

play16:42

of the corresponding fingerprints for

play16:44

that so those documents can get pretty

play16:45

big but you know as a result of that

play16:47

data locality hopefully the reads will

play16:49

be a little bit faster and then the

play16:51

final component of the puzzle is

play16:52

basically just having some sort of

play16:54

hadoop cluster which as new songs are

play16:56

created and published we'll go ahead and

play16:58

use a daily spark job to go ahead and

play17:01

update both the index and the database

play17:03

to account for all of the new songs

play17:06

god damn as you can see i'm already

play17:08

starting to sweat which is

play17:10

believable but uh yeah turn my

play17:13

air conditioner off for like five

play17:14

minutes and this is what happens in this

play17:16

new place so i guess every time i record

play17:18

i'm just gonna be a sweaty

play17:20

but

play17:20

is what it is so anyways guys i hope you

play17:23

enjoy this one um

play17:25

it's an interesting problem like i said

play17:27

i wouldn't expect that anyone would ever

play17:29

ask you this but i don't think it's

play17:30

unfair if someone gave you like

play17:33

i don't know a 10 minute primer on

play17:36

how those keys are created for a

play17:38

specific song and additionally you know

play17:41

kind of the process of how you would

play17:42

figure out

play17:44

once you have the fingerprints of the

play17:46

user clip and also the fingerprints of

play17:48

you know the potential candidate songs

play17:51

um and then basically say how would you

play17:53

distribute this that's decently fair but

play17:56

i just can't see it happening i've just

play17:57

heard of this problem as a systems

play17:59

design problem and

play18:01

you know i like to cover it that's what

play18:03

i do hit them all

play18:04

dirty looked

play18:06

so uh yeah

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Music RecognitionShazam AlgorithmDistributed SystemsAudio FingerprintingSpectrogram AnalysisDatabase DesignInverted IndexSystem ScalabilityLatency OptimizationAudio ProcessingTech Tutorial