The Problem With UUIDs

Theo - t3․gg
7 May 202425:53

Summary

TLDRThe video script discusses the complexities and considerations of using Universally Unique Identifiers (UUIDs) as primary keys in MySQL databases. It highlights that while UUIDs ensure uniqueness across systems, they can negatively impact database performance due to increased storage requirements and the challenges of maintaining a balanced B+ tree structure with random UUIDs. The script also explores various UUID versions, with a focus on time-based versions like V1 and V6, and contrasts them with random variants like V4. It further delves into alternative ID types, such as Snowflake IDs and Nano IDs, which offer more structured approaches to generating unique identifiers. The video emphasizes the importance of choosing the right ID system for database architecture, considering factors like security, performance, and storage efficiency.

Takeaways

  • 📈 **Sponsor Acknowledgment**: The video is sponsored by Planet Scale, which also created a blog post that serves as a starting point for the discussion on UUIDs.
  • 🔄 **Multiple Attempts**: This is the third time the video has been filmed over two years, indicating the complexity and importance of accurately covering the topic.
  • 🤔 **Performance Concerns**: Using UUIDs as primary keys in MySQL can hurt database performance due to the way B+ tree indexes need to be updated and balanced with random values.
  • 📅 **UUID Versions**: There are five official and three proposed versions of UUIDs, each with different properties and use cases, highlighting the evolution and issues in the standard.
  • 🚫 **UUID v2 Issues**: UUID version 2 is rarely used because replacing the low time segment with a POSIX user ID increases the chances of collisions.
  • 🔢 **Randomness in UUID v4**: Version 4 UUIDs are almost entirely random, which can lead to storage and performance issues due to the randomness affecting the B+ tree balance.
  • 🔬 **Technical Details**: The video dives into the technical aspects of how UUIDs are structured and the implications of using different versions as primary keys.
  • 💾 **Storage Utilization**: Storing UUIDs requires significantly more storage space compared to auto-incrementing integers, which can impact database storage requirements.
  • 🚀 **Best Practices**: The video suggests best practices for using UUIDs, such as using the binary data type and considering ordered UUID variants to mitigate performance and storage issues.
  • 🛡️ **Security Considerations**: There's a discussion on the security implications of using sequential keys versus random UUIDs, including the risk of guessable IDs leading to unauthorized access.
  • 🌟 **Alternative ID Types**: The video mentions alternative ID types like Snowflake IDs, ULIDs, and Nano IDs, which can be considered depending on the specific requirements and constraints of a system.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is the discussion of using UUIDs (Universally Unique Identifiers) as primary keys in MySQL databases and the associated performance implications.

  • Why did the video creator film the video multiple times?

    -The video creator filmed the video multiple times because they wanted to ensure they got all the details about UUIDs correct and felt that previous attempts did not meet their standards for accuracy and comprehensiveness.

  • What is the significance of October 10th, 1568, in the context of UUIDs?

    -October 10th, 1568, is significant because it marks the start of the Gregorian calendar, which is used as a reference point for the embedded timestamp within a UUID, specifically in version 1.

  • What are the potential issues with using UUIDs as primary keys in a MySQL database?

    -Using UUIDs as primary keys can lead to performance issues due to the randomness of the values, which can cause problems with index rebalancing in B+ trees, increased storage utilization, and potential page splitting that is less efficient than with sequential keys.

  • What is the difference between UUID version 1 and version 6?

    -UUID version 6 is nearly identical to version 1, but the key difference is that the bits used to capture the timestamps are flipped, meaning the most significant portions of the timestamp are stored first, which can make the UUIDs more sortable.

  • Why might someone choose to use a different identifier format other than UUIDs in a distributed system?

    -Alternative identifier formats like Snowflake IDs, UL IDs, or Nano IDs might be chosen over UUIDs due to their ability to provide a more efficient and sortable unique identifier that can mitigate some of the performance and storage issues associated with UUIDs.

  • What is the recommendation for storing UUIDs to reduce storage requirements?

    -Storing UUIDs in their native binary format as a binary(16) column can reduce the storage requirement down to 16 bytes, which is more efficient than storing them as a 36-character string.

  • null

    -null

  • Why might the use of UUID version 4 lead to excessive storage usage in database indexing?

    -UUID version 4 is randomly generated, which means that the values are not sequential. This randomness can lead to inefficient page utilization, with pages being only around 50% full, thus using significantly more storage space for the index.

  • What is the primary advantage of using time-based UUIDs like version 6 or 7?

    -Time-based UUIDs like version 6 or 7 can guarantee uniqueness while keeping the generated values as close to sequential as possible, which can help avoid some of the page splitting issues and improve performance in database operations.

  • What is the concern regarding the use of sequentially ordered UUIDs from a security perspective?

    -The concern is that sequentially ordered UUIDs can be guessed, which might lead to security vulnerabilities such as unauthorized password resets or unauthorized access if an attacker can predict the ID sequence.

  • What is the significance of the B+ tree data structure in the context of MySQL databases?

    -The B+ tree data structure is significant in MySQL databases because it is used to create indexes that allow for efficient querying of data. The structure keeps data organized in a way that enables quick searches, insertions, and updates, which is crucial for database performance.

  • What is the role of Planet Scale in this video?

    -Planet Scale is a sponsor of the video and has created a blog post that the video creator uses as a starting point for the discussion on UUIDs. The video creator emphasizes that Planet Scale has not influenced the content of the video but provided a resource that helped shape the discussion.

Outlines

00:00

🎬 Introduction to UUIDs and Sponsorship Acknowledgement

The video's introduction discusses the topic of UUIDs (Universally Unique Identifiers) and the challenges in creating a video about them. The speaker mentions filming the video multiple times due to the complexity of the subject. The video is sponsored by Planet Scale, who also authored a blog post on the problems with using a UID primary key in SQL, which serves as the basis for the discussion. The speaker clarifies that Planet Scale has not influenced the content and that their sponsorship is appreciated for its educational value.

05:01

🔑 UUIDs as Primary Keys and Their Impact on Database Performance

This paragraph delves into the potential issues with using UUIDs as primary keys in MySQL databases. It explains how UUIDs are designed for uniqueness across systems and the problems that can arise when they are used incorrectly, such as performance degradation. The speaker shares their preference for ordered IDs, like time-based UUIDs, for their usefulness in tracking creation dates. The paragraph also touches on the various versions of UUIDs, highlighting the differences and issues with each, including the proposed versions 6, 7, and 8.

10:04

📈 UUIDs vs. Auto Incrementing Integers in MySQL

The speaker contrasts UUIDs with auto incrementing integers in terms of their use as primary keys in MySQL. They discuss the performance implications of inserting new records and how MySQL's B+ tree indexing structure is affected by the randomness of UUIDs, which can lead to longer rebalancing times and impact user experience. The storage utilization is also compared, noting that UUIDs require significantly more storage space than integers, which can have a substantial impact on database size and performance.

15:05

🚀 Best Practices for Using UUIDs as Primary Keys

The paragraph outlines best practices for using UUIDs as primary keys to minimize negative side effects. It suggests using the binary data type to reduce storage requirements and considering ordered UUID variants to maintain a more sequential order, thus avoiding page splitting issues. The speaker also recommends considering other identifier types like Snowflake IDs, UL IDs, and Nano IDs, which can offer similar benefits with fewer drawbacks.

20:06

🛠️ UUIDs, Security, and Alternatives in Database Design

The speaker discusses the security implications of using sequential IDs versus random UUIDs, mentioning that guessable IDs can lead to unauthorized access but also stressing that system security should not rely solely on ID complexity. They highlight the importance of choosing the right ID type for database design, considering factors like performance, storage, and uniqueness. The paragraph concludes with a recommendation to explore alternatives to UUIDs, such as cuid V2, and a mention of a useful tool for assessing ID collision probabilities.

25:07

📝 Conclusion and Final Thoughts on UUIDs in Databases

In the concluding paragraph, the speaker summarizes the key points discussed in the video about the trade-offs of using UUIDs as primary keys in MySQL. They emphasize the availability of multiple UUID versions and alternative ID types that can address these trade-offs. The speaker expresses relief at having completed the video and thanks Planet Scale for their sponsorship, which provided valuable educational content on the topic.

Mindmap

Keywords

💡UUID (Universally Unique Identifier)

UUIDs are 128-bit numbers used to uniquely identify information in computer systems. They are designed to have a negligible chance of being duplicated, ensuring uniqueness across different systems. In the video, the discussion revolves around the use of UUIDs as primary keys in MySQL databases and the associated performance implications. UUIDs are a central theme as they explore the various versions and their impact on database operations.

💡Planet Scale

Planet Scale is mentioned as a sponsor of the video and the creator of a blog post that the video is based on. The blog post discusses the issues with using UUIDs as primary keys in SQL, which is the starting point for the video's exploration. Planet Scale is portrayed as a valuable resource for educational content related to database management and performance.

💡Primary Key

In the context of the video, a primary key is a unique identifier for records in a database table. The video discusses the use of UUIDs as primary keys and compares them to auto-incrementing integers. The primary key is crucial for database indexing and query performance, making it a key concept in the discussion of database optimization.

💡B+ Tree

B+ Trees are a type of data structure used by MySQL for indexing. They allow for efficient searching, insertion, and deletion operations. The video explains how the introduction of randomness in UUIDs can affect the performance of B+ Trees, leading to longer times for the database to rebalance and maintain performance.

💡Storage Utilization

Storage utilization refers to how efficiently a database uses its allocated storage space. The video discusses how using UUIDs as primary keys can lead to higher storage utilization compared to traditional integer-based keys due to the larger size of UUIDs. This can impact the overall efficiency and performance of the database system.

💡CUID (Content-Addressable Unique Identifier)

CUID is an alternative to UUIDs that the video presenter has been advocating for due to its advantages in certain scenarios. CUIDs are designed to be more sortable and have been the presenter's preferred standard for IDs. The video explores the trade-offs between using CUIDs and UUIDs in the context of database primary keys.

💡Versioning of UUIDs

The video script discusses different versions of UUIDs (V1 through V8), each with specific characteristics and potential issues. Understanding the versioning is important as it relates to the unique properties of each UUID variant, such as time-based generation, random generation, and deterministic generation, which affect their suitability as primary keys in databases.

💡Auto-Incrementing Integer

An auto-incrementing integer is a type of primary key that automatically assigns a unique number to each new record. The video contrasts UUIDs with auto-incrementing integers, highlighting the performance benefits and storage efficiency of using integers in certain database scenarios.

💡Indexing

Indexing in databases is the process of creating a data structure that improves the speed of data retrieval operations. The video explains how the choice of primary key (UUIDs vs. integers) affects indexing, with implications for storage requirements and query performance due to the different ways data is organized and accessed.

💡Snowflake ID

Snowflake IDs are an alternative identifier system mentioned in the video as a part of the discussion on alternatives to UUIDs. They are designed to be unique and are often used in distributed systems. The video suggests considering Snowflake IDs as one of the options when architecting a database that requires unique identifiers.

💡Nano ID

Nano ID is another unique identifier system that is briefly mentioned in the video. It is used by Planet Scale and is noted for its compact size and simplicity. The video references a Nano ID Collision tool, which helps estimate the probability of ID collisions, making it a useful resource for developers considering different ID systems.

Highlights

The video discusses the complexities and considerations of using UUIDs (Universally Unique Identifiers) as primary keys in MySQL databases.

Sponsor Planet Scale has created a blog post that the video uses as a starting point for the discussion on UUIDs.

The video corrects the misconception that UUIDs are absolutely unique, noting that collisions can occur.

Version 8 of UUIDs allows vendor-specific implementations, which the video humorously refers to as 'barely a standard'.

The video explains the structure of different versions of UUIDs, such as V1 being time-based and V4 being random.

Using UUIDs as primary keys can hurt database performance due to the way MySQL B+ tree indexes need to be updated and balanced.

Storing UUIDs requires significantly more space compared to traditional 32-bit integers, impacting storage efficiency.

The video debates the pros and cons of using ordered UUID variants like version 6 or 7 for better performance.

The video discusses the security implications of using sequentially ordered IDs, such as the potential for unauthorized access.

Alternative ID types like Snowflake IDs, UL IDs, and Nano IDs are presented as options to UUIDs.

The video provides a detailed comparison between auto-incrementing integers and UUIDs in terms of performance and storage.

The use of UUIDs can lead to inefficient storage utilization due to the randomness of the IDs.

The video highlights the importance of choosing the right type of ID for database architecture to balance uniqueness, performance, and storage.

The video mentions the use of MySQL's built-in UUID functions and their potential benefits.

The video concludes by emphasizing the importance of making an informed decision when choosing an ID system for database design.

The video humorously refers to the complexity of UUID versions as a 'Kingdom Hearts playthrough chart', indicating the difficulty in keeping track of them.

The video provides a critical perspective on the use of UUIDs, suggesting that while they offer uniqueness, they may not always be the best choice for primary keys in every scenario.

Transcripts

play00:00

this video is sponsored by Planet scale

play00:01

I've wanted to do a video about uyu IDs

play00:03

for a while so much so that this is my

play00:06

third time filming it over 2 years last

play00:08

time I filmed it was on my couch in my

play00:10

old living room and the time before that

play00:11

was a random stream filming like kind of

play00:13

what I'm doing here neither of these

play00:15

were good enough for me to release

play00:16

because there's just so many details

play00:17

about uyu IDs I wanted to make sure I

play00:19

got right and even though one of those

play00:21

videos was edited I never ended up

play00:22

posting it and I wanted to take the time

play00:24

to get it exactly right thankfully I

play00:26

don't have to because one of my sponsors

play00:28

Planet scale actually went and did it

play00:29

for me they made a phenomenal blog post

play00:31

the problem with using a uid primary key

play00:34

in my SQL that I want to use as a

play00:36

starting point to have this discussion

play00:37

Planet skill does sponsor the channel

play00:39

and they are sponsoring this video but

play00:40

they have given me no input whatsoever I

play00:42

don't even think they know what I'm

play00:43

covering here they just wrote this blog

play00:45

post knowing I wanted it as something to

play00:46

cover and now I get to cover it so yeah

play00:49

I know Planet scale is a bit of a spicy

play00:50

topic right now but they're still a

play00:52

phenomenal resource for a lot of

play00:53

Education stuff and this is a good blog

play00:55

post to start with so let's go in the

play00:57

problem with using a uid primary key in

play01:00

MySQL understand the different versions

play01:02

of uu IDs and why using them as a

play01:04

primary key in MySQL can hurt DB

play01:06

performance interesting is going to be a

play01:07

performance thing I'm even more Curious

play01:10

now I'll lay out my biases ahead of time

play01:12

I think having the ability to order your

play01:14

IDs is particularly nice when you can

play01:17

and there's a lot of cases where it's

play01:18

nice to know when something was created

play01:20

and using the time as a way to make your

play01:23

unique values more uniques Not a Bad

play01:25

Thing cuid V2 in particular is something

play01:28

that I'm a bit of a shill of that I've

play01:30

been pushing pretty hard for a while now

play01:33

and I'll probably be coming back to cuid

play01:35

2 during this because it is my goto my

play01:38

preferred standard for IDs although I

play01:40

have rolled my own both with my own

play01:42

crazy requirements as well as following

play01:43

along with the things that planet scale

play01:45

recommends but I'm curious what they're

play01:46

recommending here universally unique

play01:48

identifiers also known as uu IDs are

play01:50

designed to allow developers to generate

play01:52

unique IDs in a way that guarantees

play01:54

uniqueness without knowledge of other

play01:56

systems this is the key piece here where

play01:58

uu IDs are meant to be so unique that if

play02:01

you generate them in multiple places at

play02:03

the same time the likelihood of a

play02:04

collision is effectively zero that

play02:07

across the whole universe you'll never

play02:08

generate the same ID twice there have

play02:10

now been multiple times that uu IDs

play02:12

collided so this is just sadly not true

play02:15

when you also add in the fact that

play02:16

random number generators aren't always

play02:18

particularly random and those are being

play02:20

used to then generate these uu IDs

play02:22

things can get messy fast yeah sadly the

play02:25

universal promise of uu IDs for many

play02:27

reasons has not been true for a bit

play02:29

thankfully these things are changing I

play02:30

already see people in chat dropping that

play02:32

the version 8 fixes some of the issues I

play02:34

was talking about around time I'll open

play02:36

this up so we have it to reference in a

play02:38

bit but as someone else already pointed

play02:40

out as well if you need eight versions

play02:41

of the standard to make it right it's

play02:43

probably the standard that's awful to be

play02:45

determined let's keep reading these are

play02:47

especially useful in distributed

play02:48

architecture where you have a number of

play02:50

systems and databases responsible for

play02:52

creating records again very handy when

play02:54

you have lots of things writing to your

play02:56

database at once if they're all writing

play02:58

with the same ID generator

play03:00

those can never Collide if two things

play03:02

are trying to write to the database at

play03:03

the same time with the same ID you're

play03:05

you might think that using uyu

play03:06

IDs as a primary key in a database is a

play03:08

great idea when used incorrectly they

play03:10

can drastically hurt DB performance and

play03:12

here we go let's see how this can affect

play03:14

things in this article you learn about

play03:16

the downsides of uids as primary keys in

play03:18

your mySQL database I am almost positive

play03:20

this isn't just for MySQL either way I

play03:22

promise this video will be valuable even

play03:24

if you're not a MySQL Dev and you're

play03:26

just using other SQL Solutions or maybe

play03:27

not even SQL at all I want to talk about

play03:29

these idas they're very interesting and

play03:31

contrasting the different solutions is

play03:33

something we can all learn from the many

play03:34

versions of uu IDs at the time of this

play03:36

writing there are five official versions

play03:38

of uu IDs and three proposed versions so

play03:41

that V8 by the way that's not official

play03:43

yet six seven and eight are all proposed

play03:45

still yes there's three future versions

play03:47

proposed that we haven't accepted yet V1

play03:50

is known as a time-based uid and can be

play03:52

broken down as follows we have the Time

play03:55

Low Time mid Time Low inversion clock

play03:57

sequence inversion so this is where some

play03:59

of the random comes from and then the

play04:00

general random node while much of modern

play04:03

Computing uses the Unix Epoch time which

play04:05

is January 1st 1970 God imagine how

play04:08

convenient it would be if you were

play04:09

actually born on January ver 1970 like

play04:11

every system would be very trivial it

play04:13

would almost be like free to

play04:14

mathematically compute your birthday

play04:16

anyways since that's the base uids

play04:18

actually use a different date of October

play04:20

10th 15 why why would you use your own

play04:22

thing I didn't even know this what which

play04:25

is the date that the Gregorian calendar

play04:26

started to be more widely what did

play04:29

anybody hear know already what October

play04:31

10th 1568 was the start of the calendar

play04:34

okay somebody said it in chat that's the

play04:35

start of the calendar one person knew it

play04:38

this feels like it's meant to be a like

play04:40

a a pop quiz type thing not a technical

play04:43

implementation detail what the what

play04:45

the is that like that's ice spice's

play04:48

original

play04:50

birthday like this feels like somebody

play04:53

putting in like the well actually the

play04:55

calendar started in 1568 so that that

play04:58

should be our starting point you never

play04:59

know and somebody might need to generate

play05:00

an ID in the 1600s even though we're

play05:03

inventing this in the '90s yeah anyways

play05:06

I don't want to talk about goryan

play05:07

calendars I want to talk about uu ID

play05:09

which we'll do here the embedded time

play05:11

stamp within a uuu ID grows in 100 nond

play05:14

increments from this date which is then

play05:16

used to set the Time Low Time mid and

play05:17

time high segments of the uid the third

play05:20

segment of the uuid contains the version

play05:22

as well as the time high and it occupies

play05:24

the first character of that segment this

play05:26

is true for all versions of uu IDs as

play05:28

shown in subsequent examples the reserve

play05:30

portion is also known as the variant of

play05:32

the uid which determines how the bits

play05:34

within the uuid are meant to be used

play05:36

finally the last segment of The UU ID is

play05:38

the node which is the unique address for

play05:40

the system generating that uu ID most of

play05:43

the little bits here like the reserved

play05:45

and the node not super important but the

play05:47

node is how it knows which of your

play05:49

systems was the one doing the generation

play05:52

so about V2 next version two of the uid

play05:54

implemented a change compared to version

play05:56

one yeah no obviously version two

play05:58

is changing things from version one duh

play06:00

That's How versions work anyways imagine

play06:03

if it didn't change anything if version

play06:05

two implemented no changes compared to

play06:06

version one that would be more

play06:07

interesting that's something I'd put a

play06:09

sentence in here for but since it

play06:10

changed something let's see what it

play06:11

changed the low time segment of the

play06:13

structure was replaced with a posix user

play06:14

ID the theory was that these uu IDs

play06:17

could be traced back to the user account

play06:19

that generated them this sounds like a

play06:20

terrible idea this sounds like an

play06:22

atrocious idea I don't know why they

play06:23

would do this since the low time segment

play06:26

is where much of the variability of uu

play06:27

IDs reside replacing the segment

play06:29

increases the chances of collision as a

play06:31

result this version of uu ID is rarely

play06:33

used yeah I would hope so this is a

play06:36

disaster uuid V3 and V5 versions three

play06:39

and five of uids are very similar I

play06:42

dislike that three and five are similar

play06:43

and four isn't this kind of feels like

play06:46

what happened with like the Final

play06:47

Fantasy versions where like most Final

play06:48

Fantasy games have nothing in common but

play06:50

then like seven and 10 have two

play06:52

characters that are overlapping even

play06:54

though different universes this is chaos

play06:57

versions three and five of uu IDs are

play06:59

very similar the goal of these versions

play07:00

is to allow uu IDs to be generated in a

play07:02

deterministic way so that given the same

play07:04

information the same uu IDs can be

play07:06

generated I'm getting more scared as

play07:09

this goes these implementations use two

play07:11

pieces of information a namespace which

play07:13

is itself a Hu ID oh boy nested U IDs

play07:16

yeah it's going to go well so we have

play07:17

the name space as well as a name these

play07:19

values are run through a hashing

play07:20

algorithm to generate 128bit value that

play07:23

can be represented as a uu ID the key

play07:25

differences between these versions is

play07:27

that version 3 uses an md5 hashing

play07:29

algorithm and version five uses an S1 oh

play07:32

boy so how is four different cuz four is

play07:34

what I see basically everyone using

play07:36

version four is known as the random

play07:37

variant because as the name implies the

play07:39

value of the uid is almost entirely

play07:42

random how how did we let this happen by

play07:45

the way how are V3 and V5 including

play07:48

meaningful information and we just have

play07:50

this one version between the two though

play07:51

it's like yeah what if we just made it

play07:53

all random that will solve the problem

play07:55

and there's since been a six and a seven

play07:57

proposed and I hope we get to those in

play07:59

here so version 4 is the one that almost

play08:01

everyone uses because it's entirely

play08:02

random the exception to this is the

play08:04

first position in the third segment of

play08:05

the uuid which will always be a four to

play08:07

signify the version used so if you see a

play08:09

four in this third section and it's

play08:11

eight digits four digits four digits

play08:13

this is a four it's probably a uid

play08:15

before version six and here's where we

play08:18

get into those ones that are proposed

play08:20

version six is nearly identical to

play08:22

version why we have to diagram this so

play08:25

v1's become V6 V3 has for the most part

play08:30

become V5 and then V4 is what we

play08:33

actually use does this correctly explain

play08:36

the history of uuid versions up to V6 so

play08:39

far because it only gets worse from here

play08:40

this is our starting point back to this

play08:43

article version six is nearly identical

play08:46

to version one the only difference is

play08:47

that the bits used to capture the time

play08:49

stamps are flipped meaning the most

play08:51

significant portions of the time stamp

play08:52

are stored first so time high is now in

play08:55

the front revolutionary huge the graphic

play08:58

below demonstrates you yeah yeah cool

play09:00

makes sense the main reason for this is

play09:01

to create a value that's compatible with

play09:03

version one while allowing these values

play09:04

to be more sortable since the most

play09:06

significant portion of the time stamp is

play09:07

up front this was an attempt to make uid

play09:09

sortable and it failed if you want a

play09:11

sortable uuid you probably want a CU uid

play09:12

we go there in a bit though now for

play09:15

version 7 version 7's also a time-based

play09:17

uid variant but it integrates the more

play09:19

commonly used Unix Epoch timestamp

play09:21

instead of the Gregorian calendar used

play09:23

by version one progress be cool if they

play09:24

use like primagen birthday or something

play09:26

that's pretty close to the like Epoch

play09:29

time for uh Unix right those would be

play09:31

pretty close the other key difference is

play09:33

that the node which is the value based

play09:35

on the systems generating the uid is

play09:36

replaced with Randomness making these

play09:38

uids less trackable back to their Source

play09:41

also cool including the system that's

play09:42

generating the uid is disgusting very

play09:45

happy that they have ditched that but

play09:48

now we have to update our diagram to

play09:50

include V7 which um is I guess most

play09:53

accurately represented by breaking off

play09:56

this fixes

play10:00

cool keep going uid V8 version 8 is the

play10:04

latest version that permits vendor

play10:05

specific implementations while adhering

play10:07

to RFC standards oh boy let's solve the

play10:10

problem by letting everyone introduce

play10:12

their own problems that sounds great the

play10:13

only requirement for uid V8 is that the

play10:16

versions be specified in the first

play10:17

portion of the third segment as all

play10:19

other

play10:20

versions so uui v8's barely even a

play10:23

standard if I'm understanding this

play10:25

correctly apparently I broke where V3 is

play10:27

pointing yeah V3 should oops oh star to

play10:29

do there fix that so V3 is still

play10:31

pointing at V5 and I have to add V8

play10:33

which when I add it's going to move all

play10:34

the arrows we love that so fix that so

play10:37

it's pointing there I'll fix this again

play10:40

I want to properly emphasize that this

play10:41

one's like

play10:44

imaginary this is barely a

play10:48

standard more A vibe cool I think that

play10:53

accurately communicates how I'm feeling

play10:55

thus far I wanted to maybe change the

play10:57

font family of V8 but I there's nothing

play11:00

here that is what I I want for it this

play11:02

looks like a Kingdom Hearts playthrough

play11:06

chart you're not wrong yeah uh I'm not

play11:10

going to go any further if you know you

play11:12

know anyways air from V4 to V7 does does

play11:15

V4 go to V7 it seemed like it's based on

play11:18

version one and V4 is its own thing that

play11:20

doesn't have any of the date time

play11:22

yeah it's almost entirely random the

play11:24

only thing that in V4 that isn't random

play11:26

is this digit everything else is random

play11:27

and all of the others time is faked in

play11:29

which is why V4 is so different oh so so

play11:32

V7 is version one plus version 4 with

play11:35

the random so I guess yeah the other key

play11:38

difference is that the node the values

play11:39

based on is replaced with Randomness so

play11:41

there a little more random there's just

play11:43

the node is random which the node isn't

play11:44

a big part okay I guess the no's a big

play11:47

part fine the no's a big part I

play11:49

guess by technicality that V4 does kind

play11:53

of also go into V7 cool looking better

play11:56

by the minute guys sure that yeah this

play11:58

will defin defitely not confuse anybody

play12:01

anyways let's keep reading this post

play12:03

we're going to be here forever yeah the

play12:05

version 8 is a Vibe it's not a standard

play12:07

let's see how we're actually using this

play12:08

uu IDs and MySQL using U IDs mostly

play12:11

guarantees uniqueness across all systems

play12:13

in your architecture so you might be

play12:15

inclined to use them as primary keys for

play12:16

your records be aware that there are

play12:18

several trade-offs to doing so when

play12:19

compared to an auto incrementing integer

play12:22

interesting that they're comparing to

play12:23

Auto incrementing integers primarily we

play12:24

we'll see where this goes I'm not

play12:26

necessarily the biggest fan of just

play12:27

using ins as our database IDs but we'll

play12:30

see what they have to say insert

play12:31

performance whenever a new record is

play12:33

inserted into a table in MySQL the index

play12:35

associated with the primary Keys needs

play12:37

to be updated so querying the table is

play12:39

performant indexes in MySQL take the

play12:42

form of a B+ tree which is a multi-layer

play12:44

data structure that allows queries to

play12:46

quickly find the data that they need

play12:47

there's a bunch of words for their going

play12:49

through the IDS in fancy data structure

play12:51

formats to actually find the things

play12:53

you're looking for the following diagram

play12:54

shows what this looks like with six

play12:55

entries with values from 1 to six a

play12:57

query comes asking for five my equal

play12:59

will start at the root and know from

play13:00

there if it has to Traverse to the right

play13:02

for what it's looking for cool this is a

play13:04

standard binary tree instead of a B+

play13:05

tree the key difference is a B plus tree

play13:07

the leaf nodes contain a reference to

play13:08

the data while on a b tree the leaf

play13:10

nodes do not cool be balanced not binary

play13:13

sorry brain fart it happens make fun of

play13:16

me in the comments anyways B tree we

play13:18

have four is where we're at and it

play13:20

points in the two different directions

play13:22

one is to where 1 2 and three are and

play13:24

the other is to where five and six are

play13:25

so we're looking for five and we're

play13:26

starting here we know we need to go to

play13:28

the right and we'll find it pretty quick

play13:30

cool if values 7 through 9 are added my

play13:32

SQL will split the right node and

play13:33

rebalance accordingly so now once we

play13:35

have these new nodes we split the tree

play13:37

so we have four and seven and we know

play13:38

four points to 1 2 3 on the left and

play13:40

five and six on the right and seven

play13:41

points 5 six on the left and 8 n on the

play13:43

right process is known as page splitting

play13:46

and the goal is to keep the b+3

play13:47

structure balanced so that MySQL can

play13:49

quickly find the data it's looking for

play13:51

with sequential values this process is

play13:53

relatively straightforward however when

play13:55

Randomness is introduced into the

play13:56

algorithm it can take significantly

play13:58

longer for my to rebalance the tree on a

play14:00

high volume database this can hurt users

play14:02

experience as MySQL tries to keep the

play14:04

tree in Balance learning things already

play14:06

oh boy yeah things for us to think about

play14:09

for our databases cool for more

play14:11

information about how B+ trees work we

play14:13

have a dedicated video in our MySQL for

play14:15

developers course good to know yeah RP

play14:17

Aon who created this wonderful course

play14:20

that is fully freely available highly

play14:22

recommend checking it out if you haven't

play14:23

it's one of the best ways to learn SQL

play14:25

and it's dope that they made this free I

play14:27

have actually considered doing this

play14:28

whole thing myself as a video just to

play14:30

show you guys how little I actually know

play14:32

about SQL and how much I can do with it

play14:34

but uh someday anyways oh we even have

play14:37

people in chat saying that that video by

play14:38

Aaron is dope and worth watching so very

play14:41

worth checking out higher storage

play14:42

utilization this I'll be honest is

play14:44

something I care about a lot less

play14:46

overall but I'm curious how important

play14:48

they think it is all primary keys in my

play14:50

SQL are indexed by default an auto

play14:51

incrementing integer will consume 32

play14:53

bits of storage per value compare this

play14:55

with uu IDs if store in a compact binary

play14:57

format a single uu ID would cons consume

play14:59

128 bits on dis already that's four

play15:02

times the consumption of a 32-bit

play15:03

integer If instead you chose to use a

play15:05

more human readable string based

play15:06

representation each uid could be stored

play15:08

as a care 36 consuming a whopping 688

play15:12

bits per uuid I can't remember last time

play15:14

I broke a sweat over 688 bits anyways

play15:17

this means that each record would store

play15:18

over 20 times more data than the 32-bit

play15:21

integer it's a record it has other

play15:22

things on it yes it adds up but if I

play15:25

have like a URL or a username or a name

play15:29

on these things it can get big this does

play15:31

get bad for indexing though which is I'm

play15:32

sure what we're about to get to in

play15:34

addition to the default index created on

play15:36

the primary key secondary indexes will

play15:38

also consume more space this is because

play15:40

secondary indexes use the primary key as

play15:42

a pointer to the actual row meaning they

play15:44

need to be stored with the index this is

play15:46

a good point in order for a an index

play15:48

that will help you find data in your

play15:49

database faster to work it needs to have

play15:52

all of the IDS as well so if you now for

play15:55

all of your indexes have to store 20

play15:57

times more data your database indes are

play15:59

suddenly massive and also have to be

play16:01

held in memory so that they can be

play16:02

performed and if these are moving to

play16:03

disc cuz they're too big now you're

play16:04

getting yes I understand this can

play16:07

lead to significant increases in storage

play16:09

requirements for your database depending

play16:10

on how many indexes are created on

play16:12

tables using uuids as the primary key

play16:14

this point alone is big enough that I'm

play16:16

I'm swaying my my feelings here you're

play16:19

making good points finally page

play16:21

splitting as described in the previous

play16:22

section can also negatively impact

play16:24

storage utilization as well as

play16:26

performance INB assumes that the primary

play16:28

key will increment predictably either

play16:30

numerically or lexicography

play16:32

lexicographically I was going to assume

play16:34

this is lexor graphically but the lexico

play16:36

well I have to worry about prime reading

play16:37

this article if true INB will fill the

play16:39

pages to about 94% of the page size

play16:42

before creating a new page when the

play16:43

primary key is random the amount of

play16:45

space utilized for each page can be as

play16:47

low as 50% also huge due to this using

play16:49

EU IDs that incorporate Randomness can

play16:51

lead to excessive usage of pages to

play16:53

store the index okay I've seen this this

play16:56

is funny because this is a case against

play16:57

the one good type of ID which was V4 and

play17:00

now it's saying that the randomness

play17:01

means that you have to store

play17:03

significantly more data Jesus yeah good

play17:06

to

play17:08

know best ways to use a uid primary key

play17:10

with SQL you absolutely need to use uids

play17:13

as the unique identifier for records in

play17:14

your table there's a few best practices

play17:16

you can follow to minimize the negative

play17:17

side effects of doing so first is you

play17:19

can use the binary data type well uids

play17:21

are often sometimes often sometimes I

play17:25

just thought I was having a stroke for a

play17:26

second

play17:27

uh this is not a sentence I feel yeah no

play17:31

I don't think the author had a stroke I

play17:32

think the author made a typo or deleted

play17:34

the wrong word and it made me feel like

play17:35

I was having a stroke but yeah while

play17:37

uids are often represented as 36

play17:39

character strings they can also be

play17:41

represented in their native binary

play17:42

format as well if converted to a binary

play17:44

value you can store it in a binary 16

play17:46

column which reduces the storage

play17:48

requirement down to 16 bytes this is

play17:50

still quite large compared to a 32-bit

play17:51

integer but certainly better than

play17:53

storing the uid is a care

play17:55

36 bits versus bites by the way I'm not

play17:58

telling you what the difference is there

play17:59

that's your problem hopefully you know

play18:01

that if you made it this far in anyways

play18:03

here we have a table of uu IDs U ID is

play18:05

character 36 binary binary 16 inserted

play18:09

with both and we can see that as binary

play18:11

it's able to represent it much smaller

play18:13

and take a Plus data good to know next

play18:16

option use an ordered uid variant using

play18:18

a uid version that supports ordering can

play18:20

mitigate some of the performance and

play18:21

storage impacts of using uids by making

play18:23

the generated values more sequential and

play18:25

by doing such you are able to avoid some

play18:27

of the page splitting issues described

play18:28

earlier I don't really recommend using

play18:31

uu IDs as a primary ID unless you're

play18:33

doing something like this I'm going to

play18:34

use this opportunity to plug CU ID which

play18:36

has been my go-to for this type of thing

play18:37

for a while if I want like a big unique

play18:40

ID that can be written from any of many

play18:42

different places without having to worry

play18:43

about collisions Co ID is proven to be a

play18:45

very very good and secure solution which

play18:48

allows me to sort properly so it handles

play18:51

a lot of these problems but it's also

play18:53

handling a ton of uniqueness which

play18:55

significantly lowers the chance that

play18:58

you'll have cion you'd have to reach

play19:00

some crazy multi- trillion trillion

play19:03

number to reach 50% chance of having a

play19:05

collision insane oh I forgot uh it's not

play19:08

good for sequential because it's not

play19:10

technically sortable as a V2 apparently

play19:12

they disagree about the

play19:20

sortabad at Fields instead according to

play19:22

this author since we have created at

play19:24

fields we don't have to worry about

play19:25

sorting this is something that is a big

play19:27

disagreement they even call out here

play19:29

that one of the reasons to use

play19:31

sequential Keys is to avoid ID

play19:32

fragmentation which can require a large

play19:34

amount of dis space for databases with

play19:35

billions of Records however at such a

play19:37

large scale modern systems often use

play19:39

cloud native databases that are designed

play19:41

to handle terabytes of data efficiently

play19:43

and at a low cost Additionally the

play19:45

entire database may be stored in memory

play19:46

providing fast random access Lookout

play19:47

performance not if you can't fit it in

play19:49

memory because you're using a giant ID

play19:51

standard and have to index it all over

play19:52

the place yeah this is where these

play19:55

disagree if we go back to this diagram

play19:57

if you can't sort the ID

play19:59

you can't split the table up logically

play20:01

based on an ID so if somebody wanted to

play20:03

look up ID 4 and they have to find that

play20:05

through your crazy tree and you can't

play20:07

use the Sorting of the IDS to do that

play20:09

now you have to look through everything

play20:11

to find your way there utter chaos as

play20:14

such I yeah I don't like that these

play20:16

disagree because no offense to the cuid

play20:19

guys CU as you all know I massively

play20:22

respect y'all and I use cuid for a bunch

play20:24

of the fact that they have backed

play20:27

out of having sort ability is scary

play20:29

there is a fair point that they can

play20:30

cause hot spots in the database like if

play20:32

you have a system that generates a large

play20:33

number of IDs in a short period the IDS

play20:35

will be generated in sequential order

play20:37

causing the tree to become unbalanced

play20:39

which will force frequent rebalancing

play20:40

the counter point there is that the

play20:42

rebalancing is much more expensive if

play20:44

the IDS can't be like quickly identified

play20:47

as which way they go if the value is

play20:49

going in a random you have to rebalance

play20:51

much more often because as they said

play20:53

when a primary key is random the amount

play20:54

of space utilized for each page can be

play20:56

as low as 50% so you're rebalance Bing

play20:59

almost twice as much off of this alone

play21:01

so like yeah it's cheaper to do the

play21:03

rebalance with sequential IDs than not

play21:05

I'm I'm not vibing with the cuid post

play21:07

anymore and it's crazy how quickly my

play21:09

own stance on these things is changing

play21:11

as I read this so yeah I happy that I'm

play21:15

reading this and that we're learning so

play21:17

back to this bit of using a ordered uid

play21:19

variant because apparently my cuid V2 is

play21:21

not something I can rely on anymore

play21:22

using a uid version that supports

play21:24

ordering can mitigate some of the

play21:25

performance and storage impacts of using

play21:27

uids by making the generated values more

play21:30

sequential which avoids some of the page

play21:31

splitting issues described earlier even

play21:33

when they are being generated on

play21:34

multiple systems time based uid such as

play21:37

version 6 or 7 can guarantee uniqueness

play21:39

while keeping values as close to

play21:40

sequential as possible the exception to

play21:42

this is uid V1 which has the least

play21:44

significant portion of the time stamp

play21:45

first uid V1 was such a fumble in so

play21:48

many ways I can't believe they recovered

play21:49

from that funny enough the main reason

play21:51

people are recommending cid2

play21:54

specifically the original maintainers of

play21:55

cv1 is that by having the IDS be

play21:58

sortable and auto incremented they

play22:00

believe this introduces security issues

play22:02

like unauthorized password resets via

play22:04

guessable IDs unauthorized access to

play22:07

gitlab Via guessable IDs unauthorized

play22:09

password resets via gu ID all things

play22:11

that our friend Eva absolutely loves I

play22:14

haven't seen her say anything in the

play22:16

last few seconds so I'm going to go

play22:17

through her chat history where she said

play22:18

some stuff about this earlier love uu

play22:20

IDs from a Dev perspective I like

play22:22

incremental IDs as a security pentester

play22:24

not because they're good just because

play22:25

they're easy to pentest yeah it's a lot

play22:28

easier to test things when you can guess

play22:31

IDs just by generating them yourself

play22:34

that said if your service can be

play22:36

compromised through an ID being known

play22:38

you need to rethink the way that service

play22:40

works because it should take a lot more

play22:41

data than that IDs are still ID even if

play22:44

random yes if your IDs are all it takes

play22:46

to compromise your system then you need

play22:47

to fix that not the ID system itself I

play22:50

get it now we want super small

play22:51

performance sortable unique access

play22:53

across the world support a fast access

play22:55

insert removal Etc okay simple yeah

play22:57

super easy go build it shouldn't be that

play22:59

hard at all anyways other options that

play23:02

are provided here are using the built-in

play23:03

MySQL uid functions probably an option

play23:06

worth considering I'm sure that their

play23:07

implementation is better never mind how

play23:10

does anyone take the standard seriously

play23:12

I yeah while it is not a great practice

play23:15

to use them by themselves there's a

play23:16

helper function in my SQL called uid to

play23:18

bin not only does this function convert

play23:21

the string value to Binary but you can

play23:22

use the option swap flag which will

play23:24

reorder the timestamp portion to make

play23:26

the resulting binary more sequential

play23:28

[Music]

play23:32

I'm going

play23:33

insane and the other option use an

play23:35

alternative ID Type U IDs are not the

play23:38

only type of identifier that provides

play23:39

uniqueness with a distributed

play23:40

architecture consider that they were

play23:42

first created in 1987 there has been

play23:44

plenty of time for other professionals

play23:46

to propose different formats such as

play23:48

snowflake IDs UL IDs and even Nano IDs

play23:51

which is what they use at Planet scale I

play23:52

actually really like this post I don't

play23:53

know if I've covered this before but I

play23:55

use Nano IDs for a bunch of random stuff

play23:57

because of this post they even have a

play23:59

little Link in here the Nano ID

play24:01

Collision tool that is super useful as

play24:03

you restrict things because something I

play24:05

did with RS is I didn't want to include

play24:08

lowercase letters because I didn't want

play24:09

users to have to like if they're typing

play24:11

in a a page based on the ID which was a

play24:13

real use case for what we were building

play24:15

I didn't want them to have to worry

play24:16

about those characters so I deleted all

play24:18

of that so it's just letters and numbers

play24:20

and I also think I got rid of zero and O

play24:22

because they look like numbers I think I

play24:24

got rid of I too for the same reason to

play24:26

make it a little easier to to read this

play24:28

but now that we have that we can pick

play24:31

different lengths and see how long it

play24:32

takes for a 1% probability of a

play24:36

collision and you can even change the

play24:38

speed at which you're generating over

play24:39

that time too super super handy tool if

play24:42

you want to generate your own Nano IDs

play24:44

and like know how realistic it is to hit

play24:46

a collision I liked this a lot so check

play24:49

out that post if you're curious about

play24:50

more there so yeah they recommend all of

play24:52

those options all of which make sense

play24:53

did not know snowflake IDs were just

play24:55

numbers and like really big ones

play24:57

interesting

play24:59

conclusion using a uid primary key in

play25:01

MySQL can nearly guarantee uniqueness in

play25:03

a distributed system however it comes

play25:05

with several trade-offs luckily with the

play25:07

many versions available as well as

play25:08

several Alternatives you have options

play25:10

that can better address some of these

play25:12

trade-offs after reading this article

play25:13

you should be in a better position to

play25:15

make an informed decision about the ID

play25:17

that you choose to use while

play25:18

architecting your next database yeah

play25:20

this is pretty solid let me go post the

play25:23

uid chaos on Twitter quick

play25:31

thank you to the Kingdom Hearts

play25:32

referencer earlier that will live in my

play25:35

head rentree forever that's all I to say

play25:37

about yuu IDs I'm so happy this video is

play25:39

finally done and people can stop asking

play25:40

for it let me know in the comments how

play25:42

you feel and thank you again to Planet

play25:43

scale for letting me finally put this

play25:45

one to rest once and for all and more

play25:47

importantly showing me that cuid has the

play25:49

same problems as well and until next

play25:51

time peace NS

Rate This

5.0 / 5 (0 votes)

Related Tags
Database PerformanceUUIDsMySQLPlanet ScalePrimary KeysDistributed SystemsData StructuresB+ TreeCUIDNano IDsSecurity Concerns