How Notion Handles 200 BILLION Notes (Without Crashing)

Coding with Lewis
20 Mar 202510:03

Summary

TLDRThis video delves into how Notion handles its rapidly growing database system, exploring their shift from a single Postgres database to a more complex sharded structure. As user numbers surged, Notion adopted sharding, a data lake, and various open-source technologies like Apache Spark and Kafka to scale efficiently. The video also highlights the challenges they faced, including the need for faster updates, managing increasing data volumes, and handling the load on their database. Notionโ€™s engineers share insights on overcoming these hurdles and creating a scalable system to support millions of users with minimal downtime.

Takeaways

  • ๐Ÿ˜€ Notion is one of the fastest growing softwares, scaling from 1,000 users in 2020 to 100,000,000 users by 2024.
  • ๐Ÿ˜€ Notionโ€™s database model treats each block in a page as a unique database row, which includes IDs for the block, its type, child blocks, and parent blocks.
  • ๐Ÿ˜€ Notion faced significant scalability challenges as they reached 20 billion blocks and decided to implement database sharding to distribute the load.
  • ๐Ÿ˜€ Sharding divides the database into smaller parts across many machines, reducing heavy queries and improving traffic management. Notion used this method to handle 32 database instances with 480 logical shards.
  • ๐Ÿ˜€ Notion employed the double write method, writing data to both new and old databases while a 'catch-up worker' filled in the missing data over time.
  • ๐Ÿ˜€ A data lake, built using Snowflake and AWS S3, was introduced to manage and process Notionโ€™s raw data more efficiently.
  • ๐Ÿ˜€ Snowflake's data lake was originally designed for fast inserts, but due to Notion's frequent updates, it became costly, prompting the creation of a custom data lake tailored for Notionโ€™s needs.
  • ๐Ÿ˜€ Notion's custom data lake uses Amazon S3, Apache Spark, Apache Kafka, and Apache Hoodie to handle large datasets, optimize analytics, and manage data pipelines.
  • ๐Ÿ˜€ As the user base continued to grow, Notion had to triple the number of database machines and implement more efficient traffic distribution methods using pgbouncer.
  • ๐Ÿ˜€ The final solution involved sharding the pgbouncer into four groups to handle the increased traffic, followed by a careful transition to the new database system without downtime for users.

Q & A

  • How does Notion store and manage database entries for billions of users?

    -Notion uses a system where every time a user opens a new page, each block is rendered as a separate database row. This allows Notion to handle billions of entries with unique IDs for each block, as well as references for child and parent blocks.

  • What was the primary reason Notion needed to implement sharding?

    -Notion implemented sharding because their single Postgres database was struggling to handle the rapid growth in user numbers, with 20 billion blocks slowing down performance. Sharding allowed them to distribute the data across multiple smaller machines to improve scalability and efficiency.

  • What does sharding involve, and how did Notion apply it to their database?

    -Sharding involves splitting a large database into smaller, more manageable pieces. Notion divided their database based on workspace IDs, ensuring each block only had one parent at a time. They created 32 separate database instances, each hosting 15 shards, leading to 480 logical shards in total.

  • How did Notion manage the migration to the new sharded system without losing data?

    -Notion used the double write method, which involved writing data to both the old and new databases. They also created an audit log and a catch-up worker to periodically synchronize and backfill data. After verification, they performed a quick five-minute switch to the new system.

  • What is a data lake, and how did Notion use it in their system?

    -A data lake is a centralized repository for large amounts of raw data, often used for analytics. Notion used a data lake to offload data from Postgres into Snowflake for easier processing. They later transformed this raw data for analytics and insights, using services like Apache Spark.

  • Why did Notion decide to build their own data lake rather than using Snowflake?

    -Notion built their own data lake because Snowflake wasn't optimized for handling Notion's update-heavy block data. They needed a system that could process both raw and processed data quickly while supporting modern features like AI and search, which required handling unstructured data efficiently.

  • What technologies did Notion integrate to manage their data lake and transform their data?

    -Notion integrated several open-source technologies, including Apache Spark for data transformation, Apache Kafka for consistent data streaming, and Apache Hoodie for managing data pipelines. These tools helped them efficiently process and store their data in Amazon S3.

  • How did Notion address performance and scalability issues with their Postgres database over time?

    -Notion faced scalability challenges when their shards reached 90% utilization. To solve this, they expanded from 32 to 96 database machines, reducing the number of shards per machine and improving system stability. They also used Postgres logical replication to synchronize data during this transition.

  • What was the role of pgbouncer in Notion's system, and how did they manage connection limits?

    -Pgbouncer is a connection pooler that sits between the server and the database to manage database connections efficiently. Notion ran into connection limits as they scaled, so they sharded pgbouncer into four groups to distribute traffic across more databases and prevent downtime from hitting connection limits.

  • How did Notion ensure zero downtime during their database transition to the new sharded system?

    -Notion tested their new system with dark reads, which compared data from both old and new databases. They then transitioned one database at a time, stopping connections, verifying data consistency, updating pgbouncer, and resuming traffic. This careful process ensured there was no downtime for users.

Outlines

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now

Mindmap

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now

Keywords

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now

Highlights

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now

Transcripts

plate

This section is available to paid users only. Please upgrade to access this part.

Upgrade Now
Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
NotionDatabase ScalingShardingTech InnovationDevelopersEngineeringPostgresData LakeAWSMachine LearningAnalytics