Migrating Millions Of Databases | Scaling Postgres 374

Scaling Postgres

13 Jul 202518:35

Summary

TLDRIn this episode of Scaling PostgreSQL, the discussion focuses on managing multi-tenancy in large-scale SaaS applications, particularly Atlassian’s approach to handling 4 million Jira databases across multiple AWS regions. The episode explores database migration challenges, PostgreSQL replication issues, and the importance of efficient change data capture techniques. Other topics include optimizing query performance with random page cost adjustments and handling large data sets in PostgreSQL. The episode wraps up with insights from a consulting corner on database upgrades and AWS migration strategies, offering valuable technical tips for managing PostgreSQL at scale.

Takeaways

😀 Multi-tenancy in databases is common in SaaS companies to isolate customer data. This can be done by using tenant IDs, separate schemas, or individual databases per tenant.
😀 Migrating databases in large-scale systems can be a regular task. Atlassian's Jira platform migrated 4 million databases across 3,000 Postgres servers in 13 AWS regions.
😀 Atlassian's migration process involves rebalancing databases to ensure an even spread of load, using backup, restore, and logical replication techniques.
😀 Managing millions of databases requires operational expertise to maintain performance and handle issues such as migration challenges or file size problems.
😀 A major challenge in Postgres replication occurs when attempting to create a logical replication slot on a read replica, leading to a tight loop in search of a consistent start point.
😀 A bug in Postgres replication was identified where logical replication slots could hang on read replicas due to an inability to find a consistent point in the WAL.
😀 A patch was introduced to allow the cancellation of stuck replication slots on read replicas, improving control over logical replication processes in Postgres.
😀 Not all search tasks in AI require vector-based search; lexical search is crucial for returning exact or semantically relevant results, especially in cases like codebase searches.
😀 To handle logical replication issues, it's essential to monitor slot sizes and prevent them from growing too large, which could result in replication failure or system overload.
😀 In Postgres, the use of hashing functions or generated columns can help solve index size issues when dealing with large data or unique indexes in tables.
😀 Active-active replication setups can be complex and might not be suitable for scaling write throughput. Issues such as conflict resolution, networking costs, and management overhead need to be considered.

Q & A

What is multi-tenancy, and how is it typically implemented in databases?
-Multi-tenancy is a practice used by software-as-a-service (SaaS) companies to ensure that one customer's data is kept separate from another's. It can be implemented by segmenting data in a single table using a tenant ID or by using separate schemas or databases for each tenant. The latter approach leads to more complex operations, as managing large numbers of databases introduces unique challenges.
How does Atlassian manage the 4 million Jira databases in AWS Aurora?
-Atlassian manages its 4 million Jira databases by using one database per tenant. These databases are distributed across 3,000 PostgreSQL servers in 13 AWS regions. To migrate these databases to AWS Aurora, they use a combination of backup and restore for small databases or logical replication for larger ones, rebalancing databases to maintain an even spread of load.
What operational challenges does managing millions of databases bring?
-Managing millions of databases involves significant operational challenges such as ensuring consistent performance, balancing load across instances, handling migrations effectively, and dealing with high overhead in terms of server and database management. Atlassian, for example, migrates 1,000 databases a day to ensure efficient use of resources and balanced load.
What was the issue with replication in Postgres 16 when creating logical replication slots on read replicas?
-In Postgres 16, creating logical replication slots on read replicas led to an issue where the backend process would hang while attempting to find a consistent start point. This problem was due to the difficulty in finding this point on replicas, as they don't have real-time lock information like the primary does. The replication process would be stuck in a tight loop, causing delays.
How did the team address the issue with replication slots hanging on read replicas?
-The team identified that the replication slot creation process on read replicas would hang due to a loop in the backend process that tried to find a consistent start point. They submitted a patch to Postgres to introduce an interrupt check, allowing the process to be cancelled when stuck. Additionally, there was a discussion about adding a new wait event to help track the replication state.
What is the primary difference between lexical search and semantic search in databases?
-Lexical search focuses on exact or similar textual matches, while semantic search considers the meaning behind the words. Lexical search is more appropriate when the exact text matters, such as in codebases or when searching for precise values like 'get user by ID.' In contrast, semantic search is used when context and meaning are more important, but it can be less efficient for tasks requiring exact matches.
Why is it important to set a maximum replication slot size in PostgreSQL?
-Setting a maximum replication slot size is important to prevent a primary database from running out of disk space. If replicas cannot keep up with the replication load, the slot size can grow too large, leading to potential issues. Limiting the slot size, such as capping it at 50 GB, ensures that older WAL segments are discarded, and the replication process is reset to avoid overconsumption of resources.
What is the significance of using heartbeats in PostgreSQL logical replication?
-Heartbeats in PostgreSQL logical replication are useful for ensuring that all databases on the same host can keep up with the Write-Ahead Logging (WAL) activity. They prevent situations where low traffic in some databases causes their replication slots to stall. By generating artificial traffic, heartbeats ensure that the replication slots can progress, preventing them from getting stuck.
What are the potential challenges of using active-active replication?
-Active-active replication, where multiple databases can act as both primary and replica, can lead to issues like conflict resolution, high infrastructure and networking costs, and increased operational overhead. Managing synchronization across multiple regions and databases adds complexity and can result in performance degradation and difficulty debugging issues.
What is the impact of not choosing the optimal query plan in PostgreSQL?
-Choosing a suboptimal query plan can lead to inefficient query execution, leading to higher resource usage and slower performance. A key factor in determining the query plan is the random page cost, which influences the decision between using an index scan or a sequential scan. Misconfigurations in this setting can lead to performance problems, especially in large databases or complex queries.