98 Percent Cloud Cost Saved By Writing Our Own Database

ThePrimeTime

20 Apr 202421:45

Summary

TLDRThe video script discusses the decision-making process and the technical challenges behind building a custom database for a cloud platform that tracks thousands of people and vehicles simultaneously. The company, jokingly referred to as 'NSA' due to its European roots, faced escalating costs with Amazon Aurora and the limitations of on-premise database clusters. To address these issues, they developed a bespoke in-process storage engine that significantly reduced cloud costs by 98%. The custom solution, named 'aen', is a binary format that prioritizes high performance and low storage space, with a unique 4-byte identifier for each entry and a separate index file for fast retrieval. Despite the skepticism about the engineering costs and the lack of a version field in their binary encoding, the company claims that their approach provides the exact functionality they need without losing any features. They also mention the use of AWS services like EBS and Glacier for data storage and the importance of considering versioning in binary protocols for future adaptability.

Takeaways

🚀 **Innovative Database Development**: The company saved 98% on cloud costs by developing their own database, tailored to their specific needs for tracking and storing location data.
🔍 **High Performance Requirements**: They needed a database that could handle up to 30,000 location updates per second per node, with the ability to buffer these updates.
📊 **Data Compression**: The custom database uses a minimal delta-based binary format, which significantly reduces the storage space needed, allowing for about 30 million location updates per gigabyte.
💾 **Storage Efficiency**: They replaced a costly Aurora instance with a much cheaper elastic block storage volume, which, combined with their custom storage engine, led to the massive cost reduction.
⏱️ **Speed Improvements**: Queries and data retrieval have become much faster, with one example going from 2 seconds to 13 milliseconds for a specific operation.
🔢 **Binary Data Format**: The database stores data in a binary format, which is more space-efficient but also requires careful design to accommodate future changes.
📈 **Scalability**: The system is designed to allow for unlimited parallelism, with multiple nodes able to write data simultaneously without an upper limit.
🌐 **Global Use Case**: The database serves a cloud platform that tracks a large number of people and vehicles, with use cases ranging from car rentals to precise location tracking for various industries.
🔒 **Data Privacy and Loss Tolerance**: The company is comfortable with potentially losing some data due to buffering, which indicates a trade-off between data integrity and system performance.
📝 **Lack of Versioning**: There is a noted absence of a version field in the binary format, which could be crucial for future compatibility and upgrades.
📘 **Archiving Strategy**: Data older than 10 days is moved to AWS Glacier, which further reduces costs and is aligned with customer query habits.

Q & A

What is the main reason the company decided to build their own database?
-The company decided to build their own database to save on cloud costs, which were upwards of $10,000 a month, and to handle the high volume of location updates efficiently.
What are the key performance requirements for the new database system?
-The key performance requirements include handling up to 30,000 location updates per second per node, having unlimited parallelism for simultaneous writes across multiple nodes, and maintaining a small disk footprint due to the large volume of data.
How does the company's database system differ from a general-purpose database like PostgreSQL?
-The company's database system is a purpose-built, in-process storage engine with a limited set of functionality that is bespoke to their specific needs, as opposed to a general-purpose database like PostgreSQL which offers an expressive query language and broader functionality.
What is the significance of the binary format used in the company's database system?
-The binary format is significant because it allows for a minimal delta-based binary storage, which is highly space-efficient. This format enables the storage of about 30 million location updates in a gigabyte of space.
How does the company manage data consistency and privacy concerns?
-The company acknowledges that they are okay with losing some data due to buffering and server failures. They maintain low consistency guarantees and are comfortable with the potential loss of one second's worth of updates in the current buffer.
What is RTK and how does it relate to GPS accuracy?
-RTK stands for Real-Time Kinematics, a technique used to enhance the accuracy of position data from GPS signals. It can improve the accuracy to as low as 10 centimeters, which is significantly better than the traditional 6-meter accuracy.
Why did the company choose to move older data to AWS Glacier?
-The company moved older data to AWS Glacier to reduce costs. Since their customers rarely query entries older than 10 days, archiving data that exceeds 30 gigabytes to Glacier is a strategic decision to optimize their EBS costs.
What is the role of the separate index file in the company's database system?
-The separate index file translates the static string ID for each entry and its type to a unique 4-byte identifier, which allows for extremely fast retrieval of the history for a specific object.
How does the company ensure high uptime and reliability for their database system?
-They use provisioned IOPS SSD io2 with 3,000 IOPS and batch updates to ensure high uptime. Additionally, EBS has built-in automated backups and recovery, which provides high uptime guarantees similar to what Aurora offered.
What is the primary storage format used by the company's database system?
-The primary storage format is a minimal delta-based binary format, which includes flags, ID, type index, timestamp, and latitude and longitude data, with full state storage occurring every 200 writes.
What is the impact of the custom database system on query performance?
-The custom database system has significantly improved query performance. For example, recreating a particular point in time in a realm's history went from around 2 seconds to 13 milliseconds.