2: Instagram + Twitter + Facebook + Reddit | Systems Design Interview Questions With Ex-Google SWE

Jordan has no life

25 Nov 202343:49

Summary

TLDRThis video script outlines a comprehensive guide to building a social media platform supporting services like Instagram, Twitter, Facebook, and Reddit. It covers essential features including newsfeeds, user following, and Reddit-style nested comments. The speaker discusses system design considerations, database choices, and the use of technologies like Cassandra, Flink, and Kafka to ensure scalability, performance, and consistency. The script also delves into optimizing read operations, managing large volumes of data, and handling popular posts from verified users.

Takeaways

😀 The video covers building a quad-combo service for Instagram, Twitter, Facebook, and Reddit, focusing on similar features like Newsfeeds and Reddit-style nested comments.
🕵️‍♂️ The plan includes supporting a Newsfeed, user following/followers, and configurable privacy types for posts, with an emphasis on optimizing for read operations due to the nature of social media usage patterns.
📈 Capacity estimates are provided, assuming 100 bytes per character for posts and comments, with potential storage requirements calculated for a billion posts per day and a million comments per post.
🔄 The use of Change Data Capture (CDC) is proposed to maintain follower relationships and avoid partial failure scenarios, ensuring data consistency without the need for two-phase commits.
💡 Derived data and stream processing frameworks like Kafka and Flink are recommended for keeping data in sync and ensuring no messages are lost, even in the event of a failure.
🛠️ Cassandra is suggested as the database of choice for the user followers table due to its high write throughput and the ability to handle right conflicts naturally by merging data.
🔑 The importance of proper partitioning and sorting in databases is highlighted to ensure fast query performance, especially for operations like deleting a follower or loading a user's Newsfeed.
📱 For Newsfeed optimization, the video discusses the concept of caching every user's Newsfeed on powerful servers to provide a fast reading experience, even considering the asynchronous nature of data updates.
🔍 A hybrid approach is considered for handling popular posts from verified users with many followers, using a combination of direct database reads and caching strategies to manage the load.
🗣️ The script touches on the implementation of security levels in posts, suggesting storing security permissions within the followers table and allowing Flink to manage these permissions when delivering posts to caches.
🌐 Finally, the video addresses the complexity of implementing nested comments, proposing a depth-first search index similar to a geohash for efficient range queries and good disk locality.

Q & A

What is the main focus of the video?
-The video focuses on building a system that supports features for Instagram, Twitter, Facebook, and Reddit, including newsfeed and Reddit-style nested comments.
What are the key features planned for the system?
-The key features include a newsfeed, support for Reddit-style nested comments, quickly loading who a user is following and who follows them, getting all posts for a given user, low latency newsfeed, configurable privacy types, and optimizing for read operations.
Why is optimizing for reads important in the context of a social media site?
-Optimizing for reads is important because the majority of user interactions on social media sites involve reading or 'lurking' rather than posting, making read operations more frequent.
What is the estimated storage requirement for a single post and how does it scale up to yearly storage for a billion posts per day?
-A single post is estimated to be around 200 bytes, including metadata. With a billion posts per day, this could lead to approximately 73 terabytes of storage per year.
How does the system plan to handle the follower and following relationships in a distributed database setting?
-The system plans to use a change data capture (CDC) method with a single source of truth table and stream processing frameworks like Kafka and Flink to ensure data consistency and avoid partial failure scenarios.
What database is suggested for handling the user follower table and why?
-Cassandra is suggested due to its high write throughput, leaderless replication, and the use of LSM trees, which allow for fast ingestion and buffering in memory.
How does the system handle the issue of popular users with millions of followers?
-For popular users, a hybrid approach is used where posts are read from the Post DB directly, and a caching layer for popular posts is introduced to handle the high volume of followers efficiently.
What is the proposed method for implementing configurable privacy levels for posts?
-The implementation involves storing additional information in the followers table to indicate the security level of the relationship, which is then used by the Flink consumer to filter posts accordingly.
What challenges arise when considering the replication of comments in a social media system?
-Challenges include maintaining causal dependencies and ensuring that the state of the replicas makes sense, avoiding situations where a comment's child exists on a replica but not its parent.
How does the video script address the problem of reading nested comments efficiently?
-The script suggests using a depth-first search index, similar to a geohash, which allows for range queries to efficiently retrieve entire branches of comments.
What is the overall architecture of the system presented in the video?
-The system architecture includes services for user management, follower relationships, post management, and comments, with databases like MySQL, Cassandra, and potentially a graph database or a depth-first search index for comments, all interconnected through Flink nodes for stream processing.