Google's Tech Stack (6 internal tools revealed)

NeetCode
13 Aug 202309:07

Summary

TLDRThis video script delves into Google's revolutionary in-house technologies, starting with the Google File System for managing petabytes of data and moving to MapReduce for efficient data processing. It covers the evolution of RPCs and gRPC, influenced by Google's Stubby, and touches on Borg, Google's precursor to Kubernetes for job scheduling. The script also explores databases like Bigtable and Spanner, and wraps up with Google's Pub/Sub messaging system, providing a comprehensive look at the tools that power Google's massive infrastructure.

Takeaways

  • 😎 Google was once renowned for its innovation in distributed computing rather than data collection.
  • 🗂️ The Google File System (GFS) was designed to handle petabytes of data with high throughput and replication across multiple servers.
  • 🔄 GFS inspired the creation of Hadoop Distributed File System, which was open-sourced and widely adopted.
  • 📈 MapReduce was Google's initial big data processing framework, simplifying distributed programming with map and reduce functions.
  • 🚀 The simplicity of MapReduce allowed non-experts to process large datasets without deep knowledge of distributed systems.
  • 🔧 Google's internal RPC system, Stubby, is similar to gRPC but with more focus on internal Google operations.
  • 🤖 Borg, Google's internal job scheduler, influenced the development of Kubernetes, an open-source container orchestration system.
  • 💾 Bigtable was created to overcome the limitations of relational databases, supporting high scalability and millions of requests per second.
  • 🌳 Bigtable uses an LSM tree for data storage, providing efficient writes and reads with time as a third dimension.
  • 📚 The technologies developed by Google, such as Bigtable, have inspired other NoSQL databases like Cassandra and DynamoDB.
  • 📨 Pub/Sub is Google's message queuing system, offering a way to decouple services and handle high throughput in distributed architectures.

Q & A

  • What is the Google File System (GFS) and why was it created?

    -The Google File System (GFS) is a proprietary distributed file system developed by Google in 2003 to handle large amounts of data generated by their search engine's web crawlers. It was designed to store petabytes of data and allow concurrent read and write access by multiple machines, with high throughput, consistent data, and replicated files.

  • How does the Google File System store data differently from traditional file systems?

    -GFS stores data by splitting files into 64-megabyte chunks, each assigned a unique ID and stored on at least three servers for redundancy. It also uses a single Master server to maintain the directory structure and map files to their corresponding chunks, similar to the super block in a Linux file system but with distributed file chunks.

  • What inspired the development of the Hadoop Distributed File System (HDFS)?

    -The Hadoop Distributed File System (HDFS) was inspired by Google's GFS. After Google published a paper on GFS, engineers at Yahoo developed HDFS, which was later open-sourced.

  • What is MapReduce and how does it simplify big data processing?

    -MapReduce is a programming model and an associated implementation for processing and generating large datasets. It simplifies big data processing by allowing programmers to focus on implementing two functions: the map function that processes input data and produces intermediate key-value pairs, and the reduce function that aggregates these pairs by key.

  • Why did Google develop their own version of RPC called gRPC?

    -Google developed gRPC to improve efficiency in data serialization by using binary format instead of human-readable JSON. It also provides type safety through the use of protocol buffers and is more suitable for environments where type safety and performance are critical.

  • What is the relationship between Borg and Kubernetes?

    -Borg is Google's internal system for scheduling and managing jobs and tasks across thousands of machines. Kubernetes, an open-source tool created by Google, was influenced by Borg's design, with Borg jobs being similar to Kubernetes pods and Borg tasks being similar to Kubernetes containers.

  • What is Bigtable and how does it differ from traditional relational databases?

    -Bigtable is a distributed storage system for managing structured data that is designed to scale and support millions of requests per second. Unlike traditional relational databases, Bigtable uses a sparsely populated, three-dimensional table structure with columns, rows, and multiple versions of a value at each row-column intersection.

  • How does the LSM tree in Bigtable work?

    -The LSM tree in Bigtable is an acronym for Log-Structured Merge-tree. It stores writes in a memtable, which is kept in memory and sorted. Once the memtable reaches a certain size, it is flushed to an SSTable (Sorted String Table) on disk, where the data becomes immutable.

  • What is the significance of the CAP theorem in the context of Spanner?

    -Spanner is a database system developed by Google that aims to break the CAP theorem, which states that it is impossible for a distributed system to simultaneously provide all three of the following guarantees: Consistency, Availability, and Partition tolerance. Spanner uses GPS and atomic clocks to synchronize time across data centers, enabling it to provide strong consistency and availability.

  • What is Cloud Pub/Sub and how does it function in Google's architecture?

    -Cloud Pub/Sub is a message queuing service that allows for asynchronous communication between services. It is similar to other message queuing systems like RabbitMQ and Kafka. In Google's architecture, it helps to decouple services, handle high throughput, and ensure data durability.

  • What are some of the tools and technologies that have been influenced by Google's internal systems?

    -Several tools and technologies have been influenced by Google's internal systems, including Hadoop's MapReduce and HDFS, which were inspired by Google's GFS and MapReduce. Kubernetes was influenced by Borg, and databases like DynamoDB, Cassandra, and Bigtable have also been influenced by Google's technologies.

Outlines

00:00

😲 Revolutionary Google Technologies and Tools

The first paragraph introduces the speaker's departure from Google and their playful intent to 'expose' Google's secret technologies. It then dives into Google's history of revolutionizing distributed computing with six key tools. The Google File System (GFS) is highlighted for its ability to handle massive data storage and concurrent access, with a unique system of chunking and replication. The paragraph also discusses the MapReduce programming model, which simplified the process of handling large datasets and inspired the creation of Hadoop and other big data processing frameworks. The speaker also touches on Google's use of RPCs and Protocol Buffers for efficient data serialization and type safety, referencing Google's internal version, Stubby, which influenced the development of gRPC.

05:01

🤖 Google's Infrastructure and Database Innovations

The second paragraph continues the exploration of Google's technological innovations, focusing on their infrastructure and database solutions. Borg, a job scheduling system that influenced Kubernetes, is mentioned, along with Google's container stack. The paragraph then describes Bigtable, a NoSQL database designed for high scalability and performance, which uses an LSM tree for efficient data storage. Bigtable's design inspired other databases like DynamoDB, Cassandra, and influenced the development of data warehousing solutions like BigQuery and Dremel. The speaker also humorously mentions 'Goops' and 'Google Domains' before concluding with a mention of Cloud Pub/Sub, Google's message queuing service, which is essential for high throughput and service decoupling in distributed systems.

Mindmap

Keywords

💡Google File System (GFS)

Google File System is a proprietary distributed file system developed by Google to handle large amounts of data across machines. It is designed to store and manage petabytes of data, which is essential for search engines that need to crawl and index web pages. GFS splits files into 64-megabyte chunks and stores them across multiple servers for redundancy and high availability. The script mentions GFS as a revolutionary tool that inspired the development of Hadoop Distributed File System, showing its significance in distributed computing.

💡Distributed Computing

Distributed computing involves the use of multiple computers to work together to solve a problem or process large amounts of data. The video script discusses how Google has revolutionized distributed computing through various tools like GFS and MapReduce. Distributed computing is central to the video's theme as it underpins the technologies that enable Google's search engine and other services to operate at a massive scale.

💡MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large datasets. It is composed of a map step, which processes input data and produces intermediate key-value pairs, and a reduce step, which aggregates these pairs by key. The script explains how MapReduce simplified the process of writing distributed programs for data processing, making it accessible to programmers without requiring them to be distributed systems experts.

💡

💡Hadoop

Hadoop is an open-source framework that allows for distributed processing of large data sets across clusters of computers using simple programming models. It was inspired by Google's MapReduce and GFS, as mentioned in the script. Hadoop includes the Hadoop Distributed File System (HDFS) for storage and yet another implementation of MapReduce for processing data, which has become a standard in big data processing.

💡Protocol Buffers (Protobuf)

Protocol Buffers is a method of serializing structured data that was developed by Google. It is used for communication protocols, data storage, and more. The script describes how Protobuf provides schemas and type safety, making it more efficient than human-readable formats like JSON, especially for binary serialization in technologies such as gRPC.

💡gRPC

gRPC is an open-source RPC (Remote Procedure Call) framework that uses HTTP/2 for transport and Protocol Buffers as the interface definition language. The script contrasts gRPC with RESTful APIs, highlighting its efficiency and type safety due to binary serialization and the use of Protobuf. gRPC is also mentioned in the context of Google's internal version called Stubby.

💡Borg

Borg is Google's internal cluster manager, which is similar to Kubernetes, an open-source tool that was influenced by Borg. The script explains that Borg schedules and manages jobs across thousands of machines, with each job consisting of one or more tasks, which are instances of a binary like a server or batch job. Borg is a key component in Google's infrastructure for job orchestration.

💡Bigtable

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size. It was developed by Google to address the limitations of relational databases when dealing with massive datasets. The script describes Bigtable as a NoSQL database that supports millions of requests per second and is highly scalable, using a three-dimensional table structure with time as the third dimension.

💡LSM Tree

An LSM tree (Log-Structured Merge-Tree) is an advanced data structure used in databases like Bigtable for managing large volumes of data. The script mentions that data in Bigtable is stored in an LSM tree, where writes are batched into a memtable and then flushed to SStables, which are immutable. This structure optimizes write performance and is crucial for the high throughput and scalability of Bigtable.

💡CAP Theorem

The CAP theorem is a concept in distributed systems that states that it is impossible for a distributed system to simultaneously provide all three of the following guarantees: Consistency, Availability, and Partition tolerance. The script humorously refers to Spanner, Google's database, as 'breaking' the CAP theorem, which suggests that Spanner aims to provide all three guarantees, an ambitious goal in distributed systems design.

💡Pub/Sub

Pub/Sub, short for Publish-Subscribe, is a messaging pattern that Google uses for decoupling services and handling high throughput. The script describes Cloud Pub/Sub as the public version of Google's internal message queuing system, which is essential for building scalable and resilient architectures by allowing services to communicate asynchronously.

Highlights

Google's revolutionary approach to distributed computing with six tools.

Introduction of the Google File System (GFS) for handling petabytes of data.

GFS's method of splitting files into 64MB chunks and replicating across multiple servers.

The role of the Master server in GFS for managing file metadata.

How GFS inspired the development of Hadoop Distributed File System.

The release of MapReduce for processing large datasets efficiently.

Explanation of MapReduce's three main steps: map, shuffle/sort, and reduce.

The simplicity of MapReduce for programmers, requiring only two function implementations.

Evolution from MapReduce to Flume, Apache Beam, and Cloud Dataflow.

Introduction to gRPC and Protocol Buffers for efficient API communication.

Comparison of gRPC with RESTful APIs in terms of efficiency and type safety.

Stubby, Google's internal version of gRPC.

Borg, Google's job scheduler, and its influence on Kubernetes.

Bigtable, Google's NoSQL database designed for high scalability and performance.

Bigtable's architecture, including the use of LSM trees for data storage.

The impact of Bigtable on the development of Dynamo, Cassandra, and other NoSQL databases.

Spanner, Google's globally distributed database that breaks the CAP theorem.

Dremel, Google's data warehouse for large-scale data analysis.

Blaze, Google's build tool, and its open-source equivalent, Bazel.

Cloud Pub/Sub, Google's message queuing service for decoupling services.

Summary of Google's tools for data storage, movement, processing, and orchestration.

Transcripts

play00:00

I just quit my job at Google so now I'm

play00:02

gonna expose all of their Ultra secret

play00:04

Technologies just kidding all of these

play00:06

can be found in this GitHub repo before

play00:09

Google was known for suckling on your

play00:11

sweet sweet data they were known for

play00:13

revolutionizing distributed computing

play00:15

again and again and again and they did

play00:18

it by using six revolutionary tools that

play00:21

most people never heard of but I'm gonna

play00:22

show them to you today let's start with

play00:24

a classic the Google file system it's

play00:27

2003 most people are still learning how

play00:30

to open their file explorer but Google

play00:33

has a different problem search engines

play00:35

need a web crawler to navigate through

play00:37

every web page available and save it so

play00:40

that it can be converted to a weighted

play00:42

graph and used within pagerank now

play00:44

that's quite a lot of data to store

play00:46

today that would be on the order of

play00:48

petabytes of data on top of that these

play00:50

files need to be concurrently read and

play00:52

written to by many machines from many

play00:55

developers at Google requiring High

play00:57

throughput consistent data and

play00:59

replicated file files the high level

play01:01

implementation is that files are split

play01:03

into 64 megabyte chunks before they are

play01:06

stored each chunk is assigned a

play01:09

universally unique ID and a given chunk

play01:11

is not only stored on one server but

play01:13

it's stored on at least three servers

play01:16

there's also a single Master server

play01:18

which sort of acts as the table of

play01:20

contents it tells you the directory

play01:22

structure Maps every file to a list of

play01:25

its corresponding chunks and of course

play01:27

tells you the chunk locations as well it

play01:30

kind of reminds me of the super Block in

play01:32

a Linux file system but in this case our

play01:34

file chunks are distributed which makes

play01:36

things a lot more complicated now Google

play01:38

didn't just keep their secrets to

play01:40

themselves they published a very

play01:42

revolutionary paper which inspired

play01:44

engineers at Yahoo to develop the Hadoop

play01:46

distributed file system which was later

play01:48

open sourced the original GFS was also

play01:51

succeeded by Google's Colossus file

play01:53

system okay but storing data is easy how

play01:56

do we actually process the raw web pages

play01:58

well conveniently a year later Google

play02:01

released the mapreduce white paper the

play02:04

problem is that you have to process a

play02:05

lot of data like from Google file system

play02:07

and you could use a single machine to do

play02:10

it but then you might never go home

play02:12

instead you can use the most powerful

play02:14

big data processing framework at the

play02:16

time that your company just happened to

play02:18

invent and don't worry it's actually

play02:20

pretty simple as the name implies

play02:22

there's two main steps mapping the data

play02:25

and reducing the data at least from your

play02:28

perspective as a programmer but there's

play02:29

a hidden middle step called the shuffle

play02:31

or sort step let's say our input is a

play02:34

bunch of raw text files for the map step

play02:37

we would split our data up into

play02:39

individual chunks and each server would

play02:42

receive a portion of them the output

play02:44

from each server would be a list of

play02:46

intermediate key value pairs now before

play02:48

the reduced step we will Shuffle or sort

play02:51

the data by making sure that every pair

play02:53

with the same key value ends up at the

play02:56

same server for the reduced step this is

play02:59

important because as when we reduce or

play03:01

aggregate our data we do so by the key

play03:04

so the input here is the intermediate

play03:06

key value Pairs and the output is the

play03:08

final key value pairs congratulations

play03:10

you just wrote a distributed program to

play03:13

count Words exciting but the reason

play03:15

mapreduce was so revolutionary is that

play03:18

from a programmer's perspective you are

play03:20

only responsible for implementing two

play03:22

functions and these are analogous to the

play03:24

map and reduce functions from functional

play03:26

programming you didn't have to be a

play03:28

distributed systems expert to use this

play03:31

text and shortly after Hadoop mapreduce

play03:33

and open source variation was released

play03:36

nowadays no one really uses mapreduce

play03:38

anymore Google uses Flume Apache beam is

play03:41

the open source equivalent and cloud

play03:43

dataflow is the Google manage job Runner

play03:46

meanwhile the rest of the world uses

play03:47

Apache spark and flank for the same

play03:50

purposes okay enough about

play03:51

infrastructure show me some code well

play03:54

you've probably heard of restful apis

play03:55

but if you really want to bust your

play03:57

balls watch me build the same API with

play03:59

RPC we're using Proto version 3 and this

play04:03

is our hello world package we start by

play04:05

defining our schemas for our rpc's

play04:07

requests and responses then we can

play04:09

create a service which is where we

play04:10

Define our rpcs say hello will basically

play04:13

receive a name and return a greeting

play04:15

message now in our service code we start

play04:17

with some boilerplate like importing the

play04:20

protocol buffers that we just declared

play04:21

then we can Define and register our

play04:24

hello world RPC Handler and create a grp

play04:26

server which listens for requests now

play04:29

that's definitely a lot more work than

play04:30

the rest equivalent and on top of that

play04:32

grpc is not natively supported from

play04:35

browsers now if you are paying attention

play04:37

though you might have noticed that we

play04:38

actually have schemas and type safety

play04:41

that's one of the purposes of protocol

play04:42

buffers it might be less obvious though

play04:45

that grpc is more efficient because data

play04:48

is binary serialized rather than being

play04:50

human readable like Json okay but what

play04:52

does this have to do with stubby

play04:54

basically it's the Google internal

play04:56

version of grpc because Google really

play04:58

doesn't like open source for some reason

play05:00

or maybe they're too lazy to migrate

play05:03

over if someone from Google is watching

play05:04

this let us know enter Borg it's

play05:07

comparable to a really popular open

play05:09

source tool that Google created can you

play05:11

guess which one let me give you a hint

play05:13

it's schedules and manages hundreds of

play05:15

thousands of jobs spanning thousands of

play05:18

machines where a job is the smallest

play05:20

Deployable unit of computing it consists

play05:23

of one or more tasks where a task is a

play05:25

runnable instance of a binary like a

play05:28

server or a batch job so if you guessed

play05:30

kubernetes nice job yes kubernetes was

play05:33

influenced by borg and was open sourced

play05:36

in 2015 where a borg job is kind of like

play05:39

a kubernetes pod and a board task is

play05:41

kind of like a kubernetes container but

play05:43

instead of Docker Google uses this

play05:45

container stack if you can't tell at

play05:47

Google there's a culture of creating the

play05:49

most esoteric names possible for

play05:52

everything but I will say that at Google

play05:54

I never saw anything that resembled a

play05:56

Docker file so this might be a rare case

play05:58

where the tech really just works Borg

play06:01

has some more parallels with kubernetes

play06:03

which you can read more about in the

play06:04

white paper but to summarize Google

play06:06

doesn't use the complicated kubernetes

play06:08

tool and instead uses something even

play06:10

more complicated yeah now finally the

play06:13

moment you've been waiting for databases

play06:16

bigtable was created when Google ran

play06:18

into the limitations of relational

play06:20

databases it was designed to support

play06:22

millions of requests per second and be

play06:24

extremely scalable now just like SQL

play06:27

data is still stored in tables of

play06:29

columns and rows related columns can be

play06:31

grouped together via a column family

play06:33

each row column intersection can have

play06:36

multiple versions of a value so it's

play06:38

kind of like a three-dimensional table

play06:40

where the third dimension is time

play06:42

another difference from SQL is that it's

play06:44

sparsely populated because not every

play06:46

column has to be required by each row

play06:49

and we have the flexibility to add

play06:51

columns as needed now under the hood

play06:52

data is stored in an LSM tree where

play06:55

rights are batched into the mem table

play06:56

where they're stored in sorted order

play06:58

before being flush to the SS tables

play07:01

where they will be immutable like I said

play07:03

storing data is easy guys the 2006

play07:06

bigtable paper inspired the 2007 Dynamo

play07:09

paper which later led to dynamodb and it

play07:12

also inspired the development of

play07:13

Cassandra and I'm sure Amazon continued

play07:15

to give back to the open source

play07:17

Community after that this is the part

play07:19

where I was going to have a mongodb

play07:21

sponsorship I'll just plug my site

play07:23

instead I recently added a full stack

play07:25

development course where we build out a

play07:27

YouTube skeleton and we focus on the

play07:29

part that most people avoid the back end

play07:31

specifically focusing on the upload

play07:33

feature where videos are asynchronously

play07:36

transcoded and then served to users you

play07:39

can read more about it in a short design

play07:40

dock I wrote which is free I think this

play07:43

is the type of project that most people

play07:44

definitely don't have on their resume so

play07:46

it might help set you apart and now

play07:48

number one right after a few honorable

play07:50

mentions spanner is the crackhead

play07:53

database that uses GPS and atomic clocks

play07:55

to literally break cap theorem Dremel is

play07:58

a data warehouse similar to bigquery the

play08:01

crown jewel of Google cloud and there's

play08:03

Blaze Google's build tool that I never

play08:05

fully learned how to use and was open

play08:07

sourced as basil finally a name that

play08:09

makes sense and now for the final one

play08:12

Google domains just kidding it's dead

play08:15

let's finish with the most secretive

play08:17

tool on this list goops okay seriously

play08:20

who's coming up with these names there's

play08:22

not much public info on it but Cloud Pub

play08:24

sub is the public version and is pretty

play08:26

much equivalent at its core Pub sub is a

play08:28

message queue so it's comparable to

play08:30

tools like rapidmq and Kafka without a

play08:33

message queue your architecture might

play08:35

look like this but if we want to handle

play08:37

High throughput add a durability layer

play08:39

and decouple our services we can

play08:42

introduce Pub sub to our architecture to

play08:44

summarize we talked about how Google

play08:46

stores data whether we're talking about

play08:48

files or transactional data we talked

play08:50

about how Google moves data with grpc

play08:53

protocol buffers and message cues we

play08:55

talked about how Google processes data

play08:57

originally with mapreduce and later

play08:59

Flume and lastly Borg which is how

play09:01

Google orchestrates all of the above

play09:03

thanks for watching this was a fun one

play09:05

and hopefully I'll see you soon

Rate This

5.0 / 5 (0 votes)

Etiquetas Relacionadas
Google TechnologiesDistributed ComputingBig DataGFSMapReduceRPCgRPCBorgBigtableSpanner
¿Necesitas un resumen en inglés?