Google's Tech Stack (6 internal tools revealed)
Summary
TLDRThis video script delves into Google's revolutionary in-house technologies, starting with the Google File System for managing petabytes of data and moving to MapReduce for efficient data processing. It covers the evolution of RPCs and gRPC, influenced by Google's Stubby, and touches on Borg, Google's precursor to Kubernetes for job scheduling. The script also explores databases like Bigtable and Spanner, and wraps up with Google's Pub/Sub messaging system, providing a comprehensive look at the tools that power Google's massive infrastructure.
Takeaways
- 😎 Google was once renowned for its innovation in distributed computing rather than data collection.
- 🗂️ The Google File System (GFS) was designed to handle petabytes of data with high throughput and replication across multiple servers.
- 🔄 GFS inspired the creation of Hadoop Distributed File System, which was open-sourced and widely adopted.
- 📈 MapReduce was Google's initial big data processing framework, simplifying distributed programming with map and reduce functions.
- 🚀 The simplicity of MapReduce allowed non-experts to process large datasets without deep knowledge of distributed systems.
- 🔧 Google's internal RPC system, Stubby, is similar to gRPC but with more focus on internal Google operations.
- 🤖 Borg, Google's internal job scheduler, influenced the development of Kubernetes, an open-source container orchestration system.
- 💾 Bigtable was created to overcome the limitations of relational databases, supporting high scalability and millions of requests per second.
- 🌳 Bigtable uses an LSM tree for data storage, providing efficient writes and reads with time as a third dimension.
- 📚 The technologies developed by Google, such as Bigtable, have inspired other NoSQL databases like Cassandra and DynamoDB.
- 📨 Pub/Sub is Google's message queuing system, offering a way to decouple services and handle high throughput in distributed architectures.
Q & A
What is the Google File System (GFS) and why was it created?
-The Google File System (GFS) is a proprietary distributed file system developed by Google in 2003 to handle large amounts of data generated by their search engine's web crawlers. It was designed to store petabytes of data and allow concurrent read and write access by multiple machines, with high throughput, consistent data, and replicated files.
How does the Google File System store data differently from traditional file systems?
-GFS stores data by splitting files into 64-megabyte chunks, each assigned a unique ID and stored on at least three servers for redundancy. It also uses a single Master server to maintain the directory structure and map files to their corresponding chunks, similar to the super block in a Linux file system but with distributed file chunks.
What inspired the development of the Hadoop Distributed File System (HDFS)?
-The Hadoop Distributed File System (HDFS) was inspired by Google's GFS. After Google published a paper on GFS, engineers at Yahoo developed HDFS, which was later open-sourced.
What is MapReduce and how does it simplify big data processing?
-MapReduce is a programming model and an associated implementation for processing and generating large datasets. It simplifies big data processing by allowing programmers to focus on implementing two functions: the map function that processes input data and produces intermediate key-value pairs, and the reduce function that aggregates these pairs by key.
Why did Google develop their own version of RPC called gRPC?
-Google developed gRPC to improve efficiency in data serialization by using binary format instead of human-readable JSON. It also provides type safety through the use of protocol buffers and is more suitable for environments where type safety and performance are critical.
What is the relationship between Borg and Kubernetes?
-Borg is Google's internal system for scheduling and managing jobs and tasks across thousands of machines. Kubernetes, an open-source tool created by Google, was influenced by Borg's design, with Borg jobs being similar to Kubernetes pods and Borg tasks being similar to Kubernetes containers.
What is Bigtable and how does it differ from traditional relational databases?
-Bigtable is a distributed storage system for managing structured data that is designed to scale and support millions of requests per second. Unlike traditional relational databases, Bigtable uses a sparsely populated, three-dimensional table structure with columns, rows, and multiple versions of a value at each row-column intersection.
How does the LSM tree in Bigtable work?
-The LSM tree in Bigtable is an acronym for Log-Structured Merge-tree. It stores writes in a memtable, which is kept in memory and sorted. Once the memtable reaches a certain size, it is flushed to an SSTable (Sorted String Table) on disk, where the data becomes immutable.
What is the significance of the CAP theorem in the context of Spanner?
-Spanner is a database system developed by Google that aims to break the CAP theorem, which states that it is impossible for a distributed system to simultaneously provide all three of the following guarantees: Consistency, Availability, and Partition tolerance. Spanner uses GPS and atomic clocks to synchronize time across data centers, enabling it to provide strong consistency and availability.
What is Cloud Pub/Sub and how does it function in Google's architecture?
-Cloud Pub/Sub is a message queuing service that allows for asynchronous communication between services. It is similar to other message queuing systems like RabbitMQ and Kafka. In Google's architecture, it helps to decouple services, handle high throughput, and ensure data durability.
What are some of the tools and technologies that have been influenced by Google's internal systems?
-Several tools and technologies have been influenced by Google's internal systems, including Hadoop's MapReduce and HDFS, which were inspired by Google's GFS and MapReduce. Kubernetes was influenced by Borg, and databases like DynamoDB, Cassandra, and Bigtable have also been influenced by Google's technologies.
Outlines
😲 Revolutionary Google Technologies and Tools
The first paragraph introduces the speaker's departure from Google and their playful intent to 'expose' Google's secret technologies. It then dives into Google's history of revolutionizing distributed computing with six key tools. The Google File System (GFS) is highlighted for its ability to handle massive data storage and concurrent access, with a unique system of chunking and replication. The paragraph also discusses the MapReduce programming model, which simplified the process of handling large datasets and inspired the creation of Hadoop and other big data processing frameworks. The speaker also touches on Google's use of RPCs and Protocol Buffers for efficient data serialization and type safety, referencing Google's internal version, Stubby, which influenced the development of gRPC.
🤖 Google's Infrastructure and Database Innovations
The second paragraph continues the exploration of Google's technological innovations, focusing on their infrastructure and database solutions. Borg, a job scheduling system that influenced Kubernetes, is mentioned, along with Google's container stack. The paragraph then describes Bigtable, a NoSQL database designed for high scalability and performance, which uses an LSM tree for efficient data storage. Bigtable's design inspired other databases like DynamoDB, Cassandra, and influenced the development of data warehousing solutions like BigQuery and Dremel. The speaker also humorously mentions 'Goops' and 'Google Domains' before concluding with a mention of Cloud Pub/Sub, Google's message queuing service, which is essential for high throughput and service decoupling in distributed systems.
Mindmap
Keywords
💡Google File System (GFS)
💡Distributed Computing
💡MapReduce
💡
💡Hadoop
💡Protocol Buffers (Protobuf)
💡gRPC
💡Borg
💡Bigtable
💡LSM Tree
💡CAP Theorem
💡Pub/Sub
Highlights
Google's revolutionary approach to distributed computing with six tools.
Introduction of the Google File System (GFS) for handling petabytes of data.
GFS's method of splitting files into 64MB chunks and replicating across multiple servers.
The role of the Master server in GFS for managing file metadata.
How GFS inspired the development of Hadoop Distributed File System.
The release of MapReduce for processing large datasets efficiently.
Explanation of MapReduce's three main steps: map, shuffle/sort, and reduce.
The simplicity of MapReduce for programmers, requiring only two function implementations.
Evolution from MapReduce to Flume, Apache Beam, and Cloud Dataflow.
Introduction to gRPC and Protocol Buffers for efficient API communication.
Comparison of gRPC with RESTful APIs in terms of efficiency and type safety.
Stubby, Google's internal version of gRPC.
Borg, Google's job scheduler, and its influence on Kubernetes.
Bigtable, Google's NoSQL database designed for high scalability and performance.
Bigtable's architecture, including the use of LSM trees for data storage.
The impact of Bigtable on the development of Dynamo, Cassandra, and other NoSQL databases.
Spanner, Google's globally distributed database that breaks the CAP theorem.
Dremel, Google's data warehouse for large-scale data analysis.
Blaze, Google's build tool, and its open-source equivalent, Bazel.
Cloud Pub/Sub, Google's message queuing service for decoupling services.
Summary of Google's tools for data storage, movement, processing, and orchestration.
Transcripts
I just quit my job at Google so now I'm
gonna expose all of their Ultra secret
Technologies just kidding all of these
can be found in this GitHub repo before
Google was known for suckling on your
sweet sweet data they were known for
revolutionizing distributed computing
again and again and again and they did
it by using six revolutionary tools that
most people never heard of but I'm gonna
show them to you today let's start with
a classic the Google file system it's
2003 most people are still learning how
to open their file explorer but Google
has a different problem search engines
need a web crawler to navigate through
every web page available and save it so
that it can be converted to a weighted
graph and used within pagerank now
that's quite a lot of data to store
today that would be on the order of
petabytes of data on top of that these
files need to be concurrently read and
written to by many machines from many
developers at Google requiring High
throughput consistent data and
replicated file files the high level
implementation is that files are split
into 64 megabyte chunks before they are
stored each chunk is assigned a
universally unique ID and a given chunk
is not only stored on one server but
it's stored on at least three servers
there's also a single Master server
which sort of acts as the table of
contents it tells you the directory
structure Maps every file to a list of
its corresponding chunks and of course
tells you the chunk locations as well it
kind of reminds me of the super Block in
a Linux file system but in this case our
file chunks are distributed which makes
things a lot more complicated now Google
didn't just keep their secrets to
themselves they published a very
revolutionary paper which inspired
engineers at Yahoo to develop the Hadoop
distributed file system which was later
open sourced the original GFS was also
succeeded by Google's Colossus file
system okay but storing data is easy how
do we actually process the raw web pages
well conveniently a year later Google
released the mapreduce white paper the
problem is that you have to process a
lot of data like from Google file system
and you could use a single machine to do
it but then you might never go home
instead you can use the most powerful
big data processing framework at the
time that your company just happened to
invent and don't worry it's actually
pretty simple as the name implies
there's two main steps mapping the data
and reducing the data at least from your
perspective as a programmer but there's
a hidden middle step called the shuffle
or sort step let's say our input is a
bunch of raw text files for the map step
we would split our data up into
individual chunks and each server would
receive a portion of them the output
from each server would be a list of
intermediate key value pairs now before
the reduced step we will Shuffle or sort
the data by making sure that every pair
with the same key value ends up at the
same server for the reduced step this is
important because as when we reduce or
aggregate our data we do so by the key
so the input here is the intermediate
key value Pairs and the output is the
final key value pairs congratulations
you just wrote a distributed program to
count Words exciting but the reason
mapreduce was so revolutionary is that
from a programmer's perspective you are
only responsible for implementing two
functions and these are analogous to the
map and reduce functions from functional
programming you didn't have to be a
distributed systems expert to use this
text and shortly after Hadoop mapreduce
and open source variation was released
nowadays no one really uses mapreduce
anymore Google uses Flume Apache beam is
the open source equivalent and cloud
dataflow is the Google manage job Runner
meanwhile the rest of the world uses
Apache spark and flank for the same
purposes okay enough about
infrastructure show me some code well
you've probably heard of restful apis
but if you really want to bust your
balls watch me build the same API with
RPC we're using Proto version 3 and this
is our hello world package we start by
defining our schemas for our rpc's
requests and responses then we can
create a service which is where we
Define our rpcs say hello will basically
receive a name and return a greeting
message now in our service code we start
with some boilerplate like importing the
protocol buffers that we just declared
then we can Define and register our
hello world RPC Handler and create a grp
server which listens for requests now
that's definitely a lot more work than
the rest equivalent and on top of that
grpc is not natively supported from
browsers now if you are paying attention
though you might have noticed that we
actually have schemas and type safety
that's one of the purposes of protocol
buffers it might be less obvious though
that grpc is more efficient because data
is binary serialized rather than being
human readable like Json okay but what
does this have to do with stubby
basically it's the Google internal
version of grpc because Google really
doesn't like open source for some reason
or maybe they're too lazy to migrate
over if someone from Google is watching
this let us know enter Borg it's
comparable to a really popular open
source tool that Google created can you
guess which one let me give you a hint
it's schedules and manages hundreds of
thousands of jobs spanning thousands of
machines where a job is the smallest
Deployable unit of computing it consists
of one or more tasks where a task is a
runnable instance of a binary like a
server or a batch job so if you guessed
kubernetes nice job yes kubernetes was
influenced by borg and was open sourced
in 2015 where a borg job is kind of like
a kubernetes pod and a board task is
kind of like a kubernetes container but
instead of Docker Google uses this
container stack if you can't tell at
Google there's a culture of creating the
most esoteric names possible for
everything but I will say that at Google
I never saw anything that resembled a
Docker file so this might be a rare case
where the tech really just works Borg
has some more parallels with kubernetes
which you can read more about in the
white paper but to summarize Google
doesn't use the complicated kubernetes
tool and instead uses something even
more complicated yeah now finally the
moment you've been waiting for databases
bigtable was created when Google ran
into the limitations of relational
databases it was designed to support
millions of requests per second and be
extremely scalable now just like SQL
data is still stored in tables of
columns and rows related columns can be
grouped together via a column family
each row column intersection can have
multiple versions of a value so it's
kind of like a three-dimensional table
where the third dimension is time
another difference from SQL is that it's
sparsely populated because not every
column has to be required by each row
and we have the flexibility to add
columns as needed now under the hood
data is stored in an LSM tree where
rights are batched into the mem table
where they're stored in sorted order
before being flush to the SS tables
where they will be immutable like I said
storing data is easy guys the 2006
bigtable paper inspired the 2007 Dynamo
paper which later led to dynamodb and it
also inspired the development of
Cassandra and I'm sure Amazon continued
to give back to the open source
Community after that this is the part
where I was going to have a mongodb
sponsorship I'll just plug my site
instead I recently added a full stack
development course where we build out a
YouTube skeleton and we focus on the
part that most people avoid the back end
specifically focusing on the upload
feature where videos are asynchronously
transcoded and then served to users you
can read more about it in a short design
dock I wrote which is free I think this
is the type of project that most people
definitely don't have on their resume so
it might help set you apart and now
number one right after a few honorable
mentions spanner is the crackhead
database that uses GPS and atomic clocks
to literally break cap theorem Dremel is
a data warehouse similar to bigquery the
crown jewel of Google cloud and there's
Blaze Google's build tool that I never
fully learned how to use and was open
sourced as basil finally a name that
makes sense and now for the final one
Google domains just kidding it's dead
let's finish with the most secretive
tool on this list goops okay seriously
who's coming up with these names there's
not much public info on it but Cloud Pub
sub is the public version and is pretty
much equivalent at its core Pub sub is a
message queue so it's comparable to
tools like rapidmq and Kafka without a
message queue your architecture might
look like this but if we want to handle
High throughput add a durability layer
and decouple our services we can
introduce Pub sub to our architecture to
summarize we talked about how Google
stores data whether we're talking about
files or transactional data we talked
about how Google moves data with grpc
protocol buffers and message cues we
talked about how Google processes data
originally with mapreduce and later
Flume and lastly Borg which is how
Google orchestrates all of the above
thanks for watching this was a fun one
and hopefully I'll see you soon
Ver Más Videos Relacionados
25 Computer Papers You Should Read!
Google Cloud infrastructure
Google’s $2 Trillion Business Model | How Google Earns Money? | Dhruv Rathee
GOOGLE DATA-CENTER के अंदर क्या होता है? | What Happens Inside a Google Data Center
Google SWE teaches systems design | EP22: HBase/BigTable Deep Dive
Kubernetes Explained in 6 Minutes | k8s Architecture
5.0 / 5 (0 votes)