Distributed Systems in One Lesson by Tim Berglund

Devoxx Poland
1 Aug 201749:00

Summary

TLDRこのビデオは、分散システムの構築における諸問題と、それらを解決するためのオープンソースプロジェクトについて説明しています。ストレージ、コンピューティング、メッセージングの3つの側面からアプローチし、単一システムと比較して、分散システムではどのような課題が生じるかを詳しく解説しています。Cassandra、Spark、Kafkaなどのオープンソースプロジェクトが、それぞれの領域でどのように問題を克服しているかを紹介しています。分散システムの構築は困難を伴いますが、このビデオはユーモアを交えながら、明快で実践的な洞察を提供しています。

Takeaways

  • 🔑 分散システムでは、単一のコンピューターシステムよりも多くの問題が発生する。ストレージ、コンピューティング、メッセージングの3つの側面でそれぞれ異なる課題がある。
  • 📚 単一のマスターリレーショナルデータベースでは、スケーリングの問題に対処するために一貫性やデータモデルを犠牲にする必要がある。一方、Cassandraのようなスケーラブルなノンリレーショナルデータベースは、最初から分散システムを前提としている。
  • 🧮 分散コンピューティングでは、MapReduceやSparkなどのフレームワークが用いられる。Sparkの方がMapReduceよりも開発者フレンドリーである。
  • 💬 メッセージングシステムは、マイクロサービス間の疎結合を可能にする。ApacheKafkaは分散メッセージブローカーで、パーティショニングによりスケーリングが可能になる。
  • 🌐 分散システムでは、一貫性と可用性とパーティション耐性の3つの特性のうち、2つしか同時に達成できない(CAP定理)。
  • 🔄 分散システムでは、問題を解決すると別の問題が発生することがある。たとえば、データの複製によって一貫性の問題が生じる。
  • 🌉 メッセージバスを徹底的に活用することで、アーキテクチャそのものが変わる可能性がある。サービスはイベントをそのままストリームとして処理できる。
  • ⚖️ 単一サーバーでは簡単に実現できる機能(一貫性の保証、メッセージの順序保証など)を、分散システムでは実現するのが難しい。
  • 🔩 分散システムでは、スケーリングのためにシャーディングを行うと、結合ができなくなるなどデータモデルに制約が生じる。
  • ✨ 分散システムの構築は大変だが、Apacheのようなオープンソースプロジェクトがその課題を解決するのに役立つ。

Q & A

  • 分散システムとは何ですか?

    -分散システムとは、複数のコンピューターが1つのコンピューターとして振る舞うシステムのことです。分散システムには、多数のコンピューターから構成され、並行動作し、独立してシステムの一部が故障する可能性があるという3つの特徴があります。

  • 分散システムを構築する際の主な課題は何ですか?

    -分散システムを構築する際の主な課題は、データの保存、計算の分散、メッセージングの3つです。単一のシステムではこれらは比較的簡単に行えますが、分散環境下では非常に難しくなります。それぞれの分野で異なる問題が生じ、トレードオフの選択を強いられます。

  • データの保存における主な課題と対処方法は?

    -主な課題は一貫性とスケーラビリティのトレードオフです。読み取り用レプリカを追加して一貫性を犠牲にする、シャーディングでスケーラビリティを確保しデータモデルを破棄する、などの対処が必要になります。キーバリューストアのCassandraは、一貫性レベルを調整できるなどの対処方法を提供しています。

  • 分散計算の場合はどうですか?

    -MapReduceはHadoopでよく使われていましたが、プログラミングモデルが非常に複雑でした。SparkはMapReduceよりも開発者フレンドリで、ストリーム処理にも対応しています。Kafkaはイベントストリームをベースにしたストリーミングのためのプラットフォームで、分散計算にも使えます。

  • 分散メッセージングではどのような問題が発生しますか?

    -単一サーバーのメッセージキューでは、メッセージの順序性や一度だけの配信が保証されますが、スケーラビリティと冗長性に欠けます。ApacheKafkaは分散メッセージブローカーで、パーティションとレプリケーションによりスケーラビリティと冗長性を実現しますが、グローバルな順序性は失われます。

  • 分散システムを構築する際の一般的なアプローチは?

    -一般的には、単一システムから徐々にコンポーネントを分散化し、必要に応じてトレードオフを選択しながらシステムを拡張していきます。ストレージ、コンピュート、メッセージングの各層で異なる分散ソリューションを採用し、マイクロサービスのようなアーキテクチャへと移行することが多くなっています。

  • マイクロサービスアーキテクチャにおけるメッセージングの役割は?

    -メッセージングは、マイクロサービス間の疎結合な通信を実現する上で重要な役割を果たします。Kafkaなどのメッセージングプラットフォームにより、サービス間の依存関係を排除し、個別のデプロイと拡張が可能になります。また、イベントドリブンなアーキテクチャの実現にも役立ちます。

  • 分散システムを構築する上での一般的なアドバイスは?

    -分散システムには避けられない問題がつきものですが、単一システムを徐々に分散化していく過程で、トレードオフを注意深く検討し、適切なソリューションを選択することが重要です。オープンソースのツールやプラットフォームを上手く活用し、柔軟に対応することをお勧めします。

  • 分散システムのCAPセオリームについて説明してください。

    -CAPセオリームは、分散システムにおいて一貫性(Consistency)、可用性(Availability)、分割耐性(Partition Tolerance)の3つを同時に満たすことはできないことを示しています。ネットワーク分割が発生した場合、一貫性と可用性のどちらかを選択する必要があります。システム設計の際にこのトレードオフを意識することが重要です。

  • 分散システムにおけるコンシステンシーについてもう少し詳しく説明してください。

    -コンシステンシーとは、データの整合性を指します。単一システムの場合は常に書き込んだデータを読み取れますが、分散環境では複製によりデータの不整合が発生する可能性があります。Cassandraなどでは、読み取り/書き込み時のレプリカ数を調整することで、コンシステンシーのレベルをコントロールできます。最終的な一貫性と可用性のトレードオフが避けられません。

Outlines

00:00

👋 講演の冒頭、自己紹介と背景

講演者のTim Berglundが自己紹介し、今回の講演が分散システムについて包括的に説明することを前置きしています。分散システムは恐ろしいものだが、避けられない現実でもあると述べ、それでも簡単なシステムを選択することをお勧めしています。分散システムが抱える問題を3つの側面(ストレージ、計算、メッセージング)から説明し、各領域で利用されるオープンソースプロジェクトについても紹介すると宣言しています。

05:01

💾 分散ストレージの課題と対策

単一サーバーの関係データベースから規模を拡大していく過程で発生する問題について説明しています。リードレプリケーションでスケーリングすると整合性が失われ、さらにシャーディングを行うと結合ができなくなるなど、データモデルが壊れていきます。これらの問題に対処するために作られた分散ストレージ製品であるCassandraについて、そのアーキテクチャやデータモデル、整合性の扱い方を解説しています。分散システムでは1つの問題を解決すると別の問題が生じることを指摘し、CAPセオリームについても簡単に触れています。

10:02

💻 分散計算の課題と対策

単一サーバー/プロセッサでの計算は容易ですが、分散環境では困難さが増すことを説明しています。MapReduceの仕組みと課題を解説した上で、Hadoopが抱えていた問題点を指摘し、その後継としてのApache Sparkの特徴を紹介しています。さらに、ストリーミングデータに対する分散計算の枠組みとしてApache Kafkaについても言及しています。要するにデータの在処を意識せずに分散計算を行える環境が重要だと主張しています。

15:04

📩 分散メッセージングの課題と対策

メッセージングはマイクロサービスの普及とともに注目を集めるようになったと述べた上で、その基本的な概念やメリットについて解説しています。単一サーバーでは簡単だったメッセージ配信の保証などが分散環境では難しくなることを指摘し、そうした課題を解決するソリューションとしてApache Kafkaを取り上げています。さらに、分散メッセージングを徹底的に活用することで、イベントストリームをシステムの中心におき、サービスの疎結合化が実現できると説明しています。

20:05

🚀 まとめとプラグ

講演の終盤では、ストレージ、計算、メッセージングの3つの側面から分散システムに伴う課題とオープンソースによる対処法を概説しました。分散システムは恐ろしいものの、避けられない現実でもあり、その中でどう上手く活用していくかが重要だと述べています。最後に自身のオンラインコースやKafkaの地域コミュニティについての宣伝を行い、講演を締めくくっています。

Mindmap

Keywords

💡分散システム

複数のコンピューターから構成され、ユーザーには1つのシステムのように見えるシステム。主題の中心となる概念で、ストレージ、計算、メッセージングなど、あらゆるものが分散化される際に生じる問題と解決策が説明されている。「分散システムは、たくさんのコンピューターから成り、同時に動作し、個別に故障する可能性があり、グローバルな時計を共有しないことを特徴とする」と定義されている。

💡ストレージ

データを格納する仕組み。単一のサーバーでは簡単だが、分散環境では複雑になる。リレーショナルデータベースの例では、レプリケーション、シャーディングなどの手法が取られるが、その過程で一貫性やデータモデルを犠牲にせざるを得なくなる。その代替としてCassandraのような、分散を前提としたNoSQLデータベースが紹介されている。「規模を大きくするほど、リレーショナルデータベースの素晴らしい特性が失われていく」と警告している。

💡計算

データの処理を指す。単一のプロセッサでは簡単だが、分散環境では非常に難しくなる。MapReduceは計算を分散するための基本的な手法だが、Sparkのようなより洗練された手法も登場している。「Sparkは、Hadoopのような従来のMapReduceよりも、はるかにプログラマフレンドリーな計算モデルを提供している」と説明されている。さらに、Kafkaを使ったストリーム処理の手法も紹介されている。

💡メッセージング

システム間でデータをやり取りする仕組み。マイクロサービスの台頭により、この分野が重要視されるようになった。単一のメッセージブローカーでは機能に制限があるが、Kafkaのような分散メッセージブローカーを使うと、高い可用性とスケーラビリティが得られる。ただし、メッセージの順序保証はパーティション内に限られるなど、トレードオフが生じる。「メッセージングを使えば、システムの各部分を疎結合にでき、独立してリリースやバージョン管理ができる」と利点が説明されている。

💡一貫性

ストレージやメッセージの整合性を保つこと。分散システムでは非常に重要だが、実現が難しい。「CAPセオリー」という概念が紹介されており、「一貫性」「可用性」「パーティション耐性」のうち、2つしか同時に達成できないと説明されている。一貫性を守るには可用性を犠牲にしたり、強い一貫性を諦めて最終的な一貫性に留めるなどの選択を迫られる。「Cassandraでは、リクエスト単位で一貫性のレベルを調整できる」と紹介されている。

💡スケーラビリティ

システムの規模を拡大できる能力。分散システムの最大の利点だが、様々な問題も生じる。ストレージの場合は「リード」と「ライト」のスケーラビリティへの対応が異なり、レプリケーション、シャーディングなどの手法が取られる。計算の場合はデータの分散配置が前提となり、MapReduceやSpark、Kafkaストリーミングなどの分散処理アプローチが使われる。メッセージングではKafkaのようなブローカークラスターで高いスケーラビリティが実現される。「単一サーバーで扱えるよりはるかに大規模なシステムを、分散システムにより構築できる」とその意義が説明されている。

💡マイクロサービス

モノリシックではなく、小さな機能単位でサービスを構築するアーキテクチャスタイル。この動きが分散メッセージングの重要性を高めた。「各マイクロサービスが疎結合でメッセージキューと対話することで、システム全体がゆるやかに結合される」と説明されている。メッセージングが中心的役割を果たすことで、サービス間の結合度を下げ、個別のリリースサイクルを実現できる。単一のデータベースを共有するアンチパターンに対する代替手段としてKafkaが紹介されている。

💡キューイングシステム

メッセージを一時的に保持し、シーケンシャルに配信するシステム。Kafkaはこのような機能を提供する分散メッセージブローカーである。「Kafkaはトピックという名前空間を使い、メッセージをパーティションしてパブリッシュ/サブスクライブする」と説明されている。ただし、分散環境ではグローバルな順序付けは失われ、パーティション内でのみ順序が保証される。このようなトレードオフを受け入れる代わりに、高いスケーラビリティと可用性が得られる。

💡パーティショニング

データやワークロードを複数のノードに分割すること。ストレージ、計算、メッセージングのいずれの領域でも、分散システムにおけるスケーラビリティを実現するための重要な概念である。Cassandraではコンシステントハッシングを利用してデータをパーティションする。MapReduceではデータをパーティション化して分散処理を行う。Kafkaではトピックをパーティションに分割し、メッセージをハッシュ化して振り分ける。「パーティショニングにより、システム全体のスケーラビリティが実現される」と説明されている。

💡ベクタークロック

分散システムでノード間の時刻の順序付けを実現する仕組み。物理時計に依存せず、論理時計によってイベントの発生順序を捉える。講演では、「分散システムではグローバルな時計を共有できないため、ベクトルクロックが不可欠になる」と指摘されている。ノードが独立したクロックを持つことで、イベント順序の整合性が保たれる。論理クロックなので、絶対時刻の同期は不要だが、各ノードのイベントの順番関係を把握できる点が重要である。

Highlights

Distributed systems have three characteristics: made up of lots of computers, operate concurrently, and don't share a global clock.

In a single server database, reads are usually cheaper than writes, so read replication is used to scale reads, but this creates an eventually consistent system.

Sharding a database can help scale writes, but it breaks the data model and ability to join across shards.

Cassandra uses consistent hashing to distribute data across nodes and allows tuning consistency on a per-request basis.

The CAP theorem states that in a distributed system, you can only have two of three properties: consistency, availability, and partition tolerance.

MapReduce funnels all computation through map and reduce functions, which is a terrible thing to do, but it was useful for large distributed data.

Spark provides a much more developer-friendly way to do distributed computation than MapReduce, with abstractions like RDDs and Datasets.

Kafka treats everything as a stream, allowing computation on data in flight without the need for a cluster.

Messaging enables loose coupling of systems, which is useful for microservices communication.

Kafka partitions topics across brokers, maintaining ordering within partitions but losing global ordering across the topic.

The lambda architecture emerged as an anti-pattern, forcing developers to write the same code twice for batch and stream processing.

Treating everything as an event stream in Kafka allows processing in-place and decoupling services from production and consumption of events.

Moving from a single server relational paradigm to distributed systems like Cassandra requires rethinking consistency, data modeling, and system architecture.

The speaker encourages attending Kafka meetups to learn more about distributed messaging and streaming.

The speaker makes a shameless plug for his O'Reilly video on distributed systems, which goes into more detail on the topics covered in this talk.

Transcripts

play00:10

all right

play00:11

i'm on

play00:13

hey that's convenient

play00:15

good afternoon everybody

play00:17

time to get started

play00:18

i am tim berglund

play00:20

this is distributed systems in one

play00:22

lesson

play00:22

so these guys realize that's not what

play00:24

they wanted to be

play00:25

uh

play00:26

see you guys

play00:27

if you're expecting to be in distributed

play00:29

systems in one lesson then you are in

play00:31

the right place

play00:32

um

play00:34

a little bit about me i already said my

play00:36

name i'm tim

play00:37

as of a couple of months ago i work for

play00:40

a company called confluent

play00:42

and a lot of my attention is shifting

play00:44

to kafka and that that whole ecosystem

play00:47

that's what conflict does uh i'm the

play00:49

director of developer experience there

play00:50

so my team is in charge of documentation

play00:53

and evangelism and making sure meetups

play00:56

are staffed uh and training and all

play00:58

those things so

play01:00

uh it's not a big team yet

play01:02

but we're hiring and it's an exciting

play01:05

time to be

play01:06

at confluent

play01:07

also an exciting time to be here

play01:09

i realized

play01:11

just so you know i picked up a bottle of

play01:13

water and i thought it was still water

play01:15

and it turns out it is light sparkling

play01:16

water and do you guys know

play01:19

when you have a microphone

play01:22

if you have a chance between a choice

play01:23

between still water

play01:24

and sparkling water

play01:26

always go with the still water

play01:28

so i might not drink very much of this

play01:32

anyway we've got about 50 minutes to

play01:33

spend together and i want to talk about

play01:35

what a distributed system is

play01:37

and why they are terrible

play01:39

now this might seem strange coming from

play01:41

a guy who makes his living in

play01:43

distributed systems and has for some

play01:45

years but that's kind of a theme of this

play01:47

talk is you really

play01:49

if there's anything i can do

play01:51

to talk you out of building a

play01:52

distributed system and choosing a

play01:54

simpler life

play01:56

i want to now i know i'm going to fail

play01:58

i know you're going to do it anyway but

play02:00

we'll talk about the various things that

play02:02

go wrong when we try to do simple things

play02:05

like store data compute data move data

play02:09

from one place to another all of those

play02:11

things are relatively easy

play02:13

at small scale on a single computer when

play02:16

we get multiple computers involved they

play02:18

become terrible so i'm going to show you

play02:20

the things that go wrong and for each

play02:23

category of things we'll talk about uh

play02:26

storage and computation and messaging

play02:29

for each one of those things i'll try to

play02:30

dive into

play02:32

an open source

play02:34

technology that addresses that problem

play02:37

that's really what we're doing today

play02:38

kind of an introduction to some topics

play02:40

an introduction to some open source

play02:42

technologies that solve the problems

play02:44

that come up when we build distributed

play02:46

systems a definition might be in order

play02:51

andrew tanenbaum

play02:52

defines a distributed system as a

play02:54

collection of independent computers that

play02:56

appear to its users

play02:58

as one computer

play03:01

so you might say something like

play03:03

amazon.com

play03:05

i guess it looks like one computer so

play03:07

that would be a distributed system a

play03:09

cassandra cluster

play03:10

a kafka cluster all of these things at

play03:13

one level of abstraction or another they

play03:16

act like a single system but in fact

play03:18

they're composed of many computers

play03:20

now distributed systems have three

play03:23

characteristics for something to qualify

play03:25

as a distributed system it has to be

play03:27

made up of lots of computers they have

play03:29

to operate concurrently that's kind of

play03:31

obvious right those computers are all

play03:33

running at the same time

play03:34

they might fail independently and the

play03:37

possibility of one of our machines

play03:39

failing is

play03:40

uh a specter that is always looming over

play03:42

us we always have to deal with that and

play03:45

here is those are frankly are two

play03:46

trivial things this last one is a little

play03:48

subtle and this is what makes things

play03:50

interesting uh the computers don't share

play03:53

a global clock so if you ever have a

play03:55

collection of processors that are in

play03:57

some sense synchronized

play03:59

uh

play04:00

that their clocks are synchronized uh

play04:03

that technically is not going to

play04:04

function as a distributed system and all

play04:06

of the various terrible things that

play04:07

happen when we build conventional

play04:09

distributed systems some of those

play04:11

terrible things won't happen if you have

play04:12

a global clock but that's just not

play04:14

practical that's not how we build real

play04:16

systems in the real world so we have

play04:19

independent clocks independent failure

play04:21

and of course computers operating

play04:22

concurrently that's what i mean

play04:24

when i say distributed system so as i

play04:26

said we will dive into three topics here

play04:30

storage computation and messaging now in

play04:34

the single server

play04:36

single processor even single thread

play04:38

version of these three things life is

play04:41

grand these are more or less solved

play04:44

problems and everybody knows how to do

play04:47

them like everybody knows how to use a

play04:49

single master relational database that's

play04:51

a that's a fairly uh commodity skill

play04:54

uh everybody knows how to do computation

play04:56

in a single thread if you're a

play04:57

programmer at all you know how to do

play04:59

that but when we try to distribute these

play05:01

problems more broadly life gets real

play05:03

hard uh and like i said i'll look at

play05:05

each one of those and it's it's my my

play05:08

my goal here again is to convince you

play05:11

don't do this

play05:12

get an easier job

play05:14

build a smaller system

play05:17

live in the country

play05:19

slower pace of life

play05:22

i'm just looking at you i know you're

play05:23

not going to do it i wish you would i'm

play05:24

trying

play05:26

i got i got a few minutes left to

play05:27

convince you but let's let's see where i

play05:29

get

play05:30

also by way of a shameless plug there is

play05:32

also an o'reilly video available on

play05:35

safari with the same title distributed

play05:37

systems in one lesson that's like three

play05:39

or four hours long i go into more detail

play05:41

uh

play05:42

and yeah it's by me so that's the

play05:44

shamelessness of the plug if you're

play05:45

interested and want more

play05:48

all right

play05:50

let's talk about storage now to remind

play05:52

you

play05:54

single master storage the the pleasant

play05:56

life you used to lead

play05:58

the salad days the simpler times when

play06:01

you have a database that exists on one

play06:03

server it's very simple but

play06:05

life gets hard life gets complicated and

play06:07

if you could imagine the behavior of a

play06:10

database

play06:11

and if if you had like a slider where

play06:14

you could turn the scale of the system

play06:16

up and down

play06:17

certain things happen to us as we crank

play06:19

that scale slider up and the system gets

play06:22

busier and dizzier

play06:24

now typically in

play06:27

in a lot of web systems there are more

play06:30

reads than rights right there's more

play06:32

read traffic than right traffic now

play06:36

in

play06:37

a relational database reads are usually

play06:40

cheaper than writes so that's okay but

play06:42

imagine as we're turning this scale

play06:43

slider up at some point this single

play06:46

server runs out of gas we've got all

play06:48

these reads happening not as many writes

play06:50

happening and we need to scale reads

play06:53

what do we do we all know the answer we

play06:55

do read replication okay that's easy

play06:57

enough

play07:00

here we'll take rights that go to the

play07:02

master and we'll propagate them to uh

play07:04

the two follower servers

play07:06

and you're able to do reads against

play07:08

those keep the rights going to the

play07:10

master and you buy yourself some time

play07:12

right as this scale slider is turning up

play07:14

now you've broken something already

play07:16

you've created a distributed system a

play07:18

distributed database it's just not a

play07:20

particularly intentional one yet

play07:23

and

play07:24

what is the thing that we have broken

play07:26

we have broken consistency

play07:29

when we were back in the simpler days

play07:32

i can't say i didn't warn you uh anytime

play07:35

i write to this database and then i read

play07:36

from it i am always going to read the

play07:38

thing that i wrote that is a guarantee

play07:41

and um

play07:42

you know for for any for any database

play07:45

we're ever likely to use we're going to

play07:46

read the thing i write now when i do to

play07:48

go to read replication that's not

play07:49

necessarily true i'm going to write here

play07:51

and when i read here that read may not

play07:54

have propagated yet so i now have an

play07:56

eventually consistent uh database

play07:59

accidentally

play08:00

and

play08:01

the the bummer here is that i may be

play08:03

telling myself well wait this is a

play08:04

relational database i've got

play08:06

strong consistency guarantees it's okay

play08:08

but i've broken that so a little bit of

play08:10

life has gotten bad

play08:11

also this only works for so long as i

play08:13

continue to turn up that slider what

play08:15

happens well i can keep adding follower

play08:18

servers for reads and scale my reads as

play08:20

much as i want but all of the writes

play08:22

still go here and so i'm going to run

play08:24

out of gas on the the master on the the

play08:28

leader server

play08:29

so then i do an even more terrible thing

play08:32

and that is sharding

play08:34

again this is not a new technique people

play08:36

have been sharding relational databases

play08:38

for approximately as long as there have

play08:40

been relational databases but what i do

play08:42

here is i will

play08:44

find some key i'm using name here and i

play08:46

basically just break up the database

play08:48

into three pieces based on that key

play08:51

so if i'm storing people's settings

play08:54

based on their username let's just say i

play08:56

will pick username as that key and put a

play08:59

through f here f through n here and n

play09:01

through z as my people say on the last

play09:04

server i suppose i should say zed huh

play09:08

no i'm going to keep saying z all right

play09:12

and each one of these is its own little

play09:15

read replicated cluster so i can

play09:18

sort of make the database as big as i

play09:20

want there but i've broken something

play09:21

else i already broke consistency and now

play09:25

kind of i've broken my data model if

play09:27

this is a relational database and when

play09:29

one scales like this it almost certainly

play09:31

is a relational database

play09:33

now for example i can't join across

play09:36

shards i've got completely independent

play09:39

databases from shard to shard

play09:42

and that's a little bit of a bummer that

play09:44

works sometimes

play09:46

i have a talk later this afternoon where

play09:47

we're going to talk about sharding as a

play09:49

way of scaling an entire system

play09:51

and in that case um

play09:54

sharding a whole system sometimes i get

play09:56

away with that real well sometimes i

play09:58

don't and can be a bit of a bummer

play09:59

anyway briefly that's just what you do

play10:02

with

play10:03

a database that's not built as a con as

play10:05

a as a

play10:07

as a distributed database uh that's how

play10:10

you can make it get a little bit bigger

play10:13

again the bad thing is certain behaviors

play10:16

and guarantees and claims of a

play10:18

relational database that i start off

play10:20

with little by little i give those up in

play10:23

fact it's even a little worse than that

play10:24

um

play10:26

there are other things if you can

play10:27

imagine that scale slider going up and

play10:29

up i have to do things with the topology

play10:32

to address that i have to add computers

play10:34

and make a distributed system i'll also

play10:36

have to take things away from my data

play10:38

model as i turn that slider up right so

play10:41

suppose for example uh the

play10:44

your database comes under pressure

play10:46

latency is going up uh you're having

play10:48

higher transactional volume and suddenly

play10:51

reads are not performing properly

play10:54

what are you going to do and what's the

play10:56

one of the first things you'll do in

play10:58

response to that

play11:00

most people will say at this point

play11:02

add an index

play11:04

kind of an okay answer right maybe

play11:06

that'll help maybe you're missing an

play11:07

index that's good uh suppose you've

play11:10

already done all the index optimizing

play11:11

you can that's trivial uh after that

play11:14

what you do is you you realize oh wait

play11:17

the reason reeds are slow is because of

play11:19

this wonderful relational model i'm

play11:21

doing all this joining because my my da

play11:24

my schema is properly normalized i have

play11:26

to take joins away in order to make

play11:28

reeds perform and how do you take joins

play11:30

away

play11:31

well you denormalize

play11:32

which we all know is immoral

play11:35

right that's what we're taught in school

play11:37

so you actually become what you think is

play11:39

a terrible person and you denormalize

play11:41

because you your scale is forcing you

play11:43

you know you're

play11:44

just you have no choice

play11:47

also

play11:49

if your first answer was to add indexes

play11:52

that is another pressure that you'll

play11:53

come under on the right side right as as

play11:56

right

play11:57

right traffic increases

play11:59

uh before you shard or even after you

play12:02

shard you're likely to give up on

play12:04

indexes which will then have further

play12:07

pressure on you denormalizing your data

play12:09

model down the line so all of the

play12:11

wonderful things that you thought were

play12:12

true about your database when you

play12:13

started they're just gone they're taken

play12:15

away from you by the time you have to

play12:17

scale it and you end up with something

play12:19

that doesn't really look all that

play12:20

relational

play12:22

by the time you're done so why not just

play12:24

do it right and here is an approach that

play12:26

is taken by uh really i'll be talking in

play12:28

terms of cassandra because it's a

play12:30

database that i know fairly well uh but

play12:32

this is sort of

play12:34

if you look at at the the

play12:36

scalable non-relational databases that

play12:39

family of products they all kind of work

play12:40

like this they use a technique called

play12:42

consistent hashing

play12:43

in fact consistent hashing is a nice

play12:45

thing to know when you're looking at

play12:47

other systems

play12:49

and other open source projects in the

play12:50

distributed systems world

play12:52

this method is it often kind of pokes up

play12:55

its head is like winking at you from

play12:57

various systems it's front and center

play12:58

and cassandra but it shows up in other

play13:00

places so just be advised now

play13:03

here's an example of how you build a

play13:04

database from the ground up to be

play13:07

distributed now it's still doing bad

play13:09

things to you all right there's still

play13:11

lots of features cause sandra's not

play13:12

gonna give you that you wish you had but

play13:14

it's very upfront about it right right

play13:16

from the beginning it's saying here here

play13:18

are the limitations of the data model

play13:20

here's how scale works

play13:22

period and and sort of just deal with it

play13:25

and uh move forward from there so

play13:27

suppose i've got this database here i've

play13:30

got eight computers and they're all the

play13:32

same they're all peers it's a pretty

play13:33

simple architecture they're just

play13:35

identified with a numeric token and in

play13:38

the case of cassandra we always draw

play13:39

this in a circle now that's not because

play13:41

they're actually connected that way

play13:43

you know these are regular computers on

play13:45

a ethernet network just like you were a

play13:47

normal person

play13:48

but they all have this unique token this

play13:51

unique numerical id and it helps to draw

play13:53

them in order in uh clockwise in a

play13:57

circle and i'll show you why

play13:59

right now

play14:02

i'm storing favorites

play14:05

and per the

play14:07

coffee

play14:08

picture because i forgot to say this at

play14:10

the opening slide as many of these

play14:11

examples as possible i will draw from a

play14:13

coffee shop just to be cute so suppose

play14:15

this web scale coffee shop is trying to

play14:18

remember people's favorite coffee

play14:20

basically i'm an americano guy sometimes

play14:23

if i'm feeling a little crazy

play14:26

almond milk latte that's about as fancy

play14:28

as it gets but

play14:29

we want to store my uh my favorite

play14:31

coffee americano what we do is we take

play14:33

that key value pair

play14:36

cassandra's not a key value database but

play14:38

this is an easy abstraction to kind of

play14:39

get on board and this is as far as we'll

play14:41

go here in the summary take the key and

play14:44

run it through a hash function

play14:46

doesn't matter what the hash function is

play14:48

as long as it's always the same hash

play14:49

function we hash it we say okay where is

play14:51

that number going to fit in this ring we

play14:53

want to see okay it's bigger than eight

play14:54

thousand it's less than a thousand so

play14:57

just based on the way those arrows are

play14:59

going i'm going to write that key into

play15:01

that node simple as that now the cluster

play15:04

can grow the hash values can change and

play15:07

as long as internally the database moves

play15:09

keys around as those things as those

play15:12

things change all i ever need to do when

play15:14

i'm writing to the database is look at

play15:16

it and see what are your tokens right

play15:17

now how many nodes do you have what are

play15:20

your tokens and

play15:22

hash my key and i know where to write

play15:24

and i know where to read from later on

play15:26

if i say hey what's tim's favorite

play15:28

coffee there's a new barista he doesn't

play15:30

know what my favorite coffee is it's

play15:31

fine this is web scale he can't remember

play15:34

all those things we'll hash that we'll

play15:36

say oh nine f seven two fine that's to

play15:39

come from a thousand i'll go there get

play15:41

that key and pull it out

play15:45

so that's basically how you read and

play15:47

write

play15:48

with a consistently hash database

play15:51

of course

play15:53

we would never want to have just one

play15:55

copy of of a piece of data that would be

play15:57

crazy

play15:58

so we can do things like replicate

play16:02

if i say well you know the the first

play16:04

copy would have gone to this node i'll

play16:06

just go to the next two nodes down

play16:08

uh the ring

play16:10

in the cluster and put a copy on

play16:12

those nodes as well right that fixes the

play16:15

problem of now that i'm in a distributed

play16:17

system what was my second characteristic

play16:20

computers fail independently

play16:23

we know this we have no way to predict

play16:24

when a node's going to die i would never

play16:26

want a key in just one place so now i've

play16:28

got three copies of it i've solved that

play16:30

problem i'm i'm now able to tolerate

play16:33

hardware failure and be up

play16:36

sounds great

play16:38

i told you distributed systems are

play16:39

terrible i created a problem by solving

play16:42

one and what is that problem

play16:46

consistency i have three copies on three

play16:49

computers of something that can change

play16:53

i'm a fickle coffee drinker it's

play16:54

americano today tomorrow maybe it's

play16:58

triple skinny two thirds decaf half

play17:00

chocolate no whip grande mocha with

play17:01

extra foam you have no idea what i might

play17:04

come up with tomorrow i am the very

play17:06

definition of mutable state so

play17:10

now that i have three copies of

play17:11

something that can change i have created

play17:14

a consistency problem and to uh to be

play17:16

clear one of these

play17:18

if i if i write tomorrow

play17:21

my new that crazy thing i just said um

play17:25

maybe one of these nodes is down while

play17:27

i'm doing the right

play17:28

and it doesn't succeed and it has an old

play17:31

copy of

play17:33

the favorite coffee and the other two

play17:35

nodes have the new copy entirely

play17:37

possible and then when i go to read how

play17:39

do i know who's telling you the truth

play17:42

all of these things become problems now

play17:44

basically i want to give you a rule for

play17:46

thinking about consistency in a database

play17:48

like this

play17:49

when i write to the database i'm going

play17:52

to require that a certain number of

play17:53

nodes take the right

play17:56

so as a client as a client application

play17:59

i'm going to say

play18:00

you two at least two of you need to

play18:02

succeed if only one of you succeeds i'm

play18:05

going to consider it to be a failure or

play18:07

i'll say only one of you needs to

play18:08

succeed i know i have three replicas but

play18:10

i'm in a hurry i'll take one those are

play18:12

options that i have i could be i could

play18:14

be

play18:15

particularly fastidious and say i need

play18:17

all three of you to succeed

play18:19

nobody would do that really just one and

play18:20

two are the practical options likewise

play18:22

when i read i need to say well i'm gonna

play18:23

hear from two of you or one of you

play18:27

i have a simple formula here

play18:29

for knowing whether you are using your

play18:31

database in a strongly consistent way if

play18:34

the number of nodes i write plus the

play18:37

number of nodes i read

play18:39

is greater than the number of replicas

play18:42

then i have strong consistency

play18:44

so if i if i write one and read one

play18:47

and there's three replicas that's

play18:49

eventually consistent if i write two and

play18:51

read two and i have three replicas

play18:53

that's strongly consistent cassandra is

play18:56

a pretty cool database if it's if you

play18:57

want a big database it's kind of the way

play18:59

to go

play19:00

and it lets you it lets you

play19:04

tune this

play19:05

on on a request by request basis you can

play19:08

you can change your consistency

play19:09

semantics on the fly

play19:11

pretty cool

play19:13

so even though

play19:15

everything's a problem in a distributed

play19:17

system and solving one problem creates

play19:18

another one there are some things that

play19:20

have elegant answers um and and

play19:22

cassandra is one such

play19:25

all right

play19:28

the cap theorem should we talk about the

play19:29

cap theorem just a tiny bit

play19:32

it

play19:33

back when nosql was new it was like 2010

play19:37

reading hacker news in the morning which

play19:39

is a thing i do

play19:40

it's like my newspaper

play19:43

it was painful because about once a week

play19:45

there would be a new and newly

play19:47

misinformed blog post that would make it

play19:49

to the front page of hacker news that

play19:51

was something about the cap theorem

play19:52

everybody cared all of a sudden because

play19:54

everybody was making scalable databases

play19:56

and so people had to to blog about it

play19:58

and say things that weren't true it's

play20:00

really not hard to grasp and it comes up

play20:03

sometimes still in distributed systems

play20:05

discussions so i want to give you a

play20:07

simple way of understanding it

play20:09

the cap theorem says if you're building

play20:11

let's just make it simple a distributed

play20:13

database

play20:14

you you want your database to have three

play20:16

properties you want to be consistent

play20:19

that is that's the c

play20:20

when i read something i want that to be

play20:23

the most recent thing i have written

play20:26

so i write and then i read and i see

play20:28

what i read

play20:29

i want it to be available and that

play20:31

simply means when i try to write it

play20:32

works and when i try to read it works i

play20:34

also want it to be partition tolerant

play20:37

that means it's a cluster i've got lots

play20:39

of nodes being a database together and

play20:40

if some of those nodes go away for a

play20:42

little bit and then come back together i

play20:44

don't want the database to completely

play20:46

fall apart i wanted to be able to deal

play20:47

with the fact

play20:48

that sometimes

play20:49

parts of the system can't talk to the

play20:52

the rest of the system that's partition

play20:54

tolerance so

play20:55

um imagine if you would you're in a

play20:57

coffee shop like i said we're oh wait i

play21:00

forgot the slide consistency

play21:01

availability partition to hearts you

play21:02

guys got this all right imagine you're

play21:04

in a coffee shop forget about databases

play21:06

for a minute and you are writing

play21:08

something together you're working on

play21:10

your screenplay with a friend your

play21:12

co-writer and your old schools you're

play21:15

doing it on a yellow legal pad with a

play21:17

pen

play21:19

it's you it's not me i don't know why

play21:20

you're doing this this way but you are

play21:22

and both of you have a copy of the

play21:24

screenplay one of you makes a change

play21:26

here the other one makes a change on

play21:28

their copy that person makes a change

play21:30

you delete something you keep your

play21:31

copies consistent because it's how you

play21:33

want to do it it's a little strange but

play21:34

it's what you're doing

play21:36

the coffee shop closes i'm not going to

play21:38

judge you i don't know why you're doing

play21:39

it this way but you just are cough shop

play21:40

closes but you're in the zone so you go

play21:42

home and you keep going over the phone

play21:44

and you keep in sync over the phone you

play21:46

say well i added this line of dialogue i

play21:49

removed this scene heading and you're

play21:51

keeping things up to date but then

play21:53

your battery dies

play21:55

and you can't talk anymore you lost your

play21:56

charger you now have a network partition

play22:00

you keep writing because you're in the

play22:02

zone presumably your partner keeps

play22:04

writing

play22:05

your spouse comes in the room and says

play22:07

hey

play22:08

how's the screenplay going

play22:10

read me back your most recent page

play22:13

which is great that your spouse is so

play22:14

interested i think

play22:17

here we have a decision so back to the

play22:19

the venn diagram here you can't have all

play22:21

three of these things that's the

play22:23

fundamental insight of the cap theorem

play22:24

in the presence of a partition

play22:26

in this situation

play22:28

you can either

play22:29

choose to be unavailable

play22:32

that is you say to your spouse i can't

play22:33

tell you

play22:35

because i might tell you the wrong thing

play22:36

and i would hate to do that that would

play22:38

be unfair to my writing partner or you

play22:41

could say well i'll just give you an

play22:42

answer it might be wrong

play22:44

um

play22:46

so i could i could give up on

play22:47

availability and keep consistency by not

play22:50

answering or i could

play22:52

keep availability and give up on

play22:54

consistency that's the fundamental

play22:56

insight of the cap theorem is i can only

play22:58

have two of these three good things i

play23:01

always want all three

play23:03

but sometimes i can't have all three

play23:06

um it's uh

play23:08

again

play23:10

there's a there's a

play23:11

uh

play23:12

a lot of writing on the internet about

play23:15

what this means and what its

play23:16

implications are and when it applies and

play23:19

so forth but just think of distributed

play23:20

databases it's a simple trade-off

play23:22

between those three things that you want

play23:26

all right

play23:27

enough of storage

play23:30

let's talk about computation

play23:34

again computation is super easy if you

play23:36

have just one

play23:38

processor um if you go to multiple

play23:40

threads on one processor and you try to

play23:42

do computation in a threaded way that's

play23:44

usually pretty terrible and uh there are

play23:47

talks at this conference about ways to

play23:49

avoid that terribleness

play23:51

and

play23:53

i should say but

play23:56

you got multiple computers and you're

play23:57

trying to distribute a computation task

play23:59

among multiple computers

play24:01

life gets really hard and we've created

play24:03

some truly terrible tools for solving

play24:06

this problem the good news is life is

play24:08

getting a lot better

play24:10

trying to do this 10 years ago or 12

play24:12

years ago was bad trying to do it now is

play24:16

starting to not feel like somebody is

play24:19

hitting you with a 2x4 so i think

play24:20

there's a lot of a lot of reason for um

play24:24

for optimism and mapreduce is not one of

play24:27

those reasons but let's talk about it

play24:30

just to make sure we're on the same page

play24:32

and you know the badness that we've come

play24:34

from so the typical way to explain

play24:36

mapreduce is with word count

play24:39

and this is a union rule people i have

play24:41

to do this

play24:42

so

play24:43

i'm going to use a word count example

play24:45

just to show you basically what's going

play24:46

on here so i've got this big chunk of

play24:49

text now if i'm actually

play24:52

oh this works really well in the u.s

play24:54

because everybody knows this

play24:55

it's an american poet he's from

play24:57

baltimore

play24:58

his name is edgar allan poe

play25:00

the american football team

play25:02

from baltimore does anybody know the

play25:03

name of the team

play25:06

the ravens

play25:08

because this poet is from baltimore this

play25:11

is his most favorite poem it's an

play25:13

amazing poem strongly recommended

play25:16

and i'm not even a football fan but i

play25:18

kind of love the ravens for this reason

play25:19

anyway if you actually had the text of

play25:22

the raven i think it's about 1800 words

play25:24

the right way to count that is to use

play25:26

the wc command don't screw around that's

play25:29

a small data problem but this is helpful

play25:31

just to see how mapreduce works map

play25:33

reduce

play25:35

funnels all of your computation through

play25:37

two functions one called map and one

play25:40

called reduce that is a terrible thing

play25:42

to do to a person but here's how it

play25:44

works i would dump the text into my map

play25:46

function and what map would do is it

play25:48

would it would tokenize it it would

play25:50

split it up into words

play25:52

and then it would turn those words into

play25:54

key value pairs

play25:56

this is i'd like to go through this

play25:58

because this is kind of a neat insight

play26:00

into the way you would solve a problem

play26:03

like this in a distributed way

play26:05

what we begin by doing

play26:07

is counting words

play26:10

that is what i have just done here

play26:12

that's the first step in the mapreduce

play26:14

canonical mapreduce word count example

play26:16

is you count the words now what people

play26:18

say is i tokenize the text and i turn

play26:21

the words into key value pairs the word

play26:23

is the key and the value is the number

play26:25

one really what i have done though is

play26:27

i've just counted each word

play26:29

and i've seen each word one time it's

play26:31

like very naive stateless counting i

play26:34

don't remember whether i've seen the

play26:35

word a or came or wrapping before i just

play26:38

count each one one time all right

play26:41

then

play26:42

there is a step in between that always

play26:44

gets ignored it's called shuffle so

play26:45

after i do that mapping i count all my

play26:47

words once then potentially i have to

play26:50

move these words around the network

play26:52

to different computers to make sure

play26:54

similar words are uh nearby each other

play26:57

so at or rather a

play27:00

wrapping tapping i want to get those in

play27:02

order

play27:03

then i take these collection of

play27:06

collections of similar words and i dump

play27:08

them into a function called reduce

play27:11

and what reduce is going to do is

play27:13

basically add up the numbers

play27:15

so it's a less naive counting it says

play27:17

well i counted the word a 3 000 times

play27:21

let me just add up all those numbers

play27:23

now i am skipping a bunch of steps here

play27:25

in the spirit of summary and a bunch of

play27:28

complexity here in the spirit of summary

play27:30

um

play27:32

but

play27:33

uh this i end up with with counted words

play27:36

and i'm able to do this

play27:39

um i'm able to split up the work of

play27:42

mapping and the work of reducing and

play27:44

distribute it among lots and lots of

play27:46

computers this is a really good idea if

play27:50

your data is already say in a

play27:52

distributed file system

play27:54

if you've got so much data that you need

play27:56

300 computers to store it all and

play27:59

replicate it all

play28:00

then

play28:01

doing your compute this way

play28:03

it seemed like a good idea at the time

play28:05

and it was a moderately good idea a lot

play28:07

of people built i shouldn't say a lot

play28:09

some people built systems using

play28:12

mapreduce that were worthwhile all right

play28:14

and it by itself mapreduce is just a

play28:16

pattern it's not a product but it's the

play28:18

pattern the compute pattern used by

play28:20

hadoop

play28:23

hadoop is

play28:24

still widely deployed

play28:26

rapidly becoming in my view in the view

play28:28

of many others a legacy technology it

play28:30

does not it is not the cool new thing

play28:32

anymore at all

play28:35

but you should know

play28:37

that whole mapreduce thing yep that's

play28:39

the fundamental layer uh the bottom of

play28:41

hadoop and

play28:43

the other the other i think more

play28:45

long-lived part of hadoop is the

play28:47

distributed file system hdfs

play28:50

that is in the words of one of its

play28:51

co-creators a cockroach

play28:54

nobody is really building any new

play28:55

mapreduce systems

play28:57

anymore and that's not you don't see

play28:58

that in lots of new designs

play29:00

hdfs is going to be with us

play29:03

uh until our grandchildren are writing

play29:05

software i think it's it's probably

play29:07

going to be a very long-lived piece of

play29:08

infrastructure and for good reason it

play29:10

works just fine

play29:12

hadoop spouted an enormous ecosystem and

play29:16

not necessarily

play29:17

in a good way in my opinion at least it

play29:20

became very very complicated because the

play29:22

underlying program auto programming

play29:23

model was so terrible

play29:25

that we needed this big ecosystem of of

play29:27

more pleasant layers on top of it and

play29:30

that's kind of what happened things like

play29:31

hive uh grew up and that became a

play29:34

relatively pleasant way to use hadoop

play29:36

actually programming mapreduce not many

play29:38

people did that which is why

play29:41

a tool when we can we keep going through

play29:43

the world of distributed computation a

play29:45

tool like spark

play29:46

has

play29:48

really taken over the uh the mind share

play29:51

of mapreduce in the last three years

play29:55

the way people

play29:57

have been talking about spark

play30:00

at least you know when it was new

play30:02

they would start off by saying hey this

play30:03

is a way better thing than hadoop

play30:05

just believe me and uh you know

play30:08

mapreduce is dumb and then you start

play30:10

looking at spark internally you kind of

play30:11

learn the api you're like yeah this is

play30:13

way better than mapreduce instead of map

play30:15

i have transform instead of reduce i

play30:17

have action

play30:19

i like it

play30:20

so the the underlying paradigm is still

play30:23

very similar we still assume we've got a

play30:25

bunch of data sitting out on a bunch of

play30:27

computers

play30:28

and then we need to go to that data and

play30:30

do stuff to it and if that's what you're

play30:33

going to do then that basic scatter

play30:35

gather paradigm is always going to

play30:38

happen

play30:39

and it's probably always going to follow

play30:42

that

play30:43

basic map reduce

play30:45

sort of duality

play30:47

now uh spark's insight was to say

play30:51

let's create an abstraction on top of

play30:53

the data

play30:54

that's officially a part of our api

play30:57

and

play30:59

originally the the

play31:00

in early versions of spark that

play31:02

abstraction was the rdd now kind of the

play31:05

front-facing abstraction and spark is a

play31:07

thing called the data set it's very

play31:08

similar but uh in contrast to hadoop

play31:12

spark gave you an object that you could

play31:14

program

play31:16

you you felt when you're writing

play31:18

mapreduce code you feel like

play31:20

um

play31:21

like someone is hitting you with a stick

play31:24

when you're writing spark code you feel

play31:26

like a programmer

play31:27

you've got an object that represents

play31:29

some big collection of data and what do

play31:30

you do you call methods on that object

play31:33

and it does stuff to it now maybe

play31:34

they're essentially map and reduce that

play31:37

are that are just have a little bit of

play31:38

syntactic sugar on them uh but

play31:42

the spark programming model is uh so

play31:45

much more developer friendly than hadoop

play31:47

that mapreduce has even in in the eyes

play31:50

of the hadoop vendors at this point

play31:52

essentially been completely taken over

play31:54

by spark and it it deserves it

play31:56

spark is a much much more pleasant way

play31:58

to write code it also doesn't have

play32:01

storage opinions

play32:03

hadoop's mapreduce really said you could

play32:06

put data anywhere you want as long as

play32:07

it's hdfs spark says of course i will

play32:10

operate on hdfs data i'll operate on

play32:13

cassandra data i will operate on dang

play32:16

parquet files and s3 i don't care it

play32:19

just wants to make rdds and data sets

play32:21

and let you transform them and do

play32:23

actions on them and do that distributed

play32:25

computation now the assumption is still

play32:28

in the world of spark again i've got a

play32:30

bunch of computers i have a distributed

play32:33

storage problem that's why we started

play32:35

with distributed storage i have i have a

play32:37

distributed storage problem and now i

play32:39

have to do compute over all that data

play32:41

that's sitting out there

play32:43

so if i have a distributed file system

play32:45

or a distributed database

play32:47

that's that's data at rest out there

play32:50

then spark the spark paradigm still

play32:52

makes sense

play32:53

hadoop is the mapreduce paradigm really

play32:56

you could think of that as yesterday's

play32:57

news and spark is more of a today thing

play32:59

now kafka it's not quite time for kafka

play33:01

stay in the sun but because we're

play33:03

talking about computation and in the

play33:05

last year kafka has entered this world

play33:08

uh i should mention

play33:10

it is an approach to distributed

play33:12

computation in which everything is a

play33:14

stream

play33:17

so as long as you've got streaming data

play33:19

you can do computation on that streaming

play33:21

data as the data is in flight you don't

play33:23

have the assumption that you have put

play33:25

the data out somewhere and you're going

play33:27

to go to that data and do compute on it

play33:30

this is a stream processing framework

play33:32

where streams are first-class citizens

play33:34

and you don't even need a cluster to do

play33:37

your computing

play33:39

spark importantly also has

play33:42

a stream processing api and and it's not

play33:45

just api there's actual infrastructure

play33:47

in a spark cluster that helps with

play33:49

stream processing

play33:50

there are other solutions to the stream

play33:52

processing problem but kafka needs

play33:54

mention here because it is uh not just a

play33:57

messaging messaging framework but is

play33:59

also a

play34:01

computation framework now

play34:04

all right

play34:05

we are at about the t-minus 15-minute

play34:07

mark

play34:09

oh you know what i didn't say

play34:12

at first i didn't say if you have a

play34:14

question ask it

play34:16

did you assume that it's a little late

play34:18

now because i i i won't i will never

play34:20

leave time for questions at the end i

play34:21

always prefer that those just happen

play34:23

when they come too late dang it

play34:26

now you know if you got a question ask

play34:27

it the worst thing i'll say is i don't

play34:29

know

play34:32

all right let's talk about

play34:34

my third promise and that is messaging

play34:38

um messaging is great because it lets us

play34:40

loosely couple systems and that lets us

play34:42

build little chunks of functionality and

play34:45

really release them and version them

play34:48

independently and still have them

play34:50

communicate and

play34:52

does that sound like anything we do

play34:53

these days if i say little chunks of the

play34:55

system that are released and versioned

play34:57

independently

play34:58

that sounds kind of micro service-y

play35:01

that's amazing so messaging is because

play35:04

of microservices in the last three years

play35:07

messaging and streaming i'm seeing come

play35:09

way more to the foreground in the last

play35:11

year or so people are realizing that

play35:14

microservices are a super good idea for

play35:16

a number of reasons and people have been

play35:18

struggling with how to get them to talk

play35:20

um

play35:21

you know early mike when maybe this is

play35:23

your microservices deployment and i

play35:25

should be respectful but kind of an

play35:27

early approach to microservices is well

play35:29

let's just have a bunch of little pieces

play35:31

of code and one big database and they

play35:32

talk through the database

play35:34

people realize that was a terrible idea

play35:35

it didn't really work and so getting

play35:37

those micro services to talk

play35:39

suddenly messaging is coming back into

play35:41

the foreground so

play35:42

it's a way of loosely coupling systems

play35:44

we usually use the term subscribers

play35:46

um

play35:47

or producers

play35:50

um sometimes subscribers we might call

play35:52

them consumers all these kind of uh

play35:54

words we almost always in messaging

play35:56

system organize messaging systems

play35:58

organize them into named topics which

play36:01

are just namespaces for similar messages

play36:03

and there's usually this this this uh

play36:06

term broker which is just a computer in

play36:08

the messaging framework

play36:11

um we typically think of messages as

play36:13

being persistent over the short term

play36:16

now it's up to you how you define short

play36:18

term

play36:19

uh the new york times is a kafka open

play36:21

source kafka user and they keep all of

play36:24

their newspaper content going back to

play36:26

the 1860s

play36:27

in kafka so you get to set

play36:30

the data retention yourself

play36:32

and believe me 1860s in america is a

play36:34

long time ago

play36:36

so

play36:37

that's that's a lot of newspaper data

play36:38

for us

play36:41

now yeah the basic idea here is uh

play36:44

you've got a producer it sends a sends

play36:46

messages into a topic and then the

play36:48

consumer gets to read those out at its

play36:50

own

play36:51

uh rate

play36:54

the problem is

play36:56

when that producer gets big

play37:02

what if a topic gets too big for one

play37:04

computer now how can it be too big well

play37:06

you can retain too much data suppose you

play37:09

want to retain data into the 19th

play37:11

century

play37:12

hey again i'm not going to judge it's

play37:14

your system so that's big or your

play37:17

messages are big or um

play37:20

you are transacting messages too fast

play37:24

for one computer to keep up maybe data

play37:25

retention isn't a big thing but you're

play37:27

reading and writing too fast for one

play37:29

machine given the fastest machine you

play37:31

can reasonably deploy right now what if

play37:33

one computer is not reliable enough

play37:34

remember they break

play37:36

and you would like your system not to go

play37:38

down

play37:41

um what if you really want to be able to

play37:43

deliver to to guarantee delivery

play37:46

even if a computer is down well

play37:50

enter apache kafka

play37:53

i should say when you have a messaging

play37:55

system and it is a single server

play37:56

messaging system it can do all kinds of

play37:58

fantastic things like it can trivially

play38:01

guarantee that messages are delivered

play38:05

exactly one time

play38:07

what it can't do is be particularly

play38:08

resilient to failure or particularly

play38:10

scalable

play38:12

oh i should say also it can always

play38:14

guarantee that messages are ordered

play38:17

this is a message queue after all and we

play38:19

would expect messages to be ordered and

play38:21

when i have a single server message

play38:23

queue or message broker then

play38:25

everything's great my messages are

play38:26

always going to be in order i'm going to

play38:29

have to give up on some of those things

play38:31

when i scale

play38:32

but i'll get things in return and it's

play38:34

as it turns out a lot of awesome things

play38:36

so

play38:37

kafka now just a few quick definitions

play38:39

in case i use these words of course

play38:41

message

play38:42

that's the basic

play38:43

thing topic i've already said that

play38:45

producer this is a client process

play38:48

outside the kafka cluster that puts

play38:50

stuff in consumer this is a client

play38:52

process outside the kafka cluster that

play38:54

takes messages out and broker is simply

play38:57

one of the computers that the kafka

play39:00

cluster is made up of

play39:02

so a

play39:03

a broker will have many topics for

play39:05

example i'm not really deep diving on

play39:08

kafka here just like i wasn't deep

play39:10

diving on cassandra so it does things

play39:12

like replication also and it answers the

play39:15

consistency question also in various

play39:17

ways so all of those problems come up

play39:20

and all of those problems are solved by

play39:22

this particular distributed message

play39:24

broker

play39:25

but the money in kafka the really

play39:27

interesting thing is when we get past

play39:30

just being a message queue so i want to

play39:33

show you how it works as a message queue

play39:35

and where

play39:36

the distributed system ugliness creeps

play39:39

in

play39:40

like what kafka does to you that's mean

play39:43

because it's distributed

play39:45

and of course what we get in return for

play39:47

that

play39:48

then i want to talk real quickly in

play39:49

closing about what happens

play39:52

when

play39:54

we have made everything into a message

play39:56

like how

play39:58

how thoroughly can you live that life

play40:00

and what sorts of things happen uh when

play40:02

you do so kafka all right again producer

play40:04

consumer topic

play40:07

this is the trivial account of kafka

play40:09

kafka as a pipe

play40:12

and this is currently how most people

play40:14

think of kafka because it's it's it's

play40:17

value-added features that are more than

play40:19

just pipe are really only a year old so

play40:22

let's just go through the pipe version

play40:24

when that topic

play40:26

gets big

play40:27

what do we do

play40:29

of course we partition it we'll just

play40:31

split that up onto several computers so

play40:34

let's say instead of one computer i've

play40:36

got that single topic partitioned over

play40:38

three

play40:39

computers now each of these partitions

play40:42

of course i would also replicate just

play40:44

put that to the side for a moment we're

play40:46

not going to talk about that we've got

play40:48

these three partitions so these three

play40:51

uh rows here are those three brokers

play40:55

each of which has one partition of the

play40:57

topics these are independent computers

play40:59

the gears over there are producers

play41:01

they're going to produce messages and

play41:02

these diligent people working at uh like

play41:06

2010 imax those are

play41:09

uh the consumers so

play41:12

when i create a message

play41:15

um as a as a producer what's gonna

play41:17

happen is we're going to look at some

play41:19

part of that message

play41:21

now maybe i'll just hash the whole

play41:23

message maybe i'll look into the field

play41:24

look into a field of the message like

play41:27

username or ip address or something that

play41:29

means something to me as a producer but

play41:31

i'm going to take some part of that

play41:34

message up to all of it and i'm going to

play41:36

hash it

play41:37

now this is like consistent hashing over

play41:39

there

play41:40

poking its head up and winking at you

play41:41

like i warned you it would i'm going to

play41:43

hash that message mod the number of

play41:45

partitions and that's going to tell me

play41:47

what partition to write to

play41:49

within each partition

play41:52

i will be able to keep things in order

play41:53

each partition is ordered so i'm going

play41:56

to keep writing in here and i will kind

play41:59

of randomly really or uniformly have

play42:01

these messages assigned

play42:04

to my partitions

play42:07

uh oops i consumed one let's not consume

play42:09

that yet

play42:12

now already you can't remember what

play42:14

order these happened in

play42:17

system-wide or topic-wide ordering is

play42:20

lost

play42:21

because this is a distributed message

play42:23

queue

play42:24

i can only have ordering within a

play42:26

partition that's a limitation i have to

play42:28

accept i don't get global topic-wide

play42:32

ordering that's a bummer i would love to

play42:34

have it there is no day i will wake up

play42:36

in the morning

play42:38

roll out of bed

play42:39

have a cup of coffee and say i literally

play42:41

do not want global ordering on my topics

play42:45

darn it i wish i didn't have it shake my

play42:46

fist at the heavens it's just not going

play42:48

to happen i would always want it but i

play42:49

can't have it

play42:50

it's just not a possibility but i do

play42:52

have it within partitions and so i

play42:55

that's why i said that at the at the

play42:56

producer level you can be smart about

play42:58

what part of the message gets hashed and

play43:01

so i can guarantee within a topic that

play43:04

um messages from the same user or from

play43:07

the same host or from the same whatever

play43:09

will always end up in the same partition

play43:12

and then i can consume them in order

play43:15

from each consumer and those consumers

play43:18

are independent computers operating

play43:19

independently scaling my system like a

play43:22

boss so

play43:23

cool

play43:27

i have a message bus

play43:29

i have solved the problem of distributed

play43:31

messaging i already came up with a way

play43:33

to solve distributed storage we talked

play43:35

about one

play43:36

i came up with there are a few different

play43:38

ways to solve distributed computation

play43:40

now i've got this distributed messaging

play43:42

system so all my services can

play43:45

communicate with one another through

play43:47

topics in this elastically scalable

play43:49

essentially infinitely scalable message

play43:52

bus that's great

play43:54

and that's what i did yesterday i would

play43:57

generate events into this bus this is

play43:58

kind of how people start out and then i

play44:00

would i would i would consume service

play44:02

would consume one of those messages and

play44:04

what do i do i write it to a database

play44:07

so i've got this event and i turn that

play44:09

event into a row and that row sits in a

play44:11

place

play44:13

the data is sitting in a place somewhere

play44:15

maybe

play44:16

tomorrow instead of doing that instead

play44:18

of saying here's an event i'll put it

play44:20

here and i'll go back here when i need

play44:22

it maybe tomorrow i could just do

play44:23

computation on the events after all i've

play44:25

got the giant message bus sitting there

play44:27

and everybody has an api everybody is is

play44:30

coupled to that bus through that api

play44:32

they're producing they're consuming

play44:34

maybe i could do that so without

play44:38

that

play44:40

this this what is now really an

play44:42

anti-pattern emerges about seven years

play44:45

ago this was considered pretty cool and

play44:47

it's it's the so-called lambda

play44:49

architecture that means i've got all

play44:50

these events out here and i have to

play44:52

build

play44:53

two parallel systems to process them

play44:56

this is what happens if you don't if you

play44:58

don't use your message bus thoroughly i

play45:00

have one system that all my batch

play45:03

processing takes place in that's my

play45:05

distributed database

play45:07

those are slow but they're very thorough

play45:09

those jobs up there they can take

play45:10

minutes or hours and then i've got this

play45:12

other system down here

play45:14

that really is just my temporary message

play45:16

bus thing and i might do little bits of

play45:18

stream processing for quick summaries

play45:20

all right this lambda architecture

play45:22

emerges this is bad because essentially

play45:25

what lambda did to us and if you come to

play45:26

my talk this afternoon

play45:28

i'll talk about this in more detail

play45:30

is it forced people to write the same

play45:32

code twice you had to write the stream

play45:33

code and the batch code

play45:35

and

play45:37

yeah they say the best code is the code

play45:38

you never write

play45:40

well the worst code would be the code

play45:42

that you write two or more times that's

play45:44

a terrible idea so

play45:47

the

play45:49

the basic idea once i've got a message

play45:51

bus and i say hey look how about i stop

play45:53

obsessing over taking events and writing

play45:56

them to some place

play45:58

in a database and why not just let them

play46:00

be events and process them in place

play46:03

through the uh you know through a stream

play46:05

processing api

play46:08

at the very top level i can do something

play46:10

like this i've got a kafka cluster and

play46:13

i've got various services attached to it

play46:15

i've got some service maybe that's doing

play46:17

some analytics or some monitoring a

play46:20

service that's doing some analytics i've

play46:22

got a couple of little databases that's

play46:24

okay for a microservice to maintain a

play46:26

local database but all of these are

play46:28

services i might even have a hadoop

play46:29

cluster because it's still there we

play46:31

spent a lot of money on it the cio

play46:32

doesn't want to get in trouble

play46:34

for not using it so you have to keep

play46:36

using the hadoop cluster i might have an

play46:37

elastic search cluster down here all

play46:39

doing different kinds of things that my

play46:41

system needs based on this messaging bus

play46:44

that they all talk to

play46:47

um

play46:50

pardon me so

play46:52

i already did this and i just wasn't

play46:54

looking at the screen so i have events

play46:56

out in the world that go into that

play46:58

cluster and get propagated to the

play47:00

services that need them

play47:02

and the nifty thing here is

play47:04

each service doesn't really need to

play47:06

worry about what other services might be

play47:09

doing with it an event is just an event

play47:11

it happens it's there in a topic and a

play47:14

new service can grow up and say hey i'm

play47:16

interested in that event without having

play47:18

to have any coupling to the way that

play47:20

event is produced or who else might

play47:22

consume it

play47:23

so once i start putting things into a

play47:25

message queue

play47:28

i can be

play47:32

it could cause me to think very

play47:33

differently about the architecture of

play47:35

the system this can be an impactful

play47:36

thing this distributed messaging thing

play47:39

it's possible to go from my single

play47:42

server relational paradigm to a

play47:44

distributed cassandra paradigm i have to

play47:46

change my mind about some things about

play47:48

the way consistency works about uh the

play47:51

way data modeling works right like

play47:53

that's a big deal when i go from single

play47:55

threaded programming to distributed

play47:57

computation that's kind of hard uh

play48:00

that's a tough thing to do when i when i

play48:02

get streaming into my blood suddenly the

play48:05

world is very very different and i start

play48:07

building i start building systems in a

play48:09

very different way indeed

play48:12

now um

play48:13

with that

play48:15

just about out of time

play48:17

and we have covered storage computation

play48:21

and messaging a brief introduction to a

play48:24

few things that go wrong and right when

play48:27

you build distributed systems and how a

play48:28

few open source projects solve them

play48:30

shameless plug

play48:32

videos by the same name on o'reilly if

play48:34

you care to learn more

play48:35

absolutely not shameless plug for kafka

play48:38

meetups because this is what i do if

play48:40

you're interested in more uh check that

play48:42

out take a picture of that slide if you

play48:44

want to go to a kafka meetup i would

play48:45

love to see you there sometimes i am

play48:47

even seen

play48:48

in europe not often enough

play48:52

but thanks for having me here today

play48:53

have a good day

Rate This

5.0 / 5 (0 votes)

Related Tags
分散システムストレージコンピューティングメッセージングオープンソースCassandraKafkaSparkMapReduceシステム設計
Do you need a summary in English?