Simplify data integration with Zero-ETL features (L300) | AWS Events

AWS Events
4 Jun 202426:25

Summary

TLDRこのビデオスクリプトでは、データ分析における多岐にわたるデータ源を統合するZL(Zero ETL)ソリューションが紹介されています。リレーショナルデータベースやNoSQLデータベースからデータウェアハウスやデータレイクへのデータ移動を自動化し、分析のためのリアルタイムデータアクセスを提供します。スピーカーであるTasとAmanは、AWSサービス間のデータ統合の課題と、ZLが提供する解決策、その価値提案、および顧客事例について詳述しています。

Takeaways

  • 🗃️ ZL(Zero ETL)は、データ分析のためのデータ統合を簡素化し、異なるデータベース間でのデータ移動を支援する一連の統合です。
  • 🔍 ビジネスにとってデータは全てで、より多くのデータを持つほど分析が可能で洞察を得られますが、データは異なる場所に分散しています。
  • 🤝 タスとアマンは、リレーショナルデータベースとNoSQLデータベースの専門家で、それぞれがZL統合の異なる側面について議論します。
  • 🔄 ZLは、ストリーミングデータのコピーをサポートし、複数のソースから複数の宛先にデータを移動させることが可能です。
  • 🚀 Amazon Aurora MySQLとRedshiftの間のZL統合により、リアルタイムで分析を実行できるようになります。
  • 🛠️ ZLは、データパイプラインの構築、維持、監視、セキュリティ確保の複雑さを軽減し、ビジネスの分析ニーズに応えるための「分かりにくい作業」を削減します。
  • 📈 Redshiftでの分析により、より迅速で高度な分析操作が可能となり、ビジネスの意思決定プロセスを強化します。
  • 🔗 ZLは、DynamoDBからOpenSearchへのデータ移動をサポートし、NoSQLデータベースと検索サービスの間の同期を容易にします。
  • 📊 Data Prepperというオープンソースのサーバーレス技術を活用して、ZLはETLプロセスを簡素化し、データの抽出、変換、ロードを自動化します。
  • 📈 ZL統合は、DynamoDBからRedshiftへのデータのインポートもサポートしており、データウェアハウスでの分析が可能になります。
  • 🔑 ZLはAWSが積極的に投資しているプロジェクトであり、将来的にはより多くのデータストアやデータソースとの統合が期待されています。

Q & A

  • ZLとはどのようなソリューションですか?

    -ZLは、複数のソースから複数の宛先にストリーミングデータをコピーするための統合セットです。データのコピー、セキュリティ確保、および効率化を目的としています。

  • ZLが解決しようとしている問題とは何ですか?

    -ZLは、ビジネスの分析面と操作面の間でデータを移動する際に必要となる、多数のパイプラインの構築、維持、監視、セキュリティ確保の複雑さを軽減することを目指しています。

  • Amazon Aurora MySQLとRedshift間のデータ転送に関して、ZLはどのように機能しますか?

    -ZLは、Amazon Aurora MySQLからRedshiftへのデータ転送をリアルタイムで自動化し、操作データベースに追加の負荷をかけることなく、分析を可能にします。

  • データの低レイテンシ移転を実現するためにZLはどのような技術を使用していますか?

    -ZLはAuroraのストレージ機能を最大限活用し、変更されたデータのみをキャプチャしてRedshiftに送信することで、レイテンシを数桁の秒間に抑えることを目標としています。

  • ZLが提供する統合の種類には何がありますか?

    -ZLは、現在、Amazon Aurora MySQLとRedshift、Amazon DynamoDBとOpenSearchの間での統合を提供しています。他の統合はプレビュー段階または準備中です。

  • DynamoDBとOpenSearch間のZL統合にはどのような利点がありますか?

    -DynamoDBとOpenSearch間のZL統合により、データの同期を維持しながら、検索機能を強化し、ユーザーエクスペリエンスを向上させることができます。

  • Data Prepperとは何であり、ZL統合でどのような役割を果たしていますか?

    -Data PrepperはオープンソースのサーバーレスETLツールで、ZL統合でデータを抽出、変換、およびロードするプロセスを自動化します。

  • ZL統合を使用することでデータエンジニアにどのような利点がありますか?

    -ZL統合を使用することで、データエンジニアはデータパイプラインのコードの維持やインフラの設定などの煩わしい作業から解放され、より迅速かつスケーラブルなデータストア間のセットアップが可能になります。

  • AWSは今後ZLにどのような投資を続ける予定ですか?

    -AWSはZLへの投資を続けており、他のデータベースやデータソースとの統合も計画中です。AWSはZLのサポートを拡大し、より多くのユーザーニーズに対応する予定です。

  • ZL統合のフィードバックはどのように提供できますか?

    -セッションの後またはAWSアカウントチームに連絡することで、ZL統合に関するフィードバックや要望を提供できます。AWSはユーザーからのフィードバックを活用してサービスを改善しています。

Outlines

00:00

😀 ゼロコードETLの紹介

この段落では、TasとAmanがゼロコードETL(ZL)の重要性とその機能について説明しています。Tasはリレーショナルデータベース専門家であり、AmanはDynamoDBの専門家です。彼らはデータ分析における多岐にわたるデータソースの課題と、ZLが提供する解決策について話しました。ZLはリレーショナルデータベースからストリーミングデータのコピーをサポートする一連の統合であり、分析エリアへのデータ移動を容易にします。また、彼らはこのプロセスを通じてデータの安全性を確保し、分析のためのデータパイプラインの複雑さを軽減することを目的としています。

05:03

😉 Amazon Aurora MySQLとRedshift間のZL統合

この段落では、Aurora MySQLからRedshiftへのデータ転送を自動化するZLの機能について詳しく説明しています。このプロセスはほぼリアルタイムで行われ、操作データベースに追加の負荷をかけることなく、Redshiftで分析が可能になります。また、ZLは複数のクラスターからデータを引き出し、Redshiftに送信することができるため、大量のデータを分析することができます。ZLはAuroraのストレージ機能を最大限に活用し、データの変更のみをキャプチャしてRedshiftに送信することで、低レイテンシを実現しています。

10:04

🎓 AmanによるDynamoDBとOpenSearch間のZL統合の解説

AmanはDynamoDBとOpenSearch間のZL統合について説明しています。この統合は、高スループットで低レイテンシのDynamoDBと柔軟性の高い検索機能を持つOpenSearchを組み合わせた場合に適しています。ZLは、Data PipelineやLambda関数を通じてDynamoDBの変更をOpenSearchに反映させるためのコードや設定を簡素化します。Data Prepperというオープンソースのサーバーレス技術を利用しており、YAMLテンプレートを使ってETLプロセスを定義できます。これにより、データの抽出、変換、およびロードが容易になり、データストア間の同期が効率的に行われます。

15:05

🛠️ DynamoDBからRedshiftへのZL統合の詳細

この段落では、DynamoDBからRedshiftへのデータのインポートについて説明しています。ZLは、DynamoDBのインクリメンタルエクスポート機能を利用して、変更されたデータのみを定期的にRedshiftに転送します。このプロセスは、データの差分を効率的に取得し、Redshiftでの分析を可能にします。ターゲットテーブルは、DynamoDBのスキーマレスな特性を考慮して作成され、パーティションキーとソートキーに基づいてRedshiftにデータをインポートします。この方法により、データエンジニアはデータパイプラインの維持にかかる労力を軽減できます。

20:07

🔍 ZL統合の能力とAWSの将来の計画

最後の段落では、ZL統合の能力がデータエンジニアにとってどのように有益であるかを強調しています。ZLは、データパイプラインの維持とコードの管理の複雑さを軽減し、データストア間の低レイテンシでスケーラブルなセットアップを実現します。AWSはZLの将来について積極的に投資しており、他のデータベースやデータソースとの統合も計画中です。フィードバックを歓迎し、ZL統合に関するリクエストや要望についてはAWSアカウントチームに連絡するよう呼びかけています。

25:11

📢 ZLのサポートとフィードバックの重要性

このセクションでは、ZL統合のサポートとフィードバックの重要性が強調されています。ZLはデータエンジニアにとって新しいツールであり、AWSはその開発を積極的に進めています。参加者は、ZL統合に関するフィードバックや要望をAWSチームに伝えることが期待されています。ZLのサポートと改善のために、ユーザーからの意見が非常に重要であると述べています。

Mindmap

Keywords

💡データ統合

データ統合とは、異なるデータソースからデータを一か所に集約し、分析やアプリケーションで使用できるようにすることです。このビデオでは、異なるデータベースからデータを抽出し、分析エリアに移すことで、トレンド分析や不正行為の検出が可能になるソリューションについて説明しています。

💡ZL (Zero ETL)

ZLは、Zero ETLの略で、データの抽出、変換、ロード(ETL)プロセスを簡素化する技術です。ビデオでは、ZLを利用して、データベース間のデータコピーを自動化し、分析のためのデータ準備を効率化する利点を強調しています。

💡Amazon Aurora MySQL

Amazon Aurora MySQLは、AWSが提供するリレーショナルデータベースサービスの一つで、MySQLに似た互換性を持つクラウドネイティブのデータベースです。ビデオでは、Aurora MySQLからデータを抽出し、分析エリアに移すことができるZLインテグレーションについて説明しています。

💡Redshift

Redshiftは、AWSが提供するデータウェアハウスサービスで、大量のデータを高速に分析することができます。ビデオでは、Aurora MySQLのデータをRedshiftに移し、分析を実行するプロセスをZLインテグレーションで自動化する利点を紹介しています。

💡DynamoDB

DynamoDBは、AWSが提供するNoSQLデータベースサービスで、高スループットで低レイテンシのデータアクセスを特徴としています。ビデオでは、DynamoDBからデータを抽出し、OpenSearchやRedshiftに移すことができるZLインテグレーションについて触れています。

💡OpenSearch

OpenSearchは、AWSが提供するフルテキスト検索エンジンで、データの検索や分析に使用されます。ビデオでは、DynamoDBとOpenSearchの間でデータを同期するZLインテグレーションを通じて、検索や分析を強化する例を説明しています。

💡Data Prepper

Data Prepperは、オープンソースのサーバーレスETLツールで、ZLインテグレーションによって使用され、データの抽出、変換、ロードを自動化します。ビデオでは、Data PrepperがZLインテグレーションの一部として、DynamoDBとOpenSearchの間のデータ同期を簡素化する役割を果たしていることを強調しています。

💡リアルタイム分析

リアルタイム分析とは、データが生成された瞬間に分析を実行し、即時に洞察を得る能力です。ビデオでは、ZLインテグレーションを利用して、Aurora MySQLやDynamoDBのデータをリアルタイムにRedshiftやOpenSearchに移し、分析を提供する利点を説明しています。

💡データパイプライン

データパイプラインは、データの移動や変換を自動化するプロセスです。ビデオでは、ZLインテグレーションがデータパイプラインの管理や保守の複雑さを軽減し、データの分析やビジネスの意思決定を迅速に行うことができると強調しています。

💡AWS

AWSは、Amazon Web Servicesの略で、クラウドコンピューティングサービスを提供するAmazonの部門です。ビデオでは、AWSが提供する様々なデータベースサービスとZLインテグレーションを組み合わせて、データ統合と分析を効率化するソリューションについて紹介しています。

Highlights

Tas和Aman分别作为数据库专家和DynamoDB解决方案架构师,共同探讨了ZL集成解决方案。

ZL集成旨在解决数据分散在多个数据库中的问题,实现数据的统一分析。

数据集成的挑战在于需要构建、维护和监控将数据从操作数据库转移到分析数据库的多个管道。

ZL集成提供了一种减少未区分工作量的方法,确保数据安全地从一个数据库复制到另一个数据库。

介绍了Amazon Aurora MySQL到Redshift的数据传输集成,实现近实时分析而不影响操作数据库。

ZL集成利用Aurora存储功能和Redshift存储功能,通过增强的二进制日志捕获仅变化的数据。

展示了ZL集成如何帮助用户分析来自多个集群的大量数据,而无需额外的运维工作。

Aman讨论了DynamoDB到OpenSearch的集成,强调了使用正确的工具来完成正确的工作的重要性。

解释了如何使用Data Prepper作为无服务器技术来简化ETL过程,减少用户的工作量。

展示了ZL集成如何在YAML模板中定义提取、转换和加载数据的过程。

讨论了DynamoDB到Redshift的集成,强调了增量导出功能和对数据仓库的同步。

提供了关于如何将DynamoDB中的数据变化同步到Redshift的详细信息。

强调了ZL集成的监控能力,允许用户了解系统内部发生的情况。

提到了AWS对ZL集成未来的投资,以及对其他数据库或数据源集成的探索。

鼓励用户提供反馈,以改进ZL集成并满足他们对数据存储的需求。

总结了ZL集成的能力,特别是在减少数据工程师在维护数据管道方面的努力。

强调了ZL集成支持的数据库和数据源的多样性,以及AWS对新集成的持续开发。

最后,演讲者感谢了参与者,并邀请他们提供反馈以改善会议体验。

Transcripts

play00:02

so we had few few talks before us and

play00:07

almost in each of them zero ATL was

play00:10

mentioned so let's unpack it

play00:12

now my name is Tas I'm a database

play00:15

specialist focused on relational engines

play00:18

and I accompanied by Aman he's a Dynamo

play00:22

DB specialist he will be helping us to

play00:24

review the other part of the

play00:26

presentation about ZL integration

play00:30

so what we're going to talk about we

play00:32

will review H what problem we're trying

play00:35

to solve with this with this integration

play00:38

with this

play00:39

solution um we'll see what it all about

play00:42

how it works and what its value

play00:45

proposition and we will see how it used

play00:48

by our customers and discuss it and we

play00:52

have some call for Action

play00:56

later so it's been mentioned multiple

play00:59

times today that data is everything for

play01:02

our business right so the more data we

play01:05

have the more insight we get we can gain

play01:08

from it but the data is located in

play01:13

different places and we want to run

play01:16

almost the same analysis on all the

play01:18

sources of our data right we want to get

play01:21

Trend analysis when to take fraud when

play01:23

to understand that everything is good so

play01:26

what what is a challenge let's see with

play01:30

raise of hands how many of you need to

play01:33

analyze data from multiple

play01:36

databases

play01:37

yeah and all right good amount of people

play01:40

so you in the right

play01:42

room and who need to build multiple

play01:45

pipelines and maintain them for this

play01:48

purpose all

play01:50

right so we kind of understood

play01:53

understanding now what is the complexity

play01:55

with what what is the problem the data

play01:58

our data sources our

play02:00

operational data is located in

play02:03

multiple databases or data sources and

play02:07

we like to move them to our analytical

play02:10

area it can be data warehouse it can be

play02:13

data Lake um and we need to build and

play02:15

maintain and of course Monitor and

play02:18

secure all the pipelines that transfer

play02:21

this data from operational side of the

play02:24

business to the analytical side of the

play02:25

business so this is the challenge

play02:31

and this is the solution that we

play02:33

suggesting for for this um and it's

play02:36

called ZL so it's not one solution it's

play02:40

set of integration that can copy data

play02:43

help with copying of streaming data from

play02:46

multiple sources to multiple destination

play02:49

we'll be reviewing them um

play02:52

shortly and we would like to eliminate

play02:56

or r or dramatically reduce the amount

play02:58

of undiff created work undifferentiated

play03:01

work that you or your teams will be

play03:04

investing into this process and to make

play03:07

sure that it's copy the data in Secure

play03:10

way from database to

play03:15

database so those Integrations that are

play03:18

available

play03:20

currently H you can see that we have a

play03:23

multiple sources like for relational

play03:25

databases or MySQL or pogress and of

play03:29

course no scale databases like Dynamo DB

play03:32

and open search and we can move data

play03:37

from through various pipelines to

play03:40

various destinations and let's review

play03:43

some of

play03:45

them so first that we would like to

play03:47

discuss is the integration between

play03:50

Amazon Aurora MySQL and red

play03:57

shift so what's what what's all about

play04:00

what we are talking about we

play04:03

currently providing this option to

play04:06

transfer data without any effort on your

play04:10

side to build or maintain pipelines in

play04:13

near real time from

play04:16

Amazon Aurora for MySQL to Red shift and

play04:22

we can when we have this data in red

play04:25

shift we can start analyzing it run our

play04:27

models on top of it without any

play04:30

additional load added to our operational

play04:32

database and the latency is a big

play04:35

question we're going to discuss it in

play04:37

few

play04:38

slides so how how we usually will tackle

play04:42

this we will have our sources we would

play04:44

like to have our data in analytical area

play04:47

in in red shift so usually we will build

play04:50

multiple pipelines and we will have to

play04:53

maintain them and monitor them and

play04:56

secure them and for this we need

play04:58

resources people and Investments right

play05:02

so all this is undifferentiated heavy

play05:05

lifting that your business is not um

play05:09

benefiting from this so we trying

play05:12

to to take this into and offload from

play05:17

you to us and we will be providing this

play05:21

as a service so it will be much easier

play05:25

to just spin up this integration between

play05:29

your Rel database Aurora MySQL in this

play05:32

case we have this ability in preview for

play05:36

Aurora pogress so this one is coming

play05:39

soon and it will be copying the data to

play05:43

Red shift and this will allow you near

play05:45

realtime analytics on your production

play05:48

data without impacting your production

play05:51

database or any need to upgrade it or

play05:55

anything

play05:56

else and of course we we aware that you

play06:00

may have more than one source of

play06:03

your of your of the data that you would

play06:06

like to analyze so ZL can pull data from

play06:10

multiple clusters and ship it to Red

play06:15

shift without any addition that you need

play06:18

to run so this will allow you to

play06:20

analyze large amount of data in red

play06:24

shift and with all the benefits that

play06:26

come with this including faster queries

play06:30

analytic operations that will be much

play06:34

much more sophisticated and so

play06:37

on so what we do H how we make this

play06:41

magic work so we um we trying to offload

play06:47

as much as possible of this toward

play06:49

storage layer so our head nodes or

play06:51

writers or compute nodes are not busy

play06:54

with this so we utilizing as much as

play06:56

possible Aurora storage capabilities and

play06:59

r storage capabilities and we will be

play07:02

enabling enhanced bin login on Aurora

play07:06

side so we will be consistent in our

play07:09

operation and we'll be capturing only

play07:13

changes that happened on Aurora and

play07:15

sending them to Red shift and from tests

play07:18

that I was running in my environment

play07:20

usually the data will be in red shift in

play07:23

single digit seconds so the latency will

play07:26

be really really low and I think it's

play07:29

great achievement aent so of course it

play07:31

may be different in your workloads

play07:35

but because it's done on because it's

play07:38

optimized and it's done mostly on

play07:40

storage layer I hope that your workloads

play07:43

will be benefiting from this pretty the

play07:45

same way single digit second latency

play07:47

between your production data and option

play07:50

to analyze it in red

play07:54

shift all right so this is like example

play07:58

of what we see our customer doing with

play08:00

this solution let's imagine that we are

play08:05

um we are company that have multiple

play08:08

streams of data H let's say data that is

play08:12

structured and data is not nonstructured

play08:15

and we have a lot of H data that is um

play08:20

streaming into Aurora MySQL and we have

play08:23

a lot of data streaming from um from

play08:26

streams Kinesis data streams um let's

play08:30

say we need to analyze both with low

play08:34

latency and we'd like to have red shift

play08:37

to hold them so we can offload this from

play08:41

uh from you towards Zer ETL that will

play08:44

help him to that will help to ship you

play08:47

that will help you to ship your data

play08:48

from MySQL to Red shift with really low

play08:53

latency and without any need to build it

play08:57

or maintain it or monitor it and of

play08:59

course we will expose a lot of counters

play09:02

so you will be able to see what's going

play09:04

on and monitor it and and be aware of

play09:08

what's going on in the

play09:10

system all

play09:12

right so just again uh to remind you

play09:16

this is the integration that are

play09:17

currently are

play09:19

there Aurora to Red shift and Amazon

play09:22

Dynamo DB uh to open search are

play09:26

available the rest are in preview orora

play09:29

pogress is in preview RDS from MyQ in

play09:32

preview and Dynamo DB to Red shift is in

play09:35

preview so get out there check them see

play09:40

how they work and of course we'll be

play09:42

really happy to hear your feedback

play09:44

because we working on this this is the

play09:46

project that we just started and we have

play09:49

future for

play09:50

this all right thank you and now I will

play09:53

invite Aman to speak about noise scale

play09:56

solution for this one

play10:00

thank you am I audible okay so how many

play10:04

of us use dwb already in our

play10:06

applications

play10:07

today okay I'm in the right room uh so

play10:11

uh for some use cases you know my name

play10:13

is Aman I'm a Dynamo DB Solutions

play10:15

architect I've been with AWS for just

play10:18

under 7 years um primarily focusing on

play10:20

the no SQL Technologies uh some of the

play10:23

customers that we see you know have use

play10:25

cases where they want to combine uh

play10:27

purpose build Solutions purpose build

play10:29

data bases right it's it's all aimed at

play10:31

using the right tool for the right job

play10:33

so for example they'll have Dynamo DB

play10:35

stored for their olp use case right that

play10:38

high uh high high throughput low latency

play10:41

access but they also need a lot of

play10:43

flexibility for certain you parts of the

play10:46

the data access so maybe think about uh

play10:49

the Amazon retail store right we like we

play10:51

like to eat our own doc food so think

play10:54

about the Amazon retail store you go on

play10:55

the website uh there is a search bar at

play10:57

the top uh you start adding you know uh

play11:00

start start typing for products so that

play11:02

could be pillows that could be laptops

play11:04

so and so forth and as in when you're

play11:06

typing uh you see an auto complete

play11:08

already right uh that cannot be powered

play11:10

by Dynamo DB unfortunately but there is

play11:12

a purpose buil solution for that which

play11:14

is open search right so open search can

play11:16

do your your fuzzy searching your

play11:18

analytics your uh you know anomaly

play11:20

detection your vector store so semantic

play11:22

searches and so on and so forth so it is

play11:24

the right tool for that particular job

play11:27

so we see customers using both of these

play11:29

stores together uh in cases where they

play11:32

need the the fuzzy searching the rich

play11:34

Tech search or the semantic search

play11:36

capabilities but they also need the high

play11:39

throughput loow latency access and this

play11:42

is also something that we use on the am

play11:44

Amazon retail store uh similar to uh you

play11:47

know the services uh the the other

play11:50

example where you know I've seen this

play11:52

being used is let's say you're a retail

play11:55

store right and you have customer

play11:56

service uh on the desk in in the shop

play11:59

shops there uh and they want to look at

play12:01

customer profiles they want to fetch a

play12:02

profile based on a customer standing in

play12:05

the queue so the customer could give any

play12:07

information maybe their first name last

play12:09

name their contact information their you

play12:11

know uh their address or what have you

play12:14

and there may be mistakes in you know

play12:16

the spellings and so on and so forth

play12:18

that's where you need open search uh

play12:20

because thenb will do a a string

play12:22

matching basically uh but you if you

play12:25

need those Rich Text uh search

play12:28

capabilities that that's where open

play12:29

search comes into the

play12:31

picture so because we've seen customers

play12:34

build this sort of pipeline where they

play12:37

keep the data in sync between dyo DV and

play12:39

open search uh you know we we saw that

play12:42

and based on that there there's a

play12:44

certain amount of effort that's required

play12:47

in order to you know make it

play12:49

possible so this is sort of the pipeline

play12:52

that you know uh customers need to build

play12:55

uh and I've built that myself so it's

play12:58

it's really a lot of work uh if you look

play13:00

at it I I'll start with number one so U

play13:03

when you're first seeding your open

play13:04

search server uh be it a cluster or a

play13:07

serverless domain uh you basically need

play13:09

to take an export right a full data dump

play13:11

of your dyb data uh put that into S3 and

play13:14

then write an application that reads

play13:16

from that S3 data and seeds your open

play13:18

search service be it a cluster or a

play13:20

serverless domain right uh with that you

play13:23

know you have taken an initial dump but

play13:25

what happens to the data that's incoming

play13:28

while you're doing that uh dump and and

play13:30

load process so for that you rely on

play13:33

dynamodb streams which is number four

play13:35

there dynamodb streams is a change data

play13:37

capture stream of every change that

play13:39

happens on the table so you see a record

play13:41

of what data changed how it changed uh

play13:44

and you know what is the target state of

play13:46

the data so that can be consumed by a

play13:49

Lambda function and that Lambda function

play13:51

can basically have the business logic uh

play13:55

and the the code to basically ingest

play13:57

that data into the open search domain or

play13:59

cluster so maybe if there's there's an

play14:02

insert there is an update there is a

play14:04

delete of a record in d modb uh that

play14:07

data needs to make its way through to

play14:08

open search right so think about a

play14:10

product use case a product catalog for

play14:12

the retail website where all the

play14:14

products are actually stored in dyn DB

play14:16

but you need to support that top search

play14:18

capability right and so you need open

play14:21

search to power that but every time a

play14:23

product description changes or certain

play14:25

things in the product change uh those

play14:27

changes are made into Dynamo DB table

play14:29

but you need that uh to replicate into

play14:32

the open search service right so this is

play14:34

the whole process that goes behind

play14:36

building such a data

play14:38

pipeline now clearly this is a lot of

play14:40

undifferentiated heavy lifting and this

play14:42

is the word that you you know take home

play14:44

for the day uh because we've said it so

play14:46

many times uh and so the Z ATL

play14:50

integration uh between dynamod and open

play14:52

search uh leverages an open source uh

play14:56

secure serverless uh technology called

play14:58

Data prepper and data prepper is

play15:00

basically uh you know it's not something

play15:03

that's newly built it's been there for

play15:05

years uh we're running something around

play15:07

uh

play15:08

2.7.0 release now so we've we' you know

play15:12

it's it's out there it's open source uh

play15:14

it's secure uh and it's also serverless

play15:17

and so this is doing the ETL for you

play15:20

where you do not need to think about the

play15:22

initial data dump the export to S3 the

play15:25

the chain data capture the Lambda

play15:27

function and all that that goes behind

play15:29

doing

play15:30

it so all of that basically unref heavy

play15:35

lifting uh translates into just writing

play15:38

a yaml template right this is the z8l

play15:40

integration between dynamodb and open

play15:42

search so I'll break if if you're not

play15:44

able to read it I'll break that down

play15:46

into the e t and L uh phases right but

play15:50

uh as a whole this is basically 26 lines

play15:53

of a template that will do the Z ATL for

play15:57

me the first part is the is the

play15:59

extraction which is you know taking the

play16:01

data from The Source uh which is your

play16:04

Dynamo DB table so here you know I pass

play16:07

the dyb uh table Arn or resource name uh

play16:11

and I provide uh an S3 location where

play16:13

the Zer ATL integration itself will do

play16:16

the initial data dump right uh and I can

play16:19

I can secure it with a particular role

play16:21

tie to a particular role and have uh

play16:23

permissions granted as per leas

play16:25

privilege

play16:26

principle the second part is the trans

play16:29

formation so this is where I I can add

play16:32

certain transformation uh functions

play16:35

right the whole list is on the right uh

play16:37

and basically what I'm doing here is I'm

play16:39

using two individual attributes which is

play16:41

user latitude and user longitude and

play16:43

combining that into a location attribute

play16:47

that the data prepper will add on my

play16:49

behalf in real time right so you can do

play16:52

a lot of stuff with anomaly detection

play16:54

you can aggregate uh you can compress or

play16:57

you can you can add new

play16:59

uh data in it right uh and if you want

play17:02

to uh vectorize it vectorize the data uh

play17:05

you can also use a Bedrock connector on

play17:08

your open search uh server or cluster uh

play17:11

to basically vectorize that data as it

play17:13

is coming

play17:15

in the final part is the the load right

play17:18

and this is where I provide a the target

play17:22

uh open source domain or cluster

play17:24

depending on where you're doing cluster

play17:25

based or serverless uh and again uh I

play17:28

can provide information about the index

play17:30

where the data should go to I can

play17:32

provide routes based on you know based

play17:34

on certain characteristics within the

play17:36

data which particular index should that

play17:38

data end up in and it's not just one

play17:41

cluster or domain that I can Target I

play17:43

can Target multiple uh clusters of open

play17:46

search to to index that data right uh

play17:49

and one of the key things over here is

play17:51

the document ID now this is the unique

play17:54

identifier of the data in open search uh

play17:57

and over here the the example uses a

play18:00

primary key which is basically the

play18:03

partition key and the sort key of

play18:05

dynamod DB that you uniquely identify

play18:07

each and every record right so uh

play18:10

because we are also thinking about

play18:12

keeping the two data stores in sync so

play18:14

whenever a change happens in the product

play18:16

in the Dynamo DB data uh it needs to be

play18:19

propagated onto open search site so the

play18:22

same product ID should reflect the new

play18:24

change that happened right so for that

play18:27

we need the same unique identifier as we

play18:29

did in dynamodb so for that matter uh we

play18:32

can just automate that process of

play18:33

creating a document ID uh by leveraging

play18:36

the the the metadata in open search so

play18:39

when we when we set it up like this data

play18:41

prepper goes to dynamodb gets the

play18:44

partition key and sort key attribute

play18:45

names all the column names combines that

play18:48

and that becomes your unique identifier

play18:50

for the document ID in open

play18:53

search now all of that undifferentiated

play18:56

heavy lifting or the whole architecture

play18:59

becomes much more simpler with just the

play19:01

Z ADL

play19:06

integration so not only uh open search

play19:10

but we see customers also you know

play19:12

obviously leveraging the likes of red

play19:14

shift for their analytics use cases so a

play19:17

lot of times again data Engineers uh

play19:20

would need to set up pipelines where

play19:22

they export data from Dynamo DB uh put

play19:25

that into you know run some

play19:26

Transformations on it put that into to

play19:28

Red shift and then run their reports or

play19:30

their uh you know analysis pipelines

play19:34

basically so for that uh We've launched

play19:37

a uh a Dynamo db2 red shift 08l

play19:40

integration as well this is currently in

play19:42

preview mind you so if there are certain

play19:44

things that you know uh you see that

play19:47

maybe they're not great let us know we

play19:49

we want to hear feedback so basically

play19:52

the thing that powers the 08l

play19:54

integration between DB and and red shift

play19:57

are incremental exports by by dynamodb

play20:00

so dynamodb supports incremental exports

play20:03

which is basically you provide a start

play20:04

time and an end time and it will return

play20:07

you all the data that changed between

play20:09

those times right so if you need the

play20:12

Delta of changes you can call it as many

play20:14

times you like and you're only charged

play20:16

for the amount of data that's uh shared

play20:19

with you not the you know the whole data

play20:21

dump or the whole table data size so

play20:24

what the Z ATL integration does here it

play20:27

it takes uh an Inc export roughly every

play20:30

15 minutes and then uses that output uh

play20:34

to power uh to ingest into red shift so

play20:37

you decide a red shift cluster or

play20:39

serverless uh you know resource and it

play20:41

will do the the syncing between dynamodb

play20:44

and red

play20:46

shift now uh this is uh the target table

play20:51

that's created for my integration uh

play20:53

that was running so it was an e-commerce

play20:56

uh table because you know Amazon uh and

play20:59

so the schema of that that Target table

play21:01

is basically a partition key name a sort

play21:04

key name and a value uh column so uh you

play21:09

know because dmod is schema less and uh

play21:12

you know it's a no SQL database uh the

play21:15

only mandatory parts of your record in

play21:19

Dynamo DB are the partition key and the

play21:21

sort key or the primary key uh as a

play21:24

whole right so a single item in the Dy

play21:28

table can have let's say five columns or

play21:30

five attributes the the next item in the

play21:33

same table could have 200 so because of

play21:36

the schema less nature uh it's sort of

play21:39

uh tricky to you know map that in real

play21:41

time into a schema uh enforcer enforcing

play21:45

system such as red shift so the way it

play21:48

it works uh is that we decide a a

play21:51

primary key uh same as the Dynamo DB

play21:54

table in in red shift so there's a

play21:56

partition key my partition key name was

play21:58

PK my sord key name was SK and then

play22:01

there is a a raw Json value in the in

play22:04

the third column that's the value uh

play22:06

column itself so on the right is

play22:08

something that uh you know this is how

play22:10

the data would appear uh in in red shift

play22:13

now it is up to you to either use the

play22:16

same table that gets created

play22:18

automatically uh into your reporting

play22:20

pipelines in your reporting queries

play22:23

where you need to probably uh map the

play22:26

value Json attribute as per your uh you

play22:29

know analytical queries or you move this

play22:32

data from this particular red shift

play22:34

table to your production table and

play22:36

sanitize it while you're doing it right

play22:39

so that's the work the data is already

play22:41

there in Tad shift you either decide to

play22:43

update your queries to you know reflect

play22:45

to point to this particular table or you

play22:48

the table that you're already using you

play22:49

basically move data from this table to

play22:52

that

play22:53

one in terms of my test uh and you know

play22:57

again your mileage May vary uh but

play23:00

basically I was loading the same

play23:01

e-commerce data uh randomized into uh

play23:04

into dynamodb which was making its way

play23:06

onto red shift and so there was a the

play23:09

the cloud metrics offered by this

play23:11

integration uh are basically include a

play23:14

lag between the time data was ingested

play23:16

into red shift and the time uh it is

play23:18

available for quering uh which was

play23:21

roughly about 20 25 minutes so on on the

play23:24

left you see the the lag there which is

play23:26

pretty good if you think about you know

play23:27

your analy IAL database right your data

play23:30

warehouse basically uh and on the right

play23:32

side is the data transfer the raw bites

play23:35

transferred uh so there there are many

play23:37

more cloudwatch metrics not only for the

play23:39

red shift integration but also for the

play23:41

open search integration which give you

play23:43

an idea of what was the time it took end

play23:45

to end uh and that can help you you know

play23:49

estimate whether it's good for you

play23:51

whether you need it to be fast whether

play23:52

it's too quick uh well nobody uh hates

play23:56

uh things being too quick uh but

play24:00

anyway so I want to I want you to take

play24:04

away from this session uh the

play24:06

capabilities of Z ATL right if you're

play24:08

data Engineers you're already uh you

play24:11

maintaining data pipelines uh there's a

play24:14

lot of effort that goes behind uh

play24:16

maintaining the code setting up the

play24:18

infrastructure uh maintaining that

play24:20

infrastructure and maintaining the code

play24:22

again so and and making it bug free is

play24:26

almost impossible right all of those

play24:28

things get taken away with the Zer ATL

play24:30

Integrations that that are being

play24:32

supported by for purpose buil databases

play24:35

uh and so this can allow you low latency

play24:38

High throughput scalable uh sort of

play24:41

setup between your data stores uh where

play24:44

you can root the use the right tool for

play24:47

the right job but also extract insights

play24:49

in near real

play24:52

time so these are the the Integrations

play24:55

you know that I want to put up again for

play24:58

you to take uh uh grasp it for a few

play25:00

seconds uh these are the Integrations

play25:02

that are supported uh I I did not cover

play25:05

open source to S3 but basically in the

play25:07

same uh yaml template that we saw we can

play25:10

also add an S3 bucket where open search

play25:13

data will be uh dumped

play25:17

into and finally uh this is not the only

play25:21

set of Integrations that are available

play25:23

uh well not will be available uh AWS is

play25:27

investing in a Z ATL future heavily

play25:30

right so if there's any other

play25:32

Integrations that you uh you know today

play25:35

set up data pipelines for uh let us know

play25:38

we we're we're already uh in progress uh

play25:42

there's there's a number of projects in

play25:44

progress that are integrating other

play25:47

database or other data sources uh but

play25:50

the whole idea is that z8l is not going

play25:53

anywhere it's only getting started uh

play25:55

you know feel free to reach out to us

play25:58

after the session or your AWS account

play26:00

teams if there are certain requests that

play26:02

you have about Z ATL Integrations about

play26:05

your data stores uh and let us know we'd

play26:07

love to

play26:09

learn with that I'd like to thank you uh

play26:12

for being here with uh with us here

play26:14

today uh and I would appreciate if you

play26:16

let us know how we how we did with the

play26:19

feedback thank you

Rate This

5.0 / 5 (0 votes)

Related Tags
データ分析ZL統合リアルタイムデータパイプラインAWSDynamoDBRedshiftオープンサーチ自動化ETLクラウド
Do you need a summary in English?