DuckDB An Embeddable Analytical Database
Summary
TLDRこのトークでは、DuckDBという組み込み可能な分析データベースについて紹介します。DuckDBは、大量のデータを効率的に処理するための新しいデータベース管理システムで、他のソフトウェアに組み込むことができる点に特化しています。データ分析タスクに適した機能を備えており、SQLiteと同様に単一のファイルでデータベースを管理し、データ転送が高速に行えます。また、内部的にはベクタライズド処理エンジンを採用しており、データがCPUキャッシュに適切に収まり、パフォーマンスを確保しています。DuckDBはオープンソースであり、さまざまなデータ分析ツールと統合可能で、PythonやRのパッケージも提供されています。
Takeaways
- 🌟 DuckDBは、他のソフトウェアに組み込むことができる分析型の埋め込みデータベースです。
- 👨🏫 スピーカーは、データベースの構築を通じてデータベースの学びを促進するCWIの研究者であり、またコンピュータサイエンスを教える教師です。
- 📈 DuckDBは大量のデータを扱うために設計されており、オンラインストアでの注文処理のようなトランザクション処理とは異なります。
- 💾 DuckDBはSQLiteと同様に、単一のファイル形式でデータベースを保存しますが、データ分析に特化しています。
- 🚀 DuckDBは非常に高速なベクタライズドデータ処理エンジンを搭載しており、これはデータ分析における高速性を実現しています。
- 🔄 DuckDBは外部依存関係がなく、ヘッダーファイルと実装ファイルの2つのファイルで構成されています。
- 🔧 DuckDBはC++ APIをベースレイヤーとして持っていますが、PythonやRのパッケージも提供しており、データ分析ツールと統合されています。
- 📊 DuckDBは完全なSQLサポートを持ち、ウィンドウ関数などの高度な機能を実装しています。
- 🛠️ DuckDBは品質保証に力を入れており、継続的インテグレーション、ベンチマークテスト、クエリのファジングなどを含む自動化されたテストを実施しています。
- 🆓 DuckDBはMITライセンスに基づくフリーソフトウェアであり、オープンソースコミュニティによるフィードバックや貢献を歓迎しています。
Q & A
DuckDBはどのようなデータベースですか?
-DuckDBは組み込み可能な分析型データベースで、大量のデータを処理する機能に特化しています。
DuckDBはなぜ新しいデータベース管理システムとして注目されていますか?
-DuckDBは他のソフトウェアに組み込みやすく、大量のデータを効率的に処理する能力を持つため、注目されています。
DuckDBはどのような問題を解決するものですか?
-DuckDBはデータ管理と分析において現状の混乱を解決し、データ分析タスクに適したデータベース管理システムを提供することを目指しています。
DuckDBはどのように他のデータベースとは異なりますか?
-DuckDBは他のデータベースとは異なり、ベクター化処理エンジンを用いて高速にデータを処理し、単一ファイルでデータベース全体を管理するという特徴があります。
DuckDBはどのようなプログラミング言語のサポートがありますか?
-DuckDBはC++ APIをベースレイヤーとして持ち、PythonやRのパッケージ、さらにはコマンドラインインターフェースとRESTサーバーも提供しています。
DuckDBは外部依存は持っていますか?
-DuckDBはゼロ外部依存を誇り、他のプログラムをインストールする必要なく、簡単に使用することができます。
ベクター化処理とは何ですか?
-ベクター化処理はデータチャンクを一度に処理することで、CPUキャッシュの効率的な使用を可能にし、大量のデータを高速に処理する技術です。
DuckDBは内部データ圧縮をサポートしていますか?
-DuckDBはディスクへの保存時にデータを圧縮し、現在、圧縮された中間データの扱いも開発中です。
DuckDBはパーセンタイルやヒストグラムなどの統計関数をサポートしていますか?
-DuckDBはユーザー定義関数をサポートしており、必要な統計関数を追加することができますが、直接の統計関数のサポートは限定的です。
DuckDBはSQLAlchemyやPandasと連携できますか?
-DuckDBはSQLiteと同様のクエリ言語をサポートしているため、SQLAlchemyやPandasと連携する可能性がありますが、現在の状態は不明です。
DuckDBはオープンソースですか?
-はい、DuckDBはMITライセンスに基づいてオープンソースであり、誰でも自由に使用、改善、フィードバックを提供することができます。
Outlines
💻 DuckDBの紹介とデータベースの現状
スピーカーはDuckDBという組み込み可能な分析データベースについて説明し、データベースの現状を批判的に述べています。DuckDBは大量のデータを扱うことができるデータベースであり、他のソフトウェアに組み込むことができるという特徴があります。また、データ管理とデータ分析の現状は混乱しており、データの保存と処理が難しく、データ分析に適したデータベースシステムが求められていると指摘しています。
🔧 DuckDBの機能と内部構造
DuckDBはSQLiteと同様に単一ファイルでデータベースを管理し、インストールも簡単です。C++ APIをベースにしており、SQLを完全にサポートしています。RやPythonなどのデータ分析ツールとの統合も重視されており、パッケージが提供されています。内部的にはベクタライズド処理エンジンを採用しており、データのチャンクを扱うことで高速なデータ処理を実現しています。
📊 ベクタライズド処理の利点とパフォーマンス
ベクタライズド処理エンジンの利点として、データがCPUキャッシュに適切に保持されることで、メモリに比べて高速な処理を実現できると説明されています。また、DuckDBは従来のデータベースエンジンよりも大幅に高速であり、ベクタライズド処理エンジンはデータがメモリを超える大きさでも分析を実行できるという利点があります。
🔄 DuckDBの開発状況とコミュニティへの呼びかけ
DuckDBは現在プレリリース段階であり、MITライセンスで公開されています。内部的なデータ圧縮や統計関数のサポートについても話し、開発中の機能であると明かしています。また、コミュニティへの参加を呼びかけ、フィードバックやプルリクエストを歓迎する姿勢を示しています。
Mindmap
Keywords
💡DuckDB
💡埋め込み可能
💡データ分析
💡ベクタライズド処理エンジン
💡シングルファイルストレージフォーマット
💡SQL
💡データ圧縮
💡統計関数
💡データベースライブラリ
💡オープンソース
Highlights
介绍DuckDB,一个嵌入式分析数据库,专注于处理大数据量,与处理在线商店订单等事务不同。
演讲者在荷兰国家计算机科学研究室工作,教授数据库知识,并参与数据库的构建。
DuckDB由包括Mark Rusfeld在内的团队开发,旨在嵌入到其他软件中。
数据管理和数据分析领域目前存在混乱,常见的数据处理方式如Pandas存在局限性。
DuckDB旨在使数据库管理系统也适用于常见的数据分析任务。
DuckDB不需要运行单独的服务器,可以作为库嵌入到应用程序中。
DuckDB具有单文件存储格式,简化了数据的存储和访问。
DuckDB支持零外部依赖,易于安装和集成。
DuckDB提供了C++ API和对SQLite API的包装,方便替换SQLite。
DuckDB与数据分析工具如R和Python有良好的集成。
DuckDB提供了命令行界面和REST服务器,方便不同的使用场景。
通过Python和R的集成示例,展示了如何轻松使用DuckDB。
DuckDB的内部使用向量化处理引擎,提高了数据处理速度。
DuckDB的向量化处理允许处理比内存大的数据,避免了内存不足的问题。
DuckDB的性能在标准基准测试TPCH中表现出色,比传统引擎快40倍。
DuckDB拥有严格的质量保证流程,包括持续集成和基准测试。
DuckDB是免费且开源的,采用MIT许可证,目前处于预发布阶段。
DuckDB团队鼓励用户反馈和贡献代码,以改进数据库的功能和性能。
DuckDB正在开发内部数据压缩功能,以提高存储效率。
DuckDB支持用户定义的函数,以扩展其统计和分析功能。
DuckDB可能支持SQLAlchemy连接器,方便与Pandas等工具集成。
Transcripts
hello everybody
welcome to foster lightning talks in
building age
i want to introduce you harness mule
eisen where we talk about
duck db an embedded analytic database
and give him a warm welcome
thank you welcome everybody um so a
quick introduction
so i work at cwi which is the dutch
national research lab for computer
science and mathematics
i also teach computer science students
about the wonderful world of databases
but i have found out that a good way of
learning about databases is building
them
and therefore i also do that and today
i'd like to talk to you about one of
these products
and that is duckdb obviously duckdb is
not my
own sort of sole creation but there's of
other people involved
most notably mark rusfeld who is not
here today
um so we're going to talk about duckdb
and duck to be the database management
system
and it's new it's completely new and
it's
focused specifically to be embeddable
which means not embeddable as in
hardware but embeddable as in embeddable
into other software
and it's analytical which means that
it's focused on
crunching through large amounts of data
as opposed to
you know dealing with uh transactions
like you know orders in your online shop
so if you want to do orders in your
online shop
go to up to the postgres people next
door if you want to crunch large amount
of data you can use duck tv
um now i have to find out whether my
clicker works it does um it is common to
start these kind of talks with a
description of how terrible the state of
this world is
um this is no exception the present is
very bad
um the data management in data analytics
is a huge mess
i don't know if anybody of you has ever
tried to use things like
pandas and that's great it works with
the five examples that they have on the
website but
um one of the problems there that is
really
um overwhelming is the um
in the data storage itself you know
people tend to have
these text files used where you know
there's a well-known folder structure
somewhere which has a bunch of csv files
in it and there is
maybe some code on top of that that
decides which csv file to
should be read once we have load
loaded these files um we have these
crude query processing engines
for example the one that is in in pandas
or
the one that is in the r environment
um once people decide that csv files are
too slow they start
inventing their own crude hand-rolled
binary formats
that are on disk maybe and and start
processing those there's
been a recent push in the direction um
in generally this is sort of a zoo of
one of solutions
um and that makes like secondary
problems like for example changing
anything about the data that you have
very difficult
so this is bad um we don't want this
um and these things are solve problems
you know we have data management systems
they've been around
uh for 50 years or so um and
what we're trying to do with db is make
them usable also for these
data analysis tasks that are so common
so here so now this is the contra the
future is bright obviously
um with sqlite uh sorry with duct tape
who has used sqlite okay this is
very many people and in fact everybody
has used sqlite because it is in every
browser every phone
um and every device that you can imagine
um what we're trying to do is build
something similar to sqlite
but very different in sort of the um
intended features
um in the sense of what kind of data
analysis questions
you want to ask so you want to do anal
data analytics in
contrast to with sqlite where you do
transactional data management
and how do we do this um we have built a
very fast so-called vectorized data
processing engine i will explain to you
in a bit uh what that is
and we have stolen a lot of good ideas
from sqlite
for example and ductdb does not require
you to
run a separate server um you know this
idea that you have to run a daemon that
is your database that you have to kind
of set up and configure and restart and
whatever
no it's kind of database as a library
you run
the duckdb system inside your process
this has a nice side effect that data
transfer
from whatever you were using to talk to
duckdb and that db becomes very fast
and this is for data analysis this is
really a critical question
we've written a paper it was quite fun
measuring for example the client
protocol speed of various
popular databases and the guys next door
from postgres they
came pretty badly what we also have
stolen from sqlite is the idea that you
have a single
file storage format so basically where
all your database
no matter how complex it is no matter
how many
tables it has is in a single file
and we've also stolen the idea of that
it should be simple to install
more on that in a bit so this is uh this
is
the bright future um how do we make that
work
um so dr b is a library so
think of a just a package a library that
you embed into your
application um we have zero external
dependencies this is really something
that
that took a lot of work but it is
something that we believe
is is actually quite necessary for a
library to be successful
is that you don't have to install 57
other programs before you can use it
in fact we have a special way to build
activity that results in two files one
header and one implementation
um ductdb has a the on the base layer is
a c plus plus api
we have full sql support so i went
through um
the these wonderful job of implementing
things like window functions in uh in a
database system which i can tell you are
not fun
so you don't have to do it because you
can use duckdb
um we also have built a wrapper
for the api that sqlite uses so in
principle what you can do if you have an
application
that talks to sqlite you can do some
library preload tricks and it will use
duckdb instead so this is
something that we have done to make it
easy to to switch
we've also learned from previous project
how important it is to integrate with
the tools that people are using
in terms of data analysis people use r
and python
so there are packages for r and python
i'll show an example in a bit
that basically include everything that
you need to run
duckdb as well and just to wrap it up
there is a command line interface
and for the people that want to do the
web stuff we have a rest server as well
let's show some examples so here is an
example for
for python which by the way was also
invented at cwi so
we are kind of obliged to integrate with
python
you say pip install duckdb that's very
complicated
and then you have it installed there's
no additional
software required all the batteries
included
um and then you can just use this
wonderful python database api where you
um yeah you connect to a database in
this case a database is a file so this
would be a file
um and then you can run sql queries
which is
a required you know skill that you have
to have to work with db
or maybe not because in the r world we
have
a similar integration where you you know
you loaded up the database you connect
to your database file
and the our people have invented this
wonderful deep flyer
system of actually programmatically
expressing queries which is
quite nice um and finally
the c plus plus api i wanted to show you
for the people that are more
in c land is really just that this is
the actual fully functioning
minimum integration of duckdb into c
plus plus
uh where again you know you you specify
which file you want
your database to be stored in and then
you can merely run
sql queries so that's the outside view
right so it's not very exciting
i realize this i mean not many people
get excited about databases i'm one of
the few
but it is a tool that you can use to
store your data and you can actually and
this is the big difference
you can get it out again quickly and you
can run
queries on large amounts of data on your
local computer quite quickly
now how do we do this um let me talk
briefly about some internals
um so we have something called
vectorized processing i'm not gonna talk
a lot about the other things
but um this is the core of the engine
that makes it fast
um and you have to under to understand
vectorized processing
you have to understand that database
engines comes in different flavors that
is
traditionally coupled at the time this
is what postgresql sql sqlite everybody
uses
is basically we look at one row of data
at a time
in the process of running queries that's
great
however it's slow then we have the
pandas numpy
r way of doing things where we look at
one column at a time
which is faster but has issues when the
data becomes bigger than memory
and then finally we have vectorized
processing which is kind of the
the middle ground where you look at
chunks of data at a time
and this is a very nice thing because
that means that
the data that we look at in the query
fits into the higher in the cpu cache
hierarchy
um so here on the right you see a short
overview over the cpu caches and
basically what we're trying to do with
duckdb is keep the data that has been
worked on
up here in these very fast l1 and l2
caches
and actually avoid going into main
memory for performance reasons
and this is very nice because it allows
us to process data that is bigger than
main memory this is one of the
limitations of things like pandas
is that once your data becomes bigger
than memory you're screwed
with a vectorized execution engine you
actually have a reasonable chance
of still completing your analysis
questions
yeah and you don't get wonderful out of
memory errors
um so now i'm gonna
actually uh skip something
um so you would ask then you would ask
okay so why should i do vectorization
it's great that harness is excited about
it
but what's what does what kind of
different does it makes and does it make
and this is a
uh like a very very crude benchmark we
run like a standard benchmark tpch
on different systems and this is based
on an old version we have gotten faster
in the meantime
but basically if you look on the on the
bottom there you can see the
the time it takes to complete these
benchmark queries between the different
systems
and then there is duckdb up here which
clearly is
much faster so generally you would say
that this is 40 times faster than a
traditional engine that is working in a
top-level time fashion
but then you would say but yeah honest
you're an academic and
you have a nice pet project but you know
i i'm interested in something that i can
use
um maybe even a serious ideas
um and this is why i briefly want to
talk about our quality assurance
and that we are um sort of doing with
duckdb so basically we have
continuous integration running where we
have millions of sql queries run on
every single release
we know the correct result for every one
of these queries so whenever we get
something wrong with instantly flagged
we have verified benchmark results for
large standard benchmarks that
we also check for and basically we went
around and steal everyone's test cases
so with sql engines you can do this
because they all have the same sort of
query language
so the only thing you have to do is you
have to write a parser for whatever the
result format they have my favorite part
was to write a scraper for the um
sql server website because they have
example queries with answers
and from that we generated a bunch of
test cases as well we also do query
fuzzing where we ought to generate
queries and to try to break um
our system uh which always works if you
run the fuzzer long enough but you find
very important bugs in the meantime
and we also have something that we call
continuous benchmarking where every
release
is subjected to benchmarking and we can
flag performance regressions
quickly so
db is free and open source under the mit
license
we are currently in pre-release so which
means that you can't yell at us if we
change apis internally
but um it is fully functional you can
use this to run
uh queries to store data it's uh it is
all there
we have a website uh there's a github
page where you know you can
go file a full request if you want um we
are very interested in hearing feedback
and if duckdb doesn't do you know
something that you wanted to do
um then please tell us if you're even
more database inclined then you can send
us a
pull request with new features bug fixes
whatever we have a
long list of issues in the issue tracker
that have tagged with help wanted or
good first issue so these are good
places to start
and with that i'm happy to take
questions thank you
can i ask two questions we have to ask
him
do you do something for internal data
compression
as you say you store it's used for a big
amount of data
yeah okay so the question is do we do
something for internal compression
um what we what we are working on is
that the
two things one is the um the storage on
disk
is going to be compressed so whatever we
write to this to the single file format
is going to be compressed
but we also and we this is really
something we're working on right now
is working with compressed intermediates
so that vectors for example if the
if you have a vector 1000 values and
they're all the same
then we have compression that will
actually not
move these thousand values around but um
you know the fact that it's the same and
the second question is
do you support any statistical functions
like
computing percentiles and getting
histograms back from
the database engine that's a good
question um so our philosophy there is
that because the data
transfer between db and the host is so
fast
that if you want things that we don't
support it's actually you're not going
to die
pulling a chunk of data into pandas for
example and running it there
um there is support for user defined
functions if you want to add anything
we have a fairly complete aggregation
functions library so
there is multiple options there but but
the general idea is that we don't
um we don't punish you for pulling a
large chunk
out of the system we don't hold the data
hostage
hi i have a question uh thanks for the
talk um
do we have a connector for the sql
alchemy for example
in pandas uh you have a connector for
sqlite so you can write a sql query and
then
yeah um that has been i'm not sure what
the status and that is but people have
worked on this
um i think eventually if it's not
working already it should
be working pretty straightforward
because we support the exact same query
language as posgress so i
suspect it should already work and it's
just a question of plumbing
um the connection
okay thank you very much i'm outside of
if you want to talk to me i'm outside
yeah
okay perfect thank you for your talk
関連動画をさらに表示
【厳選8種】Notionと合わせて使いたいオススメ拡張サービスをご紹介!
ExcelでPythonが利用可能に! どのようなメリットや使い方ができるか解説します!
GPT-4oの強化されたテーブル(表)機能が感動するレベルですごい!!
MongoDB Schema Design | Embedding or Referencing? MongoDB Tutorials
Global-Scale Apps Using Globally Distributed Autonomous Databases | Oracle DatabaseWorld AI Edition
外観検査・画像検知AIソリューション「良品学習」
5.0 / 5 (0 votes)