DuckDB An Embeddable Analytical Database

FOSDEM
24 Oct 202016:19

Summary

TLDRこのトークでは、DuckDBという組み込み可能な分析データベースについて紹介します。DuckDBは、大量のデータを効率的に処理するための新しいデータベース管理システムで、他のソフトウェアに組み込むことができる点に特化しています。データ分析タスクに適した機能を備えており、SQLiteと同様に単一のファイルでデータベースを管理し、データ転送が高速に行えます。また、内部的にはベクタライズド処理エンジンを採用しており、データがCPUキャッシュに適切に収まり、パフォーマンスを確保しています。DuckDBはオープンソースであり、さまざまなデータ分析ツールと統合可能で、PythonやRのパッケージも提供されています。

Takeaways

  • 🌟 DuckDBは、他のソフトウェアに組み込むことができる分析型の埋め込みデータベースです。
  • 👨‍🏫 スピーカーは、データベースの構築を通じてデータベースの学びを促進するCWIの研究者であり、またコンピュータサイエンスを教える教師です。
  • 📈 DuckDBは大量のデータを扱うために設計されており、オンラインストアでの注文処理のようなトランザクション処理とは異なります。
  • 💾 DuckDBはSQLiteと同様に、単一のファイル形式でデータベースを保存しますが、データ分析に特化しています。
  • 🚀 DuckDBは非常に高速なベクタライズドデータ処理エンジンを搭載しており、これはデータ分析における高速性を実現しています。
  • 🔄 DuckDBは外部依存関係がなく、ヘッダーファイルと実装ファイルの2つのファイルで構成されています。
  • 🔧 DuckDBはC++ APIをベースレイヤーとして持っていますが、PythonやRのパッケージも提供しており、データ分析ツールと統合されています。
  • 📊 DuckDBは完全なSQLサポートを持ち、ウィンドウ関数などの高度な機能を実装しています。
  • 🛠️ DuckDBは品質保証に力を入れており、継続的インテグレーション、ベンチマークテスト、クエリのファジングなどを含む自動化されたテストを実施しています。
  • 🆓 DuckDBはMITライセンスに基づくフリーソフトウェアであり、オープンソースコミュニティによるフィードバックや貢献を歓迎しています。

Q & A

  • DuckDBはどのようなデータベースですか?

    -DuckDBは組み込み可能な分析型データベースで、大量のデータを処理する機能に特化しています。

  • DuckDBはなぜ新しいデータベース管理システムとして注目されていますか?

    -DuckDBは他のソフトウェアに組み込みやすく、大量のデータを効率的に処理する能力を持つため、注目されています。

  • DuckDBはどのような問題を解決するものですか?

    -DuckDBはデータ管理と分析において現状の混乱を解決し、データ分析タスクに適したデータベース管理システムを提供することを目指しています。

  • DuckDBはどのように他のデータベースとは異なりますか?

    -DuckDBは他のデータベースとは異なり、ベクター化処理エンジンを用いて高速にデータを処理し、単一ファイルでデータベース全体を管理するという特徴があります。

  • DuckDBはどのようなプログラミング言語のサポートがありますか?

    -DuckDBはC++ APIをベースレイヤーとして持ち、PythonやRのパッケージ、さらにはコマンドラインインターフェースとRESTサーバーも提供しています。

  • DuckDBは外部依存は持っていますか?

    -DuckDBはゼロ外部依存を誇り、他のプログラムをインストールする必要なく、簡単に使用することができます。

  • ベクター化処理とは何ですか?

    -ベクター化処理はデータチャンクを一度に処理することで、CPUキャッシュの効率的な使用を可能にし、大量のデータを高速に処理する技術です。

  • DuckDBは内部データ圧縮をサポートしていますか?

    -DuckDBはディスクへの保存時にデータを圧縮し、現在、圧縮された中間データの扱いも開発中です。

  • DuckDBはパーセンタイルやヒストグラムなどの統計関数をサポートしていますか?

    -DuckDBはユーザー定義関数をサポートしており、必要な統計関数を追加することができますが、直接の統計関数のサポートは限定的です。

  • DuckDBはSQLAlchemyやPandasと連携できますか?

    -DuckDBはSQLiteと同様のクエリ言語をサポートしているため、SQLAlchemyやPandasと連携する可能性がありますが、現在の状態は不明です。

  • DuckDBはオープンソースですか?

    -はい、DuckDBはMITライセンスに基づいてオープンソースであり、誰でも自由に使用、改善、フィードバックを提供することができます。

Outlines

00:00

💻 DuckDBの紹介とデータベースの現状

スピーカーはDuckDBという組み込み可能な分析データベースについて説明し、データベースの現状を批判的に述べています。DuckDBは大量のデータを扱うことができるデータベースであり、他のソフトウェアに組み込むことができるという特徴があります。また、データ管理とデータ分析の現状は混乱しており、データの保存と処理が難しく、データ分析に適したデータベースシステムが求められていると指摘しています。

05:00

🔧 DuckDBの機能と内部構造

DuckDBはSQLiteと同様に単一ファイルでデータベースを管理し、インストールも簡単です。C++ APIをベースにしており、SQLを完全にサポートしています。RやPythonなどのデータ分析ツールとの統合も重視されており、パッケージが提供されています。内部的にはベクタライズド処理エンジンを採用しており、データのチャンクを扱うことで高速なデータ処理を実現しています。

10:01

📊 ベクタライズド処理の利点とパフォーマンス

ベクタライズド処理エンジンの利点として、データがCPUキャッシュに適切に保持されることで、メモリに比べて高速な処理を実現できると説明されています。また、DuckDBは従来のデータベースエンジンよりも大幅に高速であり、ベクタライズド処理エンジンはデータがメモリを超える大きさでも分析を実行できるという利点があります。

15:01

🔄 DuckDBの開発状況とコミュニティへの呼びかけ

DuckDBは現在プレリリース段階であり、MITライセンスで公開されています。内部的なデータ圧縮や統計関数のサポートについても話し、開発中の機能であると明かしています。また、コミュニティへの参加を呼びかけ、フィードバックやプルリクエストを歓迎する姿勢を示しています。

Mindmap

Keywords

💡DuckDB

DuckDBは、ビデオの中心となるデータベース管理システムです。ビデオでは、DuckDBが他のソフトウェアに組み込むことができる「埋め込み可能」なデータベースであり、大量のデータを分析する機能に焦点を当てていると説明されています。DuckDBは、データ分析タスクに適したように設計されており、SQLiteと同様にデバイスに組み込むことができますが、データ分析に特化しています。

💡埋め込み可能

「埋め込み可能」は、データベースが他のソフトウェアに簡単に統合できる性質を指します。ビデオでは、DuckDBが他のアプリケーションに組み込むことができ、データベースとして機能するが、専用のサーバーを実行する必要がないという利便性を強調しています。

💡データ分析

データ分析とは、大量のデータを処理し、洞察を得るプロセスを指します。ビデオでは、DuckDBがデータ分析に特化しており、特に大量のデータを効率的に処理する能力を持つと説明されています。

💡ベクタライズド処理エンジン

ベクタライズド処理エンジンは、データベースエンジンの高速化を図る技術で、データのチャンクを一度に処理します。ビデオでは、DuckDBがこの技術を用いてデータのスキャンを高速化し、CPUキャッシュの効率的な使用を可能にしていると説明されています。

💡シングルファイルストレージフォーマット

シングルファイルストレージフォーマットとは、データベース全体が単一のファイルに保存される方式です。ビデオでは、DuckDBがこのフォーマットを採用しており、データベースの複雑さにかかわらず、すべてのデータが1つのファイルに保存されると説明されています。

💡SQL

SQLは、データベースに対する照会を実行するための言語です。ビデオでは、DuckDBが完全なSQLサポートを提供しており、ユーザーがSQLクエリを実行してデータベースと対話できると説明されています。

💡データ圧縮

データ圧縮とは、データのサイズを小さくしてストレージスペースを節約する技術です。ビデオでは、DuckDBがディスクへの保存時にデータを圧縮し、メモリ内のベクタの圧縮も検討していると触れています。

💡統計関数

統計関数は、データから統計的な情報を得るための関数です。ビデオでは、DuckDBが特定の統計関数をサポートするかどうかについて触れており、必要に応じて外部のツール(例:Pandas)と連携して機能を拡張できると説明されています。

💡データベースライブラリ

データベースライブラリとは、アプリケーションに組み込むことができるデータベース機能を提供するソフトウェアです。ビデオでは、DuckDBがライブラリとして提供され、外部依存がなく、簡単にアプリケーションに統合できると説明されています。

💡オープンソース

オープンソースとは、ソースコードが公開されており、誰でも自由に使用・改変・配布できるソフトウェアです。ビデオでは、DuckDBがMITライセンス下で公開されており、コミュニティからのフィードバックや貢献を歓迎していると説明されています。

Highlights

介绍DuckDB,一个嵌入式分析数据库,专注于处理大数据量,与处理在线商店订单等事务不同。

演讲者在荷兰国家计算机科学研究室工作,教授数据库知识,并参与数据库的构建。

DuckDB由包括Mark Rusfeld在内的团队开发,旨在嵌入到其他软件中。

数据管理和数据分析领域目前存在混乱,常见的数据处理方式如Pandas存在局限性。

DuckDB旨在使数据库管理系统也适用于常见的数据分析任务。

DuckDB不需要运行单独的服务器,可以作为库嵌入到应用程序中。

DuckDB具有单文件存储格式,简化了数据的存储和访问。

DuckDB支持零外部依赖,易于安装和集成。

DuckDB提供了C++ API和对SQLite API的包装,方便替换SQLite。

DuckDB与数据分析工具如R和Python有良好的集成。

DuckDB提供了命令行界面和REST服务器,方便不同的使用场景。

通过Python和R的集成示例,展示了如何轻松使用DuckDB。

DuckDB的内部使用向量化处理引擎,提高了数据处理速度。

DuckDB的向量化处理允许处理比内存大的数据,避免了内存不足的问题。

DuckDB的性能在标准基准测试TPCH中表现出色,比传统引擎快40倍。

DuckDB拥有严格的质量保证流程,包括持续集成和基准测试。

DuckDB是免费且开源的,采用MIT许可证,目前处于预发布阶段。

DuckDB团队鼓励用户反馈和贡献代码,以改进数据库的功能和性能。

DuckDB正在开发内部数据压缩功能,以提高存储效率。

DuckDB支持用户定义的函数,以扩展其统计和分析功能。

DuckDB可能支持SQLAlchemy连接器,方便与Pandas等工具集成。

Transcripts

play00:11

hello everybody

play00:12

welcome to foster lightning talks in

play00:15

building age

play00:16

i want to introduce you harness mule

play00:19

eisen where we talk about

play00:20

duck db an embedded analytic database

play00:24

and give him a warm welcome

play00:31

thank you welcome everybody um so a

play00:33

quick introduction

play00:35

so i work at cwi which is the dutch

play00:38

national research lab for computer

play00:39

science and mathematics

play00:41

i also teach computer science students

play00:44

about the wonderful world of databases

play00:46

but i have found out that a good way of

play00:49

learning about databases is building

play00:50

them

play00:51

and therefore i also do that and today

play00:54

i'd like to talk to you about one of

play00:55

these products

play00:56

and that is duckdb obviously duckdb is

play00:59

not my

play01:00

own sort of sole creation but there's of

play01:03

other people involved

play01:04

most notably mark rusfeld who is not

play01:07

here today

play01:08

um so we're going to talk about duckdb

play01:10

and duck to be the database management

play01:11

system

play01:12

and it's new it's completely new and

play01:15

it's

play01:16

focused specifically to be embeddable

play01:19

which means not embeddable as in

play01:20

hardware but embeddable as in embeddable

play01:22

into other software

play01:25

and it's analytical which means that

play01:27

it's focused on

play01:29

crunching through large amounts of data

play01:31

as opposed to

play01:32

you know dealing with uh transactions

play01:34

like you know orders in your online shop

play01:36

so if you want to do orders in your

play01:37

online shop

play01:38

go to up to the postgres people next

play01:39

door if you want to crunch large amount

play01:41

of data you can use duck tv

play01:44

um now i have to find out whether my

play01:47

clicker works it does um it is common to

play01:51

start these kind of talks with a

play01:53

description of how terrible the state of

play01:55

this world is

play01:56

um this is no exception the present is

play01:59

very bad

play02:00

um the data management in data analytics

play02:03

is a huge mess

play02:04

i don't know if anybody of you has ever

play02:06

tried to use things like

play02:07

pandas and that's great it works with

play02:10

the five examples that they have on the

play02:12

website but

play02:13

um one of the problems there that is

play02:15

really

play02:16

um overwhelming is the um

play02:19

in the data storage itself you know

play02:21

people tend to have

play02:22

these text files used where you know

play02:24

there's a well-known folder structure

play02:26

somewhere which has a bunch of csv files

play02:28

in it and there is

play02:29

maybe some code on top of that that

play02:30

decides which csv file to

play02:32

should be read once we have load

play02:36

loaded these files um we have these

play02:38

crude query processing engines

play02:40

for example the one that is in in pandas

play02:42

or

play02:43

the one that is in the r environment

play02:47

um once people decide that csv files are

play02:50

too slow they start

play02:51

inventing their own crude hand-rolled

play02:53

binary formats

play02:55

that are on disk maybe and and start

play02:58

processing those there's

play02:59

been a recent push in the direction um

play03:03

in generally this is sort of a zoo of

play03:06

one of solutions

play03:07

um and that makes like secondary

play03:09

problems like for example changing

play03:11

anything about the data that you have

play03:12

very difficult

play03:13

so this is bad um we don't want this

play03:17

um and these things are solve problems

play03:19

you know we have data management systems

play03:21

they've been around

play03:22

uh for 50 years or so um and

play03:26

what we're trying to do with db is make

play03:27

them usable also for these

play03:29

data analysis tasks that are so common

play03:33

so here so now this is the contra the

play03:35

future is bright obviously

play03:37

um with sqlite uh sorry with duct tape

play03:41

who has used sqlite okay this is

play03:45

very many people and in fact everybody

play03:47

has used sqlite because it is in every

play03:49

browser every phone

play03:50

um and every device that you can imagine

play03:54

um what we're trying to do is build

play03:55

something similar to sqlite

play03:58

but very different in sort of the um

play04:00

intended features

play04:02

um in the sense of what kind of data

play04:04

analysis questions

play04:06

you want to ask so you want to do anal

play04:08

data analytics in

play04:09

contrast to with sqlite where you do

play04:12

transactional data management

play04:14

and how do we do this um we have built a

play04:18

very fast so-called vectorized data

play04:21

processing engine i will explain to you

play04:23

in a bit uh what that is

play04:25

and we have stolen a lot of good ideas

play04:28

from sqlite

play04:29

for example and ductdb does not require

play04:32

you to

play04:33

run a separate server um you know this

play04:36

idea that you have to run a daemon that

play04:38

is your database that you have to kind

play04:40

of set up and configure and restart and

play04:42

whatever

play04:43

no it's kind of database as a library

play04:45

you run

play04:47

the duckdb system inside your process

play04:50

this has a nice side effect that data

play04:53

transfer

play04:54

from whatever you were using to talk to

play04:55

duckdb and that db becomes very fast

play04:58

and this is for data analysis this is

play05:00

really a critical question

play05:02

we've written a paper it was quite fun

play05:04

measuring for example the client

play05:06

protocol speed of various

play05:07

popular databases and the guys next door

play05:10

from postgres they

play05:11

came pretty badly what we also have

play05:14

stolen from sqlite is the idea that you

play05:16

have a single

play05:17

file storage format so basically where

play05:19

all your database

play05:20

no matter how complex it is no matter

play05:23

how many

play05:24

tables it has is in a single file

play05:27

and we've also stolen the idea of that

play05:29

it should be simple to install

play05:30

more on that in a bit so this is uh this

play05:33

is

play05:34

the bright future um how do we make that

play05:37

work

play05:38

um so dr b is a library so

play05:41

think of a just a package a library that

play05:44

you embed into your

play05:45

application um we have zero external

play05:48

dependencies this is really something

play05:50

that

play05:51

that took a lot of work but it is

play05:54

something that we believe

play05:56

is is actually quite necessary for a

play05:58

library to be successful

play06:00

is that you don't have to install 57

play06:02

other programs before you can use it

play06:05

in fact we have a special way to build

play06:08

activity that results in two files one

play06:10

header and one implementation

play06:12

um ductdb has a the on the base layer is

play06:16

a c plus plus api

play06:18

we have full sql support so i went

play06:20

through um

play06:21

the these wonderful job of implementing

play06:24

things like window functions in uh in a

play06:26

database system which i can tell you are

play06:28

not fun

play06:29

so you don't have to do it because you

play06:30

can use duckdb

play06:32

um we also have built a wrapper

play06:36

for the api that sqlite uses so in

play06:38

principle what you can do if you have an

play06:40

application

play06:41

that talks to sqlite you can do some

play06:44

library preload tricks and it will use

play06:46

duckdb instead so this is

play06:48

something that we have done to make it

play06:50

easy to to switch

play06:52

we've also learned from previous project

play06:54

how important it is to integrate with

play06:56

the tools that people are using

play06:58

in terms of data analysis people use r

play07:00

and python

play07:01

so there are packages for r and python

play07:04

i'll show an example in a bit

play07:06

that basically include everything that

play07:08

you need to run

play07:09

duckdb as well and just to wrap it up

play07:12

there is a command line interface

play07:15

and for the people that want to do the

play07:17

web stuff we have a rest server as well

play07:21

let's show some examples so here is an

play07:23

example for

play07:24

for python which by the way was also

play07:27

invented at cwi so

play07:28

we are kind of obliged to integrate with

play07:30

python

play07:33

you say pip install duckdb that's very

play07:35

complicated

play07:36

and then you have it installed there's

play07:38

no additional

play07:40

software required all the batteries

play07:42

included

play07:43

um and then you can just use this

play07:45

wonderful python database api where you

play07:48

um yeah you connect to a database in

play07:50

this case a database is a file so this

play07:52

would be a file

play07:54

um and then you can run sql queries

play07:56

which is

play07:58

a required you know skill that you have

play08:00

to have to work with db

play08:01

or maybe not because in the r world we

play08:04

have

play08:05

a similar integration where you you know

play08:07

you loaded up the database you connect

play08:09

to your database file

play08:10

and the our people have invented this

play08:12

wonderful deep flyer

play08:14

system of actually programmatically

play08:16

expressing queries which is

play08:18

quite nice um and finally

play08:21

the c plus plus api i wanted to show you

play08:23

for the people that are more

play08:24

in c land is really just that this is

play08:28

the actual fully functioning

play08:29

minimum integration of duckdb into c

play08:32

plus plus

play08:32

uh where again you know you you specify

play08:35

which file you want

play08:36

your database to be stored in and then

play08:38

you can merely run

play08:40

sql queries so that's the outside view

play08:44

right so it's not very exciting

play08:46

i realize this i mean not many people

play08:48

get excited about databases i'm one of

play08:50

the few

play08:51

but it is a tool that you can use to

play08:53

store your data and you can actually and

play08:55

this is the big difference

play08:56

you can get it out again quickly and you

play08:59

can run

play08:59

queries on large amounts of data on your

play09:02

local computer quite quickly

play09:04

now how do we do this um let me talk

play09:07

briefly about some internals

play09:09

um so we have something called

play09:11

vectorized processing i'm not gonna talk

play09:13

a lot about the other things

play09:15

but um this is the core of the engine

play09:17

that makes it fast

play09:19

um and you have to under to understand

play09:21

vectorized processing

play09:22

you have to understand that database

play09:25

engines comes in different flavors that

play09:26

is

play09:27

traditionally coupled at the time this

play09:28

is what postgresql sql sqlite everybody

play09:30

uses

play09:31

is basically we look at one row of data

play09:34

at a time

play09:34

in the process of running queries that's

play09:36

great

play09:38

however it's slow then we have the

play09:40

pandas numpy

play09:41

r way of doing things where we look at

play09:43

one column at a time

play09:45

which is faster but has issues when the

play09:47

data becomes bigger than memory

play09:49

and then finally we have vectorized

play09:51

processing which is kind of the

play09:53

the middle ground where you look at

play09:55

chunks of data at a time

play09:57

and this is a very nice thing because

play09:59

that means that

play10:00

the data that we look at in the query

play10:02

fits into the higher in the cpu cache

play10:04

hierarchy

play10:05

um so here on the right you see a short

play10:08

overview over the cpu caches and

play10:10

basically what we're trying to do with

play10:12

duckdb is keep the data that has been

play10:14

worked on

play10:15

up here in these very fast l1 and l2

play10:18

caches

play10:18

and actually avoid going into main

play10:20

memory for performance reasons

play10:25

and this is very nice because it allows

play10:26

us to process data that is bigger than

play10:29

main memory this is one of the

play10:31

limitations of things like pandas

play10:33

is that once your data becomes bigger

play10:34

than memory you're screwed

play10:36

with a vectorized execution engine you

play10:38

actually have a reasonable chance

play10:40

of still completing your analysis

play10:42

questions

play10:44

yeah and you don't get wonderful out of

play10:46

memory errors

play10:48

um so now i'm gonna

play10:51

actually uh skip something

play10:54

um so you would ask then you would ask

play10:56

okay so why should i do vectorization

play10:58

it's great that harness is excited about

play11:00

it

play11:00

but what's what does what kind of

play11:02

different does it makes and does it make

play11:04

and this is a

play11:05

uh like a very very crude benchmark we

play11:08

run like a standard benchmark tpch

play11:11

on different systems and this is based

play11:12

on an old version we have gotten faster

play11:14

in the meantime

play11:15

but basically if you look on the on the

play11:17

bottom there you can see the

play11:19

the time it takes to complete these

play11:20

benchmark queries between the different

play11:22

systems

play11:23

and then there is duckdb up here which

play11:25

clearly is

play11:26

much faster so generally you would say

play11:30

that this is 40 times faster than a

play11:32

traditional engine that is working in a

play11:34

top-level time fashion

play11:36

but then you would say but yeah honest

play11:38

you're an academic and

play11:39

you have a nice pet project but you know

play11:41

i i'm interested in something that i can

play11:43

use

play11:44

um maybe even a serious ideas

play11:47

um and this is why i briefly want to

play11:49

talk about our quality assurance

play11:51

and that we are um sort of doing with

play11:53

duckdb so basically we have

play11:55

continuous integration running where we

play11:57

have millions of sql queries run on

play11:59

every single release

play12:01

we know the correct result for every one

play12:02

of these queries so whenever we get

play12:05

something wrong with instantly flagged

play12:07

we have verified benchmark results for

play12:09

large standard benchmarks that

play12:11

we also check for and basically we went

play12:13

around and steal everyone's test cases

play12:16

so with sql engines you can do this

play12:18

because they all have the same sort of

play12:19

query language

play12:21

so the only thing you have to do is you

play12:22

have to write a parser for whatever the

play12:24

result format they have my favorite part

play12:26

was to write a scraper for the um

play12:29

sql server website because they have

play12:31

example queries with answers

play12:33

and from that we generated a bunch of

play12:34

test cases as well we also do query

play12:36

fuzzing where we ought to generate

play12:38

queries and to try to break um

play12:40

our system uh which always works if you

play12:43

run the fuzzer long enough but you find

play12:44

very important bugs in the meantime

play12:47

and we also have something that we call

play12:48

continuous benchmarking where every

play12:50

release

play12:51

is subjected to benchmarking and we can

play12:53

flag performance regressions

play12:55

quickly so

play12:58

db is free and open source under the mit

play13:00

license

play13:02

we are currently in pre-release so which

play13:04

means that you can't yell at us if we

play13:06

change apis internally

play13:08

but um it is fully functional you can

play13:10

use this to run

play13:11

uh queries to store data it's uh it is

play13:14

all there

play13:15

we have a website uh there's a github

play13:18

page where you know you can

play13:20

go file a full request if you want um we

play13:22

are very interested in hearing feedback

play13:25

and if duckdb doesn't do you know

play13:27

something that you wanted to do

play13:28

um then please tell us if you're even

play13:31

more database inclined then you can send

play13:32

us a

play13:33

pull request with new features bug fixes

play13:36

whatever we have a

play13:37

long list of issues in the issue tracker

play13:39

that have tagged with help wanted or

play13:41

good first issue so these are good

play13:42

places to start

play13:44

and with that i'm happy to take

play13:46

questions thank you

play13:55

can i ask two questions we have to ask

play13:58

him

play13:59

do you do something for internal data

play14:01

compression

play14:03

as you say you store it's used for a big

play14:06

amount of data

play14:08

yeah okay so the question is do we do

play14:09

something for internal compression

play14:11

um what we what we are working on is

play14:14

that the

play14:15

two things one is the um the storage on

play14:18

disk

play14:19

is going to be compressed so whatever we

play14:21

write to this to the single file format

play14:22

is going to be compressed

play14:23

but we also and we this is really

play14:25

something we're working on right now

play14:27

is working with compressed intermediates

play14:28

so that vectors for example if the

play14:30

if you have a vector 1000 values and

play14:32

they're all the same

play14:34

then we have compression that will

play14:36

actually not

play14:37

move these thousand values around but um

play14:40

you know the fact that it's the same and

play14:43

the second question is

play14:45

do you support any statistical functions

play14:48

like

play14:49

computing percentiles and getting

play14:52

histograms back from

play14:53

the database engine that's a good

play14:55

question um so our philosophy there is

play14:57

that because the data

play14:58

transfer between db and the host is so

play15:00

fast

play15:01

that if you want things that we don't

play15:03

support it's actually you're not going

play15:04

to die

play15:05

pulling a chunk of data into pandas for

play15:07

example and running it there

play15:08

um there is support for user defined

play15:10

functions if you want to add anything

play15:12

we have a fairly complete aggregation

play15:14

functions library so

play15:16

there is multiple options there but but

play15:18

the general idea is that we don't

play15:20

um we don't punish you for pulling a

play15:23

large chunk

play15:24

out of the system we don't hold the data

play15:25

hostage

play15:28

hi i have a question uh thanks for the

play15:30

talk um

play15:31

do we have a connector for the sql

play15:33

alchemy for example

play15:35

in pandas uh you have a connector for

play15:37

sqlite so you can write a sql query and

play15:40

then

play15:40

yeah um that has been i'm not sure what

play15:43

the status and that is but people have

play15:44

worked on this

play15:45

um i think eventually if it's not

play15:47

working already it should

play15:49

be working pretty straightforward

play15:50

because we support the exact same query

play15:52

language as posgress so i

play15:54

suspect it should already work and it's

play15:57

just a question of plumbing

play15:58

um the connection

play16:03

okay thank you very much i'm outside of

play16:05

if you want to talk to me i'm outside

play16:06

yeah

play16:07

okay perfect thank you for your talk

Rate This

5.0 / 5 (0 votes)

Related Tags
DuckDBデータベース分析データ管理ベクタライズドPythonR言語SQLオープンソースデータ圧縮
Do you need a summary in English?