Open Source Friday with LIDA - Generate Infographics with LLMS

GitHub
31 May 202458:46

Summary

TLDRこのビデオスクリプトでは、Microsoftの研究者であるVictor Dibia博士が開発したオープンソースプロジェクト「Laa」について紹介しています。Laaは、データからビジュアライゼーションとインフォグラフィックを自動生成するツールで、ユーザーがデータサイエンスや機械学習の知識を必要とせずに、データの可視化を簡単に行うことができます。デモでは、YouTuberのデータセットを用いて、Laaがどのようにデータの要約、質問の生成、そしてビジュアライゼーションの作成を行うかを紹介しています。また、Laaの限界や改善点についても議論され、プロジェクトへの貢献の方法も紹介されています。

Takeaways

  • 😀 GitHubの開発アドボケート、CadesaとMicrosoftの研究ソフトウェアエンジニア、Victor Dibia博士がゲストとして登場し、オープンソースプロジェクト「Laa」について紹介している。
  • 🔍 Laaはデータからビジュアライゼーションとインフォグラフィックを自動生成するオープンソースプロジェクトで、Victorが2018年に開発を開始した。
  • 📊 Laaはデータの要約と仮説生成を通じて、ユーザーがビジュアライゼーションを作成するためのコードを自動生成する。
  • 🎓 Victorは伝統的なソフトウェアエンジニアリングの背景を持つ。彼は人間の側面を理解するために、情報システムのPhDを取得し、ユーザービヘイビアと心理学の理論を応用して、人々がツールとインターフェースを使用する際の意思決定プロセスを研究している。
  • 🤖 Laaはビジュアライゼーションの生成をシーケンス予測問題として扱い、RNNなどのシーケンスモデルをトレーニングすることで、生データを入力からビジュアライゼーションへの翻訳を学習する。
  • 📈 Laaはデータの可視化を通じて、人間の認知負荷を軽減し、データから洞察を迅速に引き出すことを可能とする。
  • 🛠️ Laaはデータの前処理やクリーニングは行わないが、データがクリーンで適切な形式にある場合、高い信頼性でビジュアライゼーションを生成することができる。
  • 🌐 LaaはPython APIとWeb APIを提供しており、ユーザーがデータセットを指定してビジュアライゼーションを生成することができる。
  • 🔧 Laaはビジュアライゼーションの質を保証するために、自己評価モジュールと自己修復機能を備えており、生成されたビジュアライゼーションを評価し、改善案を提案することができる。
  • 👥 Laaはデータサイエンティストやビジュアライゼーション初心者にとって非常に価値があるとされており、専門知識を持たないユーザーでもデータから質問を生成し、視覚化を行える。

Q & A

  • オープンソース・フライデーとは何ですか?

    -オープンソース・フライデーは、GitHubの開発者アドボケイトであるカデサが主催する毎週のショーで、メンテナーやコントリビューター、プロジェクトを祝うためのものです。

  • 今回のゲストは誰ですか?

    -今回のゲストは、Microsoftのプリンシパルリサーチソフトウェアエンジニアであるビクター・ディビア博士です。

  • ビクター・ディビア博士が紹介したプロジェクトLIAとは何ですか?

    -プロジェクトLIAは、ジェネレーティブAIと大規模言語モデルを用いて、データの可視化とインフォグラフィックを自動生成するオープンソースプロジェクトです。

  • ビクター・ディビア博士の学歴は?

    -ビクター・ディビア博士は、ソフトウェア工学の学士号と修士号を取得し、CMUで情報システムの修士号を取得しました。その後、香港で情報システムの博士号を取得しました。

  • プロジェクトLIAの主な機能は何ですか?

    -プロジェクトLIAは、データセットを入力すると、そのデータに基づいて可視化を自動的に生成し、ユーザーに視覚的にデータを理解しやすくするツールです。

  • プロジェクトLIAが他のAIツールと異なる点は何ですか?

    -プロジェクトLIAは、特定の可視化ユーザーエクスペリエンスを提供し、Python APIとWeb APIを持ち、コードを生成して実行し、ユーザーに結果を表示する点で異なります。

  • プロジェクトLIAがサポートするデータフォーマットは何ですか?

    -プロジェクトLIAは、CSVやJSONなどのデータフォーマットをサポートしています。

  • ビクター・ディビア博士はプロジェクトLIAをどのように使用していますか?

    -ビクター・ディビア博士は、データを視覚化してプレゼンテーション用のスライドに追加するためにLIAを使用しています。

  • プロジェクトLIAの使用にあたり、データはどのような状態であるべきですか?

    -プロジェクトLIAを使用する際には、データがある程度クリーンで整った状態であることが理想です。

  • プロジェクトLIAの貢献方法について教えてください。

    -プロジェクトLIAはオープンソースであり、GitHub上でリポジトリにスターを付けたり、バグ修正やドキュメントの改善などで貢献することができます。

Outlines

00:00

🎥オープンソースプロジェクトの紹介

カデサはGitHubのデベロッパーアドボケートとして、オープンソースフリデイを紹介し、コントリビューターやメンテナーを称えます。ゲストであるVictor Dibia博士は、Microsoftのプリンシパル研究ソフトウェアエンジニアとして、生成AIとLLMを使ってデータの可視化とインフォグラフィックを作成するプロジェクト「Laa」を紹介します。

05:02

👨‍🎓Victor Dibia博士の経歴とプロジェクトLaas

Victor Dibia博士は伝統的なソフトウェアエンジニアリングのバックグラウンドを持っており、後に人間の側面を学ぶために情報システムのPhDを取得しました。彼はIBM研究でのインターンシップを通じてAIに興味を持ち、特に多様なインターフェイスを組み合わせたマルチモーダルな経験を研究しました。彼はデータ可視化を自動化する「Data ToV」プロジェクトにも関与し、Laasはその経験をもとに作られたオープンソースプロジェクトです。

10:02

📈Laasプロジェクトの概要と機能

Laasはデータからインフォグラフィックを作成するツールで、データの要約と可視化の目標を自動的に生成します。データのクリーンアップは必要とせず、データがすでに整っていると仮定しています。LaasはPython APIとWeb APIを提供しており、特定のユーザーエクスペリエンスを構築し、信頼性の高い可視化を生成することが可能です。

15:02

🛠️Laasのデモと操作

Laasのデモでは、GitHubのレポジトリからインストールし、コード空間から新しいコード空間を開始してUIを起動します。Laasはデータセットを指し示し、要約を作成し、可視化の質問を生成してから、コードを生成して実行します。ユーザーは質問を選択して生成された可視化を見ることができます。

20:03

🔧Laasの柔軟性とコードの変更

Laasはコードを生成し、実行するだけでなく、ユーザーからのリクエストに応じてコードを変更することができます。たとえば、棒グラフに変換するなどの要求に応じて、Laasはコードを再書き換え、実行エンジンはエラー訂正を行います。

25:06

📊Laasの可視化品質と自己評価

Laasは可視化の品質を確保するために、自己評価機能も備えています。Laasは可視化コードを評価し、多角的なディメンションに基づいてスコアを割り当て、改善点を提案します。必要に応じて、自己修復モジュールを通じて可視化を向上させることができます。

30:08

🔄Laasのデータ処理とリアルタイムデータ

Laasはpandasが処理できるデータであれば何でもサポートし、リアルタイムデータストリームにも対応可能です。データチャンキング戦略を構築し、データが到着するたびに可視化を更新することができます。

35:11

🤖Laasの限界と貢献の呼びかけ

Laasにはまだ限界があり、複雑な指示に対してロジックエラーが発生する可能性があります。しかし、コミュニティによる貢献が求められており、データコネクタの追加やマルチモーダル評価のサポートなど、多くの貢献の余地があります。

40:13

🌐Laasのデータソースと可視化比較

LaasはCSVやJSONなどのデータソースをサポートし、pandasが処理できるものは何でも対応可能です。また、Laasは2つの異なるデータのスナップショットを比較する機能も持ち、可視化品質の比較も可能です。

45:14

📬Laasへの貢献とコミュニケーション

Laasはオープンソースプロジェクトであり、スターや貢献を積極的に求めています。ドキュメントの改善、新しい機能の提案、データコネクタの追加などが挙げられます。Victor Dibia博士はニュースレターも発行しており、Laasに関する情報やAI、機械学習、デザインに関する考え方を共有しています。

Mindmap

Keywords

💡オープンソース

オープンソースとは、ソフトウェアのソースコードが公開され、誰でも自由に使用・改変できることを指します。ビデオでは、オープンソースプロジェクトであるliaが紹介されており、これは開発者たちが協力して進めたプロジェクトとして位置づけられています。

💡GitHub

GitHubは、バージョン管理システムであるGitを利用したウェブベースの分散バージョンコントロールとリポジトリホスティングサービスです。ビデオでは、GitHubでの開発者アドボケートとしてのキャディサが登場し、オープンソースプロジェクトのサポートについて話しています。

💡generative AI

generative AIとは、機械学習を用いて新しいデータを作成する技術です。ビデオでは、generative AIを用いてデータの可視化やインフォグラフィックを作成することができました。特に、liaプロジェクトがこの技術を活用して、データから視覚化を自動生成することができると説明されています。

💡LLM (Large Language Models)

LLMとは、大規模な言語モデルを指し、自然言語処理において使用されるアルゴリズムです。ビデオでは、liaプロジェクトがLLMを利用して、データの要約や視覚化の生成を行う方法が紹介されています。

💡データ可視化

データ可視化とは、データから得られた情報を視覚的に表現するプロセスです。ビデオでは、liaプロジェクトがデータ可視化を自動化し、複雑なデータセットから洞察を迅速に得られるようにすることを目的としています。

💡API (Application Programming Interface)

APIとは、ソフトウェア間でデータをやり取りするためのインターフェースです。ビデオでは、liaが提供するPython APIやWeb APIについて触れられており、これによりユーザーはプログラムからliaの機能を利用できると説明されています。

💡機械学習

機械学習は、コンピュータがデータを解析し、自己学習するアルゴリズムを用いた分野です。ビデオでは、liaプロジェクトが機械学習を利用してデータの可視化を自動化するプロセスが紹介されています。

💡多変量分析

多変量分析とは、複数の変数を含むデータを分析する統計学の手法です。ビデオでは、liaがデータの多変量分析を自動化し、データセットの要約や視覚化の生成に利用していると説明されています。

💡ユーザーインターフェース

ユーザーインターフェースとは、ユーザーとコンピュータプログラムとの間での相互作用を容易にするソフトウェアの部分です。ビデオでは、liaのユーザーインターフェースを通じて、視覚化の生成や編集を行う方法が紹介されています。

💡コード空間

コード空間は、GitHubが提供するオンライン開発環境です。ビデオでは、キャディサがGitHubコード空間を使用してliaプロジェクトをデモンストレーションしている場面があります。

Highlights

开源项目Laa的介绍,由微软的首席研究软件工程师Victor Dibia博士开发,用于通过生成性AI和大型语言模型(LLMs)创建数据的可视化和信息图表。

Laa项目在Chat GPT出现之前就已经存在,展示了如何使用生成性AI和LLMs来自动生成数据可视化。

Victor Dibia的背景介绍,从传统的软件工程到信息系统博士,再到对人工智能和多模态交互的兴趣。

IBM研究院的Data Tov项目,使用序列模型自动生成数据可视化,将数据可视化视为序列预测问题。

使用RNNs和特定可视化语言Vega-Lite,将数据和可视化作为文本表示,训练模型进行转换。

可视化如何使数据易于理解,减少从数据中提取洞察力的认知负担。

创建好的可视化需要技能和努力,Laa旨在通过自动化过程简化这一工作。

Laa的工作流程,从数据摘要到生成潜在的可视化目标,再到用户无需任何努力即可生成复杂可视化。

Laa的局限性,它假设数据已经是干净且格式良好的,不专注于数据清洗。

Laa的Python API和Web API,以及它专注于可视化特定用户体验的研究。

Laa的演示,包括安装、使用Web界面和与GitHub Codespaces的集成。

Laa如何使用示例数据集(如S&P 500)自动生成问题和可视化。

Laa的交互性,允许用户通过文本请求修改可视化,例如将图表转换为条形图。

Laa如何使用LLM进行自我评估和修复可视化中的错误。

Laa支持多种可视化库和模型提供商,以及如何配置它们。

Laa如何处理实时数据流,以及在实时数据可视化方面的潜力和挑战。

Laa的开源性质,鼓励社区贡献和协作,以及如何开始贡献。

Laa的当前限制和未来发展方向,包括多模态评估和数据连接器的增强。

如何将Laa用于教学和工作坊,使其对没有数据可视化经验的用户更加可访问。

Laa如何帮助用户理解大型数据集,以及它在数据可视化领域的创新之处。

Transcripts

play00:01

[Music]

play00:12

[Music]

play01:39

[Music]

play01:52

hey everyone welcome to open source

play01:56

Friday I'm cadesa and I work at the

play01:59

GitHub as a developer Advocate and I'm

play02:02

super excited to be here today open

play02:04

source Friday as you know is the weekly

play02:06

show that we have to celebrate

play02:08

maintainers contributors and the project

play02:10

that makes our life so much easier today

play02:14

I'm super excited to have Victor dibia

play02:17

PhD who is here to show us all about how

play02:21

to build uh generalizations and

play02:23

infographics with generative Ai and llms

play02:27

with his project laa now Victor is a

play02:31

principal research software engineer at

play02:33

Microsoft and he does a lot of great

play02:36

work with generative Ai and LMS and laa

play02:39

is a project that he built last year

play02:41

before chat GPT came to the scene so I'm

play02:45

super happy to have Victor here today to

play02:47

tell us all about his project laa hi

play02:49

everyone welcome to the show thanks so

play02:52

much for joining us I'm gonna add Victor

play02:54

to the stage and we will get right into

play02:57

the tech hey

play02:58

Victor hi um hi everyone great to be

play03:02

here thanks for the

play03:04

introduction of course super happy to

play03:06

have you here Dr diaa okay that's going

play03:09

to be the last time I say

play03:11

that but uh I think it's very impressive

play03:14

that you have a PhD like that takes a

play03:16

lot of work and like a lot of dedication

play03:18

to you know education and

play03:20

learning yeah well well thanks thanks

play03:23

for the kind

play03:24

words okay he's a little shy okay great

play03:27

so today we're chatting about project

play03:30

laa or so project laa from my

play03:33

understanding is an open- source project

play03:36

from Microsoft that Victor built last

play03:39

year and it helps us to generate like

play03:42

visualizations and infographics for our

play03:45

data now I'm super excited to talk about

play03:48

project Lia because I've been learning

play03:50

some machine learning stuff and I'm just

play03:52

like okay yeah I have an appreciation

play03:54

for this project I know what's going on

play03:58

nice um yeah yeah but like tell us tell

play04:01

us more about yourself actually um what

play04:03

does it mean to be in the role that you

play04:06

do and like how did you start you know

play04:09

having an interest in generative Ai and

play04:11

large language models yeah um so my

play04:16

background is just traditional software

play04:18

engineering so I did like a bachelor's

play04:20

degree a masters mostly covering courses

play04:24

in computer science software engineering

play04:26

distributed systems I did a master's

play04:28

degree in information

play04:30

systems at uh information networking at

play04:33

CMU Pittsburgh and um it's just covering

play04:36

topics in softare engineering

play04:38

distributed systems um some bit of

play04:41

security and um after that I did a

play04:44

startup for a little while um moved to

play04:46

West Africa to do that and then start

play04:49

other university really enjoyed that

play04:52

then I figured I also wanted to learn

play04:54

about like the human aspect of like

play04:56

software tools and systems and so I

play04:59

ended up doing a PhD in Information

play05:02

Systems um and so the information

play05:04

systems is this field that looks at

play05:07

um you know applying theories from user

play05:10

Behavior psychology and using that sort

play05:13

of understand how people make decisions

play05:15

as they use tools and

play05:16

interfaces and that was the PHD I did

play05:19

that in Hong Kong actually um wow rigar

play05:22

roll there um so I did that at C

play05:24

University of Hong Kong and um towards

play05:27

the end I got a chance to intern with a

play05:29

group at IBM research uh a primarily HCI

play05:32

group just again looking at like

play05:34

interface design applying theories from

play05:37

user Behavior decision making to study

play05:39

that but then we worked very closely

play05:41

with the applied machine learning group

play05:42

at IBM research and that's really where

play05:43

my AI sort of interest of started

play05:46

started out so I first first of all

play05:48

started to look at how we could apply

play05:50

these models build like multimodal

play05:52

interaction and so we we we worked on

play05:54

things like room scale interaction a

play05:56

person walks into a room there's a

play05:58

camera array there's a 3D pointing

play06:01

device there's a microphone array and

play06:04

essentially we doing text to speech

play06:05

speech to text and vision recognition

play06:08

pointing gesture recognition and sort of

play06:10

building that into multimodal like

play06:13

experiences wow and yeah and then I

play06:16

started out sort of applying things and

play06:18

then I started figur out hey you know

play06:20

like it would be nice to sort of build

play06:22

new models solve new interaction type

play06:24

problems and in 2018 at IBM research one

play06:29

of the things I worked on was a project

play06:31

called Data Tov and I'll just share my

play06:33

screen if you want to put that up a

play06:36

little um essentially this project was

play06:39

in 2018 and the the the title was just

play06:43

automatic generation of data

play06:44

visualization using sequence sequence

play06:47

models so these were the State ofthe art

play06:48

models back then rnns and what this

play06:51

project just showed for the first time

play06:53

was that we could represent

play06:56

visualization generation as um as a

play06:59

quence to sequence prediction problem

play07:01

and we could train specialized um

play07:04

sequence to sequence models that could

play07:07

take raw data and then generate

play07:09

visualizations from

play07:11

Thats if you represent the visualization

play07:14

as text and so we could do this by just

play07:16

sampling let's say the first two rules

play07:17

from the data set and then we represent

play07:20

the visualization as text and so in this

play07:22

case we used like a a declarative

play07:25

visualization language called veal light

play07:27

and so essentially you could represent

play07:29

your visual I izations just as as Json

play07:32

Json just like as a Json object and we

play07:36

took this pairs of sample data and

play07:39

visualizations and we made a fairly big

play07:41

data set out of that and ENT we just

play07:43

took an RNN a sequence sequence model

play07:46

the exact type of model you would use to

play07:48

train like a language translation model

play07:50

so imagine that you would learn to

play07:52

translate between English and

play07:54

French you have all these all these

play07:57

pairs of English and French and so we

play07:59

just replace that with data and

play08:01

visualization and this model learned to

play08:03

translate between between that and then

play08:07

oh if the data is big enough yeah just

play08:10

just so I understand just so I

play08:12

understand so it's kind of like seeing

play08:15

visualizations as a language yes

play08:18

exactly oh wow that's represented as

play08:21

text as a as a just like like a language

play08:24

in this case it's just a specialized

play08:26

declarative language to represent

play08:28

visualizations and you represent your

play08:30

data again as text then you can get the

play08:32

models to actually just at at the wrong

play08:35

time you give them a sample of a data

play08:37

set and they'll actually come up with

play08:39

visualizations that are sort of grounded

play08:41

or contextually grounded in that data

play08:44

wow and so it was the first time we we

play08:47

sort of show why show that this was

play08:50

possible so the next question is like I

play08:52

don't know like why would you want to do

play08:54

that um and in in general I have this

play08:58

little slide that I I sort of use to

play09:01

explain the topic and the key idea is

play09:04

that like visualizations make data

play09:06

accessible and so let's say you had like

play09:09

a couple hundred thousand rows of data

play09:12

um but typically you want to make

play09:14

decisions from that right and so it just

play09:16

turns out that the human mind has

play09:18

evolved such that we can look at a chart

play09:20

and we can make really fast conclusions

play09:22

we can look at the chart and say like

play09:23

hey know my sales tends to be highest in

play09:26

the summer and so maybe I need to devote

play09:29

more of my resources and my marketing

play09:31

dollars all that kind of thing the

play09:33

summer so you could just look at a chart

play09:35

make a decision like that really really

play09:36

fast so in general chart sort of reduce

play09:39

the cognitive burden associated with

play09:41

sort of extracting insights from data

play09:43

super super important for many many

play09:45

fields um however to create good

play09:47

visualization it takes quite a bit of

play09:49

skill and effort so first of all you

play09:51

need to understand the data you to come

play09:53

up with good questions that takes some

play09:55

dat skill and when you come up with good

play09:58

questions you know you need to represent

play09:59

them as visualizations in some way right

play10:02

what encoding do you

play10:03

use do you use a bar chart a pie chart a

play10:06

radi chart what's the appropriate

play10:08

representation of that right then maybe

play10:12

the data might not be in its best format

play10:14

maybe you need to apply some

play10:15

transformations to the data right maybe

play10:17

you need to like combine to field to a

play10:21

new field do some normalization all of

play10:23

that and then finally when you do all of

play10:26

that you need to figure out okay so how

play10:27

do I actually create this visualization

play10:29

um do I use Excel or powerbi to write

play10:32

code in up so there's quite a bit of

play10:34

effort there so the promise is um what

play10:38

if we could build a tool that will

play10:40

automatically understand the data in

play10:43

this case let's say summarize it in the

play10:45

case of Lia um what if we could based on

play10:49

this summary of the data could generate

play10:51

a bunch of potential visualization goals

play10:53

and this is exactly what um data

play10:57

scientists do with Ed yes explor to data

play11:00

analysis just but imagine that you get

play11:02

this for free with zero effort and then

play11:05

for each of these exploratory goals or

play11:08

hypothesis can we generate complex

play11:11

visualizations without any just with

play11:13

zero requirements on the user and how

play11:16

can we do all of this with high

play11:17

reliability and and pretty much that's

play11:19

what ligher set up to

play11:22

accomplish you know it it's it's so

play11:25

interesting and and I'm so happy that

play11:27

I've been studying and and it just so

play11:29

happens that I started learning about um

play11:32

like data cleaning and all all that jazz

play11:35

um before our stream because knowing

play11:37

that laa understands the data and then

play11:40

can generate goals and hypotheses

play11:43

automatically from the data that kind of

play11:45

blows my mind because it takes so long

play11:47

to you know like to like clean the data

play11:50

and then like you see and then you're

play11:51

just like okay what questions am I going

play11:53

to ask about this and like what am I

play11:55

trying to accomplish from all this it

play11:57

takes so much time right like instead of

play12:00

uh putting all that man power or U man

play12:03

energy into that you can use laa to you

play12:07

know generate those questions and then

play12:08

visualize the data and then from then

play12:10

you can you know go on to do like actual

play12:14

actual things actual work um yeah so so

play12:17

technically lighter yeah so one small

play12:19

limitation is lighter doesn't exactly

play12:22

focus in cleaning the data if the data

play12:25

exists it makes an assumption the data

play12:27

just exists in the potentially decent

play12:29

format and then all the stuff like you

play12:31

know come up with goals summarize it

play12:34

accomplish high high reliability all of

play12:37

that is after that and so data cleaning

play12:39

is another nice research topic but

play12:42

something I'd like to talk about at a

play12:43

different time but that's a little out

play12:45

of scope for Lia gotcha so so so sorry

play12:49

to interrupt but like essentially the

play12:51

data should be clean before using Lia I

play12:55

mean technically if it's if it's not

play12:56

clean you still get results it'll do its

play12:58

best effort um LMS are pretty cool

play13:01

actually in that way they they they're

play13:03

kind of robust to small amounts of noise

play13:05

and so they'll do a best effort and if

play13:07

we have some time you know we can go to

play13:09

some random data set on kagle and just

play13:11

see what ligher does this me okay yeah

play13:15

how about that yeah

play13:17

yeah so the whole idea is how do we

play13:20

create a better visualization authoring

play13:22

experience right um go from data to

play13:24

natural language plus natural language

play13:27

intent to Reliable visualizations but

play13:29

you know a couple of the a few really

play13:32

really Vigilant folks on on the internet

play13:34

might be like you know but isn't that

play13:36

what chadt and could interpret it does

play13:38

already um so there there there few

play13:41

differences so the first thing is um

play13:44

this project sort of started internally

play13:46

as a hackathon Microsoft research so I

play13:48

know kadashi mentioned last year it was

play13:50

actually two years ago August

play13:53

2022 um and Chad GPT came out in in

play13:57

December but you know this project sort

play13:59

of showed how you could sort of take

play14:01

that data generate like summaries um and

play14:06

then sort of put an interface on top of

play14:08

all that um generate code executed and

play14:11

show the the result to a user and it it

play14:14

sort of used the earlier open a models

play14:16

so the DAV Series this space moves so

play14:19

fast everybody has forgotten the DAV

play14:21

series models but they really were the

play14:22

state of the art like two years ago so d

play14:25

d instruct

play14:27

um and and now it's once chat GPT came

play14:31

out essentially the GPT 3.5 turbo models

play14:35

it just made this app work a lot better

play14:37

so some of the differences here is that

play14:39

like it provides a python API and a web

play14:41

API um it has a visualization specific

play14:44

user experience sort of baked on top of

play14:46

it and then it the paper and the

play14:48

research around it sort of focuses on

play14:50

metrics for evaluating this kind of um

play14:53

um this kind of

play14:55

setup and uh and so at this point I'd

play14:59

like to probably switch to demo what you

play15:02

think yeah sounds

play15:04

great yeah so someone was just asking if

play15:07

there were any examples that we can

play15:09

share actually oh yeah yeah absolutely

play15:11

yeah um so the starting point for um for

play15:14

lighter would be to go to the repo um

play15:18

which is I think it's yeah it's already

play15:20

in the chat and yes the first thing you

play15:23

want to do is to install it um do like a

play15:26

peep install that you Lia and typically

play15:29

if you have GitHub code spaces you could

play15:32

essentially just click on a code space

play15:34

start out a new code space so I have one

play15:37

actually running and so the the the

play15:40

installation step would be to if you

play15:42

were installing from Source you would do

play15:44

p install that e dots or pep install

play15:46

lighter I've kind of done that before

play15:48

just to save us some some time and then

play15:51

to spin up the UI you would see

play15:52

something like um lighter space UI and

play15:56

and essentially that just spins

play15:59

um um a web interface that sort of build

play16:04

on top of lighter and will'll come back

play16:06

to the low level python API experience

play16:09

um in a moment but let's start out with

play16:11

the um let's start out with the with the

play16:14

UI experience and so once once you

play16:16

clicked on that you you went to the URL

play16:19

that it provided go to demo the first

play16:22

thing is you can point it at some data

play16:23

right and so let's assume that we

play16:25

pointed at the SP 500 data sets

play16:29

a few things happen so the first thing

play16:31

that happens is that there's a there's a

play16:33

module called the

play16:35

summarizer which sort of works in a

play16:37

two-stage process as a two-stage process

play16:40

in the first process we take everything

play16:41

we know about like data sets and we

play16:44

compute a bunch of Statistics so for

play16:45

example we use like pandas to infer that

play16:48

like the First Column is a date we

play16:50

compute the main and the max we generate

play16:52

a bunch of samples from that and then we

play16:54

compute number of unique samples so we

play16:57

know how to do this with code and so we

play16:59

this first next we take that

play17:01

representation and then we give that to

play17:03

an llm and the LM does things like

play17:05

adding a semantic type and a description

play17:08

field and so here we know it's a string

play17:11

but we also need to know the semantic

play17:12

type it's it's a date and in other data

play17:15

sets something might be just a string

play17:17

but we need to know if it's a country or

play17:19

it's heads of state and that kind of

play17:20

thing then also we add a description for

play17:23

each of these know we use the LM to use

play17:25

you know best effort and just based on

play17:27

its parametric knowled come up with the

play17:29

sort of annotations to improve the

play17:31

representation of the data

play17:33

set so why why might we want this it

play17:36

just turns out even for humans the more

play17:38

data you have about the each of the

play17:40

columns and Fields the more interesting

play17:42

grounded questions you can ask of that

play17:45

of that data so that's the first piece

play17:47

so the the summarizer does that the next

play17:51

thing is the goal exploration module

play17:53

which is the Eda sort of module we

play17:55

talked about earlier comes up with a

play17:57

bunch of questions for data right and so

play18:00

right now the user has done nothing but

play18:01

here all this potential questions here

play18:03

and so the first one was the overall

play18:05

trend of the S&P 500 over the time

play18:09

period and we express this as like a

play18:13

multitask generation problem where we

play18:15

don't just ask the llm to come up with a

play18:18

visualization title or hypothesis we ask

play18:21

it to do three things like first of all

play18:23

the

play18:24

question the title of the

play18:27

visualization and then also rational for

play18:29

what we might learn from this

play18:31

visualization so it just turns out that

play18:33

if you if you express a problem like

play18:35

this then the

play18:36

llm sort of um comes up with more

play18:39

semantically reach and more likely

play18:41

correct um or interesting

play18:44

questions and so here and then for each

play18:47

of these

play18:49

questions um we get the visualization

play18:52

module to generate some code in this

play18:56

case we generate python code and the

play18:58

background execute that code do some

play19:00

preprocessing

play19:01

postprocessing um convert that to uh an

play19:04

image extract like a basics for

play19:06

representation that and stream that back

play19:08

to the user interface so so for each of

play19:10

these questions the llm

play19:12

um essentially it generates

play19:15

visualizations and so you just need to

play19:16

select select the particular um question

play19:20

and then you can see the visualization

play19:22

that was

play19:23

generated oh so it already did

play19:27

everything yep like as soon as you gives

play19:29

it the data set it just it just does it

play19:33

yeah I agree with uh I think it's Levi

play19:36

uh simply amazing and mindblowing no for

play19:38

real like this work takes so much time

play19:42

um and that's amazing yeah and another

play19:46

interesting thing is that because the

play19:47

visualizations here as I represented as

play19:50

code um just like you could ask Chad GPT

play19:54

to generate some code and then ask it to

play19:56

modify that code you could do that here

play19:58

too so for example example if you say

play19:59

something like convert this to a bar

play20:02

chart right and

play20:05

so the LM just takes the the user

play20:09

request takes the existing code and

play20:12

rewrites um rewrites the code the

play20:15

execution engine uh executes that does

play20:18

some error correction and then uh a line

play20:22

chart is converted to to a b chart nice

play20:26

in addition to that you could also do

play20:27

other interesting things right right so

play20:29

um sometimes the visualization that

play20:31

comes back might not be correct and the

play20:33

user might not be a visualization expert

play20:35

so one way you can support them in

play20:38

making sense of the quality of the

play20:39

visualization is to ask the llm to

play20:42

retroactively sort of explain what

play20:44

what's going on in the chart so in this

play20:46

case it it makes three types of

play20:48

explanation so it talks about like just

play20:51

the overall appearance of the chart and

play20:53

so if you're building applications that

play20:54

have like accessibility requirements

play20:57

let's say you have user who are visually

play20:59

impaired you could include a feature

play21:01

like this that sort of describes what's

play21:04

in the chart so it says that the chart

play21:06

is a bar plot with blue bars

play21:07

representing the stock price of time the

play21:10

x-axis is this and all of that so s of

play21:13

explain explaining what's in the code

play21:15

and in the chart and then in terms of

play21:18

transformation it describes the type of

play21:20

transformation that's applied to the for

play21:22

data so first of all it says that the

play21:24

date column is formed as a date type

play21:28

there some sort of um error correction

play21:31

going on um and then also sort of

play21:34

describes the visualization so that's

play21:35

one thing another thing that you could

play21:38

do also is that you could try to

play21:40

leverage you know you know we all hear

play21:42

about very large language models and how

play21:44

they have consumed a lot of data and how

play21:47

they're actually General World experts

play21:50

and so it also turns out that like this

play21:52

LMS so in this case we're using I think

play21:54

GPT 3.5 turbo turns out they they

play21:57

they've read a lot of visualization

play21:58

books and you could actually ask them to

play22:00

evaluate visualization code across

play22:02

multiple Dimensions so in this case here

play22:05

like we ask it to generate it to to

play22:07

generate like ratings across Six

play22:09

Dimensions so the first is bugs um does

play22:13

it have any bugs or logic errors um so

play22:17

it says that like hey know I give the

play22:19

score 4.5 because you know the function

play22:21

could be improved by adding error

play22:23

handling to make sure that the input

play22:25

data is valid right in terms of trans

play22:28

transformation it says that like the

play22:30

data is transformed appropriately but

play22:32

here type seems a little low or yeah so

play22:36

here it says that like uh the

play22:37

visualization type is appropriate but a

play22:39

b plot is good however a line plot might

play22:42

be more effective and essentially you

play22:45

can then say okay here's all these like

play22:47

assessments you've made on this

play22:48

particular chart how we how about we

play22:50

Auto Repair the chart and so it does a

play22:53

few things so it converts it back to a

play22:54

line chart and then it adds a a bunch of

play22:57

annotations so for examp

play22:59

it it it sort of plots it figures out

play23:02

that this is a stock price graph and it

play23:04

goes back in time and it sort

play23:07

of suggest a couple of important events

play23:11

to keep track of and in this case it

play23:15

just turns out some of these events

play23:16

actually have an effect on sort of the

play23:19

stock price um trajectory so say talks

play23:23

about the lemon Brothers bankruptcy

play23:25

mining LEL and that sort of thing so so

play23:28

yeah we have a yeah we have a little

play23:31

question from Carlo they said that the

play23:34

dates are not visible at all is it

play23:36

overestimating the quality of the

play23:39

chart um so what do we mean by the dates

play23:42

are not visible you mean here I think it

play23:44

was the chart before this one yeah so

play23:48

yes so yeah on the chart previously um I

play23:51

was the one who sort of asked it like

play23:53

convert this to a bar chart and inste of

play23:55

did that and ideally you know I think

play23:58

bit if if you did ask the llm to

play24:00

generate it probably would be a bit more

play24:02

judicious in the use of like AIS and so

play24:05

that was like a forc error I did myself

play24:07

and then I asked the L to sort of

play24:09

self-critique that chart and then

play24:11

self-repair it and then it sort of fixed

play24:13

all of that error so yes um a good

play24:18

pipeline to have in general would be to

play24:20

first ask the to generate a chart and

play24:23

then um retroactively ask it to also

play24:29

retroactively ask it to also sort of

play24:31

self- evaluate the quality of that chart

play24:33

and then any kind of repairs and so that

play24:36

way you you get good quality charts um

play24:39

all all along the way um nice yeah so

play24:43

essentially oh keep going sure go ahead

play24:47

I was just GNA say so essentially laa is

play24:49

taking all the the packages that we

play24:52

would use like you know Med plot live

play24:54

Seaborn and I think one's called psychic

play24:57

learn um and it's and it's building the

play25:00

charts for US based on the data that's

play25:02

inputed yes so it's it's configurable so

play25:06

you could tell it he know use caborn so

play25:08

ah okay you could switch to Al here

play25:10

which is an interactive web based like a

play25:13

python liary on top of uh

play25:16

vialy you use M plot leave you could use

play25:18

gplot so in theory as long as um it's a

play25:22

library that represents visualization as

play25:24

code in any programming language and we

play25:26

can execute that code you typically

play25:29

should be able to use that with lighter

play25:31

yeah and then you can also select the

play25:32

type of model you want to use so on the

play25:35

backend lighter supports multiple model

play25:37

providers open AI Palm cair hugging face

play25:41

anthropic that kind of thing um and

play25:44

essential the llm is just like a

play25:45

generation engine so we can swap that

play25:47

out with multiple llms instead of build

play25:51

visualizations gotcha and do we need to

play25:53

add our own um API key to access those

play25:57

okay yes um

play25:58

so technically you would run this in

play26:00

your own environment and and essentially

play26:03

the first step would be to set up an llm

play26:06

that gets attached to all your requests

play26:09

gotcha yeah gotcha gotcha gotcha okay

play26:12

let's answer a few questions that we

play26:14

have and then I would love to know how

play26:15

can we feed it our own data um I know

play26:19

that these are example data in the demo

play26:22

yeah but let's answer a few questions

play26:24

before we show that so Carlos says is it

play26:27

possible to to control the quality of

play26:29

produced charts so we can automatically

play26:31

generate good charts like setting the

play26:34

predefined minimum quality levels yeah

play26:38

so um so controlling the lm's behavior

play26:43

um as a as an ml engineer you have

play26:45

multiple levers to sort of tweak or

play26:48

control the quality of whatever comes

play26:51

out of the llm so the LM is the primary

play26:53

driver here and the the lowest hanging

play26:56

frot is prompting so essentially um and

play27:01

light already does this for you because

play27:03

if you if we look through um some of the

play27:06

code and so I'll just go through that

play27:09

light

play27:11

components V's um I think V

play27:15

generator so there there's like a b

play27:18

system prompt that sort of um tries to

play27:21

guide the llm as to how it should

play27:23

generate visualization and if if you did

play27:26

read this out carefully you'd see that

play27:27

like

play27:28

um know there's all this instructions

play27:30

around like writing perfect code for

play27:33

visualization it sort of says oh you

play27:36

know you must follow visualization best

play27:37

practices so that's like as an ml

play27:39

engineer this is your first lever to

play27:41

sort of control and ensure that like the

play27:43

quality of the chart is sort of good um

play27:47

the second thing you can do is um I I

play27:49

sort of showed this like self evaluation

play27:52

capability um where you could actually

play27:55

take a visualization and get the llm so

play27:58

is is a self evaluation module you get

play28:01

get it to sort of evaluate a

play28:03

visualization across multiple Dimensions

play28:05

so you could run this just as part of

play28:08

your pipeline before you show anything

play28:09

to your user so first of all you have

play28:11

good prompts you get a first version of

play28:13

visualization you pass it through like a

play28:15

self- evaluation module like this you

play28:17

get like a set of critiques and

play28:20

suggestions and in this case everything

play28:22

is sort of High um You probably you

play28:25

could then check you know if the

play28:27

visualization

play28:29

quality on these

play28:30

dimensions and then finally um you could

play28:33

if there were low versions you could

play28:35

then tell the LM okay use a self-repair

play28:37

module and fix all of these Dimensions

play28:40

um and and that's one that's like one

play28:42

General way to improve

play28:43

quality

play28:46

gotcha makes sense another question we

play28:48

have is from Levi he says uh does Lia

play28:52

implements any hypothesis testing for

play28:55

the data supplied if data supplied is

play28:57

not normally distributed what measures

play29:00

does the llm take into consideration to

play29:03

fix it yeah um so one of the current

play29:07

limitations of Lia is that it makes it

play29:11

it makes assumptions about data so it it

play29:13

it assumes that the data is in a fairly

play29:15

decent

play29:18

um isn't a fairly decent like does not

play29:22

additional like

play29:25

um let a data cleaning or that kind of

play29:27

thing

play29:28

um it assumes that data s of cleaned and

play29:30

in a good State and it's ready for

play29:32

visualization so that's one thing to

play29:34

note um the other thing around

play29:36

hypothesis testing is I

play29:39

think we will need to know if if you

play29:42

really wanted to focus on

play29:44

[Music]

play29:45

um hypothesis testing we would need like

play29:48

an additional like let's say some

play29:50

hypothesis module to S of looks at data

play29:54

say explore statistics of each column

play29:56

data and can rep represent that in some

play29:59

way to the to the the visualization

play30:01

generator at test time in the current in

play30:04

the current implementation though um one

play30:07

of the reasons why we include things

play30:09

like minimum maximum value of particular

play30:13

columns with draw samples is to sort of

play30:17

take a best effort approach to give the

play30:19

LM as much information about like sort

play30:22

of the overall General Behavior or the

play30:24

general properties of each column data

play30:26

and a lot of like examples of run

play30:28

sofware show that like the llm can

play30:31

utilize the sort of

play30:32

signals while it writes uh code to uh to

play30:36

generate visualizations but yes the

play30:38

first step is that adding Rich signals

play30:40

like this to the data summary which is

play30:43

attached to all the requests made to the

play30:45

llm is one way to ensure that the the

play30:47

llm makes fairly good decisions um

play30:51

another way would be to add an like an

play30:54

additional just like hypothesis

play30:56

exploration module that comes up with

play30:58

additional like directions that the

play31:00

model should focus on and sort of append

play31:02

that late to the summary yeah cool and

play31:06

um in regards to the um I guess the

play31:11

model testing or like analyzing its own

play31:14

work how do we swallow the emotional

play31:17

damage of the AI telling us that the

play31:19

code is not

play31:21

good that's

play31:24

funny that was just a funny question

play31:26

that I saw yeah

play31:28

yeah so was there a question about the

play31:30

llm evaluating

play31:33

itself um no but how does it do that um

play31:37

what how how does it yeah yeah that's an

play31:39

important question so so there's

play31:41

something a little fishy about like an

play31:43

llm evaluating its own code yeah

play31:46

definitely is fishy definitely an

play31:48

important question instead of

play31:50

triage so um visualization quality is a

play31:55

fairly subjective thing and so things

play31:57

like it's itics um en coding so some of

play32:01

this those aspects are um are subjective

play32:04

so ideally you know we would need like

play32:07

some human expert to do that but you

play32:11

know in practice right you want an

play32:13

approach that's

play32:15

offline and um and the the general the

play32:19

general current approach is to use um a

play32:24

very capable model your most capable

play32:26

model as evaluator and so mostly because

play32:29

let's say your most capable model is

play32:31

probably very expensive and you probably

play32:32

can't use it for all of the steps in the

play32:35

pipeline so you probably can use it

play32:38

for hypothesis summarization hypothesis

play32:41

Generation all that stuff and you would

play32:44

Reserve that model as your evaluator and

play32:47

it can sort of evaluate the quality of

play32:50

the visualization generated by smaller

play32:52

models so right now the main thing you

play32:55

will do is you just Reserve in this case

play32:57

I ideally you would use gp4 or the most

play33:00

capable version of gp4 as your

play33:03

evaluator and um you would use G 3.5 or

play33:09

let's say some models from hugging face

play33:12

um as the co- model that drives all the

play33:15

other steps um and and outside of that

play33:19

you you would use a human evaluator um

play33:21

there are no easy there's no easy like

play33:23

there's no easy easy other approach to

play33:26

sort of evaluate quality of visualiz

play33:28

conditions one additional thing that's

play33:30

sort of new so when I worked in this um

play33:33

there were no multimodal models and so

play33:36

what that means is that um now with mult

play33:38

multimodel models um as opposed to just

play33:42

giving the llm the representation of the

play33:45

visualization as code critique you could

play33:47

also take both the image of the

play33:50

visualization and the code and give that

play33:53

to a multimodel model and now you can

play33:56

get like evaluations that look at actual

play34:00

representation of the visualization in

play34:03

pixel space so what does it look like

play34:05

and there are class of problems that can

play34:06

you can only detect quality issues you

play34:08

can only detect in pixel space things

play34:10

like overplotting or AES that overlap or

play34:15

that sort of thing maybe text that's

play34:17

just like very difficult to read that

play34:20

sort of thing so exploring the

play34:22

multimodal like self evaluation approach

play34:24

will probably improve the performance of

play34:26

a system like this

play34:29

gotcha nice and so like how would we how

play34:33

would we ask SLA to I guess evaluate and

play34:37

chart our own

play34:38

data come again what does that process

play34:41

how do we feed it our own data what does

play34:43

that process look like so from the UI

play34:46

perspective you could literally just

play34:47

drag a file in here um adjon or a CSV

play34:53

file and fun fact should we try to get

play34:55

something from kagle real fast and yeah

play34:57

do I do that see what

play35:00

happens

play35:02

set um let's

play35:06

see um which one should

play35:11

we emotions I don't

play35:15

know um yeah let's take a

play35:19

look

play35:25

oops see how many call it's got just

play35:29

three

play35:30

columns ah probably not that interesting

play35:33

not that one

play35:35

yeah

play35:37

um top YouTubers worldwide okay let's

play35:41

see that see see how many columns does

play35:44

this have okay nine columns let's try

play35:45

let's try this yeah let's try that one

play35:47

I'm just

play35:56

gonna Okay so we have a youtubeers the

play35:59

CSV and if all goes well I will just

play36:02

drag that in

play36:09

here oh

play36:13

nice um where did it go okay so um the

play36:18

first thing it's done is that like it's

play36:20

come up with like you

play36:23

know let's start with this let come up

play36:26

with um

play36:29

like a representation of the data and so

play36:32

let's see what kind of uh questions he

play36:34

came up with says what's the

play36:36

distribution of subscribers among the

play36:38

top 1,000

play36:39

YouTubers and let's switch a model to a

play36:43

capable model let's

play36:45

see so like Victor of course you're

play36:47

going to review the work that the AI did

play36:50

right but like this is amazing like

play36:54

yeah I'm saying of course you're going

play36:56

to review the work that the AI did but

play36:58

like this is really amazing like it did

play37:00

a summary this is like okay here's what

play37:02

it is about and then it says okay here

play37:04

are some questions then it says okay

play37:06

here are some charts yeah it says what

play37:08

what is the distribution of subscribers

play37:10

across different categories um says what

play37:14

countries have the highest average

play37:15

reviews is what is the correlation

play37:18

between average likes and comments was

play37:20

the distribution of content types

play37:23

um it looks like we having some errors

play37:27

let's try another another

play37:33

one so one thing that could happen is in

play37:36

some in a small fraction of cases the

play37:38

llm might come up with um questions that

play37:41

maybe they're not sufficiently grounded

play37:43

in the data and so maybe there might be

play37:45

like columns that like

play37:47

um okay let's see what this says which

play37:50

countries have the highest average

play37:52

reviews okay so United States Japan

play37:56

Spain turkey

play37:58

Iran okay this looks sensible to me try

play38:01

another one what is the correlation

play38:04

between um average likes and average

play38:08

comments oh that's a good

play38:11

question yeah look at that get a

play38:14

plot

play38:16

yeah let's let's do something fun let's

play38:19

say can you add a line of best hits to

play38:23

the plot to the plot I have no idea what

play38:26

it would do but

play38:27

let's

play38:28

see so while this is running were there

play38:31

any other questions yes yes uh someone

play38:36

someone ask if lighter is built on top

play38:39

of autogen

play38:42

Studio I'm not sure what a in studio is

play38:45

yeah yeah so uh it's it's a different

play38:47

project that I've been working on and so

play38:50

I think maybe what they're referencing

play38:51

is that the color scheme looks similar

play38:54

so a studio just came out like two

play38:56

months ago like is about two years old

play38:59

so no it was not built in a

play39:02

studio awesome and someone was asking if

play39:06

you can remind them how to access the

play39:07

demo

play39:09

website yeah so he opened up the he

play39:11

opened up the project in a code space

play39:14

yeah so essentially I did spin up the

play39:16

project in the code space and so the the

play39:19

overall approach will be to go to the

play39:21

lighter repo so I I don't have a hosted

play39:24

version of this um you would need to

play39:26

like install it locally on your machine

play39:28

and the way you would do that would be

play39:30

that you would run pep install um pep

play39:33

install

play39:34

ligher um if you had it previously you

play39:37

might need to update some some libraries

play39:39

that it depends on um you would make

play39:41

your open a API key so by default it

play39:44

uses an open AI API key um you need to

play39:47

set that up in your environment and then

play39:50

you run this command here that says

play39:52

lighter ui- Port

play39:55

880 and so this will just spin up

play39:57

um the interface on Port 880 on your

play40:01

local machine um and but however you you

play40:05

actually don't

play40:06

need you could also run it as a notebook

play40:09

so there's a button here that says open

play40:10

up opening collab and so essentially you

play40:13

could run this in the notebook

play40:15

completely so this just like shows the

play40:17

python the python API it shows how to uh

play40:22

set up multiple llm providers and all of

play40:26

the things we just shouldn't DUI you

play40:27

could do that in the notebook right so

play40:29

for example to create a bunch of goals

play40:31

so first to summarize the data you just

play40:33

say lighter. summarize and you point it

play40:35

at your data um pretty straightforward

play40:38

to generate a bunch of goals say lighter

play40:40

the goals um then you can sort of look

play40:44

at all the goals that it comes up with

play40:46

you could um generate goals based on a

play40:49

Persona and then for each of these goals

play40:51

you could ask it to generate the

play40:53

visualization um you could generate you

play40:57

could

play40:57

just ask your own type of question get a

play41:01

visualization out of that and then all

play41:03

the things around like asking questions

play41:06

modifying an existing visualization

play41:09

based on text chat um so you could do

play41:12

that here in this case we say like make

play41:13

the chart height and width equal change

play41:15

the color the chart to Red translate the

play41:17

chart to Spanish you get that out then

play41:22

explanations nice

play41:25

recommendations and then if you do have

play41:27

a GPU you could actually do the whole

play41:29

like

play41:30

um infographics thing where we could

play41:33

take an um an existing

play41:36

visualization and use a a diffusion

play41:39

model maybe some like stable diffusion

play41:42

to sort of go from um to go from um just

play41:48

a regular visualization to let's say

play41:50

stylized version of of of a

play41:51

visualization so I think I have like

play41:54

some some examples here so this is this

play41:57

part is still experimental so um it's

play42:00

not the co core focus of Li but um it'll

play42:03

be great to see contributions people

play42:05

sort of figuring out how to make this a

play42:07

bit more more stable explore more

play42:09

examples here yeah awesome so we have

play42:13

more

play42:14

questions um Ugo said or Yugo says can I

play42:17

stream the intermediate steps as llm is

play42:20

doing its

play42:21

processing um so so yes so that might

play42:25

require some some update is the core

play42:27

Library um so right

play42:30

now okay

play42:32

so for most of the steps right um you

play42:36

really can't consume uh essentially

play42:39

let's assume that like we we're talking

play42:41

about the visualization module right

play42:43

we're generating code to create

play42:45

visualization um the entire code needs

play42:47

to be generated before you can actually

play42:49

execute it now we can execute partial

play42:51

code and all that so so technically you

play42:53

still need to like generate the entire

play42:55

thing but let's assume that you wanted a

play42:57

user experience where you were sort of

play42:59

showing the user what exactly the LM was

play43:01

doing at a token streaming level um that

play43:04

might require some modifications to the

play43:06

online library but the good news is it's

play43:08

open source you technically can do

play43:11

anything with it

play43:14

true okay great can you use a local llm

play43:18

yes so if you if you recall the in the

play43:21

notebook I just shared um Lia uses a

play43:24

library route something called l llm X

play43:28

and technically it lets you um use

play43:31

multiple llm providers so you can use

play43:34

open AI coher Palm or even huging fist

play43:37

model

play43:39

and in theory um a lot of local llms can

play43:44

be set up using tools like let's say llm

play43:47

studio orama um VM and what that gives

play43:50

you is that it lets you spin up uh an

play43:53

API endpoint an open AI compatible

play43:57

like llm endpoint on your local machine

play44:00

and so that's what I would actually

play44:02

recommend so you take any of these nice

play44:04

hugg and fist models like Zafir s

play44:06

billion parameter models 13 billion par

play44:09

parameter models and you use like a

play44:11

server like Ama or LM studio and you

play44:14

spin up um an open AI component end

play44:16

point and you will just use that exactly

play44:18

how you would use like an open AI API so

play44:23

nice that's pretty cool that's pretty

play44:25

pretty cool okay more questions more

play44:27

questions uh does Lia support realtime

play44:31

data

play44:33

stream

play44:36

um real time data

play44:39

stream yeah so so technically you know

play44:42

if you think about the visualization

play44:44

problem you do need like a big chunk of

play44:46

data right in order to visualize it you

play44:48

need like a sufficient sample in order

play44:51

to to visualize it so um as as the ml

play44:55

engineer you have a lot of latitude and

play44:57

how you might build something like this

play44:59

if you do have like a data source that's

play45:00

streaming and so it comes in in real

play45:02

time you can figure out some sort of

play45:04

chunking

play45:06

strategy where you

play45:09

essentially um let's say within some

play45:11

sort of window you generate a chunk you

play45:14

visualize that chunk and you update um

play45:16

you update data source as the as the

play45:19

chunks or as more data becomes available

play45:21

and you generate visualizations I mean

play45:23

in theory you probably want to fix the

play45:25

visualization

play45:27

and just rerun the exact same code once

play45:30

the um once the data becomes available

play45:33

or once more data becomes available yeah

play45:36

yeah that makes sense that makes sense

play45:38

to me uh another question is what do you

play45:42

personally use Li for the

play45:44

most oh yeah um I'm a visualization

play45:48

expert um spent like many many years

play45:51

studying data visualizations um and so

play45:58

I I might use Lia let's say I I have

play46:02

like um some data and I have like a talk

play46:05

I want to give and I just maybe I had

play46:08

like a version of this a this interface

play46:10

running running running running locally

play46:14

I might just drag in my data and um

play46:19

essentially ask for a bunch of

play46:20

visualization may put that in a slide

play46:22

deck that kind of thing um so that's

play46:25

kind of like my own workflow

play46:27

um and I'm happy to like chat with the

play46:30

interface request changes here and there

play46:33

until I get like the perfect

play46:35

visualization yeah but a tool like this

play46:37

my my my intuition is that a tool like

play46:39

this um is most valuable to users who

play46:43

really have no visualization experience

play46:47

like um yes really without without like

play46:51

essentially they're not exactly sure

play46:52

about like hey what questions do we even

play46:54

ask yes um

play46:57

and for each of these questions like why

play46:59

is one better than the other like is

play47:02

this the best visualization for for this

play47:05

data so I think for users like that um a

play47:07

tool like ligher has the most um most

play47:10

has the most like delter in terms of

play47:12

like

play47:13

benefit I agree like I have a workshop

play47:16

coming up in um in two weeks and the

play47:20

first part of the workshop is to um you

play47:23

know like clean and visualize data and

play47:25

like build a model

play47:27

uh and I think I'm I'm going to be using

play47:29

laa because uh the audience you know

play47:31

they're not they're not machine learning

play47:34

or data scientist folks and I think this

play47:36

just makes it a lot more accessible so

play47:39

I'm already thinking about like a great

play47:40

use case would be like workshops and

play47:42

like teaching um because it's a lot more

play47:45

accessible and you can digest it a

play47:46

little bit better yeah so so at this

play47:49

point you know I've talked about all the

play47:51

things you can do with a tool like this

play47:53

and yeah it's important to note that

play47:55

it's not magic it's has a lot of

play47:57

limitations right um and I I I also find

play48:00

that it's important to talk about like

play48:02

when it doesn't work and um what are the

play48:05

current limitations and so it's still

play48:09

there are still cases where you know

play48:10

let's say you gave very very complex

play48:13

instructions um the current like could

play48:16

execution path um might you know lighter

play48:20

might uh light might right code that has

play48:22

like logic errors in there you try to

play48:25

run it and then fills you see

play48:27

um now there's a way to solve problems

play48:30

like that right so you can just add like

play48:31

a an agent add like an agent some sort

play48:34

of agent based Behavior where you take

play48:36

the compile errors and you give that

play48:37

back to the model then you ask it hey

play48:40

repair do some self-repair based on that

play48:42

so that's one approach I me latency

play48:44

constraints there so you might see

play48:45

things like that um the second thing is

play48:48

um not all models are equal so in my

play48:51

demo here I'm using the GPT series

play48:53

models if you use smaller models 7

play48:55

billion parameter models 13 billion

play48:57

parameter models the the difference and

play48:59

behavior can be extremely like pretty

play49:03

extreme yeah and that's an open AA of

play49:06

research for know how can we build

play49:08

better smaller models that do better a

play49:10

task like this because um it just makes

play49:13

the entire space even more accessible so

play49:15

these are potential areas of

play49:17

contribution um the third thing is it

play49:20

like the quality of

play49:21

visualization it's it relates to the the

play49:25

grammar that you choose and so most of

play49:27

the time I use cbor because it's one of

play49:29

the most popular like visualization

play49:31

packages out there and there there just

play49:34

so much examples of these things on

play49:35

GitHub and these models have seen most

play49:37

of GitHub so they're quite proficient

play49:39

that at writing like using this this

play49:42

grammars and so if we said to look at

play49:44

things like let's say your company has

play49:46

like a proprietary visualization tool

play49:49

this light might not work very well

play49:53

there yeah and there are some metrics

play49:55

that that the paper has there's a paper

play49:58

um for lier about how you might measure

play50:00

and Benchmark systems like this um so

play50:03

these are things as as an ml engineer um

play50:05

you we should you should care about yeah

play50:09

that makes complete sense that makes

play50:10

complete sense definitely not magic

play50:13

definitely not you know automatically

play50:15

like you shouldn't automatically trust

play50:17

the output but it's definitely a good

play50:20

first start especially when it comes to

play50:23

understanding large data sets so I have

play50:26

two more questions that I can um that I

play50:29

can ask and then we're going to wrap up

play50:31

so the first question is what input data

play50:34

source is supported so we know CSV is

play50:37

supported we know Json is supported yeah

play50:40

uh Carlos said what about raser data or

play50:42

other binary data

play50:46

containers so so essentially the short

play50:48

answer is whatever pandas can process is

play50:51

supported what about the P we we we

play50:55

standardize on pandas now for many

play50:57

business use cases this is insufficient

play51:00

you probably have data scale much larger

play51:03

than what hand support So essentially

play51:04

the the solution would be to rewrite

play51:06

your own just data inest engine and so

play51:09

there's another opportunity for

play51:11

contribution um to to rewrite like how

play51:14

we fundamentally process data and Lia

play51:17

and so it's an openness project lots of

play51:19

lots of potential here to to make that

play51:21

type of contribution um and also just

play51:24

supporting data connectors right so most

play51:27

most most um most production deployments

play51:31

don't have data just lining around in

play51:33

CSV adjacent files the data is s of dat

play51:36

exist in a lak house the query pogress

play51:41

SQL mongod DB that's sort of thing and

play51:44

so figuring out a good General data

play51:46

connector that lets you just import your

play51:49

data to a tool like this it's just

play51:50

something not yet supported in laa but

play51:53

it's it's a good it's a good idea um and

play51:55

so I've heard a lot of good things about

play51:57

project e is another upcoming open

play51:59

source project too yeah project what e

play52:03

or IB I I I just know IB project data

play52:10

format I think it's this one is it the

play52:14

one with the little geese oh yeah this

play52:17

one it's it's meant to be like a

play52:20

portable data frame library and the idea

play52:23

is that

play52:24

like your data might be dob click housee

play52:28

Beery pandis you can do the exact same

play52:32

pandas operations on the data sets

play52:34

without worrying about where the data

play52:35

comes from ah nice nice it's interesting

play52:39

because someone someone else was asking

play52:40

if they could just take uh data from

play52:43

like sap or service now and to Lia yeah

play52:48

um this might be the way to sort of

play52:50

think through like a problem like that

play52:54

yeah cool that's really great

play52:57

um final question for the evening is any

play53:01

chance to submit two snapshots of the

play53:04

data and ask Lia to compare

play53:10

them

play53:12

so do you mean snapshots of the charts

play53:16

or snapshots of the data

play53:19

itself yeah yes it's a ligher will not

play53:23

will not like it's not a redesigned to

play53:25

compare like data yeah yeah it's mostly

play53:29

meant to like the input it takes is is a

play53:31

data set and then Prov an output um um

play53:35

visualization um if what you meant was

play53:38

to compare that's the quality of two

play53:40

different visualizations what you could

play53:42

do is that you could get lighter to

play53:44

generate like two visualizations and

play53:46

then run the evaluation module on both

play53:48

of them instead of look at the scores

play53:50

and oh interesting yeah so you could you

play53:54

could yeah so essentially you could

play53:58

explore the examples in the in the

play54:01

notebook and just do exactly that um get

play54:04

two visualizations then for each of

play54:07

these visualizations you get you the

play54:09

lighter that evaluate module and just

play54:11

sort of compare the scores for both of

play54:13

these

play54:15

visualizations nice that's pretty cool

play54:18

um I like the I like what this person

play54:20

said they said that it's so great to

play54:22

treat uh visualization as a language

play54:25

it's a really great multimodal thinking

play54:27

that's going on right here and I think I

play54:29

think that's so true I'm so impressed by

play54:31

like oh yeah just through this as a

play54:32

language and just translate

play54:35

it a

play54:37

chart yeah that's pretty cool um this

play54:41

has been really great Victor really

play54:43

great really informative session uh I

play54:45

know everyone watching has enjoyed

play54:48

seeing project laa I know I have and I'm

play54:51

going to be digging into it especially

play54:53

since this is what I'm learning so I'm

play54:55

just like what this is so cool um thanks

play54:57

so much for taking the time uh the final

play55:00

comment I saw is imagine how middle

play55:03

managers are going to believe now that

play55:05

data mining now only takes five seconds

play55:07

because

play55:10

AI I thought that was funny but uh

play55:13

thanks everyone for joining us thank you

play55:15

so much for Victor any last words you

play55:17

want to share with audience on how to

play55:19

contribute and you know support laa All

play55:22

That Jazz yeah so um so two things

play55:30

um Li is open source so I think the link

play55:33

has been shed there give you a star um

play55:36

there are a lot of people who have

play55:37

started contribut so right now we have

play55:39

12 contributors you could be contributor

play55:41

number 13 and if you if you went to the

play55:44

issues um think you could just look

play55:47

through all of them I probably should do

play55:48

a better job of labeling them um but

play55:51

there are a few of them that you know if

play55:53

there's a complaint something doesn't

play55:55

work well um this just an excellent

play55:57

place to start helping it might be you

play56:00

know just telling others hey you know

play56:02

here's how you fix that problem it might

play56:03

not even be code contributions um you

play56:06

could look through the documentations

play56:08

find issues there um changes open PRS

play56:12

this all welcome um I really look

play56:14

forward to seeing more of you um share

play56:16

your ideas and there are things I talked

play56:18

about here today things like enabling

play56:20

multimodel

play56:22

evaluation um enabling multiple data

play56:24

connectors let's say using project e um

play56:27

these are all potential areas of uh

play56:30

contribution um and so that's it and I

play56:33

also write a

play56:35

newsletter um nice mostly just talking

play56:38

about like how I think about like AI

play56:41

machine learning and design problems and

play56:43

there there are a couple of posts around

play56:45

I think there's one about

play56:47

Lia um let's

play56:49

see let me see if I can put that in the

play56:52

chat yeah there's one about Li this

play56:55

could be another place to look learn

play56:56

more about um about the project and that

play56:59

sort of thing so yeah yeah I'm popping

play57:02

the newsletter Link in the chat right

play57:04

now Google

play57:07

superpowers but that's awesome thanks so

play57:10

much so much Victor for joining us today

play57:13

um I hope to stay in touch with you uh

play57:15

especially as I'm learning this stuff

play57:16

and you know you're kind of an expert in

play57:18

this area or you are an expert in this

play57:20

area not kind of um but yeah thanks so

play57:23

much everyone for joining us this has

play57:25

been a really good session really

play57:26

informative session I've learned so much

play57:30

and we'll see you next week thank you

play57:33

thank you for having me bye

play57:38

[Music]

play57:54

[Music]

Rate This

5.0 / 5 (0 votes)

Related Tags
オープンソースデータ可視化AIGitHub開発者機械学習ビジュアル分析情報システムユーザーインターフェース技術紹介
Do you need a summary in English?