Open Source Friday with LIDA - Generate Infographics with LLMS
Summary
TLDRこのビデオスクリプトでは、Microsoftの研究者であるVictor Dibia博士が開発したオープンソースプロジェクト「Laa」について紹介しています。Laaは、データからビジュアライゼーションとインフォグラフィックを自動生成するツールで、ユーザーがデータサイエンスや機械学習の知識を必要とせずに、データの可視化を簡単に行うことができます。デモでは、YouTuberのデータセットを用いて、Laaがどのようにデータの要約、質問の生成、そしてビジュアライゼーションの作成を行うかを紹介しています。また、Laaの限界や改善点についても議論され、プロジェクトへの貢献の方法も紹介されています。
Takeaways
- 😀 GitHubの開発アドボケート、CadesaとMicrosoftの研究ソフトウェアエンジニア、Victor Dibia博士がゲストとして登場し、オープンソースプロジェクト「Laa」について紹介している。
- 🔍 Laaはデータからビジュアライゼーションとインフォグラフィックを自動生成するオープンソースプロジェクトで、Victorが2018年に開発を開始した。
- 📊 Laaはデータの要約と仮説生成を通じて、ユーザーがビジュアライゼーションを作成するためのコードを自動生成する。
- 🎓 Victorは伝統的なソフトウェアエンジニアリングの背景を持つ。彼は人間の側面を理解するために、情報システムのPhDを取得し、ユーザービヘイビアと心理学の理論を応用して、人々がツールとインターフェースを使用する際の意思決定プロセスを研究している。
- 🤖 Laaはビジュアライゼーションの生成をシーケンス予測問題として扱い、RNNなどのシーケンスモデルをトレーニングすることで、生データを入力からビジュアライゼーションへの翻訳を学習する。
- 📈 Laaはデータの可視化を通じて、人間の認知負荷を軽減し、データから洞察を迅速に引き出すことを可能とする。
- 🛠️ Laaはデータの前処理やクリーニングは行わないが、データがクリーンで適切な形式にある場合、高い信頼性でビジュアライゼーションを生成することができる。
- 🌐 LaaはPython APIとWeb APIを提供しており、ユーザーがデータセットを指定してビジュアライゼーションを生成することができる。
- 🔧 Laaはビジュアライゼーションの質を保証するために、自己評価モジュールと自己修復機能を備えており、生成されたビジュアライゼーションを評価し、改善案を提案することができる。
- 👥 Laaはデータサイエンティストやビジュアライゼーション初心者にとって非常に価値があるとされており、専門知識を持たないユーザーでもデータから質問を生成し、視覚化を行える。
Q & A
オープンソース・フライデーとは何ですか?
-オープンソース・フライデーは、GitHubの開発者アドボケイトであるカデサが主催する毎週のショーで、メンテナーやコントリビューター、プロジェクトを祝うためのものです。
今回のゲストは誰ですか?
-今回のゲストは、Microsoftのプリンシパルリサーチソフトウェアエンジニアであるビクター・ディビア博士です。
ビクター・ディビア博士が紹介したプロジェクトLIAとは何ですか?
-プロジェクトLIAは、ジェネレーティブAIと大規模言語モデルを用いて、データの可視化とインフォグラフィックを自動生成するオープンソースプロジェクトです。
ビクター・ディビア博士の学歴は?
-ビクター・ディビア博士は、ソフトウェア工学の学士号と修士号を取得し、CMUで情報システムの修士号を取得しました。その後、香港で情報システムの博士号を取得しました。
プロジェクトLIAの主な機能は何ですか?
-プロジェクトLIAは、データセットを入力すると、そのデータに基づいて可視化を自動的に生成し、ユーザーに視覚的にデータを理解しやすくするツールです。
プロジェクトLIAが他のAIツールと異なる点は何ですか?
-プロジェクトLIAは、特定の可視化ユーザーエクスペリエンスを提供し、Python APIとWeb APIを持ち、コードを生成して実行し、ユーザーに結果を表示する点で異なります。
プロジェクトLIAがサポートするデータフォーマットは何ですか?
-プロジェクトLIAは、CSVやJSONなどのデータフォーマットをサポートしています。
ビクター・ディビア博士はプロジェクトLIAをどのように使用していますか?
-ビクター・ディビア博士は、データを視覚化してプレゼンテーション用のスライドに追加するためにLIAを使用しています。
プロジェクトLIAの使用にあたり、データはどのような状態であるべきですか?
-プロジェクトLIAを使用する際には、データがある程度クリーンで整った状態であることが理想です。
プロジェクトLIAの貢献方法について教えてください。
-プロジェクトLIAはオープンソースであり、GitHub上でリポジトリにスターを付けたり、バグ修正やドキュメントの改善などで貢献することができます。
Outlines
🎥オープンソースプロジェクトの紹介
カデサはGitHubのデベロッパーアドボケートとして、オープンソースフリデイを紹介し、コントリビューターやメンテナーを称えます。ゲストであるVictor Dibia博士は、Microsoftのプリンシパル研究ソフトウェアエンジニアとして、生成AIとLLMを使ってデータの可視化とインフォグラフィックを作成するプロジェクト「Laa」を紹介します。
👨🎓Victor Dibia博士の経歴とプロジェクトLaas
Victor Dibia博士は伝統的なソフトウェアエンジニアリングのバックグラウンドを持っており、後に人間の側面を学ぶために情報システムのPhDを取得しました。彼はIBM研究でのインターンシップを通じてAIに興味を持ち、特に多様なインターフェイスを組み合わせたマルチモーダルな経験を研究しました。彼はデータ可視化を自動化する「Data ToV」プロジェクトにも関与し、Laasはその経験をもとに作られたオープンソースプロジェクトです。
📈Laasプロジェクトの概要と機能
Laasはデータからインフォグラフィックを作成するツールで、データの要約と可視化の目標を自動的に生成します。データのクリーンアップは必要とせず、データがすでに整っていると仮定しています。LaasはPython APIとWeb APIを提供しており、特定のユーザーエクスペリエンスを構築し、信頼性の高い可視化を生成することが可能です。
🛠️Laasのデモと操作
Laasのデモでは、GitHubのレポジトリからインストールし、コード空間から新しいコード空間を開始してUIを起動します。Laasはデータセットを指し示し、要約を作成し、可視化の質問を生成してから、コードを生成して実行します。ユーザーは質問を選択して生成された可視化を見ることができます。
🔧Laasの柔軟性とコードの変更
Laasはコードを生成し、実行するだけでなく、ユーザーからのリクエストに応じてコードを変更することができます。たとえば、棒グラフに変換するなどの要求に応じて、Laasはコードを再書き換え、実行エンジンはエラー訂正を行います。
📊Laasの可視化品質と自己評価
Laasは可視化の品質を確保するために、自己評価機能も備えています。Laasは可視化コードを評価し、多角的なディメンションに基づいてスコアを割り当て、改善点を提案します。必要に応じて、自己修復モジュールを通じて可視化を向上させることができます。
🔄Laasのデータ処理とリアルタイムデータ
Laasはpandasが処理できるデータであれば何でもサポートし、リアルタイムデータストリームにも対応可能です。データチャンキング戦略を構築し、データが到着するたびに可視化を更新することができます。
🤖Laasの限界と貢献の呼びかけ
Laasにはまだ限界があり、複雑な指示に対してロジックエラーが発生する可能性があります。しかし、コミュニティによる貢献が求められており、データコネクタの追加やマルチモーダル評価のサポートなど、多くの貢献の余地があります。
🌐Laasのデータソースと可視化比較
LaasはCSVやJSONなどのデータソースをサポートし、pandasが処理できるものは何でも対応可能です。また、Laasは2つの異なるデータのスナップショットを比較する機能も持ち、可視化品質の比較も可能です。
📬Laasへの貢献とコミュニケーション
Laasはオープンソースプロジェクトであり、スターや貢献を積極的に求めています。ドキュメントの改善、新しい機能の提案、データコネクタの追加などが挙げられます。Victor Dibia博士はニュースレターも発行しており、Laasに関する情報やAI、機械学習、デザインに関する考え方を共有しています。
Mindmap
Keywords
💡オープンソース
💡GitHub
💡generative AI
💡LLM (Large Language Models)
💡データ可視化
💡API (Application Programming Interface)
💡機械学習
💡多変量分析
💡ユーザーインターフェース
💡コード空間
Highlights
开源项目Laa的介绍,由微软的首席研究软件工程师Victor Dibia博士开发,用于通过生成性AI和大型语言模型(LLMs)创建数据的可视化和信息图表。
Laa项目在Chat GPT出现之前就已经存在,展示了如何使用生成性AI和LLMs来自动生成数据可视化。
Victor Dibia的背景介绍,从传统的软件工程到信息系统博士,再到对人工智能和多模态交互的兴趣。
IBM研究院的Data Tov项目,使用序列模型自动生成数据可视化,将数据可视化视为序列预测问题。
使用RNNs和特定可视化语言Vega-Lite,将数据和可视化作为文本表示,训练模型进行转换。
可视化如何使数据易于理解,减少从数据中提取洞察力的认知负担。
创建好的可视化需要技能和努力,Laa旨在通过自动化过程简化这一工作。
Laa的工作流程,从数据摘要到生成潜在的可视化目标,再到用户无需任何努力即可生成复杂可视化。
Laa的局限性,它假设数据已经是干净且格式良好的,不专注于数据清洗。
Laa的Python API和Web API,以及它专注于可视化特定用户体验的研究。
Laa的演示,包括安装、使用Web界面和与GitHub Codespaces的集成。
Laa如何使用示例数据集(如S&P 500)自动生成问题和可视化。
Laa的交互性,允许用户通过文本请求修改可视化,例如将图表转换为条形图。
Laa如何使用LLM进行自我评估和修复可视化中的错误。
Laa支持多种可视化库和模型提供商,以及如何配置它们。
Laa如何处理实时数据流,以及在实时数据可视化方面的潜力和挑战。
Laa的开源性质,鼓励社区贡献和协作,以及如何开始贡献。
Laa的当前限制和未来发展方向,包括多模态评估和数据连接器的增强。
如何将Laa用于教学和工作坊,使其对没有数据可视化经验的用户更加可访问。
Laa如何帮助用户理解大型数据集,以及它在数据可视化领域的创新之处。
Transcripts
[Music]
[Music]
[Music]
hey everyone welcome to open source
Friday I'm cadesa and I work at the
GitHub as a developer Advocate and I'm
super excited to be here today open
source Friday as you know is the weekly
show that we have to celebrate
maintainers contributors and the project
that makes our life so much easier today
I'm super excited to have Victor dibia
PhD who is here to show us all about how
to build uh generalizations and
infographics with generative Ai and llms
with his project laa now Victor is a
principal research software engineer at
Microsoft and he does a lot of great
work with generative Ai and LMS and laa
is a project that he built last year
before chat GPT came to the scene so I'm
super happy to have Victor here today to
tell us all about his project laa hi
everyone welcome to the show thanks so
much for joining us I'm gonna add Victor
to the stage and we will get right into
the tech hey
Victor hi um hi everyone great to be
here thanks for the
introduction of course super happy to
have you here Dr diaa okay that's going
to be the last time I say
that but uh I think it's very impressive
that you have a PhD like that takes a
lot of work and like a lot of dedication
to you know education and
learning yeah well well thanks thanks
for the kind
words okay he's a little shy okay great
so today we're chatting about project
laa or so project laa from my
understanding is an open- source project
from Microsoft that Victor built last
year and it helps us to generate like
visualizations and infographics for our
data now I'm super excited to talk about
project Lia because I've been learning
some machine learning stuff and I'm just
like okay yeah I have an appreciation
for this project I know what's going on
nice um yeah yeah but like tell us tell
us more about yourself actually um what
does it mean to be in the role that you
do and like how did you start you know
having an interest in generative Ai and
large language models yeah um so my
background is just traditional software
engineering so I did like a bachelor's
degree a masters mostly covering courses
in computer science software engineering
distributed systems I did a master's
degree in information
systems at uh information networking at
CMU Pittsburgh and um it's just covering
topics in softare engineering
distributed systems um some bit of
security and um after that I did a
startup for a little while um moved to
West Africa to do that and then start
other university really enjoyed that
then I figured I also wanted to learn
about like the human aspect of like
software tools and systems and so I
ended up doing a PhD in Information
Systems um and so the information
systems is this field that looks at
um you know applying theories from user
Behavior psychology and using that sort
of understand how people make decisions
as they use tools and
interfaces and that was the PHD I did
that in Hong Kong actually um wow rigar
roll there um so I did that at C
University of Hong Kong and um towards
the end I got a chance to intern with a
group at IBM research uh a primarily HCI
group just again looking at like
interface design applying theories from
user Behavior decision making to study
that but then we worked very closely
with the applied machine learning group
at IBM research and that's really where
my AI sort of interest of started
started out so I first first of all
started to look at how we could apply
these models build like multimodal
interaction and so we we we worked on
things like room scale interaction a
person walks into a room there's a
camera array there's a 3D pointing
device there's a microphone array and
essentially we doing text to speech
speech to text and vision recognition
pointing gesture recognition and sort of
building that into multimodal like
experiences wow and yeah and then I
started out sort of applying things and
then I started figur out hey you know
like it would be nice to sort of build
new models solve new interaction type
problems and in 2018 at IBM research one
of the things I worked on was a project
called Data Tov and I'll just share my
screen if you want to put that up a
little um essentially this project was
in 2018 and the the the title was just
automatic generation of data
visualization using sequence sequence
models so these were the State ofthe art
models back then rnns and what this
project just showed for the first time
was that we could represent
visualization generation as um as a
quence to sequence prediction problem
and we could train specialized um
sequence to sequence models that could
take raw data and then generate
visualizations from
Thats if you represent the visualization
as text and so we could do this by just
sampling let's say the first two rules
from the data set and then we represent
the visualization as text and so in this
case we used like a a declarative
visualization language called veal light
and so essentially you could represent
your visual I izations just as as Json
Json just like as a Json object and we
took this pairs of sample data and
visualizations and we made a fairly big
data set out of that and ENT we just
took an RNN a sequence sequence model
the exact type of model you would use to
train like a language translation model
so imagine that you would learn to
translate between English and
French you have all these all these
pairs of English and French and so we
just replace that with data and
visualization and this model learned to
translate between between that and then
oh if the data is big enough yeah just
just so I understand just so I
understand so it's kind of like seeing
visualizations as a language yes
exactly oh wow that's represented as
text as a as a just like like a language
in this case it's just a specialized
declarative language to represent
visualizations and you represent your
data again as text then you can get the
models to actually just at at the wrong
time you give them a sample of a data
set and they'll actually come up with
visualizations that are sort of grounded
or contextually grounded in that data
wow and so it was the first time we we
sort of show why show that this was
possible so the next question is like I
don't know like why would you want to do
that um and in in general I have this
little slide that I I sort of use to
explain the topic and the key idea is
that like visualizations make data
accessible and so let's say you had like
a couple hundred thousand rows of data
um but typically you want to make
decisions from that right and so it just
turns out that the human mind has
evolved such that we can look at a chart
and we can make really fast conclusions
we can look at the chart and say like
hey know my sales tends to be highest in
the summer and so maybe I need to devote
more of my resources and my marketing
dollars all that kind of thing the
summer so you could just look at a chart
make a decision like that really really
fast so in general chart sort of reduce
the cognitive burden associated with
sort of extracting insights from data
super super important for many many
fields um however to create good
visualization it takes quite a bit of
skill and effort so first of all you
need to understand the data you to come
up with good questions that takes some
dat skill and when you come up with good
questions you know you need to represent
them as visualizations in some way right
what encoding do you
use do you use a bar chart a pie chart a
radi chart what's the appropriate
representation of that right then maybe
the data might not be in its best format
maybe you need to apply some
transformations to the data right maybe
you need to like combine to field to a
new field do some normalization all of
that and then finally when you do all of
that you need to figure out okay so how
do I actually create this visualization
um do I use Excel or powerbi to write
code in up so there's quite a bit of
effort there so the promise is um what
if we could build a tool that will
automatically understand the data in
this case let's say summarize it in the
case of Lia um what if we could based on
this summary of the data could generate
a bunch of potential visualization goals
and this is exactly what um data
scientists do with Ed yes explor to data
analysis just but imagine that you get
this for free with zero effort and then
for each of these exploratory goals or
hypothesis can we generate complex
visualizations without any just with
zero requirements on the user and how
can we do all of this with high
reliability and and pretty much that's
what ligher set up to
accomplish you know it it's it's so
interesting and and I'm so happy that
I've been studying and and it just so
happens that I started learning about um
like data cleaning and all all that jazz
um before our stream because knowing
that laa understands the data and then
can generate goals and hypotheses
automatically from the data that kind of
blows my mind because it takes so long
to you know like to like clean the data
and then like you see and then you're
just like okay what questions am I going
to ask about this and like what am I
trying to accomplish from all this it
takes so much time right like instead of
uh putting all that man power or U man
energy into that you can use laa to you
know generate those questions and then
visualize the data and then from then
you can you know go on to do like actual
actual things actual work um yeah so so
technically lighter yeah so one small
limitation is lighter doesn't exactly
focus in cleaning the data if the data
exists it makes an assumption the data
just exists in the potentially decent
format and then all the stuff like you
know come up with goals summarize it
accomplish high high reliability all of
that is after that and so data cleaning
is another nice research topic but
something I'd like to talk about at a
different time but that's a little out
of scope for Lia gotcha so so so sorry
to interrupt but like essentially the
data should be clean before using Lia I
mean technically if it's if it's not
clean you still get results it'll do its
best effort um LMS are pretty cool
actually in that way they they they're
kind of robust to small amounts of noise
and so they'll do a best effort and if
we have some time you know we can go to
some random data set on kagle and just
see what ligher does this me okay yeah
how about that yeah
yeah so the whole idea is how do we
create a better visualization authoring
experience right um go from data to
natural language plus natural language
intent to Reliable visualizations but
you know a couple of the a few really
really Vigilant folks on on the internet
might be like you know but isn't that
what chadt and could interpret it does
already um so there there there few
differences so the first thing is um
this project sort of started internally
as a hackathon Microsoft research so I
know kadashi mentioned last year it was
actually two years ago August
2022 um and Chad GPT came out in in
December but you know this project sort
of showed how you could sort of take
that data generate like summaries um and
then sort of put an interface on top of
all that um generate code executed and
show the the result to a user and it it
sort of used the earlier open a models
so the DAV Series this space moves so
fast everybody has forgotten the DAV
series models but they really were the
state of the art like two years ago so d
d instruct
um and and now it's once chat GPT came
out essentially the GPT 3.5 turbo models
it just made this app work a lot better
so some of the differences here is that
like it provides a python API and a web
API um it has a visualization specific
user experience sort of baked on top of
it and then it the paper and the
research around it sort of focuses on
metrics for evaluating this kind of um
um this kind of
setup and uh and so at this point I'd
like to probably switch to demo what you
think yeah sounds
great yeah so someone was just asking if
there were any examples that we can
share actually oh yeah yeah absolutely
yeah um so the starting point for um for
lighter would be to go to the repo um
which is I think it's yeah it's already
in the chat and yes the first thing you
want to do is to install it um do like a
peep install that you Lia and typically
if you have GitHub code spaces you could
essentially just click on a code space
start out a new code space so I have one
actually running and so the the the
installation step would be to if you
were installing from Source you would do
p install that e dots or pep install
lighter I've kind of done that before
just to save us some some time and then
to spin up the UI you would see
something like um lighter space UI and
and essentially that just spins
um um a web interface that sort of build
on top of lighter and will'll come back
to the low level python API experience
um in a moment but let's start out with
the um let's start out with the with the
UI experience and so once once you
clicked on that you you went to the URL
that it provided go to demo the first
thing is you can point it at some data
right and so let's assume that we
pointed at the SP 500 data sets
a few things happen so the first thing
that happens is that there's a there's a
module called the
summarizer which sort of works in a
two-stage process as a two-stage process
in the first process we take everything
we know about like data sets and we
compute a bunch of Statistics so for
example we use like pandas to infer that
like the First Column is a date we
compute the main and the max we generate
a bunch of samples from that and then we
compute number of unique samples so we
know how to do this with code and so we
this first next we take that
representation and then we give that to
an llm and the LM does things like
adding a semantic type and a description
field and so here we know it's a string
but we also need to know the semantic
type it's it's a date and in other data
sets something might be just a string
but we need to know if it's a country or
it's heads of state and that kind of
thing then also we add a description for
each of these know we use the LM to use
you know best effort and just based on
its parametric knowled come up with the
sort of annotations to improve the
representation of the data
set so why why might we want this it
just turns out even for humans the more
data you have about the each of the
columns and Fields the more interesting
grounded questions you can ask of that
of that data so that's the first piece
so the the summarizer does that the next
thing is the goal exploration module
which is the Eda sort of module we
talked about earlier comes up with a
bunch of questions for data right and so
right now the user has done nothing but
here all this potential questions here
and so the first one was the overall
trend of the S&P 500 over the time
period and we express this as like a
multitask generation problem where we
don't just ask the llm to come up with a
visualization title or hypothesis we ask
it to do three things like first of all
the
question the title of the
visualization and then also rational for
what we might learn from this
visualization so it just turns out that
if you if you express a problem like
this then the
llm sort of um comes up with more
semantically reach and more likely
correct um or interesting
questions and so here and then for each
of these
questions um we get the visualization
module to generate some code in this
case we generate python code and the
background execute that code do some
preprocessing
postprocessing um convert that to uh an
image extract like a basics for
representation that and stream that back
to the user interface so so for each of
these questions the llm
um essentially it generates
visualizations and so you just need to
select select the particular um question
and then you can see the visualization
that was
generated oh so it already did
everything yep like as soon as you gives
it the data set it just it just does it
yeah I agree with uh I think it's Levi
uh simply amazing and mindblowing no for
real like this work takes so much time
um and that's amazing yeah and another
interesting thing is that because the
visualizations here as I represented as
code um just like you could ask Chad GPT
to generate some code and then ask it to
modify that code you could do that here
too so for example example if you say
something like convert this to a bar
chart right and
so the LM just takes the the user
request takes the existing code and
rewrites um rewrites the code the
execution engine uh executes that does
some error correction and then uh a line
chart is converted to to a b chart nice
in addition to that you could also do
other interesting things right right so
um sometimes the visualization that
comes back might not be correct and the
user might not be a visualization expert
so one way you can support them in
making sense of the quality of the
visualization is to ask the llm to
retroactively sort of explain what
what's going on in the chart so in this
case it it makes three types of
explanation so it talks about like just
the overall appearance of the chart and
so if you're building applications that
have like accessibility requirements
let's say you have user who are visually
impaired you could include a feature
like this that sort of describes what's
in the chart so it says that the chart
is a bar plot with blue bars
representing the stock price of time the
x-axis is this and all of that so s of
explain explaining what's in the code
and in the chart and then in terms of
transformation it describes the type of
transformation that's applied to the for
data so first of all it says that the
date column is formed as a date type
there some sort of um error correction
going on um and then also sort of
describes the visualization so that's
one thing another thing that you could
do also is that you could try to
leverage you know you know we all hear
about very large language models and how
they have consumed a lot of data and how
they're actually General World experts
and so it also turns out that like this
LMS so in this case we're using I think
GPT 3.5 turbo turns out they they
they've read a lot of visualization
books and you could actually ask them to
evaluate visualization code across
multiple Dimensions so in this case here
like we ask it to generate it to to
generate like ratings across Six
Dimensions so the first is bugs um does
it have any bugs or logic errors um so
it says that like hey know I give the
score 4.5 because you know the function
could be improved by adding error
handling to make sure that the input
data is valid right in terms of trans
transformation it says that like the
data is transformed appropriately but
here type seems a little low or yeah so
here it says that like uh the
visualization type is appropriate but a
b plot is good however a line plot might
be more effective and essentially you
can then say okay here's all these like
assessments you've made on this
particular chart how we how about we
Auto Repair the chart and so it does a
few things so it converts it back to a
line chart and then it adds a a bunch of
annotations so for examp
it it it sort of plots it figures out
that this is a stock price graph and it
goes back in time and it sort
of suggest a couple of important events
to keep track of and in this case it
just turns out some of these events
actually have an effect on sort of the
stock price um trajectory so say talks
about the lemon Brothers bankruptcy
mining LEL and that sort of thing so so
yeah we have a yeah we have a little
question from Carlo they said that the
dates are not visible at all is it
overestimating the quality of the
chart um so what do we mean by the dates
are not visible you mean here I think it
was the chart before this one yeah so
yes so yeah on the chart previously um I
was the one who sort of asked it like
convert this to a bar chart and inste of
did that and ideally you know I think
bit if if you did ask the llm to
generate it probably would be a bit more
judicious in the use of like AIS and so
that was like a forc error I did myself
and then I asked the L to sort of
self-critique that chart and then
self-repair it and then it sort of fixed
all of that error so yes um a good
pipeline to have in general would be to
first ask the to generate a chart and
then um retroactively ask it to also
retroactively ask it to also sort of
self- evaluate the quality of that chart
and then any kind of repairs and so that
way you you get good quality charts um
all all along the way um nice yeah so
essentially oh keep going sure go ahead
I was just GNA say so essentially laa is
taking all the the packages that we
would use like you know Med plot live
Seaborn and I think one's called psychic
learn um and it's and it's building the
charts for US based on the data that's
inputed yes so it's it's configurable so
you could tell it he know use caborn so
ah okay you could switch to Al here
which is an interactive web based like a
python liary on top of uh
vialy you use M plot leave you could use
gplot so in theory as long as um it's a
library that represents visualization as
code in any programming language and we
can execute that code you typically
should be able to use that with lighter
yeah and then you can also select the
type of model you want to use so on the
backend lighter supports multiple model
providers open AI Palm cair hugging face
anthropic that kind of thing um and
essential the llm is just like a
generation engine so we can swap that
out with multiple llms instead of build
visualizations gotcha and do we need to
add our own um API key to access those
okay yes um
so technically you would run this in
your own environment and and essentially
the first step would be to set up an llm
that gets attached to all your requests
gotcha yeah gotcha gotcha gotcha okay
let's answer a few questions that we
have and then I would love to know how
can we feed it our own data um I know
that these are example data in the demo
yeah but let's answer a few questions
before we show that so Carlos says is it
possible to to control the quality of
produced charts so we can automatically
generate good charts like setting the
predefined minimum quality levels yeah
so um so controlling the lm's behavior
um as a as an ml engineer you have
multiple levers to sort of tweak or
control the quality of whatever comes
out of the llm so the LM is the primary
driver here and the the lowest hanging
frot is prompting so essentially um and
light already does this for you because
if you if we look through um some of the
code and so I'll just go through that
light
components V's um I think V
generator so there there's like a b
system prompt that sort of um tries to
guide the llm as to how it should
generate visualization and if if you did
read this out carefully you'd see that
like
um know there's all this instructions
around like writing perfect code for
visualization it sort of says oh you
know you must follow visualization best
practices so that's like as an ml
engineer this is your first lever to
sort of control and ensure that like the
quality of the chart is sort of good um
the second thing you can do is um I I
sort of showed this like self evaluation
capability um where you could actually
take a visualization and get the llm so
is is a self evaluation module you get
get it to sort of evaluate a
visualization across multiple Dimensions
so you could run this just as part of
your pipeline before you show anything
to your user so first of all you have
good prompts you get a first version of
visualization you pass it through like a
self- evaluation module like this you
get like a set of critiques and
suggestions and in this case everything
is sort of High um You probably you
could then check you know if the
visualization
quality on these
dimensions and then finally um you could
if there were low versions you could
then tell the LM okay use a self-repair
module and fix all of these Dimensions
um and and that's one that's like one
General way to improve
quality
gotcha makes sense another question we
have is from Levi he says uh does Lia
implements any hypothesis testing for
the data supplied if data supplied is
not normally distributed what measures
does the llm take into consideration to
fix it yeah um so one of the current
limitations of Lia is that it makes it
it makes assumptions about data so it it
it assumes that the data is in a fairly
decent
um isn't a fairly decent like does not
additional like
um let a data cleaning or that kind of
thing
um it assumes that data s of cleaned and
in a good State and it's ready for
visualization so that's one thing to
note um the other thing around
hypothesis testing is I
think we will need to know if if you
really wanted to focus on
[Music]
um hypothesis testing we would need like
an additional like let's say some
hypothesis module to S of looks at data
say explore statistics of each column
data and can rep represent that in some
way to the to the the visualization
generator at test time in the current in
the current implementation though um one
of the reasons why we include things
like minimum maximum value of particular
columns with draw samples is to sort of
take a best effort approach to give the
LM as much information about like sort
of the overall General Behavior or the
general properties of each column data
and a lot of like examples of run
sofware show that like the llm can
utilize the sort of
signals while it writes uh code to uh to
generate visualizations but yes the
first step is that adding Rich signals
like this to the data summary which is
attached to all the requests made to the
llm is one way to ensure that the the
llm makes fairly good decisions um
another way would be to add an like an
additional just like hypothesis
exploration module that comes up with
additional like directions that the
model should focus on and sort of append
that late to the summary yeah cool and
um in regards to the um I guess the
model testing or like analyzing its own
work how do we swallow the emotional
damage of the AI telling us that the
code is not
good that's
funny that was just a funny question
that I saw yeah
yeah so was there a question about the
llm evaluating
itself um no but how does it do that um
what how how does it yeah yeah that's an
important question so so there's
something a little fishy about like an
llm evaluating its own code yeah
definitely is fishy definitely an
important question instead of
triage so um visualization quality is a
fairly subjective thing and so things
like it's itics um en coding so some of
this those aspects are um are subjective
so ideally you know we would need like
some human expert to do that but you
know in practice right you want an
approach that's
offline and um and the the general the
general current approach is to use um a
very capable model your most capable
model as evaluator and so mostly because
let's say your most capable model is
probably very expensive and you probably
can't use it for all of the steps in the
pipeline so you probably can use it
for hypothesis summarization hypothesis
Generation all that stuff and you would
Reserve that model as your evaluator and
it can sort of evaluate the quality of
the visualization generated by smaller
models so right now the main thing you
will do is you just Reserve in this case
I ideally you would use gp4 or the most
capable version of gp4 as your
evaluator and um you would use G 3.5 or
let's say some models from hugging face
um as the co- model that drives all the
other steps um and and outside of that
you you would use a human evaluator um
there are no easy there's no easy like
there's no easy easy other approach to
sort of evaluate quality of visualiz
conditions one additional thing that's
sort of new so when I worked in this um
there were no multimodal models and so
what that means is that um now with mult
multimodel models um as opposed to just
giving the llm the representation of the
visualization as code critique you could
also take both the image of the
visualization and the code and give that
to a multimodel model and now you can
get like evaluations that look at actual
representation of the visualization in
pixel space so what does it look like
and there are class of problems that can
you can only detect quality issues you
can only detect in pixel space things
like overplotting or AES that overlap or
that sort of thing maybe text that's
just like very difficult to read that
sort of thing so exploring the
multimodal like self evaluation approach
will probably improve the performance of
a system like this
gotcha nice and so like how would we how
would we ask SLA to I guess evaluate and
chart our own
data come again what does that process
how do we feed it our own data what does
that process look like so from the UI
perspective you could literally just
drag a file in here um adjon or a CSV
file and fun fact should we try to get
something from kagle real fast and yeah
do I do that see what
happens
set um let's
see um which one should
we emotions I don't
know um yeah let's take a
look
oops see how many call it's got just
three
columns ah probably not that interesting
not that one
yeah
um top YouTubers worldwide okay let's
see that see see how many columns does
this have okay nine columns let's try
let's try this yeah let's try that one
I'm just
gonna Okay so we have a youtubeers the
CSV and if all goes well I will just
drag that in
here oh
nice um where did it go okay so um the
first thing it's done is that like it's
come up with like you
know let's start with this let come up
with um
like a representation of the data and so
let's see what kind of uh questions he
came up with says what's the
distribution of subscribers among the
top 1,000
YouTubers and let's switch a model to a
capable model let's
see so like Victor of course you're
going to review the work that the AI did
right but like this is amazing like
yeah I'm saying of course you're going
to review the work that the AI did but
like this is really amazing like it did
a summary this is like okay here's what
it is about and then it says okay here
are some questions then it says okay
here are some charts yeah it says what
what is the distribution of subscribers
across different categories um says what
countries have the highest average
reviews is what is the correlation
between average likes and comments was
the distribution of content types
um it looks like we having some errors
let's try another another
one so one thing that could happen is in
some in a small fraction of cases the
llm might come up with um questions that
maybe they're not sufficiently grounded
in the data and so maybe there might be
like columns that like
um okay let's see what this says which
countries have the highest average
reviews okay so United States Japan
Spain turkey
Iran okay this looks sensible to me try
another one what is the correlation
between um average likes and average
comments oh that's a good
question yeah look at that get a
plot
yeah let's let's do something fun let's
say can you add a line of best hits to
the plot to the plot I have no idea what
it would do but
let's
see so while this is running were there
any other questions yes yes uh someone
someone ask if lighter is built on top
of autogen
Studio I'm not sure what a in studio is
yeah yeah so uh it's it's a different
project that I've been working on and so
I think maybe what they're referencing
is that the color scheme looks similar
so a studio just came out like two
months ago like is about two years old
so no it was not built in a
studio awesome and someone was asking if
you can remind them how to access the
demo
website yeah so he opened up the he
opened up the project in a code space
yeah so essentially I did spin up the
project in the code space and so the the
overall approach will be to go to the
lighter repo so I I don't have a hosted
version of this um you would need to
like install it locally on your machine
and the way you would do that would be
that you would run pep install um pep
install
ligher um if you had it previously you
might need to update some some libraries
that it depends on um you would make
your open a API key so by default it
uses an open AI API key um you need to
set that up in your environment and then
you run this command here that says
lighter ui- Port
880 and so this will just spin up
um the interface on Port 880 on your
local machine um and but however you you
actually don't
need you could also run it as a notebook
so there's a button here that says open
up opening collab and so essentially you
could run this in the notebook
completely so this just like shows the
python the python API it shows how to uh
set up multiple llm providers and all of
the things we just shouldn't DUI you
could do that in the notebook right so
for example to create a bunch of goals
so first to summarize the data you just
say lighter. summarize and you point it
at your data um pretty straightforward
to generate a bunch of goals say lighter
the goals um then you can sort of look
at all the goals that it comes up with
you could um generate goals based on a
Persona and then for each of these goals
you could ask it to generate the
visualization um you could generate you
could
just ask your own type of question get a
visualization out of that and then all
the things around like asking questions
modifying an existing visualization
based on text chat um so you could do
that here in this case we say like make
the chart height and width equal change
the color the chart to Red translate the
chart to Spanish you get that out then
explanations nice
recommendations and then if you do have
a GPU you could actually do the whole
like
um infographics thing where we could
take an um an existing
visualization and use a a diffusion
model maybe some like stable diffusion
to sort of go from um to go from um just
a regular visualization to let's say
stylized version of of of a
visualization so I think I have like
some some examples here so this is this
part is still experimental so um it's
not the co core focus of Li but um it'll
be great to see contributions people
sort of figuring out how to make this a
bit more more stable explore more
examples here yeah awesome so we have
more
questions um Ugo said or Yugo says can I
stream the intermediate steps as llm is
doing its
processing um so so yes so that might
require some some update is the core
Library um so right
now okay
so for most of the steps right um you
really can't consume uh essentially
let's assume that like we we're talking
about the visualization module right
we're generating code to create
visualization um the entire code needs
to be generated before you can actually
execute it now we can execute partial
code and all that so so technically you
still need to like generate the entire
thing but let's assume that you wanted a
user experience where you were sort of
showing the user what exactly the LM was
doing at a token streaming level um that
might require some modifications to the
online library but the good news is it's
open source you technically can do
anything with it
true okay great can you use a local llm
yes so if you if you recall the in the
notebook I just shared um Lia uses a
library route something called l llm X
and technically it lets you um use
multiple llm providers so you can use
open AI coher Palm or even huging fist
model
and in theory um a lot of local llms can
be set up using tools like let's say llm
studio orama um VM and what that gives
you is that it lets you spin up uh an
API endpoint an open AI compatible
like llm endpoint on your local machine
and so that's what I would actually
recommend so you take any of these nice
hugg and fist models like Zafir s
billion parameter models 13 billion par
parameter models and you use like a
server like Ama or LM studio and you
spin up um an open AI component end
point and you will just use that exactly
how you would use like an open AI API so
nice that's pretty cool that's pretty
pretty cool okay more questions more
questions uh does Lia support realtime
data
stream
um real time data
stream yeah so so technically you know
if you think about the visualization
problem you do need like a big chunk of
data right in order to visualize it you
need like a sufficient sample in order
to to visualize it so um as as the ml
engineer you have a lot of latitude and
how you might build something like this
if you do have like a data source that's
streaming and so it comes in in real
time you can figure out some sort of
chunking
strategy where you
essentially um let's say within some
sort of window you generate a chunk you
visualize that chunk and you update um
you update data source as the as the
chunks or as more data becomes available
and you generate visualizations I mean
in theory you probably want to fix the
visualization
and just rerun the exact same code once
the um once the data becomes available
or once more data becomes available yeah
yeah that makes sense that makes sense
to me uh another question is what do you
personally use Li for the
most oh yeah um I'm a visualization
expert um spent like many many years
studying data visualizations um and so
I I might use Lia let's say I I have
like um some data and I have like a talk
I want to give and I just maybe I had
like a version of this a this interface
running running running running locally
I might just drag in my data and um
essentially ask for a bunch of
visualization may put that in a slide
deck that kind of thing um so that's
kind of like my own workflow
um and I'm happy to like chat with the
interface request changes here and there
until I get like the perfect
visualization yeah but a tool like this
my my my intuition is that a tool like
this um is most valuable to users who
really have no visualization experience
like um yes really without without like
essentially they're not exactly sure
about like hey what questions do we even
ask yes um
and for each of these questions like why
is one better than the other like is
this the best visualization for for this
data so I think for users like that um a
tool like ligher has the most um most
has the most like delter in terms of
like
benefit I agree like I have a workshop
coming up in um in two weeks and the
first part of the workshop is to um you
know like clean and visualize data and
like build a model
uh and I think I'm I'm going to be using
laa because uh the audience you know
they're not they're not machine learning
or data scientist folks and I think this
just makes it a lot more accessible so
I'm already thinking about like a great
use case would be like workshops and
like teaching um because it's a lot more
accessible and you can digest it a
little bit better yeah so so at this
point you know I've talked about all the
things you can do with a tool like this
and yeah it's important to note that
it's not magic it's has a lot of
limitations right um and I I I also find
that it's important to talk about like
when it doesn't work and um what are the
current limitations and so it's still
there are still cases where you know
let's say you gave very very complex
instructions um the current like could
execution path um might you know lighter
might uh light might right code that has
like logic errors in there you try to
run it and then fills you see
um now there's a way to solve problems
like that right so you can just add like
a an agent add like an agent some sort
of agent based Behavior where you take
the compile errors and you give that
back to the model then you ask it hey
repair do some self-repair based on that
so that's one approach I me latency
constraints there so you might see
things like that um the second thing is
um not all models are equal so in my
demo here I'm using the GPT series
models if you use smaller models 7
billion parameter models 13 billion
parameter models the the difference and
behavior can be extremely like pretty
extreme yeah and that's an open AA of
research for know how can we build
better smaller models that do better a
task like this because um it just makes
the entire space even more accessible so
these are potential areas of
contribution um the third thing is it
like the quality of
visualization it's it relates to the the
grammar that you choose and so most of
the time I use cbor because it's one of
the most popular like visualization
packages out there and there there just
so much examples of these things on
GitHub and these models have seen most
of GitHub so they're quite proficient
that at writing like using this this
grammars and so if we said to look at
things like let's say your company has
like a proprietary visualization tool
this light might not work very well
there yeah and there are some metrics
that that the paper has there's a paper
um for lier about how you might measure
and Benchmark systems like this um so
these are things as as an ml engineer um
you we should you should care about yeah
that makes complete sense that makes
complete sense definitely not magic
definitely not you know automatically
like you shouldn't automatically trust
the output but it's definitely a good
first start especially when it comes to
understanding large data sets so I have
two more questions that I can um that I
can ask and then we're going to wrap up
so the first question is what input data
source is supported so we know CSV is
supported we know Json is supported yeah
uh Carlos said what about raser data or
other binary data
containers so so essentially the short
answer is whatever pandas can process is
supported what about the P we we we
standardize on pandas now for many
business use cases this is insufficient
you probably have data scale much larger
than what hand support So essentially
the the solution would be to rewrite
your own just data inest engine and so
there's another opportunity for
contribution um to to rewrite like how
we fundamentally process data and Lia
and so it's an openness project lots of
lots of potential here to to make that
type of contribution um and also just
supporting data connectors right so most
most most um most production deployments
don't have data just lining around in
CSV adjacent files the data is s of dat
exist in a lak house the query pogress
SQL mongod DB that's sort of thing and
so figuring out a good General data
connector that lets you just import your
data to a tool like this it's just
something not yet supported in laa but
it's it's a good it's a good idea um and
so I've heard a lot of good things about
project e is another upcoming open
source project too yeah project what e
or IB I I I just know IB project data
format I think it's this one is it the
one with the little geese oh yeah this
one it's it's meant to be like a
portable data frame library and the idea
is that
like your data might be dob click housee
Beery pandis you can do the exact same
pandas operations on the data sets
without worrying about where the data
comes from ah nice nice it's interesting
because someone someone else was asking
if they could just take uh data from
like sap or service now and to Lia yeah
um this might be the way to sort of
think through like a problem like that
yeah cool that's really great
um final question for the evening is any
chance to submit two snapshots of the
data and ask Lia to compare
them
so do you mean snapshots of the charts
or snapshots of the data
itself yeah yes it's a ligher will not
will not like it's not a redesigned to
compare like data yeah yeah it's mostly
meant to like the input it takes is is a
data set and then Prov an output um um
visualization um if what you meant was
to compare that's the quality of two
different visualizations what you could
do is that you could get lighter to
generate like two visualizations and
then run the evaluation module on both
of them instead of look at the scores
and oh interesting yeah so you could you
could yeah so essentially you could
explore the examples in the in the
notebook and just do exactly that um get
two visualizations then for each of
these visualizations you get you the
lighter that evaluate module and just
sort of compare the scores for both of
these
visualizations nice that's pretty cool
um I like the I like what this person
said they said that it's so great to
treat uh visualization as a language
it's a really great multimodal thinking
that's going on right here and I think I
think that's so true I'm so impressed by
like oh yeah just through this as a
language and just translate
it a
chart yeah that's pretty cool um this
has been really great Victor really
great really informative session uh I
know everyone watching has enjoyed
seeing project laa I know I have and I'm
going to be digging into it especially
since this is what I'm learning so I'm
just like what this is so cool um thanks
so much for taking the time uh the final
comment I saw is imagine how middle
managers are going to believe now that
data mining now only takes five seconds
because
AI I thought that was funny but uh
thanks everyone for joining us thank you
so much for Victor any last words you
want to share with audience on how to
contribute and you know support laa All
That Jazz yeah so um so two things
um Li is open source so I think the link
has been shed there give you a star um
there are a lot of people who have
started contribut so right now we have
12 contributors you could be contributor
number 13 and if you if you went to the
issues um think you could just look
through all of them I probably should do
a better job of labeling them um but
there are a few of them that you know if
there's a complaint something doesn't
work well um this just an excellent
place to start helping it might be you
know just telling others hey you know
here's how you fix that problem it might
not even be code contributions um you
could look through the documentations
find issues there um changes open PRS
this all welcome um I really look
forward to seeing more of you um share
your ideas and there are things I talked
about here today things like enabling
multimodel
evaluation um enabling multiple data
connectors let's say using project e um
these are all potential areas of uh
contribution um and so that's it and I
also write a
newsletter um nice mostly just talking
about like how I think about like AI
machine learning and design problems and
there there are a couple of posts around
I think there's one about
Lia um let's
see let me see if I can put that in the
chat yeah there's one about Li this
could be another place to look learn
more about um about the project and that
sort of thing so yeah yeah I'm popping
the newsletter Link in the chat right
now Google
superpowers but that's awesome thanks so
much so much Victor for joining us today
um I hope to stay in touch with you uh
especially as I'm learning this stuff
and you know you're kind of an expert in
this area or you are an expert in this
area not kind of um but yeah thanks so
much everyone for joining us this has
been a really good session really
informative session I've learned so much
and we'll see you next week thank you
thank you for having me bye
[Music]
[Music]
تصفح المزيد من مقاطع الفيديو ذات الصلة
5.0 / 5 (0 votes)