LlamaIndex Webinar: Advanced RAG with Knowledge Graphs (with Tomaz from Neo4j)
Summary
TLDRこのウェビナーでは、Llama IndexとNeo4jが共同で提供するプロパティグラフインデックスのワークショップが行われました。新しいプロパティグラフの抽象化を使用して高度なナレッジグラフの構築方法が学べます。ドキュメントからグラフコンストラクタを通じてナレッジグラフを作成し、グラフリトリーバーを使用してユーザーの質問に基づいて情報を取得する方法が紹介されました。さらに、カスタムエンティティの特定と統合、テキストからCypherクエリの生成など、応用的なトピックもカバーされています。
Takeaways
- 📚 このウェビナーは、Llama IndexとNeo4jが共同で開催した「プロパティグラフインデックス」に関するワークショップであり、高度なナレッジグラフの構築方法を学ぶことができると紹介されています。
- 🔍 プロパティグラフとは、ノードとリレーションシップの両方ともプロパティを持つグラフモデルであり、ISO委員会の一部である新しいGQL標準に基づいています。
- 🛠️ Llama Indexでは、ドキュメントからテキストを取得し、グラフコンストラクタを使用してナレッジグラフを作成するプロセスが説明されています。
- 🌟 Llama Indexは、初心者向けのアウトオブザボックスのツールと、上級者向けにカスタマイズ可能なモジュール化されたパイプラインを提供しています。
- 🔧 3つの異なるグラフコンストラクタ(暗黙のパッド抽出、シンプルなLLMパターンセレクター、スキーマLLMパターンセレクター)がLlama Indexで利用可能な機能として紹介されています。
- 🔗 エンティティの統合(エンティティの重複を解決するプロセス)は、ナレッジグラフの構造的整合性を高めるために重要なステップであることが強調されています。
- 🔎 プロパティグラフリトリイバーには、ユーザーの質問に基づいてナレッジグラフから情報を取得するロジックがあり、Llama Indexでは複数のリトリイバーが提供されています。
- 📝 ワークショップでは、カスタムエンティティ識別とカスタムリトリイバーの定義方法が実演され、これによりより柔軟性のある情報取得が可能になることが示されています。
- 🤖 LLM(Large Language Models)を使用したグラフの構築とリトリイバルは、モデルの特性に応じて異なる結果を出すことがあり、適切なプロンプトとスキーマの定義が重要であることが触れられています。
- 📈 ウェビナーでは、技術文書や特許、科学論文など、専門的な分野でのナレッジグラフの構築と活用方法についても議論されており、実際のビジネスや研究での応用が見込まれています。
- 🔑 ワークショップの最後に、参加者が今後のアプリケーション開発やエンタープライズ開発者向けのブログ記事やケーススタディの機会について興味を持っていることが呼びかけられています。
Q & A
ラマインデックスとNeo4jのパートナーシップで提供されるこのウェビナーの主なテーマは何ですか?
-このウェビナーでは、プロパティグラフインデックスとラマインデックスを組み合わせた高度なナレッジグラフの構築方法について学ぶことができます。
プロパティグラフとはどのようなデータモデルですか?
-プロパティグラフは、ノードとリレーションシップの両方ともにプロパティを持つグラフモデルであり、ISO委員会の一部である新しいGQL標準に基づいています。
ラマインデックスで提供されているグラフコンストラクタは何種類ありますか?
-ラマインデックスでは、暗黙のパッド抽出器、シンプルなLLMパッドセレクター、スキーマLLMパッド抽出器の3つのグラフコンストラクタが提供されています。
グラフコンストラクタはどのようにして文書からナレッジグラフを作成するのですか?
-グラフコンストラクタは、与えられた文書からテキストを取得し、構造化された情報を抽出してナレッジグラフに保存します。
カスタムグラフインデックスを実装する際に、どのようなアプローチを取ることができますか?
-カスタムグラフインデックスは、モジュラーでカスタマイズ可能であるため、初心者はすぐに使用できるものから始め、上級者は自分のニーズに合わせてパイプラインをカスタマイズすることができます。
ウェビナーで紹介されたエンティティの重複を解決する方法とはどのようなものですか?
-ウェビナーでは、テキスト埋め込みとワード距離統計を使用して潜在的な候補を見つけ、それらをマージすることでエンティティの重複を解決する方法が紹介されました。
グラフリトリーブアはどのような機能を持っていますか?
-グラフリトリーブアは、ユーザーの質問に基づいてナレッジグラフからデータを取得するためのロジックを持っており、4つの組み込みリトリーブアがあります。
テキストからCypherステートメントを生成するText2Cypherリトリーブアの利点とは何ですか?
-Text2Cypherリトリーブアは、柔軟性が高いため、ユーザーがデータベース内の情報を正確に知っていれば、より複雑な問いに対する答えを得ることができます。
ラマインデックスで使用されるプロパティグラフインデックスの統合の流れを説明してください。
-プロパティグラフインデックスの統合では、文書から始まり、グラフコンストラクタを使用してナレッジグラフを作成し、その後グラフリトリーブアを使用してユーザーの質問に基づいて情報を取得します。
ウェビナーで紹介されたカスタムリトリーブアの主な目的は何ですか?
-カスタムリトリーブアは、テキストから特定のエンティティを検出し、それらのエンティティに基づいてナレッジグラフから情報を選択するように設計されています。
Outlines
📚 ラマインデックスとネオ4jによるプロパティグラフインデックスの紹介
この段落では、ラマインデックスとネオ4jが共同で開催するウェビナーシリーズの紹介が行われています。このウェビナーでは、高度なナレッジグラフの構築方法を学ぶことができます。プロパティグラフの概念が説明され、ノードやリレーションシップがプロパティを持つことが特徴であることが強調されています。また、ドキュメントからグラフへの変換プロセスや、グラフコンストラクタとグラフリトリーブャーの役割についても触れられています。
🛠️ プロパティグラフの構築とカスタマイズ可能性の説明
この段落では、プロパティグラフの構築プロセスが詳しく説明されています。ドキュメントからグラフコンストラクタを通じてナレッジグラフを作成する流れが紹介され、利用可能なグラフコンストラクタの種類も説明されています。また、初心者向けのアウトオブザボックスのオプションと、上級者向けのカスタムパイプラインの柔軟性が強調されています。
🔍 グラフコンストラクタの詳細と実践的な応用
ここでは、利用可能なグラフコンストラクタの詳細が説明されています。インプ利ットパッドエキストラクター、シンプルLLMベースのセレクター、そしてより高度なスキーマLLMベースのエキストラクターについて学びます。それぞれのコンストラクタがどのように機能し、どのようにカスタマイズできるかが具体的に紹介されています。
🤖 グラフリトリーブャーの種類と機能の概説
この段落では、グラフリトリーブャーの機能と種類について説明されています。ユーザーの質問に基づいてナレッジグラフから情報を取得するロジックを持つグラフリトリーブャーの概要が提供され、利用可能な4つのリトリーブャーについて概説されています。それぞれの方法の利点と欠点が比較され、柔軟性と正確性のトレードオフが議論されています。
📝 テキストからプロパティグラフへの変換デモンストレーション
この段落では、実際にテキストをプロパティグラフに変換するデモンストレーションが行われています。GPT-4を使用して、特定のスキーマに従ってグラフを作成し、その過程でエンティティの重複を解決する方法も紹介されています。また、実践的な例として、ニュース記事からビジネス関連の情報を抽出するプロセスが説明されています。
🔗 エンティティの統合とカスタムリトリーブャーの開発
ここでは、エンティティの統合の重要性と方法が議論されています。重複したエンティティを特定し、それらをマージすることでナレッジグラフの構造的整合性を高める方法が説明されています。さらに、カスタムリトリーブャーの開発方法が概説され、エンティティ抽出とベクターコンテキストリトリーブャーの組み合わせ方について学びます。
🚀 カスタムリトリーブャーの実装と応用
この最終段落では、カスタムリトリーブャーの実装方法とその応用が詳しく説明されています。エンティティ抽出から始まり、特定のエンティティに基づいた情報の取得方法が紹介されています。また、このプロセスを通じて、より正確で包括的な情報を取得することができることが強調されています。
Mindmap
Keywords
💡プロパティグラフ
💡グラフコンストラクタ
💡グラフリトリーブア
💡ラベル
💡ドキュメント
💡エンティティ
💡リレーションシップ
💡知識グラフ
💡テキストチャンク
💡カスタムリトリーブア
Highlights
ウェビナーはLlama IndexとNeo4jが共同で開催された、プロパティグラフインデックスに関するワークショップ。
プロパティグラフの概念と、ノードとリレーションシップがプロパティを持つことが標準化されたGQLについて紹介。
Llama Indexの新しいプロパティグラフインデックス統合の概要と、ドキュメントからグラフへの構造化情報の抽出方法。
グラフコンストラクタとグラフェトリバーのモジュール性とカスタマイズ可能性について。
ドキュメントをラップする際の「ドキュメント」の概念と、テキストをグラフに変換するプロセス。
Llama Indexで提供される3つのグラフコンストラクタの紹介とそれぞれの特徴。
カスタムエンティティ同定とグラフの構造的整合性向上のためのエンティティマージングの重要性。
カスタムエンティティ同定のためのテキスト埋め込みとワード距離統計を使用した方法。
プロパティグラフリトリバーの4つのタイプと、それぞれの検索方法の紹介。
カスタムリトリバーの作り方と、エンティティ抽出からグラフ検索までを一連のプロセスで行う方法。
LMS(Large Language Models)の選択と、知識グラフ構築に適したモデルの推奨。
グラフ構築におけるヒューマンインザループアプローチの欠点と、自動化された再パスの利点。
技術文書や特許、科学論文のような専門的なドキュメントに対するNeo4jの適応性と応用事例。
エンティティ重複除去のための複雑なCypherクエリの共有と、その有用性。
カスタムリトリバーのコード例と、テキストからのエンティティ抽出からリレーションシップの特定まで。
ウェビナーの総括と、Llama IndexとNeo4jが提供する知識グラフの潜在的な超セットとしてのポテンシャル。
ワークショップの終了と、今後の予定、ウェビナーのYouTubeでの公開予定についてのアナウンス。
Transcripts
hey everyone uh welcome back to another
episode of The Llama index webinar
series uh today is straping up to be one
of our most popular webinars workshops
ever uh with uh property graph indexes
in llama index in partnership with neo4j
um and so this is going to be a special
workshop on teaching you how to build
Advanced Knowledge Graph Rag and you'll
be able to learn how to use our brand
new property graph abstractions both uh
to construct an existing graph as well
as a queria graph um and so we're
excited to host toas from neo4j as well
as Logan from our side and without
further Ado um feel fre to kick it
off okay so uh like I said yeah I'm
happy to be here and talk about
graphs so as mentioned today we're going
to talk about the new property graph
index integration in Lama
index and if you might be wondering
property graph what actually is that
right because most of the time
I don't know if I can remove this so
most of the time um when dealing with GS
especially in the rec like Frameworks
what we see is usually triplers just
like subject uh relationship type and
then uh object and the other way around
but actually property graphs have now
got uh an actual um
standard and there's the new gql
standard which is uh part of the iso
committee um and uh that's very exciting
uh but basically property graph as you
might metion means that nodes have
properties and relationships have
properties as well so for example U here
we have one node and it has property is
name Amy PS date of birth employee ID
right but it also has
this one special property that we call
label uh in property graph uh models and
as you can see for example the green uh
node has an
employee uh node label right and node
labels are used
to put notes into sets sets of
categories so for example in this
example we have three node labels so we
have employee company and
City and as mentioned before uh
relationships can also have properties
right so it's a
slightly different um data
representation that what you might be
used
to like from the previous let's say
implementation in Lama index the
knowledge implementation where we only
only dealt with
triers so that's about property graphs
and now let's talk a little bit about
property graph index
integration so the
flow that uh most of the time you will
follow when you're using property graph
index
integration uh you start with a bunch of
documents and Lama index has great
support for various types of documents
uh so I will not go into that but
basically documents you can think of
them as just wrappers around
text and so basically we take a bunch of
documents and we pass the text from
those documents to graph
Constructors so in the integration that
Logan did you can use one or multiple uh
graph
Constructors uh to create a Knowledge
Graph and we'll talk a little bit more
about them later I'll show you what
which sh available out of the
box and so uh the property the graph
Constructors extract like structured
information from uh documents and store
them um as a graph in the knowledge
graph so there are a couple of
Integrations uh that Lama index already
has
uh as I'm from newj will focus on the
newj integration but there are others as
well and probably more coming
up so once you've built your knowledge
graph on the other side we have the
so-called graphet
rers so the graph ret rers their job is
to uh basically based on the user
question they have some sort of logic
they can use to retrieve data from the
knowledge gra right uh and again there
are a couple of out of the box that you
can use and we'll also show how easy is
to
um uh Define a custom uh graph in the
workshop lated so that's basically the
kind of the flow that you
can think of and what llama index does
is it provides graph Constructors and
graph retrievers I mean obviously also
the other parts just that the graph
Constructors and graph retrievers are
part of the new property graph index
integration and the idea is that they
are very modular and
customizable so that if you're beginner
you can just use something out of the
box but if you like Advanced user and
you need something custom
it's very easy to customize uh the
pipeline for your
needs so let's uh like uh like on at
high level property graph construction
as I said we take a bunch of documents
and we can select a Knowledge
Graph so here I have an example uh uh
documents where like former open AI uh
employees founded new companies right so
for example here you you would have four
different um documents me read some
information right but the nice thing
about knowledge graphs is that you kind
of condense that
information and unify it right so
that
basically the information that was
previous spread across multiple uh
documents is now easily accessible uh
and uh nicely represented in a Knowledge
Graph okay and then I prepared uh couple
of slides what is available out of the
box in Lama index so out of the box we
have
three uh graph Constructors
the first one is the so-called implicit
pad
extractor and what it does it it just um
uh we have a new word for it it's called
lexical graph but what it actually does
you just take the green note is the
original
document and what it does it just
chunks uh the documents so the what's
this like uh Gray is
nodes are text chunks right and text
chunks are connected to the source
document and then also we have uh an
ordered list of text chunks so that we
know uh how do they follow each other so
this is the implicit P extractor um
graph Constructor and it doesn't require
an LM because it's just uh basically
chunking up and then creating
a linked list of text
chunks so and then the next one is the
simple llm P selector and as the name
would suggest you need an llm for
that and uh from what I've been digging
around in the in in the um
implementation how the simple llm ptic
sctor works is basically through prompt
engineering
uh so you in the prompt you define
um what uh like how the output should
look like and then you provide
U uh a paring function that uh extracts
that uh output from an LM and creates a
Knowledge Graph so it's I would call it
like a plump based uh solution uh and uh
by default all nodes have the same label
like the default implementation all no
have the same label and again the purple
one is the text Chunk so we always store
the reference text in the graph as well
and then the no the the entities that
were mentioned in the text Chunk we have
the mentioned relationships
to those uh
entities and then obviously entities can
have relationships between each other
right so for example
Amilia uh aard was American Aviation
pioner right and this is kind of the
simpler uh version of graph extraction
now obviously you can customize it and
make it more uh
Advanced but by default um all nodes
have the same label
U
yeah and then uh the more
advanced is the schema llm P
extractor so here you have the ability
to Define which nodes and uh node labels
and relationship types you can extract
um this one will also show use in the
work shop so you'll get to see uh
in practice how that how do we Define
the
schema but basically as you can see by U
by different colors of noes means that
they have different uh no labels and
again the purple ones are the text
chunks the defin text chunks and then
the text
chunks mention the entities that
appeared in those text chunks and then
obviously we have a bunch of U
relationships between uh those uh
entities as well and this one uh works
best with llms that provide function
calling like native function calling
like the
commercial LS like open
Gemini misal
uh probably some others Gro is a nice
one as well Gro is actually a really
nice one because uh it uh it's really
fast and that's really nice for the
knowledge
construction uh but yeah uh so it works
best with u l with Native function
calling but Logan told me that it will
also work with
u models that don't provide native
function calling just not as well or
maybe the schema uh should be simpl in
those using those LS but U as I said I
haven't tested that out but uh maybe you
can test it out and let me know or let
us know how it
works so this is basically
the uh out of the box graph Constructors
that you can use
uh and then uh what's not in
the uh uh llama index yet but since
llama index provides a low level
connections to graph stor near for this
example if we can also come up with
custom entity dis disfiguration which we
will do in this Workshop as well and
entity dis immigation just means
that if you have multiple nodes in the
knowledge
graph excuse me that reference the same
real word entity you kind of want to
merge them together into a single noes
so that you have a better structural
Integrity right so for here in this
example this note ends with a limited
this one has abbreviation limited and
this one doesn't have limited right but
it all references the same
note so that mean that's why we want to
merge it into a single note so that we
have better seal integrity and in the
workshop we'll use a combination of text
embeddings and uh word distance heris
stics to find potential candidates and
then merge them uh together here is
basically if you're uh anxious you can
follow this link and this is basically
the notebook that we'll be using
today okay and then on the other side we
have property graph
retrievers so as you see I just took the
previous image and just slice it up
because here we have the remaining of
the arrow but the graph retriever as I
said uh based on the user input they
have some logic how to uh retrieve uh
information from the knowled
graph and then pass that information to
an llm so that the llm can generate the
final l so basically a typical like
Pipeline and we have I think
four uh out of the box retrievers that
you can use so here I didn't have time
to draw nice diagrams so I just uh sum
summarize them
quickly so the first one is the llm
synonym retriever it takes the user
query generates synonyms using an
llm and then it finds relevant notes
using exact keyword match so that's
really
important because llm is not aware of
any values in the database when it's
generating the
synonyms so it's not
not uh are given that the llm will know
which like how to construct the keywords
so that they will match any notes in the
graph so because it uses at least the
new forj um integration uses exact
keyword match it's the least reliable
right because it needs exact keyword
match we
could we could basically optimize this
uh to allow some misspelling or stuff
like that but at the moment it's using
exact keyword match and then once it
finds a relevant
notes it returns basically the direct
neighborhood or basically you have an
option to
decide uh how many like what's the
distance what's the neighborhood size of
the nose that you want to return so by
default we return just Direct neighbors
of that
node and then the second one is the
vector
context so um in the previous one we
used exact keyword search to find
relevant notes but here in this example
we using Vector search so that means
it's more robust and less reliable on
exact keyword match that because with
keyword with Factor search
you will always get some results from
the database because you take top
n uh and then uh hopefully some relevant
notes are
identified using Vector search and then
uh we just do the same thing as we did
in the previous one we just return the
direct neighborhood uh of relevant nodes
that were found using the vector
search and then another one uh is the
textto cipher
so as the name
implies we the take the text and use an
llm to generate Cypher statement so this
is kind of very
um flexible approach right because the
line can construct any sour of uh Cipher
statements so for example um with oops
how do I go back with Vector context
right when you're just finding for uh
searching for Relevant notes using
Vector search and then returning direct
neighborhoods it's very hard to answer
questions like how many nodes are in the
graph because it's like an aggregation
query and Vector Contex is not suitable
for aggregation queries
uh at least not like on the global scale
um but like for example with texer you
could ask it questions like how many
nodes are in the graph or like how many
P are in the graph and the lln will
generate a
appropriate U Cipher statements and
return that information for you that so
text CER is much more flexible than the
previous TOS but but the on there's
obviously always a tradeoff that is less
reliable because we're using Ln to
generate saer statements and
that's at the moment how is it it's
mostly
correct uh but that just but not like
always so you kind of U trading
of flexibility for a bit of accuracy um
but then on the other side what you have
is also
uh some different uh like Tex CER allows
you to also do aggregations and stuff
like that which the
previous uh uh R didn't allow
you and then the last one is the
so-called Cipher template
R and here instead of generating Cipher
statements with an llm you basically
Define the CER statements you want to be
executed and you just
um parameterize or like provide like a
parameterized cipher template so
basically you have uh a cipher statement
with like one or more um uh parameters
and then at credit time basically the
llm you provide uh instruction to to an
llm how to populate those uh
parameters and then at query time uh llm
extracts relevant parameters it needs to
use with the cipher
template uh populates the template and
then executes the predefined cipher
template so that's where the template
comes from because it's uh
predefined and then uh here I have
questions but let do a
demo to I'm just going to read through
some questions in the chat so far just
to make sure we cover some of them um
before uh before the workshop um the
first is um I think one question is
actually about using llms and um if you
have a set of recommended LMS that you
think are better for say like Knowledge
Graph construction um as well as the
cost of running llms across a large
Corpus of documents to construct an
allograph I'm curious to get your
initial there um as well as like
recommendations for some of these users
some of them are thinking about using
like rock for
instance yeah so what you will see is
that graph constru like
LMS and graph construction it's very
model dependent so different models will
uh generate different graph graphs and
it's very like given different versions
of GPT 4 will behave
differently so so I did some testing
like I'm not like an expert in all of
the LNS but for example what I will tell
you when you're using like a predefined
schema uh like the GPT 3.5 will try to
fit all that information into the schema
so that uh it kind of over fits
information into the schema where
uh uh uh it's not really where it
wouldn't really fit in reality but then
GPT for Turbo and the for are much
better at like ignoring the information
that is not part of the
schema uh for example if you want to use
I would really recommend using LMS that
are fast so for example gp4 just throw
it out of the window because it takes
forever and it's costly right so in that
like grock is really ni nice CU it's
really fast but then the problem with
Gro is they don't don't want to take our
money just yet so hopefully when they
will be taking out credit cards that's
something I would definitely uh look
into but like in general it's like the
better the model the better will be the
results no more it will follow um The
Prompt instructions right so uh great
and just following up with with just one
more quick question is um uh you know
this might not actually quite exist in
in some of our obstructions right now
but um one of the questions around like
dealing with missing information from
the graph which sort of implies this
like maybe you do some LM construction
pass it's not completely exactly where
you want it to be and so you do some
human um in the loop PA to try to like
modify and shape the graph uh to you
know better reflect what you want out of
that data um have you seen that like
kind of human in the loop approach
towards like graph
construction so it's not really human in
the loop it's more like they have some
characteristics because I didn't really
mention but it
the if you want to take a look U the
grass rack by Microsoft paper is really
nice and it deals with some of this uh
questions so the first one is also like
what types of what what's the size of
text chunks you should use right and the
thing is the it's kind of funny like the
number of notes and relationships is
kind of irrelevant to the chunk
sizes so that just means if you're using
smaller text chunks more information
will be selected and if you're using
bigger text chunks like this the overall
number of extractions will be the um
extracted information will be the the
same but since you're using larger text
chunks right on the like in summary less
information will be extracted from
larger text CHS so that's one one one
thing they mention in the paper and then
the second thing they mentioned they
have some sort of
characteristics where they can decide
okay not enough information was
extracted from the text and then they do
a second R so basically instead of
having a human in the it's kind of
automated and saying okay you didn't do
a good enough job now let's do a second
pass on the
graphic oh okay yeah super
interesting any other questions
so should be feel free to carry on
there's a ton of questions but I think
we'll
mean because the extraction part will
take a couple of minutes and we can uh
answer questions um so here I I Define
My Graph St
so okay just second okay that's
fine and uh one thing I also noticed is
that people are sometimes confused by
documents uh because like all Lama
index um mostly deals with the document
right but document is just a dier around
the text so it's very easy to go from
text to document we just instantiate a
document with the text property and
that's about it right so so here in this
example we're going to create a bunch of
documents based on the news so we have a
bunch of news um and we're going to use
GPT Pho and for example one thing that's
also interesting there's like a lot of
things
uh that comes popping up and uh one
thing I noticed today was today or
yesterday somebody did some benchmarks
and they said basically that if you're
using slightly higher temperature than
zero even for the the terministic tasks
you get better the results and that's
like and it was specifically for
photo right so again that's kind of very
interesting and we we all learning uh as
we go along
right but as as as mentioned we're going
to use the schema llm pet instructor in
this
Workshop so with schema llm pet
instructor you have to Define the types
of notes do you want to selct so here I
went for person location organization
product and event so it's mostly a very
um typical
uh
extraction and then there's the event
which is kind of more ambiguous and in
allows the llm to extract any type like
a lot of information right because event
can be anything
basically and then I also we also have
to Define the types of relationships we
want to exct so here I focused more on
the like organization business part
where we have suppliers competitors
Acquisitions subsidiaries CEOs stuff
like that so we're going to hopefully
extract some
business relevant SL financial
information hopefully from
the knowled gra and then uh so this is
the first part of the when we are
defining the scheme and then the second
part is we also have to
Define uh which information which
relationships is assigned to each person
right because not all relationships can
be part of all node labels right so we
have to Define uh so for example a
product only has provides uh
relationship and then provides is only
on the organization so then I um iide
the would generate only provides
relationships between organizations and
produ
because it's it doesn't really make
sense to have uh provides relationship
from location to let's say a product so
this is a a little bit more granular
um schema definition uh that we need to
provide and then we just uh let's go
with 100 uh and then we just pass the
possible entities relationship
validation schema to the llm and here
you have the strict mode so strict mode
like even if you
provide instructions to the llm which
types of noes and relationships it
should use it doesn't really mean that
it will follow them oops bad idea 100%
correct right because l just LMS they do
what they
want and then Logan uh implemented a
strict mode so it means that but since
we know the types of relationships and
noes we expect we can filter them out if
we want to uh in the code right or we
can leave any other nodes in the
relationship that
identified so in this case let's just I
love any information uh
the llm decides additionally to extract
now gp4 is quite good at
following um the provided schema but
other models and this is also because as
I said gbt for is an native function
calling model so when you're using
functions or tools to extract
information s information
it will have much better accuracy
whereas like llama 3 which is not Gro so
Lama 3 via orama right doesn't have
function calling it's still a really
good
model but it might not follow the uh
schema always right so that's why you
have the option to filter it in
postprocessing if you want to or not and
here we'll go for
not and we're going to exct information
from 100
articles and it's going to take like two
three minutes I think so we have time
for a couple of
questions um yeah for sure I'm trying
I'm trying to figure out what questions
asked um maybe maybe one thing is um uh
actually going back to the retrieval
side so you know there is Vector search
with the vector context Retriever and
then there's also Tex de Cipher um you
mentioned some limitations of uh Texas
Cipher like in your mind like what are
some of the maybe like tips and tricks
you see in getting Texas Cipher working
a little bit better for users um in
terms of making it making sure it
generates more reliable Cipher queries
how to make sure it actually retrieves
Val in
context
so but I mean this is kind of hard
question
so text Cipher works good for like the
when the user knows what's in the
database right and knows how to ask the
questions that fits the
schema right so that's one thing and
then how do you achieve that so what you
could do is you could have some qu re
writing steps that take the user input
and rewrites it into more of
a a question that fits the graph schema
and it's a little bit more verbos or
implicit on how it wants the information
to be
retrieved so that's one thing uh
obviously providing it with few short
examples is very
helpful because by default it uses zero
short um generation right which we just
give it the graph
schema and then hope for the best but
what you can also do is you can uh also
provideed few short examples and then uh
hope that it follows those examples and
obviously the thing is like with more
complex graph
schemas uh there's like a the just how
to describe those schemas takes a lot of
tokens so when
like and then maybe not linearly but the
bigger the size of the schema the less
it will I mean yeah the V the accuracy
will be so what you can also do is then
um just provide parts of the schema so
instead of having one text to Cipher
that deals with the whole graph schema
what you can do is you can
have like an
agent with like several tools and then
each of those tools uh focus on
different parts of the schema right uh
so you kind
of lower the complexity of the
task
yeah yeah that makes a lot of sense um I
know it's about to finish up um but that
maybe just another question and and we
can also carry this over after things
are done but is ne4 designed to uh work
with like uh like technical document use
cases like patents and scientific papers
um like will help in identifying and
building relationships between you know
science technical Concepts uh that's one
of the questions from the
audience yeah so um yeah how you say Neo
is domain agnostic so you can store any
information you want in it that being
said it's quite quite funny that you
mentioned patents and uh
technical document documentation because
that's really relevant or at least what
we see a lot of pharmaceutical or
biomedical companies right this is such
there's a lot of money in patents and
for me it was also
interesting for example biomedical
companies when they have this great idea
what we should do or what we should
research you know what they first do
they check if there's already a patent
right and then if it's already a patent
they don't research it because it won't
make money you can't patent it so and
I've seen uh basically like big
pharmaceutical
companies they all have their patent
graph they all like uh scrape like PM
you don't actually have to scrape it
because it has apis right but it's like
you can think of it like biomedical
technical documentation like with all
the latest research and they generate
knowledge graphs uh from those and then
use it to inform or recommend so one
thing that they do is basically they
generate graph from all the latest
research then they use
recommendations uh to to recommend to
doctors based on their specialization
which articles they should read right so
yes definitely NE can be used for um
patents and is actually used by existing
customers for patents and Technical
documentation okay yeah so now that
we've uh imported the graph we can also
take a look at
it not right graph is visualizations are
quite
nice so let's see we
have okay so we can see that
uh uh for
example let's see why why we have an
award we have two awards no LEL so
that's kind of funny but it wasn't in
our
uh description right so even GPT 40 can
decide oh a about really nice of the FA
Cup and the English legal title so let's
see basic probably will be who won so it
was gold
McQueen what the cup so probably there
should be a football team in there but
it's interesting you can see that even
uh and we have a disease as well one
this so even GPT
photo can
decide uh to add some information uh
that wasn't in the uh schema so that's
why we have the
um the strict mode
right if we use strict mode through we
wouldn't see those these uh nodes in the
graph because
um obviously we would filter them out PR
automatically right and then let's try
to see if we
have I'm trying to find if there's
anything more connected but basically
unfortunately okay
cool and let's
uh so we have for example United he
group is a note and now we can see a
bunch of competitors
right and we can also see probably it's
not doing so well because it had a stock
sell off and stock prise decline right
and this is as I mentioned event is kind
of ambiguous and it can be a lot of
things so in this case it was stock has
stock
price declined and JN X works at United
H group that so like all over all uh the
GPT for all uh followed
uh the uh uh schema right quite nicely
and we can see like a nice graph uh over
here and let's
go forward and then as I mentioned
before entity duplication is kind
of a must I think it's
often Overlook but you kind of want to
uh find uh
entities like notes in the graph that
reference the same uh real world entity
and merge them and here we have a kind
of involved Cipher query which took like
eight hours to come off
it by multiple people but in the end uh
we found like a nice way of using um
text and beding so here we have the code
and similarity threshold and then weary
distance so how many characters can you
change in
uh in the string to have it the same and
you can see it works quite well like so
for example Bank of America and Bank of
America
Corporation music music
group like new newcast United coinbase
so overall it works really nicely to
find uh
uh like this duplicates but obviously
it's not perfect because nothing in life
is perfect so for example this one is
kind of I mean it's the same it's still
Jeff vual space suit that but one is
fire side chat which is what yeah maybe
not really but fine and for example
Baltimore this one also right okay I can
understand that these two should be
merged but maybe this is a city and
shouldn't be merged together
that so as always uh you have the option
uh to uh tweak these two parameters and
you also have the option to uh then do
some manually like human in theop here
human in the loop is kind of important
to know what entities are you emerging
in uh but I think like just having like
some sort of Baseline to start with uh
is really nice and I think this Cipher
credit not really nice because you can
see a lot of um entities that should be
merged together that and let's just
merge them
together and then for the last part as
we
said we're going to implement a custom
the
and we have the four uh existing ones um
but here we're going to implement um a
ret that first
identifies
um all the relevant entities in the text
because for example the vector
context just takes the whole
string uh embeds it and then finds
relevant not that but what if multiple
entities are mentioned in the text
then like vector index might not be the
greatest because if
you if uh it will embed the both
entities into a single uh embedding and
then who I don't really know who really
knows what happens with those numbers
there's a bunch of zeros and ones and
what do they actually represent who
knows so what we'll do really quick
before the retrieval piece um actually
quick question on the entity uh disin
viation um that Cipher query I mean
given how involved it is but given the
fact that I imagine like a lot of people
probably need to do some sort of DD is
this like a template that's just like
shared publicly because it seems like it
would be generally useful for a lot of
people yeah yeah this is uh part of the
blog is this is all available
uh over
I mean we can add a link in the webinar
if you know how to but it's this one so
if
I I know how to do
chat let me spam it a little bit
uhhuh to everyone yeah no problem I
think we shared the notebook yeah we
Shar the notebook in the chat but
basically like that basically your the
to the audience it's like if you want
just a nice Cipher query to do n d dup
obviously there's some limitations you
probably need to tweak the the word
similarity and those types of things a
little bit but like if you want an
existing template to go off of you can
just like copy and paste from this
notebook right because it's a pretty
long Cipher string so I would imagine a
lot of people are gonna be able to write
this it's also I I would I would make it
a model in llama index it's just that
then it's like read NE forj specific and
then like it's it doesn't fit the best
into llama index like because you guys
want to have things that are I say
integration agnostic so uh but maybe we
can figure out that in the coming months
how to add that because it would be nice
to have those out of the box right uh
you just
expose these two parameters and uh let
it do the magic right but maybe this is
something yeah is I think even the raw
Cipher is useful for for the audience um
and then I just doing a quick check on
time I know we have you know technically
5 to 10 minutes left um but you know I I
know the last section is just like the
custom retrieval section but maybe we
can can just like walk through the high
level Concepts maybe just like go
through the overall class and then and
then that should be a good conclusion to
to this Workshop yeah we can do this
actually quite fast so as I said we
extract entities uh from um the user
input and we use a
identic uh open identic Pro program so
basically
again I would imagine we we kind of use
function calling behind the scenes right
we
say this is your input parameter and
it's a list of named entities in the
text and we we ask GPT for o to um
select it uh so basically so then okay
I'm rambling a bit but uh so how do you
define your custom retri uh so your
custom retri just needs two methods or
actually just one but the in it is also
quite nice if you want to uh instantiate
for example some other functions or
classes and in in in the in it here we
uh instantiate entity extraction which
is the open identic program right the
to Define to extract relevant entities
from
text and then we also extract Define or
instantiate um existing Vector Contex
retriever uh we can use
it and then u in the custom retriever
it's actually the code is very simple
right we just uh exct or
find detect it maybe the best word uh if
there are entities in the text so if
there are entities in the text we just
uh run a vector retriever for every
entity in the text and if the llm
doesn't find any uh specific entities we
just use the vector to div on the whole
text sline and that's basically it and
then you you have a couple of
options how you do you um on the
structure or format of the results that
you can pass back to dat L and here in
this example we just pass back the text
we can remove this because we don't need
to change anything uh yeah
and then we just basically instantiate
the whole thing and let's see what
happens so if you ask what do you know
about Mal or
data basically the Ln detects two
entities right and then for each of the
those two entities it R Vector retriever
separately so that it kind of ensures
that we will get both information for
both for both entities right because if
you just
use Vector on the hor string or text
anding of the on the hor string you
might just get it for one entity but not
the other right because if you use topk
for maybe one is more significant uh in
the text and bearings but with this
approach we make sure to cover all the
entities so we get a nice answers for
both
entities so yeah
that's like a high level overview of the
D and now we
can uh answer a couple of questions
again yeah and and maybe just to kind of
like say a few words to um to help wrap
this up I think you know what Tomas
really showed you was an end to-end
process of both like constructing a
Knowledge Graph um and then also
retrieving from it and not just that
like showing both the high level API as
well as the lower level API so whether
you're a beginner user for nfts and llms
and L index and neo4j you know you can
basically get do all this stuff in about
like five lines of code or if you really
want to go in you're an advanced user
you're pretty familiar with M crafts we
offer a lot of opportunities for you to
Define your own custom extractors right
uh with our core abstractions um like a
robust like property graph store like
the underlying lowlevel like storage
system to
um I think a lot of people are
interested in knowledge crafts we
basically see it as like a superet a
potential superet of existing rag
Solutions especially if you're able to
leverage these like properties and
relations to help augment your retriable
um and there's a lot of very
interesting like an Enterprise developer
You're Building all crafts within uh the
company um feel free to you know reach
out to one of us for any sort of like
blog posts case study we're always happy
to feature like really interesting use
cases of like knowledge graphs llms like
w index and and neo4j right um and so
always happy to Showcase like very
interesting applications um but
hopefully this Workshop was was useful
um to all of you today uh and you know
we'll have this on our YouTube um
Channel and then basically hopefully
we'll do you know maybe even like a
series covering like other types of
topics um as we go forth but we're
definitely looking forward to to new
types of applications um built with like
knowledge crafts kgs and and L so I
think I think with that said it's
probably a good time to generally wrap
up and really sorry I think a lot of you
had a lot of questions in the in the
chat um we uh weren't able to get
through all them but we'll have this
YouTube uh video out and basically feel
free to comment there as well so thank
you everyone thank you to and thanks
Logan for for hopping in
Browse More Related Video
5.0 / 5 (0 votes)