Create Your Own Microsoft Recall AI Feature with RAG?
Summary
TLDRこのビデオでは、Microsoftの議論の多いリコール機能にインスパイアされた独自のバージョンを作成するプロジェクトが紹介されています。開発者はローカルで実行する予定だったものの、性能が不安定だったため、クラウドベースのGPT 40を使用してプロトタイプを作成しました。プロジェクトは3つのフェーズに分けられており、記録、分析、使用がそれぞれ行われます。スクリーンショットを撮り、変化を監視し、分析フェーズでGPT 40を用いて情報を抽出します。最後に、RAGシステムを使ってユーザーが過去のスクリーンショットを検索できるようにします。コードの詳細と実際の動作を紹介し、Brilliant.orgという学習プラットフォームも紹介しています。
Takeaways
- 🤖 プロジェクトのアイデアは、Microsoftの反論機能にインスパイアされたもので、独自バージョンの作成を目指しています。
- 🚫 ローカルでの実行を計画していたものの、性能が不安定だったので、GPD 40を使わなければならなくなりました。
- 🔄 プロジェクトは3つのフェーズに分けられています:記録、分析、使用(RAG)フェーズです。
- 📸 レコードフェーズでは、スクリーンショットを撮って変化を監視し、5%のピクセルチェンジがあると新しいスクリーンショットを保存します。
- 🧐 アナリゼフェーズでは、GPT 40を使ってスクリーンショットから情報を抽出し、アーカイブに保存します。
- 🔍 RAGフェーズでは、保存されたアーカイブとスクリーンショットを使用して、特定のアクションやウェブサイトを検索することができます。
- 💻 コードの解説では、GP 4モデルの強さと、ローカルでの実行が難しい理由について言及されています。
- 🔗 RAGモデルを使用した検索デモンストレーションでは、過去のスクリーンショットと関連情報を特定の質問に応じて検索することができました。
- 👨🏫 brilliant.orgがデータ分析や言語モデルの理解を深めるためのリソースとして紹介されています。
- 🔗 GitHubのコミュニティへのリンクと、コードのアップロードが予定されていますが、プライバシーに関する注意も提示されています。
- 🎉 プロジェクトは成功裏に動作し、将来的にはローカルで実行できるようになればと期待しています。
Q & A
ビデオで紹介されたプロジェクトの主な目的は何ですか?
-ビデオでは、Microsoftのリコール機能にインスパイアされた独自のバージョンを作成するプロジェクトが紹介されています。これは、コンピュータ画面のスクリーンショットを撮り、それらを分析し、RAG(Retrieval-Augmented Generation)モデルを使って検索可能なアーカイブを作成することを目的としています。
プロジェクトをローカルで100%実現する予定だった理由は何ですか?
-プロジェクトをローカルで実行することで、パフォーマンスの安定性とプライバシーの保護を図る予定でした。しかし、利用可能なモデルの性能が不十分で安定性がないため、クラウドベースのGPT-4.0を使うプロトタイプを作成することに変わりました。
プロジェクトの3つのフェーズとは何ですか?
-プロジェクトは3つのフェーズに分けられています。1つ目は「記録フェーズ」でスクリーンショットを撮ることです。2つ目は「分析フェーズ」でスクリーンショットを分析し、ユーザーのアクションやURLを抽出します。3つ目は「RAGフェーズ」で、アーカイブされた情報を使用して検索機能を提供します。
スクリーンショットを撮る際にどのような条件を設けていますか?
-スクリーンショットは、前回のスクリーンショットと5%以上のピクセルが変化したときにのみ撮影されます。これにより、同じ画面を何度も撮影するのを避け、効率的にスクリーンショットを管理しています。
GPT-4.0を使用する理由は何ですか?
-GPT-4.0は、開発者が試した中で最も優れたモデルであり、スクリーンショットからユーザーのアクションやURLを抽出するタスクに非常に適しています。しかし、オープンソースの同等のモデルがあればローカルでの実行が実現可能になるため、将来的にはそのようなモデルを期待しています。
RAGモデルを使用する目的は何ですか?
-RAGモデルは、検索可能なアーカイブを作成し、ユーザーが過去のアクションや特定のスクリーンショットを検索できるようにするためです。これにより、ユーザーは特定の日付やウェブサイトでのアクションを問い合わせることができ、それに関連するスクリーンショットを見つけることができます。
ビデオスポンサーであるbrilliant.orgはどのようなサービスを提供していますか?
-brilliant.orgは、データ分析や大きな言語モデルのしくみなど、複雑な科目を魅力的で手を出すことができる体験に変えるオンライン学習プラットフォームです。レッスンはリアルの問題を解決することで効果的で楽しい学習体験を提供し、批判的思考スキルと学習習慣を育てることができます。
開発者が使用したチャンキングとは何ですか?
-チャンキングは、テキストを最大1000文字のチャンクに分割するプロセスです。これにより、RAGモデルで検索できるようにテキストが整理され、効率的に扱われます。
スクリーンショットの名前付けに使用されるGPT-4.0の機能とは何ですか?
-GPT-4.0は、与えられた説明に基づいて短くて具体的で関連性の高いファイル名を生成する機能を使用しています。これにより、アーカイブされたスクリーンショットが効果的に検索でき、ユーザーが必要な情報を迅速に見つけることができます。
開発者が行ったプロトタイプのデモンストレーションでは何を示しましたか?
-開発者は、スクリーンショットがどのように自動的に撮影され、分析され、RAGモデルによって検索可能なアーカイブに保存されるかをデモンストレーションしました。また、実際にRAGを使用して過去のアクションやスクリーンショットを検索するデモンストレーションも行いました。
今後のプロジェクトで何を計画していますか?
-開発者は、今後も興味深いプロジェクトを計画しており、ローカルでの実行が可能になるより良いビジョンモデルを見つけるかもしれないと話していました。また、コミュニティGitHubにコードをアップロードし、チャンネルのメンバーシップを通じてアクセスを提供する予定です。
Outlines
🤖 自己のバージョンのMicrosoftリコール機能を作る
ビデオでは、Microsoftのリコール機能にインスパイアされたプロジェクトを紹介しており、ローカルで100%実行する予定だったがパフォーマンスが不安定だったため、GPD 40を使用してプロトタイプを作成することにしました。プロジェクトは記録、分析、使用の3つのフェーズに分けられており、スクリーンショットを撮り、変化が5%あった場合にのみ新しいスクリーンショットを保存する機能も実装されています。
🔍 スクリーンショットの分析とアーカイブ化
GPD 40を使ってスクリーンショットを分析し、ウェブサイトのURLやユーザーの操作などの情報を抽出しています。スクリーンショットは新しいフォルダに保存され、後で参照できるようにアーカイブ化されています。また、GPT 40を使ってカスタムのスクリーンショット名を生成し、後で検索できるようにしています。
📚 RAGを使用したシステムの実用化
RAG(Retrieval-Augmented Generation)モデルを使って、保存されたスクリーンショットとユーザーアクションを組み合わせて検索可能にしています。ローカルのAMA埋め込みモデルを使用して検索を行い、特定のアクションやウェブサイトに関する情報を特定の日にアクセスしたかどうかを検索することができます。
💻 コードの詳細とデモンストレーション
重要なコードの機能を説明しており、スクリーンショットを分析するプロンプトやチャンキング、RAGへの組み込み方法について詳述しています。また、スクリーンショットの比較や新しいスクリーンショットのトリガーに使用されるdiffパーセントの設定についても触れています。実際にスクリプトを実行し、ウェブサイトの閲覧やターミナルでの操作がどのように記録され検索可能になるかをデモンストレーションしています。
🔗 RAGでの検索と結果の確認
RAGを使用して新しい履歴を埋め込み、過去のスクリーンショットとユーザーアクションを検索する方法を紹介しています。特定のツイートやGitHubのポストに関するスクリーンショットを特定の質問に応じて見つけることができ、そのポストに関する詳細情報を確認できることをデモンストレーションしています。
🚀 プロジェクトの総括と今後の展望
ビデオの締めくくりとして、作成したプロトタイプの総括と今後の展望について話されています。ローカルで実行できるモデルがあれば100%ローカルで実行できるといいが、現在のGPD 40のモデルはプライバシーに関する懸念があるため、実際に使用するべきではないと指摘しています。しかし、将来的にはアップグレードされ、さらに興味深いものになる可能性があると期待しています。
Mindmap
Keywords
💡Microsoft recall feature
💡GPT 40
💡record phase
💡analyze phase
💡RAG phase
💡AMA embeddings models
💡chunking
💡screenshot comparison
💡open CV2
💡community GitHub
Highlights
项目灵感来源于微软的召回功能,尝试创建自己的版本。
原计划完全本地化实现,但性能不稳定,因此放弃,转而使用GPT 40模型。
项目分为三个阶段:记录、分析和使用。
记录阶段通过屏幕截图并监测像素变化来避免重复截图。
分析阶段使用GPT 40模型提取屏幕截图中的用户交互、URL等信息。
存档功能允许用户回溯历史记录。
使用阶段通过本地AMA嵌入模型和llama 3搜索功能实现内容检索。
演示了如何通过搜索查询找到特定日期访问Discord的截图。
介绍了赞助商brilliant.org,提供数据分析和大型语言模型的课程。
代码演示了如何实现屏幕截图的分析和存档。
展示了如何使用GPT 40模型提取图像中的重要信息。
说明了如何将用户活动和截图信息分块并存档以供检索。
演示了如何通过比较像素变化来决定何时拍摄新截图。
展示了如何使用本地嵌入模型和搜索功能来实现召回功能。
讨论了代码的开源性和数据隐私问题。
提供了GitHub链接,供有兴趣的观众访问和学习。
展示了实际运行脚本并检索特定信息的过程。
讨论了召回功能的未来可能性和改进方向。
鼓励观众提供本地化解决方案,并考虑未来合作。
总结了项目的意义和对未来的展望。
Transcripts
today's project is of course going to be
heavily inspired by the kind of
controversial Microsoft recall feature
so I really want to see if we can create
our own version of this my plan was to
do this 100% locally but the wish wers
just weren't good enough the performance
was not stable so I had to kind of
abandon that and just make a prototype
using GPD 40 because that is kind of the
far by the best wish model I have tried
so far so yeah it's a bit shame but I
wanted it to be locally but uh yeah that
could be something for the future so let
me just go through kind of how we set
this up and how I wanted this to work so
I divided this into three phases so we
have the record phase we have the
analyze phase and to kind of use this we
have a rag phase I'm going to explain
that but let's just start here on the
record phase so basically when we fire
up our script this is going to
screenshot our computer screen of course
and it's going to say those screenshots
and bring them to further analyze and
put them into a rag system right but I
kind of wanted to uh Implement something
that kind of monitor pixel changes on
our screen because we don't want to spam
the same screenshot over and over again
so I implemented something that looks at
our Monitor and if there's a 5% pixel
change to kind of the previous
screenshot then it's going to take a new
screenshot and save that and bring it
over over to the analyze phase I also
implemented a step down here that GPT 40
creates a custom screenshot name uh
based on kind of the analyze uh feature
here from GPT 40 because we want to
Archive those screenshots so we can look
them up later that is kind of the ID
behind the recall feature right if we go
to the analyze phas now you can see uh
when a screenshot is Sav is going to be
analyzed by GT40 so this is going to
extract you inter actions what happened
in the screen any URLs and of course the
name that is associated with the
screenshot and this is going to be put
in our
archive uh but also the screenshot is
saved to a new folder so we can look it
up if we want to go back in time and see
what we did in this exactly Frozen
moment in time right uh I think that's
kind of how this Microsoft recall
feature works so I thought it was pretty
interesting and it does work
uh so let's just move on to kind of the
rag phase this is how we can actually
use this system and here you can kind of
see we just take the archive that is
kind of the user action plus the link to
the screenshot uh we have saved in
archive and we create embeddings from
this uh this is done locally so we are
using a local AMA embeddings models and
we use llama 3 to kind of search over
that rag space right uh so here we can
just search for let's say something like
uh did I visit Discord yesterday and
then you can kind of find the associated
screenshot when you did visit let's say
Discord what you talked about on Discord
because you kind of get the recent
action that you talked about on the
Discord screenshot and you get the name
so you can look up the screenshot and
see exactly what you did so that is
basically the whole ID behind this uh so
now I think we just going to take a
quick look at the code kind of how I set
this up and let's bring it out in action
and see if it actually works but before
we do that if you want to learn more
about how llms work data analysis
science and all that stuff take a look
at today's sponsor brilliant.org are you
eager to dive into the world of data
analysis or understand how large
language models work then you're going
to love brilliant.org the sponsor of
today's video brilliant turns complex
subjects into engaging Hands-On
experiences that make learning fun and
effective I especially like the building
regression models course that is perfect
for Learners at any level you learn how
to visualize massive data sets make
better informed decisions from the bias
theorum to multiple linear regression
another favorite mine is of course the
how llms work course this immersive AI
Workshop lets you explore the mechanics
of large language models showing you how
they build vocabulary and generate
different outputs like poetry or cover
letters freance approach to learning is
proven to be six times more effective
than traditional lectures by solving
real problems you build critical
thinking skills and a powerful daily
learning Habit to try everything
brilliant has to offer for free for 30
days visit brilliant.org allabout aai or
just click the link in the description
below go start your Learning Journey
today a big thanks to brilliant for
sponsoring this
video okay so now let's walk through
some of the most important functions in
our code to actually make this work so I
just want to start with the analyze
screenshot functions because this is
kind of the the most important part of
course uh you can see we are running the
GP 4 all model uh it's just so good like
it's by far the best wish model I ever
tried uh I just wish there was like an
open source model that could be on this
level soon so we can make this 100%
locally I think how bad cool that would
be but now it's like should you I don't
even think you should use this because
this is sending kind of your uh
proprietary data sometimes out with the
API and stuff so this is just kind of
prototype for something hopefully we can
have in the future but it is really fun
to play around with and it does work but
let's just focus on the code here now
for a while so you can see uh so let's
let me show you kind of the The Prompt I
use here extract the most important
information from what is happening in
the image include URLs on the website if
applicable because I want to be able to
kind of extract the URLs I was on let's
say I remember something I was on I was
on that web page but I don't remember
exactly the URL then I can kind of
search up that information and find the
url I was on so yeah that is kind of the
idea behind this it's a pretty simple
prompt but it works for my my use case
here okay so now we kind of come to the
chunking part this is important for the
rag right and this is kind of taken from
my EC local rag setup so we're going to
divide it into chunks of Maximum
thousand characters and
every yeah how should you say this every
action is kind of chunked into the
history. text so we can use it in rag to
search up so I just wanted to show you
an example so you can see here uh here
is kind of one action we took from a
screenshot so you can see we have the
PNG image here that is associated with
this
uh with this uh saved action so the user
was engaged in multiple activities on
the computer running a python
script executing a python script called
recall in power cell
terminal and you can see reviewing a
directory and we have a directory path
working on a canva project so we got a
lot of information from this and we can
also look up this image here if we go to
my folder now you can see I have the
archive user activity recall 12 cona AI
so here you can see everything that
happened in that image so this is kind
of associated with this uh action right
and we have another uh image here the
user was watching a YouTube video called
Introducing co- pilot PC by
Microsoft and this is our other
screenshot so you can see it here so
yeah I think it's working pretty good
and that is kind of how we set this up
uh we also have a GPD for Oat to kind of
rewrite this queries so you're an expert
extracting information from a text uh we
use that uh and here is kind of the
compare screenshot function so we take
the previous screenshot and compare it
with the new one and if kind of the
difference in pixels is uh set
percentage here uh then we're going to
execute on that screenshot because if I
just leave my screen on like this
there's no pixel changes so we're not
going to take a screenshot then only if
the user does something else that is
kind of the idea behind it and that is
one I kind of understood that Microsoft
is trying to do as well so I set the
diff percentage to five I don't know
what the optimal solution is here uh but
I set it to five now and seems to be
working pretty
good so yeah you can see here the GPD 4
oat prompt is kind of we feed in the
result from the image description right
and then we say from the image
description about extract
information uh what the user is doing on
the computer include a URL if applicable
so the reason I kind of run it again
with a
new uh function here is because I had
some results just running it from this
uh that was not too good so I tried to
run it again over a new function just by
feeding in the results and it seemed to
work pretty good so uh I kind of left it
like this and here is kind of the
relevant file name query if we look at
the names of our images you can see my
file name here is Microsoft co-pilots PC
keynote and this file name comes from
using the gp4 oat model and Generator
short concise relevant file name for the
following description this file name is
quite important because uh this can be
used in our rag phase to get keywords so
if we type Microsoft co-pilot this will
kind of pop up in our rag search right
that is why I want these file names to
be relevant and not just some random
name and yeah is there anything else to
say here about the code it's pretty
straightforward uh it kind of ends up at
the end here that we are
chunking uh everything and putting it
into this history text on kind of a new
line so it's ready to be embedded in our
rag model uh I might do like a follow-up
video video on my member section if
people are interested in diving deeper
into this of course this code is going
to be uploaded to the community GitHub
so if you want straight access to this
just become a member of the channel
follow the link in the description and I
will invite you to our community GitHub
and our community
Discord and yeah at the end there if we
want to stop this we can just keyboard
interrupt and this will exit uh other
than that it's just going to run in a
true true true Loop until we stop it I
also added this 3 second delay before
starting because when we fire up our
terminal here yeah I just wanted a small
sleep
here and yeah that is basically it uh to
do this um compare screenshots we are
using open
CV2 and seems to be working pretty good
so I think I just want to show you kind
of In Action Now how this works and what
we can do with this okay so let's run
our script now and let's kind of see how
this works now so uh let's say I just
wanted to start my day at working so I'm
just going to fire up this script I'm
just going to go to this website here
I'm just going to pretend I'm reading
this so hopefully now this has taken a
screenshot of this right uh so I just
want to leave it here now because now
you can kind of see it's measuring the
different percentage so let's say now we
switch to this x poost okay so now we
kind of changed our image and then you
can see if we go back here the
percentage changed to
26% and that means that we took a new
screenshot right because the pixels
changed and we take a look at this post
on X and now we got a second different
uh percentage because we brought up the
terminal but if we let this just run now
we're probably just going to stay on
this so now just let's just operate the
computer as normal this is going to run
in the background right so we can see
here we are on kind of the local llama
Reddit post we can read a bit about this
of course this is not going to take
screenshots every single second so it's
going to be a bit it's not perfect right
we can click onto this GitHub
page and hopefully we are collecting
screenshots now as we are scrolling but
of course it's not going to be perfect
so I think we're just going to stop it
here now and then we can kind of take a
look at the results and how we can
implement this into rag R so if we go
back now to our history. text we can
kind of reload this and we should have
some more yeah information here so you
can see a carati gpt2 reproduction
llmc the users reading a post uh Andre
carpon X yes we did that the users
reading a post on local llama reading a
post on GitHub executing a python script
that's true and yeah you can see this
worked pretty good let's take a look at
the images that is kind of associated
with this so let's take a look at the
carpati gpt2 reproduction image so we
can just go here Cara
2 uh yes so you can see this is the
image of
the the Twitter post we read right or
the X post so yeah we can actually align
this with our information and find the
screenshot here so I think the next step
now is going to be to look at our Rag
and how we can embed this and start
using our
recall feature information so here we
are basically using exactly the setup I
had on my E Local rag you can find that
that's open source you can just follow
the link in description to find this
code so basically the only difference
here is now that instead of uh feeding
in our w we call it uh in the previous
version we're just feeding in our
history. text right and embedding that
and we can start searching over it so
let's just fire up the terminal here and
kind of let's clear our cach first and
then we can embed our new history here
right so let's just run python Reco
ragp D- clear cache so we kind of clear
our previous cache okay and then we can
generate new embeddings on history. text
save embeddings to Vault embeddings Json
and now we can start asking question
about our yeah documents but it is
actually our recall uh information right
so let's try to ask I read a post about
gpt2 on X but I forgot who the author
[Music]
was okay so we are fetching some
information about gpd2 here from our
archive right according to your archive
data you read a post about Yeah by Andre
car party on X for Twitter but
reproducing the gp2 Mor okay yeah that's
good let's also try to find our um
related
screenshot do I have any PNG Files about
gpt2 okay so you can see according to
Archive data you have two PNG files
related to gpt2 reproducing gpt2 Gib uh
recall diff percentage okay so let's
check this reproducing GPT through 124
million 90 minutes so we can go to our
archive here and you can see reproducing
gpt2 so this is actually from the GitHub
so not exactly our
X um post but yeah I guess it found
something let's try to alter that and
add X in maybe let's ask do I have any
PNG Files about gpt2 from
X according to Archive data you have one
PNG file yes gp2 from X okay perfect so
this is the car party gpt2 reproduction
llmc so let's see if that
works and yeah perfect so if I was a bit
more specific there I could actually
find the post we took the screenshot of
on X so yeah I got to say I'm pretty
happy how this worked and it is actually
kind of how I wanted this to work so
this means that you can kind of get like
a imagine you get like a big storage
here in your history and you get like a
ton of different screenshots here this
could be helpful if you wanted to kind
of track back and see what you did maybe
yesterday and stuff like that so I think
it's working kind of how I wanted this
to work with this rag face
implementation so yeah that is basically
what I wanted to share with you today
like I said uh in the video if you want
access to this just follow the link in
the description and become a member of
the channel I will probably upload this
uh code tomorrow uh I might even do a
more Deep dive into how this exactly
work if people want that uh but yeah I
found it interesting it's a shame that
we can't do it or that I couldn't
actually do it locally uh yet and maybe
some of you can list if you actually
find a good way to do this locally
please leave a comment uh and I want to
see it and I want to try it and maybe we
can do something together uh but other
than that uh I think it worked pretty
good I'm very happy how it turned out
and it was not that hard to implement
and I think we could kind of upgrade
this in the future and that's going to
be very
interesting uh so yeah should you use
this probably no because you're sending
a lot of private information over the
API and yeah I'm not sure uh I don't
think I'm going to use this actively but
it was a fun prototype to Showcase what
we might have uh running locally in the
future we just need a bit of a better
Vision model and I think everything
should work pretty good uh other than
that yeah thank you for tuning in and
got some cool projects coming up so look
out for that and don't forget to check
out brilliant.org Link in the
description thank you for tuning in and
I'll see you again on Sunday
5.0 / 5 (0 votes)