“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent
Summary
TLDRThis video script discusses the evolution of web scraping in the age of vast internet data. It explores the challenges of extracting information from websites designed for human interaction and introduces the use of large language models and headless browsers to create universal web scrapers. The script also touches on the potential of multimodal models like GPT-4V to understand and interact with web pages, and the development of tools like AgentQL for more reliable web element interaction. The presenter shares insights on building intelligent web scraping agents capable of navigating complex websites and collecting structured data.
Takeaways
- 🌐 Web browsers have been the primary mode of internet interaction since 1993, with new data and websites being created at an astonishing rate.
- 📈 By the end of 2024, it's estimated that 147 zettabytes of data will be created, with platforms like Facebook generating over 4,000 terabytes of data daily.
- 💥 There are approximately 252,000 new websites created every day, which equates to three new websites every second.
- 🤖 A significant portion of web traffic is not from human users but from bots and automated systems scraping data from websites.
- 🕸️ Web scraping involves using scripts to mimic web browsers to extract information from websites, especially when no API is available.
- 🔄 The process of web scraping can be complex due to the dynamic nature of modern websites that often load content progressively or behind paywalls.
- 🧑💻 Developers use headless browsers to simulate user interactions for web scraping, which operate in the background without a user interface.
- 📚 Large language models have the potential to revolutionize web scraping by handling unstructured data and generating structured JSON outputs regardless of website structure.
- 🎯 Multimodal models like GPT-4V are advancing to understand and interpret visual elements on web pages, aligning machine and human browsing behaviors.
- 🔗 The emergence of universal web scraping agents powered by AI could reduce the need for custom scripts for each website, offering a more streamlined approach to data extraction.
- 🚀 The development of such agents could lead to the creation of an 'API for the entire internet,' where natural language prompts can be used to extract specific data points from various online sources.
Q & A
What is the significance of the year 1993 in the context of web browsers?
-1993 is significant because it's the year when Gabe Navigator was released, marking the beginning of web browsers as the primary means for people to interact with the internet and access online information.
What is the estimated amount of data that will be created by the end of 2024, and how much data does Facebook produce daily?
-By the end of 2024, it's estimated that there will be 147 zettabytes of data created. Facebook alone produces more than 4,000 terabytes of data every single day.
How many new websites are created every day according to the script?
-According to the script, approximately 252,000 new websites are created every day, which translates to about three new websites per second.
What is web scraping and why is it necessary?
-Web scraping is the process where developers write scripts to mimic web browsers and make HTTP requests to URLs to extract information. It's necessary because many websites do not offer API access, and scraping allows for the extraction of structured information from various websites.
What is 'curl' and how is it used in the context of the script?
-Curl is a command-line tool for transferring data with URLs. In the script, it's used to send a request to a website and retrieve the website content in HTML format, or to download data to a local file.
Why do some websites not provide API services for data access?
-Some websites do not provide API services because the data is often a valuable asset owned by the company. They may not want to allow others to easily grab data and use it to build competing websites or services.
What challenges do developers face when scraping data from modern websites?
-Developers face challenges such as websites being designed for human consumption with graphics and animations that are not machine-friendly, data being loaded dynamically or behind paywalls, and the need to simulate human behavior to access content.
What is a headless browser and how does it assist in web scraping?
-A headless browser is a web browser that accesses web pages but doesn't have a user interface. It allows for the simulation of user interactions like typing, clicking, and scrolling, which is useful for web scraping tasks that require complex user behavior.
How do libraries like Playwright and Puppeteer help in controlling web browsers for web scraping?
-Libraries like Playwright and Puppeteer provide high-level APIs to control web browsers, allowing developers to script actions like creating new pages, navigating to URLs, and filling in values for inputs across different browsers.
What role do large language models play in the future of web scraping according to the script?
-Large language models play a significant role in the future of web scraping by handling unstructured data and extracting structured information from any website structure. They can align how machines and humans browse and consume internet data, making it possible to create a universal web scraper.
Outlines
🌐 The Evolution of Web Scraping and Data Growth
This paragraph discusses the persistent dominance of web browsers as the primary interface for internet interaction since the advent of Gabe Navigator in 1993. It highlights the exponential growth of data and websites, with predictions of a staggering 147 zettabytes of data creation by the end of 2024. The script also touches on the significant amount of web traffic generated by bots and the practice of web scraping, which involves writing scripts to mimic web browsers for data extraction. The paragraph introduces the concept of using 'curl' for data retrieval and the limitations of websites not offering API access due to data being a valuable asset.
🤖 The Complexity of Web Scraping and the Emergence of AI
The second paragraph delves into the complexities of web scraping, noting the challenges posed by websites designed for human consumption with rich graphics and interactive elements that are not machine-friendly. It discusses the difficulties in accessing data through traditional HTTP requests due to modern websites' techniques like progressive loading and paywalls. The paragraph introduces the concept of headless browsers, which operate in the background without a user interface, and the use of libraries like Playwright and Selenium to control them. It also acknowledges the heavy operational demands of web scraping due to the unique structure of each website.
🚀 The Impact of Large Language Models on Web Scraping
This paragraph explores the transformative impact of large language models on web scraping. It emphasizes the ability of these models to handle unstructured data and extract structured information from any website, regardless of its structure. The paragraph also discusses the advancements in multimodal models like GPT-4V, which can understand visual elements and interpret actions for web tasks. It highlights the potential of these models to align machine and human browsing behaviors, leading to more effective web scraping agents.
🛠 Building a Universal Web Scraping Agent with AI
The fourth paragraph outlines the process of building a universal web scraping agent powered by AI. It discusses two primary methods: one API-based, utilizing existing scrapers and large language models to extract and summarize data, and the other browser control-based, where the agent directly controls the web browser to simulate complex user behaviors. The paragraph also addresses the challenges of data cleanliness, agent planning and memory, and the complexities of interacting with various UI elements across different websites.
🔄 Continuous Data Collection and Memory Optimization
This paragraph focuses on the continuous data collection process and the importance of memory optimization for AI agents. It describes the functions and workflow for an agent to research within a company's domain and then the internet, updating a database with found information to avoid duplication. The paragraph introduces techniques to manage the agent's memory, such as summarizing old conversations when the context window limit is reached, ensuring efficient operation.
🔍 Advanced Web Scraping with Browser Control and Pagination
The final paragraph demonstrates advanced web scraping techniques for e-commerce websites, where an agent controls a web browser to scrape product information across multiple pages with pagination. It introduces the use of AgentQL for reliable UI element interaction and shows how the agent can navigate through web pages, handle pagination, and collect comprehensive product data in a CSV file, applicable to various e-commerce platforms.
Mindmap
Keywords
💡Web Browser
💡Data Scraping
💡API (Application Programming Interface)
💡HTTP Request
💡Headless Browser
💡Web Traffic
💡Large Language Models (LLMs)
💡Multimodal Model
💡Web Scraping Agent
💡Progressive Loading
💡Authentication
💡E-commerce Website
💡CSV File
Highlights
Web browsers have been the primary means of internet interaction since 1993.
By the end of 2024, it's estimated that 147 zettabytes of data will be created, with 1 zettabyte equaling roughly 1 trillion gigabytes.
Facebook alone produces over 4,000 terabytes of data daily.
There are 252,000 new websites created every day, equating to three new websites every second.
A significant portion of web traffic is from bots and computers scraping data from websites.
Web scraping involves writing scripts to mimic web browsers and make HTTP requests to extract information.
curl is a command-line tool used for transferring data with URLs, which can retrieve website content.
API endpoints are provided by some web services for accessing structured data, unlike many websites that do not offer such access.
Web scraping is complex due to the varied structures and interactive elements of modern websites.
Headless browsers are used for web scraping as they can simulate user interactions without a user interface.
Libraries like Playwright and Selenium provide high-level APIs for controlling web browsers in scripts.
Web scraping is operation-heavy due to the lack of standardization across websites.
Large language models are capable of handling unstructured data and can extract key insights from large datasets.
Multimodal models like GPT-4V can understand visual elements and align machine and human browsing behaviors.
The emergence of large language models has made the creation of a universal web scraper possible.
A universal web scraper can be powered by agents that understand natural language prompts and extract data from any website.
The development of such agents opens up opportunities for building advanced web browsing agents that can complete complex tasks.
The presenter has been exploring the creation of a universal web scraping agent and will share case studies and learnings.
CleanMyMac X is recommended for optimizing Mac performance, with features like smart scan and space lens.
Transcripts
ever since 1993 Gabe Navigator came out
web browser Remains the default way of
how people interact with internet and
get information from online shopping
raing entertaining and communication and
every year this huge amount of new data
and website has been created for many
different purpose there's an estimation
that by end of 2024 there will be 147
zitp wor of data will be created one Z
is equal to roughly 1 trillion gigabyte
that is huge amount of data just by
Facebook itself it es made to produce
more than 4,000 tbyte of data every
single day part from this giant web
platform their new websites came out all
the time according to FS there are
252,000 new websites created every day
that basically means three new websites
are built every single second so just
for the past 60 seconds while you're
watching this video more than 100 new
websites has been live so we all know
the amount of data and traffic growing
on internet are growing really fast but
what you might not know a great amount
of those web traffic are actually human
behind a computer try to browse the
internet but from Bots and other
computers trying to getting information
out from different websites and internet
since internet is filled with valuable
information those BX and computers
search and extract best content out of
it it process we often refer to as
scraping which would process where
developers will write a script to mimic
web browser and make simple HTTP request
for the URL to get information out for
example I have this personal blog
website called Ai jason.com and this
personal Blog has no API and point for
you to access all the blog information
to get information from my personal blog
I can just open the terminal and do curl
with my website URL and click enter Then
This command line will automatically
send a request to my website and get the
website content back in the RO HTML
format and if you don't know what curl
is it is a protocol for transferring
data with URLs and I can also download
data to local file by do curl website
URL and and create a file called AI
json. HT and this will download HTML
file into my local computer if I open
the folder there's one web page if I
open it basically download whole website
data and on the other hand apart from
calling specific web page there are a
lot of web service that provide API
endpoint for you to fash internet data
like getting Google Rit data or weather
data for example I can do curl and call
a specific weather API service which
will return me clean structured Json
data about a specific weather in a
specific City however most of the
websites and web service don't really
offer such API access for people to use
because quite often those data are
actually the assets the company owns
like news media LinkedIn or e-commerce
websites so they don't really want to
provide API service for people to just
grab data and build a new website and on
the other hand most of websites traffic
is so low like your personal block they
just don't really have a need or budget
to build a API service for people to use
that's why web scraping is a main way
for developers to get structured
information from all sorts of different
websites but the complexity here is that
website builder are often designing the
website for human to consume with lots
of beautiful Graphics nice navigation
animation which is great for user
experience but not really good for
machine to get access to data for
example to provide faster response time
and better user experience modern
website often don't load everything in
the first request they do some
techniques that only loads as you scroll
into specific area and quite often
information are actually gated behind
pay world that require subscription or
certain authentication so you can't just
do a simple HTTP request to get the
content out and there are all sorts of
different interaction that have pop up
to get people to do something before
they can access content and for obvious
reason some websites have capture to
prevent those B traffic so to be able to
script those websites developers need to
write script to actually simulate real
human behavior for example wait for a
few seconds before the information
actually fully loaded to start getting
the information or can simulate it's a
typing clicking scrolling behavior in
the browser directly and to achieve this
developers often use headless browser if
you never heard about headass browser it
is basically type of web browser that
access web pag but don't have a real
user interface so you can almost think
about Chrome Firefox those type of web
browser that you already use but don't
have any inputs for you to interact with
browser but all the interaction will be
detailed by a script which can simulate
user interactions like typing clicking
scrolling download loaning data and all
sorts of different things so essentially
these browsers operate in the background
executing web page exactly like a
standard browser would but without
displaying any content on the screen
this makes them useful for various made
web tasks like testing and web scraping
and popular browser like Chrome Firefox
all have headless mode already but
directly operating and controlling those
headless browser is not that
straightforward you need to interact
with the browser Dev tool but
fortunately there are libraries like
play right properer and selinum which
provide high level API to control the
web browser so developers can R script a
lot easier you can just give very
specific actions like create a new page
go to specific URL fill in value for
inputs and those code can work across
multiple different web browsers so that
you can basically write script to
dealing with different type of complex
web browser Behavior like Progressive
loading authentication clicking and
typing behavior and even capture but one
caveat is even though you can write
script to simulate real user Behavior
Web scripting is still a really
operation heavy task and the reason is
that each website structure is widely
different the scripper you build for
Airbnb won't really work for many other
travel website where you also want to
get hotels or listing data and there's
literally no standard way to have a
script that can work across all sorts of
different websites so to do proper
scraping has been a really heavy
automation work in some companies like
travel aggregator the used to have a big
amount engineer resource are spent on
just creating and maintaining different
web scrippers but with the burst of
large langage model a universal web
scrier that can scrip in any type of
websites on the Internet is finally
possible because large language model
enabled a few things that wasn't
possible before firstly large Lage model
is extremely good at handling
unstructured data so a lot of people
already have used case where you feed
lar Lage model a big blog or 1 hour
interview transcript to ask extract key
insights and the same thing we can just
feed large language model any structure
of Dom elements from the website and get
extract specific Json format and
regardless of the website structure
model can just always looking at any
website and extract the same Json output
and this means developers don't need to
build specific scripts to handling and
parsing different website structure all
you need just come up with good prompt
and get large L model generating unified
Json data and secondly as the
development of multi model model like
GPD 4V becomes better and better
understanding visual elements now is the
first time that we can really align how
machine and human browse and consume
internet data as I mentioned before most
of website builder when they design a
website they are designed for human as a
consumer so there are lot of animation
and ux are great for Human Experience
but not necessarily good for machine to
consume or platform that design great U
for human to use with complex UI
component that is great for user
experience but a lot harder for the B to
interact with emerge of multimod model
like GPT 4V those model show strong
capability to understand web page and
interpret the actions to take to
complete web tasks in a paper published
by Microsoft where they tested all sorts
of different visual tasks and part of
test is web browsing and GUI navigation
where they will give GPT 4V different
screenshots of computer and web browser
and ask it to interpret what's the next
action to take NGB 4V showcase
capability of completing certain web
task like doing a Google search of
specific recipe browsing through
multiple different blocks and Diving to
specific web page that contain the four
end to end and in the end extract
unified information so those multimodel
large L model for the first time can
really align how machine and human brows
internet data which will probably leads
to smaller gap between how human were
manually scripting the information
versus what the machine will return and
then multimodal ability combined with a
jacket behav can enable some really
powerful and advanced web browsing
agents where the agent can have Direct
Control of a web browser and use a
multimodal model to look at the
screenshots or the D element and decide
what's the next best action and in the
end simulate the real user Behavior like
clicking typing on the browser itself
and their platform like multi-on or
hyper rde already created personal
assistant that can directly control your
web browser to complete complex web
tasks so I personally think this opens
up a huge opportunity for people to
build a Universal web scrier that is
powered by agents and this agent is
almost like API for the whole internet
where you can just give natural language
to this agent about specific data point
that you want to extract from internet
and this agent can just scrape in
whatever website or data source to get
the information back so for the past
week I've been exploring how can you
build such Universal web scraping agent
and I want to share my learnings with
you as well as specific case study of
how can you build one as well but before
we dive in as an AI Builder who tries
loads of different models and software
on my MacBook this MacBook I bought back
2021 start feeling really really slow
sometimes even laggy when I do normal
browser tasks so I almost about to buy a
new Macbook but thanks to clean my Mac X
I didn't have to if you using Mac OS
system and you haven't really used
clingman Mac yet I definitely recommend
it is the most popular and easy to use
Mac performance optimization it has
super clean ux and my favorite feature
is smart scan basically you just click
on one button and it will do the clean
up mware removal and speed up all in one
combo it takes about only 2 minutes to
polish up even a Das slow so boom I just
remove around 6 GB un needed junks on
the other side they also have this
feature called space lens it will give
me a bird eye view on my MacBook to see
which files are taking a huge amount of
storage but I never use it I can dig
into each folder multiple level deep and
see what type of files are taking a lot
of storage that I can remove over viw
and modify what applications will be run
automatically when I open Mac so C my
Mac X can really turn my old MacBook
into something that almost feel like a
new map so I definitely recommend go
give a try you can click the link in the
description below to try clean my Mac X
for free for 7even days and they also
offer 20% discount code exclusively for
this YouTube channel subscribers now
back to our web scrapping agent so at
high level there are two ways I explore
to build such a gantic web script method
one is more API based by calling
different API service that can already
give data back but we use a gentic
behavior to scrip in multiple different
data source and return the information
back and second method is give the agent
Direct Control of the web browser teach
the agent how can they use web browser
and get information back for more
complicated use case where the website
has complex pagination infinite
scrolling capture or authentication so
for the first one which is API based
agentic scripper the way it works is
pretty straightforward I basically want
to build a agent that has access to
different type of existing scrippers to
get information from the website and
agent can receive any type of research
task like getting all the listed renting
properties in SF from specific website
and the agent can just start gripping
those websites getting the RO data but
utilize large L models capability to get
structured information back and we can
utilize a lot of different Marketplace
and platform that already offer a huge
amount of data source like rapid API
Marketplace API F or even platform like
EXA I where they basically embedding the
whole internet data and allow you to do
natural language and sematic search
which sometimes it can return more
revant information rather than Google
and the challenge of those type of
agentic scrier I fa with one is the
clean data the data it return is either
row Dum Elements which contains huge
amount of noise that I don't really need
or just stripped out tasks but a lot of
information will remov during that
process like links imag and files so
without clean data to Star Wars to build
such agentic research is quite hard
fortunately there was one open source
project came out a few weeks ago called
fire they gr really fast and had more
than 2.7 s000 stars on GitHub and they
basically provide API and point that can
turn any website into large Lang model
ready markdown data they remove all the
noise but still keeps the links and
website structure and this made the
amount of data that feed into L model
much more clean and relevant so the
performance is way better and I'll show
you how can you use this to build a
really high performance web scen R ER
and the other challenge I faced with was
planning and memory of the agent so if
your scraping and research task is
pretty simple like you just want to get
the number of employee of a certain
company that Tas normally fine but quite
often the scripting needs a bit more
complicated the use case I know is users
want to collect a list of different data
point for a specific entity like for one
company they might want to know multiple
information like number of employees
case studies whether the product is live
or not the pricing competitors all those
information from the website or Internet
and through this process it will require
the agent go through many different
websites and data source so how can we
keep the agent to remember what
information has been collected already
to be able to deliver the final results
effectively is a big Challenge and one
solution I Tred that seem to work pretty
well is they having a scratch Pad where
the agent can read and write data into
even though the agent is griping through
multiple different website whenever it
found a specific information like number
of employees or comparator or case study
it will just save the information to a
database and then dynamically updating
it system prompt and insert a shortterm
memory and second method I explore is a
browser control based agentic scrier
which means we basically build agent
they have direct accessory web browser
so that it can simulate more complex
user Behavior like going through
multiple page in the pagination handling
capture or even logging to go past
authentications and the way it works is
that you will use common tool in library
like playright or Puppeteer to set up
browser session for the agent the agent
can either look at a simplified Dom
element decide what's the next actions
and actually simulate those user
interaction like typing clicking
scrolling in the browser directly or you
can even use multimodel model like GPD
4V to take a screenshot of the web page
and do some pre-processing like bonding
box feed that to gbt 4V to decide on
next actions and then simulate the user
interactions so those type of agent are
really powerful when you build things
right it can go past almost any website
since it's simulating the real user
Behavior but this type of agent is also
a lot more complex to build I give you
one example if you're building a API
based agent all they need to do just
call a function called send email and
the task is done but if you're building
a browser based web agent to complete
such task of send email it need to take
so many different steps to even complete
a simple task from go visiting at
gmail.com clicking on a search bar
search for a specific email click on the
right email click on reply button typing
out response and click Send so this type
of agent does post a much bigger
challenge in terms of planning and the
memory and on the other side there are
also a lot of capacities in terms of
locating the right UI elements to
interact with let's take web page as
LinkedIn as example if you ask agent to
sign in there are literally three
different signin button on the web page
each one of them represent a different
action so sometimes quite difficult for
agent to know which is the right you
admins to choose and if you're using Dom
based agents instead of vision based
agents this task is even harder for
example this is one of the video clip I
showcased before in another video where
another web agent Builder is talking
about challenges they faced with in term
of locating the specific UI animate this
website has given me nightmares uh the
the the issues with the HTML encodings
are just endless like for example you
see you see this like a placeholder text
and this input over here it's actually
not a placeholder it's sets dynamically
by another element not the search box by
another element that's somewhere else in
the documentary so that means that when
you go and you you look at this actual
element for the search box you don't
actually know what it is um so that
prevents the the the language model from
understanding uh where to to input it
its search barriers um but luckily there
are teams that has been working on
libraries to make this task easier one
of the library that I started working
really well is called Agent ql they
basically buil a special model where you
can give agent a specific Ur element
that you want to retrieve then the model
will try to identify and return that
specific Ur element for the agent to
interact with so you can get agent to
interact with web you element quite
reliably across different websites using
Library like this and on the left is a
quick demo of how you can build a
scripper that's sripping through all the
different Tesla website to get
information and save it somewhere and
same method can even work on mobile UI
as well you can just give the model
specific UI element that you want to get
and it will automatically return you the
right UI element on the screen so I also
build a universal kind of sraer for
e-commerce website using agent ql they
can sriping through all sorts of
different e-commerce websites and the
last part of challenge is scale so even
though we can build some quite powerful
webbased agent on your local machine to
complete certain task when you really
want to put into buiness setup you
actually need to run a scale in the
libraries we're using here like Play
Ride pop tier is actually not a
liveweight so running thousands of those
different web sessions are actually not
that easy but there are also teams that
has been working on this problem to
enable you deploying headless web
browser on the cloud very easily so
their team called browser base they
basically allow you to run headless web
browser at very large scale so you can
build webbased agent that's serving
thousands or millions of customers very
easily like multi-on de here so there
are many super Talent teams around the
world are working on solving those
individual problems that can really
enable such Universal web scraper agent
you can already built a few super
powerful web agent and I'm going to show
you step by step how can you build those
Universal web scrip per Bas so firstly I
want to build a universal agentic
scripper where user can just give a list
of companies as well as website then
give a list of different data point that
they want to collect about each company
and the agent can just do the research
and fill in all those empty sales and I
want the scripper to work in it very
specific way where it can research as
much as possible within that company
domain which means it can navigate to
Different Page in that specific company
website as we know data source that has
higher quality but if the agent can't
find any specific information then it
can go to internet to search to do that
I will open Visual Studio code and first
let's create a file called em file this
is where we're going to store all the
credentials in our specific case we need
a file C API as well as open API and
next I will create a new file called
app.py in here we first import a few
different libraries and load the
environment variable then we Define open
Ai and the GPT model we're going to use
is GPT for Turbo and firstly let's
create a few functions that agent going
to use one is a square function and I'm
going to use the hosted file cor service
they also have open source version so if
you want to host by yourself feel free
to do and it's pretty straightforward I
just Define file core app try to script
the URL and return only the markdown
information I also create a function
called search so this another service
that file cor also have that allow you
to search across the internet and return
clean data agent going to use this
search function if it can't find any
information from the company domain
websites but I want to implement in a
way that agent can always find whatever
the data point that they found into a
database and use that as a additional
context to know what information has
been script already the search endpoint
will clean up the data and return most
relevant search results so content and
return can be quite a big that's why for
a search function I actually have two
parts on one hand it record API in point
and return the search results but on the
second part actually feed the research
result into a large Lang model to do a
summary to extract only most relevant
information so I call search API and
point to return results then I will try
to GA the data Keys which is a list of
column name that user Define and put
together as a prompt that will feed
large Lage model and run a chat
completion to get the most relevant
result back so this search function
should return us the most relevant
information and third I will also add a
function called update data so this is a
function that allow the agent to
continuously read and write information
to a database so that it can be aware
about the information that already found
as well as data source that already V so
this function will basically take some
data points to update and save to Global
State this three are basically main
functionality agent will need to have
access to and next we will just need to
create some function calling agent from
scratch so I will have some help
function one is chat completion request
then I have another function to log all
the thinkings and results from the agent
and I'll create tool list I also create
a one function called memory optimize to
solve One Challenge where the large
language model have limited contact
window so that it won't exceed the token
context window limit and I just did a
pretty simple implementation here where
if there are more than 24 back and forth
conversation or the number of token of
the agent chat history is 10,000 then
just automatically takes a last 12
different conversations then do a
summary of old conversation and this
will be using the smaller model like GPD
3.5 turbo and put together a new chat
history I have one help function called
call agent step we can basically decide
whether we want agent to make a plan
before it start action so if plan is
true then we'll get agent to firstly
sync step by step make a plan first and
then continue the conversation but if
not we'll just get agent working
directly without making plan and we will
write a full loop to get agent to
execute until lar langage model return
the Finish reason to be stopped and we
will run this memory optimization after
after every single step and this pretty
much it all we need to do now is just
connecting everything together as I
mentioned before I want this agent to
follow a very specific procedure to do
research as much as possible within the
company domain and if there's nothing
can be found on their company domain
then go search the internet that's why I
break down the workflow into two stages
the step one will be run agent to do
website search where the agent would
have access to two functions one is
gripping another is update data then
agent will start doing research within
the company website of the data point
that is not known then the second stage
will be internet search where our get a
function of search as well as update
data again it will start doing research
but on the external internet source here
I will link to the entity that we want
to research the links that we already
script as well as data points that we
haven't found yet as instruction now we
can just start triggering this agent or
create a global State called link script
data points that we want agent to
collect in my specific case I want to
find a list of companies who are
offering catering to the employees and
to understand the number of employees as
well as office locations so if I'm Uber
e for business I can know whether this
is good candidate company or not and
entity name is Discord as well as a
website so I will run this two different
stages in the end return the data points
so I can open this then do python app.py
so first they make a plan about which
data source to research about it first
they script the Discord page which it
return a bunch of sub pages then next is
start research about the career page I
believe but didn't look like this code
official website provide any useful
information so it mve to the second
stage of internet search and start
calling the search function now this
time it found some information about
number of employee to be 750 from this
data source which I can click on so this
is from one the website for Zia and then
it just update this data knowledge about
number of employees and office locations
after that it did second search where
found that there's some catering
information where Discord provides their
employees with special perks like fun
investment work setup as well online
event like cooking class in the end it
return back the data points the carrying
offer for the employees the number of
employees as well as office location and
each data point has a reference link to
back up so this is example of how can
you create a super smart Universal
scripper that can just go on internet as
well as specific website it was given to
find information this works for some
general website that do have content
gating or information that didn't
require complex UI interaction like
pagination but if you want to script
let's say information from each
e-commerce website where you want to
apply specific search query and also
need to work with pagination then we
will need agent to Direct Control the
web browser and this is the second type
of agentic web scripper and I'm going to
use an Library Called Agent ql to set up
a universal e-commerce website sripper
so what I want to Chief is Agent where I
can give a URL and then it will be able
to scraping through all the product
information and also click on every
single page in the pation until all the
information has been collected to do
this our firstly go to file put in agent
ql API key here as well then I will
create a new file called browser agent
and then load a few different libraries
I will first Define a query called
products so this is a query that we will
give agent ql to return the information
we want and for each product I want the
product name price number review and
ratings and on the out side I also want
to get next page button so this is the
next button on the pation and here you
will notice that I try to get two
elements one is next page button enable
another is next page button disable the
reason I want this to is because even
though we reach the last page in the
pagination there will be still a next
button but it will be in disabled state
so I can write a y Loop to continuously
go to next page until next page button
disable is true so that's pretty much it
we can firstly start a agent ql session
by doing agent ql do start session and
give it a URL and I can also control to
scroll to the bottom because for some
e-commerce website the pation will only
be loaded after you scroll to the bottom
then I will create a file called p. CSV
with mode to be append so that it keep
adding new data to this file and I will
create four different columns the pera
name product price number review and
rating and while the next page button
enable is true and the next page button
disable is now I will try to run another
agent QR query to get a list of products
and for each product in return I will
add to the CSV file and in the end I
will click on the next page button again
scroll to the bottom and try to get the
next page button again and repeat this
process until it reach the last page so
that's pretty much it let's try it out
so I will open Terminal and run python
browser agent. py so it open the browser
and automatically scroll to the end of
the web page where the pagination will
be loaded and if I open the terminal on
the right side you will see that it get
the enable button to be go to next page
is two and disable button is not and it
scripts the first page product
information click on the next page go to
next page and start scripting the second
page perct information and also get the
next page button to be go to page three
so that it will repeat this process
until it hit the last page and once it
finished all the product data will be
saved to this CSV file and if I open
this CSV file you can see they
automatically return all the product
information across multiple pages and
what's really cool about this is that
this scrier not only works for Amazon
but almost any e-commerce website for
example I can go to eBay and just copy
the eBay search results to the agent and
I can run exact same script again and
this time it will open the eBay scroll
to the bottom to get pagination buttons
and if I open the terminal on the right
side you will see that it again get the
next page button and return that Page's
product information and and fat the next
page button again and it's going to
repeat this process again and again
until you get all the information that
it needs so this is another quick
example of how can you create a browser
based agent as well that can script
specific so as you can see there are
huge amount of possibilities for us to
create the scripper now I'm really
interesting to see what type of scripper
you will be start creating this is one
area that I'm personally really
passionate about and I do plan to launch
a universal agent scripper that can
enable people to scrip any type of
information on internet if you're
interested in that Universal agent
scripper I put a link in the description
below so you can sign up once I finish
that scripper agent I will let you know
and please comment below for any
question you have thank you and I see
you next time
Ver Más Videos Relacionados
This AI Agent can Scrape ANY WEBSITE!!!
Pembelajaran Informatika: Web Browser (SMP Kelas 9)
What Is a Headless Browser and How to Use It?
Web Scraping Tutorial | Data Scraping from Websites to Excel | Web Scraper Chorme Extension
How The Web Works - The Big Picture
Avoiding Mistakes in Defining Agents and Tasks in CrewAI
5.0 / 5 (0 votes)