“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent

AI Jason
16 May 202429:11

Summary

TLDRThis video script discusses the evolution of web scraping in the age of vast internet data. It explores the challenges of extracting information from websites designed for human interaction and introduces the use of large language models and headless browsers to create universal web scrapers. The script also touches on the potential of multimodal models like GPT-4V to understand and interact with web pages, and the development of tools like AgentQL for more reliable web element interaction. The presenter shares insights on building intelligent web scraping agents capable of navigating complex websites and collecting structured data.

Takeaways

  • 🌐 Web browsers have been the primary mode of internet interaction since 1993, with new data and websites being created at an astonishing rate.
  • 📈 By the end of 2024, it's estimated that 147 zettabytes of data will be created, with platforms like Facebook generating over 4,000 terabytes of data daily.
  • 💥 There are approximately 252,000 new websites created every day, which equates to three new websites every second.
  • 🤖 A significant portion of web traffic is not from human users but from bots and automated systems scraping data from websites.
  • 🕸️ Web scraping involves using scripts to mimic web browsers to extract information from websites, especially when no API is available.
  • 🔄 The process of web scraping can be complex due to the dynamic nature of modern websites that often load content progressively or behind paywalls.
  • 🧑‍💻 Developers use headless browsers to simulate user interactions for web scraping, which operate in the background without a user interface.
  • 📚 Large language models have the potential to revolutionize web scraping by handling unstructured data and generating structured JSON outputs regardless of website structure.
  • 🎯 Multimodal models like GPT-4V are advancing to understand and interpret visual elements on web pages, aligning machine and human browsing behaviors.
  • 🔗 The emergence of universal web scraping agents powered by AI could reduce the need for custom scripts for each website, offering a more streamlined approach to data extraction.
  • 🚀 The development of such agents could lead to the creation of an 'API for the entire internet,' where natural language prompts can be used to extract specific data points from various online sources.

Q & A

  • What is the significance of the year 1993 in the context of web browsers?

    -1993 is significant because it's the year when Gabe Navigator was released, marking the beginning of web browsers as the primary means for people to interact with the internet and access online information.

  • What is the estimated amount of data that will be created by the end of 2024, and how much data does Facebook produce daily?

    -By the end of 2024, it's estimated that there will be 147 zettabytes of data created. Facebook alone produces more than 4,000 terabytes of data every single day.

  • How many new websites are created every day according to the script?

    -According to the script, approximately 252,000 new websites are created every day, which translates to about three new websites per second.

  • What is web scraping and why is it necessary?

    -Web scraping is the process where developers write scripts to mimic web browsers and make HTTP requests to URLs to extract information. It's necessary because many websites do not offer API access, and scraping allows for the extraction of structured information from various websites.

  • What is 'curl' and how is it used in the context of the script?

    -Curl is a command-line tool for transferring data with URLs. In the script, it's used to send a request to a website and retrieve the website content in HTML format, or to download data to a local file.

  • Why do some websites not provide API services for data access?

    -Some websites do not provide API services because the data is often a valuable asset owned by the company. They may not want to allow others to easily grab data and use it to build competing websites or services.

  • What challenges do developers face when scraping data from modern websites?

    -Developers face challenges such as websites being designed for human consumption with graphics and animations that are not machine-friendly, data being loaded dynamically or behind paywalls, and the need to simulate human behavior to access content.

  • What is a headless browser and how does it assist in web scraping?

    -A headless browser is a web browser that accesses web pages but doesn't have a user interface. It allows for the simulation of user interactions like typing, clicking, and scrolling, which is useful for web scraping tasks that require complex user behavior.

  • How do libraries like Playwright and Puppeteer help in controlling web browsers for web scraping?

    -Libraries like Playwright and Puppeteer provide high-level APIs to control web browsers, allowing developers to script actions like creating new pages, navigating to URLs, and filling in values for inputs across different browsers.

  • What role do large language models play in the future of web scraping according to the script?

    -Large language models play a significant role in the future of web scraping by handling unstructured data and extracting structured information from any website structure. They can align how machines and humans browse and consume internet data, making it possible to create a universal web scraper.

Outlines

00:00

🌐 The Evolution of Web Scraping and Data Growth

This paragraph discusses the persistent dominance of web browsers as the primary interface for internet interaction since the advent of Gabe Navigator in 1993. It highlights the exponential growth of data and websites, with predictions of a staggering 147 zettabytes of data creation by the end of 2024. The script also touches on the significant amount of web traffic generated by bots and the practice of web scraping, which involves writing scripts to mimic web browsers for data extraction. The paragraph introduces the concept of using 'curl' for data retrieval and the limitations of websites not offering API access due to data being a valuable asset.

05:01

🤖 The Complexity of Web Scraping and the Emergence of AI

The second paragraph delves into the complexities of web scraping, noting the challenges posed by websites designed for human consumption with rich graphics and interactive elements that are not machine-friendly. It discusses the difficulties in accessing data through traditional HTTP requests due to modern websites' techniques like progressive loading and paywalls. The paragraph introduces the concept of headless browsers, which operate in the background without a user interface, and the use of libraries like Playwright and Selenium to control them. It also acknowledges the heavy operational demands of web scraping due to the unique structure of each website.

10:02

🚀 The Impact of Large Language Models on Web Scraping

This paragraph explores the transformative impact of large language models on web scraping. It emphasizes the ability of these models to handle unstructured data and extract structured information from any website, regardless of its structure. The paragraph also discusses the advancements in multimodal models like GPT-4V, which can understand visual elements and interpret actions for web tasks. It highlights the potential of these models to align machine and human browsing behaviors, leading to more effective web scraping agents.

15:04

🛠 Building a Universal Web Scraping Agent with AI

The fourth paragraph outlines the process of building a universal web scraping agent powered by AI. It discusses two primary methods: one API-based, utilizing existing scrapers and large language models to extract and summarize data, and the other browser control-based, where the agent directly controls the web browser to simulate complex user behaviors. The paragraph also addresses the challenges of data cleanliness, agent planning and memory, and the complexities of interacting with various UI elements across different websites.

20:05

🔄 Continuous Data Collection and Memory Optimization

This paragraph focuses on the continuous data collection process and the importance of memory optimization for AI agents. It describes the functions and workflow for an agent to research within a company's domain and then the internet, updating a database with found information to avoid duplication. The paragraph introduces techniques to manage the agent's memory, such as summarizing old conversations when the context window limit is reached, ensuring efficient operation.

25:06

🔍 Advanced Web Scraping with Browser Control and Pagination

The final paragraph demonstrates advanced web scraping techniques for e-commerce websites, where an agent controls a web browser to scrape product information across multiple pages with pagination. It introduces the use of AgentQL for reliable UI element interaction and shows how the agent can navigate through web pages, handle pagination, and collect comprehensive product data in a CSV file, applicable to various e-commerce platforms.

Mindmap

Keywords

💡Web Browser

A web browser is a software application for accessing, retrieving, and displaying content over the World Wide Web. It is central to the video's theme as it discusses the evolution of internet interaction since the advent of Gabe Navigator in 1993. The script mentions that web browsers remain the primary means for users to shop, entertain, and communicate online.

💡Data Scraping

Data scraping, also known as web scraping, is the process of programmatically extracting information from websites. The video discusses this as a method used by bots to extract valuable information from the internet. An example provided in the script is the use of 'curl' to retrieve the HTML content of a blog without an API.

💡API (Application Programming Interface)

An API is a set of rules and protocols for building software applications, which allows different software systems to communicate with each other. The script explains that many websites do not offer APIs because the data they control is often a valuable asset. However, some web services provide APIs for data such as Google Maps or weather information.

💡HTTP Request

An HTTP request is a message sent from a client to a server to request access to a resource. In the context of the video, it is mentioned in relation to data scraping, where a simple HTTP request is made to retrieve information from a URL, as demonstrated by the 'curl' command in the script.

💡Headless Browser

A headless browser is a web browser without a graphical user interface, designed to perform web page rendering and interaction via script commands. The video describes headless browsers as tools for developers to simulate user interactions like clicking and scrolling for web scraping tasks, operating in the background without displaying content on the screen.

💡Web Traffic

Web traffic refers to the amount of data transferred from web servers to a user's web browser. The script notes that a significant portion of web traffic is not from human users but from bots and computers attempting to gather information from various websites, highlighting the scale of data generation and the role of automated processes in web interaction.

💡Large Language Models (LLMs)

Large Language Models, such as GPT (Generative Pre-trained Transformer), are AI systems designed to understand and generate human-like text based on input data. The video discusses how LLMs can be used for tasks like summarizing content or extracting insights from unstructured data, which is transformative for web scraping and data extraction.

💡Multimodal Model

A multimodal model is capable of processing and understanding information from multiple types of data, such as text, images, and sound. The script mentions GPD 4V, a multimodal model, which can interpret visual elements and align machine browsing with human browsing behavior, enhancing the ability to navigate and interact with web pages programmatically.

💡Web Scraping Agent

A web scraping agent is a software entity that performs automated data extraction from websites. The video outlines the creation of a universal web scraping agent powered by AI, which can navigate and scrape data from any website using natural language prompts, offering a new approach to data collection on the internet.

💡Progressive Loading

Progressive loading is a web development technique where content is loaded in stages, often as the user scrolls down the page. The script discusses this as a challenge for web scraping, as it may require scripts to wait for content to load or to mimic user scrolling behavior to access all available data.

💡Authentication

Authentication is the process of verifying the identity of a user or device. In the context of web scraping, the script mentions authentication as a hurdle, as some information is gated behind login requirements or paywalls, necessitating the development of scripts that can simulate the login process to access content.

💡E-commerce Website

An e-commerce website is a platform designed for online buying and selling of goods or services. The script provides an example of creating a universal scraper for e-commerce websites, which can navigate through product listings, handle pagination, and collect data such as product names, prices, and reviews.

💡CSV File

A CSV (Comma-Separated Values) file is a simple file format used to store tabular data, such as a spreadsheet or database. The script describes the use of a CSV file to store and organize the scraped data from e-commerce websites, allowing for easy access and analysis of the collected information.

Highlights

Web browsers have been the primary means of internet interaction since 1993.

By the end of 2024, it's estimated that 147 zettabytes of data will be created, with 1 zettabyte equaling roughly 1 trillion gigabytes.

Facebook alone produces over 4,000 terabytes of data daily.

There are 252,000 new websites created every day, equating to three new websites every second.

A significant portion of web traffic is from bots and computers scraping data from websites.

Web scraping involves writing scripts to mimic web browsers and make HTTP requests to extract information.

curl is a command-line tool used for transferring data with URLs, which can retrieve website content.

API endpoints are provided by some web services for accessing structured data, unlike many websites that do not offer such access.

Web scraping is complex due to the varied structures and interactive elements of modern websites.

Headless browsers are used for web scraping as they can simulate user interactions without a user interface.

Libraries like Playwright and Selenium provide high-level APIs for controlling web browsers in scripts.

Web scraping is operation-heavy due to the lack of standardization across websites.

Large language models are capable of handling unstructured data and can extract key insights from large datasets.

Multimodal models like GPT-4V can understand visual elements and align machine and human browsing behaviors.

The emergence of large language models has made the creation of a universal web scraper possible.

A universal web scraper can be powered by agents that understand natural language prompts and extract data from any website.

The development of such agents opens up opportunities for building advanced web browsing agents that can complete complex tasks.

The presenter has been exploring the creation of a universal web scraping agent and will share case studies and learnings.

CleanMyMac X is recommended for optimizing Mac performance, with features like smart scan and space lens.

Transcripts

play00:00

ever since 1993 Gabe Navigator came out

play00:03

web browser Remains the default way of

play00:05

how people interact with internet and

play00:07

get information from online shopping

play00:09

raing entertaining and communication and

play00:12

every year this huge amount of new data

play00:14

and website has been created for many

play00:16

different purpose there's an estimation

play00:18

that by end of 2024 there will be 147

play00:21

zitp wor of data will be created one Z

play00:24

is equal to roughly 1 trillion gigabyte

play00:27

that is huge amount of data just by

play00:29

Facebook itself it es made to produce

play00:31

more than 4,000 tbyte of data every

play00:33

single day part from this giant web

play00:35

platform their new websites came out all

play00:37

the time according to FS there are

play00:40

252,000 new websites created every day

play00:43

that basically means three new websites

play00:45

are built every single second so just

play00:47

for the past 60 seconds while you're

play00:49

watching this video more than 100 new

play00:51

websites has been live so we all know

play00:52

the amount of data and traffic growing

play00:54

on internet are growing really fast but

play00:56

what you might not know a great amount

play00:58

of those web traffic are actually human

play01:00

behind a computer try to browse the

play01:02

internet but from Bots and other

play01:03

computers trying to getting information

play01:06

out from different websites and internet

play01:08

since internet is filled with valuable

play01:10

information those BX and computers

play01:12

search and extract best content out of

play01:14

it it process we often refer to as

play01:16

scraping which would process where

play01:18

developers will write a script to mimic

play01:20

web browser and make simple HTTP request

play01:23

for the URL to get information out for

play01:25

example I have this personal blog

play01:27

website called Ai jason.com and this

play01:29

personal Blog has no API and point for

play01:31

you to access all the blog information

play01:34

to get information from my personal blog

play01:36

I can just open the terminal and do curl

play01:39

with my website URL and click enter Then

play01:41

This command line will automatically

play01:42

send a request to my website and get the

play01:45

website content back in the RO HTML

play01:47

format and if you don't know what curl

play01:49

is it is a protocol for transferring

play01:51

data with URLs and I can also download

play01:54

data to local file by do curl website

play01:56

URL and and create a file called AI

play01:58

json. HT and this will download HTML

play02:00

file into my local computer if I open

play02:03

the folder there's one web page if I

play02:05

open it basically download whole website

play02:06

data and on the other hand apart from

play02:08

calling specific web page there are a

play02:10

lot of web service that provide API

play02:12

endpoint for you to fash internet data

play02:15

like getting Google Rit data or weather

play02:17

data for example I can do curl and call

play02:19

a specific weather API service which

play02:22

will return me clean structured Json

play02:24

data about a specific weather in a

play02:26

specific City however most of the

play02:28

websites and web service don't really

play02:29

offer such API access for people to use

play02:32

because quite often those data are

play02:34

actually the assets the company owns

play02:36

like news media LinkedIn or e-commerce

play02:38

websites so they don't really want to

play02:40

provide API service for people to just

play02:42

grab data and build a new website and on

play02:45

the other hand most of websites traffic

play02:47

is so low like your personal block they

play02:49

just don't really have a need or budget

play02:51

to build a API service for people to use

play02:54

that's why web scraping is a main way

play02:56

for developers to get structured

play02:58

information from all sorts of different

play02:59

websites but the complexity here is that

play03:02

website builder are often designing the

play03:04

website for human to consume with lots

play03:06

of beautiful Graphics nice navigation

play03:09

animation which is great for user

play03:10

experience but not really good for

play03:12

machine to get access to data for

play03:14

example to provide faster response time

play03:16

and better user experience modern

play03:18

website often don't load everything in

play03:20

the first request they do some

play03:22

techniques that only loads as you scroll

play03:24

into specific area and quite often

play03:26

information are actually gated behind

play03:28

pay world that require subscription or

play03:30

certain authentication so you can't just

play03:32

do a simple HTTP request to get the

play03:35

content out and there are all sorts of

play03:37

different interaction that have pop up

play03:38

to get people to do something before

play03:40

they can access content and for obvious

play03:43

reason some websites have capture to

play03:45

prevent those B traffic so to be able to

play03:47

script those websites developers need to

play03:50

write script to actually simulate real

play03:52

human behavior for example wait for a

play03:54

few seconds before the information

play03:56

actually fully loaded to start getting

play03:58

the information or can simulate it's a

play04:00

typing clicking scrolling behavior in

play04:02

the browser directly and to achieve this

play04:05

developers often use headless browser if

play04:07

you never heard about headass browser it

play04:09

is basically type of web browser that

play04:11

access web pag but don't have a real

play04:13

user interface so you can almost think

play04:16

about Chrome Firefox those type of web

play04:18

browser that you already use but don't

play04:20

have any inputs for you to interact with

play04:22

browser but all the interaction will be

play04:24

detailed by a script which can simulate

play04:26

user interactions like typing clicking

play04:28

scrolling download loaning data and all

play04:31

sorts of different things so essentially

play04:32

these browsers operate in the background

play04:35

executing web page exactly like a

play04:37

standard browser would but without

play04:39

displaying any content on the screen

play04:41

this makes them useful for various made

play04:43

web tasks like testing and web scraping

play04:45

and popular browser like Chrome Firefox

play04:47

all have headless mode already but

play04:49

directly operating and controlling those

play04:51

headless browser is not that

play04:53

straightforward you need to interact

play04:54

with the browser Dev tool but

play04:56

fortunately there are libraries like

play04:58

play right properer and selinum which

play05:00

provide high level API to control the

play05:02

web browser so developers can R script a

play05:05

lot easier you can just give very

play05:07

specific actions like create a new page

play05:09

go to specific URL fill in value for

play05:12

inputs and those code can work across

play05:14

multiple different web browsers so that

play05:16

you can basically write script to

play05:17

dealing with different type of complex

play05:20

web browser Behavior like Progressive

play05:22

loading authentication clicking and

play05:25

typing behavior and even capture but one

play05:27

caveat is even though you can write

play05:29

script to simulate real user Behavior

play05:32

Web scripting is still a really

play05:34

operation heavy task and the reason is

play05:36

that each website structure is widely

play05:38

different the scripper you build for

play05:40

Airbnb won't really work for many other

play05:43

travel website where you also want to

play05:45

get hotels or listing data and there's

play05:47

literally no standard way to have a

play05:49

script that can work across all sorts of

play05:51

different websites so to do proper

play05:53

scraping has been a really heavy

play05:55

automation work in some companies like

play05:58

travel aggregator the used to have a big

play06:00

amount engineer resource are spent on

play06:02

just creating and maintaining different

play06:04

web scrippers but with the burst of

play06:06

large langage model a universal web

play06:08

scrier that can scrip in any type of

play06:10

websites on the Internet is finally

play06:13

possible because large language model

play06:15

enabled a few things that wasn't

play06:17

possible before firstly large Lage model

play06:19

is extremely good at handling

play06:21

unstructured data so a lot of people

play06:22

already have used case where you feed

play06:24

lar Lage model a big blog or 1 hour

play06:27

interview transcript to ask extract key

play06:30

insights and the same thing we can just

play06:31

feed large language model any structure

play06:34

of Dom elements from the website and get

play06:36

extract specific Json format and

play06:39

regardless of the website structure

play06:41

model can just always looking at any

play06:43

website and extract the same Json output

play06:46

and this means developers don't need to

play06:47

build specific scripts to handling and

play06:50

parsing different website structure all

play06:52

you need just come up with good prompt

play06:54

and get large L model generating unified

play06:56

Json data and secondly as the

play06:58

development of multi model model like

play07:00

GPD 4V becomes better and better

play07:03

understanding visual elements now is the

play07:05

first time that we can really align how

play07:07

machine and human browse and consume

play07:09

internet data as I mentioned before most

play07:11

of website builder when they design a

play07:13

website they are designed for human as a

play07:15

consumer so there are lot of animation

play07:17

and ux are great for Human Experience

play07:20

but not necessarily good for machine to

play07:22

consume or platform that design great U

play07:24

for human to use with complex UI

play07:27

component that is great for user

play07:29

experience but a lot harder for the B to

play07:31

interact with emerge of multimod model

play07:33

like GPT 4V those model show strong

play07:36

capability to understand web page and

play07:38

interpret the actions to take to

play07:40

complete web tasks in a paper published

play07:42

by Microsoft where they tested all sorts

play07:44

of different visual tasks and part of

play07:47

test is web browsing and GUI navigation

play07:50

where they will give GPT 4V different

play07:52

screenshots of computer and web browser

play07:54

and ask it to interpret what's the next

play07:56

action to take NGB 4V showcase

play07:58

capability of completing certain web

play08:00

task like doing a Google search of

play08:03

specific recipe browsing through

play08:04

multiple different blocks and Diving to

play08:06

specific web page that contain the four

play08:08

end to end and in the end extract

play08:10

unified information so those multimodel

play08:12

large L model for the first time can

play08:15

really align how machine and human brows

play08:17

internet data which will probably leads

play08:19

to smaller gap between how human were

play08:21

manually scripting the information

play08:22

versus what the machine will return and

play08:24

then multimodal ability combined with a

play08:27

jacket behav can enable some really

play08:29

powerful and advanced web browsing

play08:31

agents where the agent can have Direct

play08:33

Control of a web browser and use a

play08:35

multimodal model to look at the

play08:37

screenshots or the D element and decide

play08:39

what's the next best action and in the

play08:41

end simulate the real user Behavior like

play08:43

clicking typing on the browser itself

play08:45

and their platform like multi-on or

play08:47

hyper rde already created personal

play08:50

assistant that can directly control your

play08:51

web browser to complete complex web

play08:54

tasks so I personally think this opens

play08:56

up a huge opportunity for people to

play08:58

build a Universal web scrier that is

play09:01

powered by agents and this agent is

play09:02

almost like API for the whole internet

play09:05

where you can just give natural language

play09:07

to this agent about specific data point

play09:09

that you want to extract from internet

play09:10

and this agent can just scrape in

play09:12

whatever website or data source to get

play09:14

the information back so for the past

play09:16

week I've been exploring how can you

play09:17

build such Universal web scraping agent

play09:20

and I want to share my learnings with

play09:21

you as well as specific case study of

play09:23

how can you build one as well but before

play09:25

we dive in as an AI Builder who tries

play09:27

loads of different models and software

play09:30

on my MacBook this MacBook I bought back

play09:33

2021 start feeling really really slow

play09:36

sometimes even laggy when I do normal

play09:38

browser tasks so I almost about to buy a

play09:41

new Macbook but thanks to clean my Mac X

play09:44

I didn't have to if you using Mac OS

play09:46

system and you haven't really used

play09:48

clingman Mac yet I definitely recommend

play09:50

it is the most popular and easy to use

play09:52

Mac performance optimization it has

play09:55

super clean ux and my favorite feature

play09:57

is smart scan basically you just click

play09:59

on one button and it will do the clean

play10:02

up mware removal and speed up all in one

play10:05

combo it takes about only 2 minutes to

play10:07

polish up even a Das slow so boom I just

play10:10

remove around 6 GB un needed junks on

play10:13

the other side they also have this

play10:15

feature called space lens it will give

play10:16

me a bird eye view on my MacBook to see

play10:19

which files are taking a huge amount of

play10:21

storage but I never use it I can dig

play10:23

into each folder multiple level deep and

play10:25

see what type of files are taking a lot

play10:27

of storage that I can remove over viw

play10:29

and modify what applications will be run

play10:32

automatically when I open Mac so C my

play10:34

Mac X can really turn my old MacBook

play10:37

into something that almost feel like a

play10:39

new map so I definitely recommend go

play10:40

give a try you can click the link in the

play10:42

description below to try clean my Mac X

play10:45

for free for 7even days and they also

play10:47

offer 20% discount code exclusively for

play10:50

this YouTube channel subscribers now

play10:52

back to our web scrapping agent so at

play10:54

high level there are two ways I explore

play10:56

to build such a gantic web script method

play10:58

one is more API based by calling

play11:00

different API service that can already

play11:02

give data back but we use a gentic

play11:04

behavior to scrip in multiple different

play11:06

data source and return the information

play11:07

back and second method is give the agent

play11:10

Direct Control of the web browser teach

play11:12

the agent how can they use web browser

play11:14

and get information back for more

play11:15

complicated use case where the website

play11:17

has complex pagination infinite

play11:19

scrolling capture or authentication so

play11:22

for the first one which is API based

play11:24

agentic scripper the way it works is

play11:26

pretty straightforward I basically want

play11:27

to build a agent that has access to

play11:29

different type of existing scrippers to

play11:32

get information from the website and

play11:34

agent can receive any type of research

play11:36

task like getting all the listed renting

play11:38

properties in SF from specific website

play11:40

and the agent can just start gripping

play11:42

those websites getting the RO data but

play11:44

utilize large L models capability to get

play11:46

structured information back and we can

play11:48

utilize a lot of different Marketplace

play11:50

and platform that already offer a huge

play11:52

amount of data source like rapid API

play11:55

Marketplace API F or even platform like

play11:58

EXA I where they basically embedding the

play12:01

whole internet data and allow you to do

play12:03

natural language and sematic search

play12:05

which sometimes it can return more

play12:06

revant information rather than Google

play12:08

and the challenge of those type of

play12:10

agentic scrier I fa with one is the

play12:12

clean data the data it return is either

play12:14

row Dum Elements which contains huge

play12:17

amount of noise that I don't really need

play12:19

or just stripped out tasks but a lot of

play12:21

information will remov during that

play12:22

process like links imag and files so

play12:25

without clean data to Star Wars to build

play12:28

such agentic research is quite hard

play12:30

fortunately there was one open source

play12:31

project came out a few weeks ago called

play12:33

fire they gr really fast and had more

play12:36

than 2.7 s000 stars on GitHub and they

play12:39

basically provide API and point that can

play12:41

turn any website into large Lang model

play12:44

ready markdown data they remove all the

play12:46

noise but still keeps the links and

play12:48

website structure and this made the

play12:49

amount of data that feed into L model

play12:51

much more clean and relevant so the

play12:53

performance is way better and I'll show

play12:55

you how can you use this to build a

play12:57

really high performance web scen R ER

play12:59

and the other challenge I faced with was

play13:01

planning and memory of the agent so if

play13:03

your scraping and research task is

play13:05

pretty simple like you just want to get

play13:07

the number of employee of a certain

play13:09

company that Tas normally fine but quite

play13:12

often the scripting needs a bit more

play13:13

complicated the use case I know is users

play13:16

want to collect a list of different data

play13:17

point for a specific entity like for one

play13:19

company they might want to know multiple

play13:21

information like number of employees

play13:23

case studies whether the product is live

play13:25

or not the pricing competitors all those

play13:28

information from the website or Internet

play13:30

and through this process it will require

play13:31

the agent go through many different

play13:33

websites and data source so how can we

play13:35

keep the agent to remember what

play13:37

information has been collected already

play13:39

to be able to deliver the final results

play13:41

effectively is a big Challenge and one

play13:43

solution I Tred that seem to work pretty

play13:45

well is they having a scratch Pad where

play13:47

the agent can read and write data into

play13:49

even though the agent is griping through

play13:51

multiple different website whenever it

play13:53

found a specific information like number

play13:55

of employees or comparator or case study

play13:58

it will just save the information to a

play13:59

database and then dynamically updating

play14:01

it system prompt and insert a shortterm

play14:04

memory and second method I explore is a

play14:06

browser control based agentic scrier

play14:09

which means we basically build agent

play14:10

they have direct accessory web browser

play14:13

so that it can simulate more complex

play14:15

user Behavior like going through

play14:16

multiple page in the pagination handling

play14:19

capture or even logging to go past

play14:21

authentications and the way it works is

play14:23

that you will use common tool in library

play14:26

like playright or Puppeteer to set up

play14:28

browser session for the agent the agent

play14:30

can either look at a simplified Dom

play14:32

element decide what's the next actions

play14:35

and actually simulate those user

play14:37

interaction like typing clicking

play14:38

scrolling in the browser directly or you

play14:40

can even use multimodel model like GPD

play14:43

4V to take a screenshot of the web page

play14:46

and do some pre-processing like bonding

play14:48

box feed that to gbt 4V to decide on

play14:50

next actions and then simulate the user

play14:53

interactions so those type of agent are

play14:55

really powerful when you build things

play14:57

right it can go past almost any website

play14:59

since it's simulating the real user

play15:01

Behavior but this type of agent is also

play15:03

a lot more complex to build I give you

play15:06

one example if you're building a API

play15:08

based agent all they need to do just

play15:10

call a function called send email and

play15:12

the task is done but if you're building

play15:14

a browser based web agent to complete

play15:17

such task of send email it need to take

play15:19

so many different steps to even complete

play15:22

a simple task from go visiting at

play15:24

gmail.com clicking on a search bar

play15:26

search for a specific email click on the

play15:29

right email click on reply button typing

play15:31

out response and click Send so this type

play15:33

of agent does post a much bigger

play15:35

challenge in terms of planning and the

play15:38

memory and on the other side there are

play15:39

also a lot of capacities in terms of

play15:42

locating the right UI elements to

play15:44

interact with let's take web page as

play15:46

LinkedIn as example if you ask agent to

play15:49

sign in there are literally three

play15:51

different signin button on the web page

play15:53

each one of them represent a different

play15:55

action so sometimes quite difficult for

play15:57

agent to know which is the right you

play15:59

admins to choose and if you're using Dom

play16:01

based agents instead of vision based

play16:04

agents this task is even harder for

play16:05

example this is one of the video clip I

play16:07

showcased before in another video where

play16:09

another web agent Builder is talking

play16:11

about challenges they faced with in term

play16:13

of locating the specific UI animate this

play16:16

website has given me nightmares uh the

play16:19

the the issues with the HTML encodings

play16:21

are just endless like for example you

play16:23

see you see this like a placeholder text

play16:25

and this input over here it's actually

play16:27

not a placeholder it's sets dynamically

play16:30

by another element not the search box by

play16:32

another element that's somewhere else in

play16:34

the documentary so that means that when

play16:36

you go and you you look at this actual

play16:38

element for the search box you don't

play16:39

actually know what it is um so that

play16:41

prevents the the the language model from

play16:44

understanding uh where to to input it

play16:47

its search barriers um but luckily there

play16:49

are teams that has been working on

play16:51

libraries to make this task easier one

play16:53

of the library that I started working

play16:54

really well is called Agent ql they

play16:56

basically buil a special model where you

play16:58

can give agent a specific Ur element

play17:01

that you want to retrieve then the model

play17:03

will try to identify and return that

play17:05

specific Ur element for the agent to

play17:07

interact with so you can get agent to

play17:09

interact with web you element quite

play17:11

reliably across different websites using

play17:13

Library like this and on the left is a

play17:15

quick demo of how you can build a

play17:17

scripper that's sripping through all the

play17:19

different Tesla website to get

play17:21

information and save it somewhere and

play17:23

same method can even work on mobile UI

play17:26

as well you can just give the model

play17:28

specific UI element that you want to get

play17:30

and it will automatically return you the

play17:32

right UI element on the screen so I also

play17:34

build a universal kind of sraer for

play17:36

e-commerce website using agent ql they

play17:39

can sriping through all sorts of

play17:41

different e-commerce websites and the

play17:42

last part of challenge is scale so even

play17:44

though we can build some quite powerful

play17:46

webbased agent on your local machine to

play17:49

complete certain task when you really

play17:51

want to put into buiness setup you

play17:53

actually need to run a scale in the

play17:54

libraries we're using here like Play

play17:56

Ride pop tier is actually not a

play17:57

liveweight so running thousands of those

play18:00

different web sessions are actually not

play18:01

that easy but there are also teams that

play18:03

has been working on this problem to

play18:04

enable you deploying headless web

play18:06

browser on the cloud very easily so

play18:08

their team called browser base they

play18:10

basically allow you to run headless web

play18:12

browser at very large scale so you can

play18:14

build webbased agent that's serving

play18:16

thousands or millions of customers very

play18:18

easily like multi-on de here so there

play18:20

are many super Talent teams around the

play18:22

world are working on solving those

play18:23

individual problems that can really

play18:25

enable such Universal web scraper agent

play18:28

you can already built a few super

play18:30

powerful web agent and I'm going to show

play18:32

you step by step how can you build those

play18:34

Universal web scrip per Bas so firstly I

play18:37

want to build a universal agentic

play18:39

scripper where user can just give a list

play18:41

of companies as well as website then

play18:43

give a list of different data point that

play18:45

they want to collect about each company

play18:47

and the agent can just do the research

play18:48

and fill in all those empty sales and I

play18:51

want the scripper to work in it very

play18:53

specific way where it can research as

play18:55

much as possible within that company

play18:57

domain which means it can navigate to

play18:59

Different Page in that specific company

play19:01

website as we know data source that has

play19:04

higher quality but if the agent can't

play19:06

find any specific information then it

play19:08

can go to internet to search to do that

play19:10

I will open Visual Studio code and first

play19:12

let's create a file called em file this

play19:14

is where we're going to store all the

play19:16

credentials in our specific case we need

play19:18

a file C API as well as open API and

play19:20

next I will create a new file called

play19:22

app.py in here we first import a few

play19:25

different libraries and load the

play19:26

environment variable then we Define open

play19:28

Ai and the GPT model we're going to use

play19:31

is GPT for Turbo and firstly let's

play19:33

create a few functions that agent going

play19:34

to use one is a square function and I'm

play19:37

going to use the hosted file cor service

play19:39

they also have open source version so if

play19:41

you want to host by yourself feel free

play19:43

to do and it's pretty straightforward I

play19:44

just Define file core app try to script

play19:47

the URL and return only the markdown

play19:49

information I also create a function

play19:51

called search so this another service

play19:53

that file cor also have that allow you

play19:55

to search across the internet and return

play19:57

clean data agent going to use this

play19:59

search function if it can't find any

play20:01

information from the company domain

play20:03

websites but I want to implement in a

play20:05

way that agent can always find whatever

play20:07

the data point that they found into a

play20:09

database and use that as a additional

play20:11

context to know what information has

play20:13

been script already the search endpoint

play20:15

will clean up the data and return most

play20:17

relevant search results so content and

play20:20

return can be quite a big that's why for

play20:21

a search function I actually have two

play20:23

parts on one hand it record API in point

play20:26

and return the search results but on the

play20:28

second part actually feed the research

play20:30

result into a large Lang model to do a

play20:32

summary to extract only most relevant

play20:35

information so I call search API and

play20:37

point to return results then I will try

play20:39

to GA the data Keys which is a list of

play20:42

column name that user Define and put

play20:44

together as a prompt that will feed

play20:46

large Lage model and run a chat

play20:48

completion to get the most relevant

play20:50

result back so this search function

play20:51

should return us the most relevant

play20:53

information and third I will also add a

play20:55

function called update data so this is a

play20:57

function that allow the agent to

play20:59

continuously read and write information

play21:01

to a database so that it can be aware

play21:04

about the information that already found

play21:06

as well as data source that already V so

play21:08

this function will basically take some

play21:09

data points to update and save to Global

play21:12

State this three are basically main

play21:14

functionality agent will need to have

play21:15

access to and next we will just need to

play21:17

create some function calling agent from

play21:19

scratch so I will have some help

play21:21

function one is chat completion request

play21:24

then I have another function to log all

play21:26

the thinkings and results from the agent

play21:28

and I'll create tool list I also create

play21:30

a one function called memory optimize to

play21:32

solve One Challenge where the large

play21:34

language model have limited contact

play21:36

window so that it won't exceed the token

play21:38

context window limit and I just did a

play21:40

pretty simple implementation here where

play21:43

if there are more than 24 back and forth

play21:46

conversation or the number of token of

play21:48

the agent chat history is 10,000 then

play21:51

just automatically takes a last 12

play21:53

different conversations then do a

play21:55

summary of old conversation and this

play21:56

will be using the smaller model like GPD

play21:58

3.5 turbo and put together a new chat

play22:01

history I have one help function called

play22:03

call agent step we can basically decide

play22:05

whether we want agent to make a plan

play22:08

before it start action so if plan is

play22:10

true then we'll get agent to firstly

play22:12

sync step by step make a plan first and

play22:14

then continue the conversation but if

play22:16

not we'll just get agent working

play22:18

directly without making plan and we will

play22:20

write a full loop to get agent to

play22:22

execute until lar langage model return

play22:24

the Finish reason to be stopped and we

play22:26

will run this memory optimization after

play22:28

after every single step and this pretty

play22:30

much it all we need to do now is just

play22:31

connecting everything together as I

play22:34

mentioned before I want this agent to

play22:35

follow a very specific procedure to do

play22:38

research as much as possible within the

play22:40

company domain and if there's nothing

play22:42

can be found on their company domain

play22:44

then go search the internet that's why I

play22:46

break down the workflow into two stages

play22:48

the step one will be run agent to do

play22:50

website search where the agent would

play22:51

have access to two functions one is

play22:53

gripping another is update data then

play22:56

agent will start doing research within

play22:58

the company website of the data point

play23:00

that is not known then the second stage

play23:02

will be internet search where our get a

play23:04

function of search as well as update

play23:06

data again it will start doing research

play23:09

but on the external internet source here

play23:11

I will link to the entity that we want

play23:13

to research the links that we already

play23:15

script as well as data points that we

play23:17

haven't found yet as instruction now we

play23:19

can just start triggering this agent or

play23:21

create a global State called link script

play23:24

data points that we want agent to

play23:25

collect in my specific case I want to

play23:27

find a list of companies who are

play23:29

offering catering to the employees and

play23:32

to understand the number of employees as

play23:34

well as office locations so if I'm Uber

play23:36

e for business I can know whether this

play23:38

is good candidate company or not and

play23:40

entity name is Discord as well as a

play23:43

website so I will run this two different

play23:44

stages in the end return the data points

play23:47

so I can open this then do python app.py

play23:49

so first they make a plan about which

play23:52

data source to research about it first

play23:54

they script the Discord page which it

play23:56

return a bunch of sub pages then next is

play23:59

start research about the career page I

play24:01

believe but didn't look like this code

play24:03

official website provide any useful

play24:05

information so it mve to the second

play24:07

stage of internet search and start

play24:09

calling the search function now this

play24:11

time it found some information about

play24:13

number of employee to be 750 from this

play24:16

data source which I can click on so this

play24:18

is from one the website for Zia and then

play24:21

it just update this data knowledge about

play24:23

number of employees and office locations

play24:26

after that it did second search where

play24:28

found that there's some catering

play24:30

information where Discord provides their

play24:32

employees with special perks like fun

play24:34

investment work setup as well online

play24:36

event like cooking class in the end it

play24:39

return back the data points the carrying

play24:41

offer for the employees the number of

play24:43

employees as well as office location and

play24:46

each data point has a reference link to

play24:48

back up so this is example of how can

play24:49

you create a super smart Universal

play24:52

scripper that can just go on internet as

play24:54

well as specific website it was given to

play24:56

find information this works for some

play24:59

general website that do have content

play25:01

gating or information that didn't

play25:03

require complex UI interaction like

play25:05

pagination but if you want to script

play25:07

let's say information from each

play25:09

e-commerce website where you want to

play25:11

apply specific search query and also

play25:13

need to work with pagination then we

play25:15

will need agent to Direct Control the

play25:17

web browser and this is the second type

play25:19

of agentic web scripper and I'm going to

play25:22

use an Library Called Agent ql to set up

play25:25

a universal e-commerce website sripper

play25:27

so what I want to Chief is Agent where I

play25:29

can give a URL and then it will be able

play25:31

to scraping through all the product

play25:33

information and also click on every

play25:35

single page in the pation until all the

play25:38

information has been collected to do

play25:40

this our firstly go to file put in agent

play25:43

ql API key here as well then I will

play25:45

create a new file called browser agent

play25:48

and then load a few different libraries

play25:50

I will first Define a query called

play25:52

products so this is a query that we will

play25:54

give agent ql to return the information

play25:56

we want and for each product I want the

play25:58

product name price number review and

play25:59

ratings and on the out side I also want

play26:02

to get next page button so this is the

play26:05

next button on the pation and here you

play26:07

will notice that I try to get two

play26:09

elements one is next page button enable

play26:11

another is next page button disable the

play26:14

reason I want this to is because even

play26:15

though we reach the last page in the

play26:17

pagination there will be still a next

play26:19

button but it will be in disabled state

play26:21

so I can write a y Loop to continuously

play26:24

go to next page until next page button

play26:26

disable is true so that's pretty much it

play26:29

we can firstly start a agent ql session

play26:31

by doing agent ql do start session and

play26:34

give it a URL and I can also control to

play26:36

scroll to the bottom because for some

play26:38

e-commerce website the pation will only

play26:40

be loaded after you scroll to the bottom

play26:43

then I will create a file called p. CSV

play26:46

with mode to be append so that it keep

play26:48

adding new data to this file and I will

play26:50

create four different columns the pera

play26:52

name product price number review and

play26:54

rating and while the next page button

play26:56

enable is true and the next page button

play26:59

disable is now I will try to run another

play27:01

agent QR query to get a list of products

play27:04

and for each product in return I will

play27:06

add to the CSV file and in the end I

play27:08

will click on the next page button again

play27:11

scroll to the bottom and try to get the

play27:13

next page button again and repeat this

play27:15

process until it reach the last page so

play27:18

that's pretty much it let's try it out

play27:19

so I will open Terminal and run python

play27:22

browser agent. py so it open the browser

play27:24

and automatically scroll to the end of

play27:26

the web page where the pagination will

play27:29

be loaded and if I open the terminal on

play27:31

the right side you will see that it get

play27:33

the enable button to be go to next page

play27:35

is two and disable button is not and it

play27:37

scripts the first page product

play27:39

information click on the next page go to

play27:42

next page and start scripting the second

play27:44

page perct information and also get the

play27:46

next page button to be go to page three

play27:48

so that it will repeat this process

play27:50

until it hit the last page and once it

play27:53

finished all the product data will be

play27:54

saved to this CSV file and if I open

play27:56

this CSV file you can see they

play27:58

automatically return all the product

play28:00

information across multiple pages and

play28:02

what's really cool about this is that

play28:04

this scrier not only works for Amazon

play28:06

but almost any e-commerce website for

play28:08

example I can go to eBay and just copy

play28:10

the eBay search results to the agent and

play28:13

I can run exact same script again and

play28:15

this time it will open the eBay scroll

play28:17

to the bottom to get pagination buttons

play28:20

and if I open the terminal on the right

play28:22

side you will see that it again get the

play28:24

next page button and return that Page's

play28:26

product information and and fat the next

play28:29

page button again and it's going to

play28:30

repeat this process again and again

play28:32

until you get all the information that

play28:34

it needs so this is another quick

play28:36

example of how can you create a browser

play28:38

based agent as well that can script

play28:40

specific so as you can see there are

play28:42

huge amount of possibilities for us to

play28:44

create the scripper now I'm really

play28:45

interesting to see what type of scripper

play28:47

you will be start creating this is one

play28:49

area that I'm personally really

play28:51

passionate about and I do plan to launch

play28:53

a universal agent scripper that can

play28:55

enable people to scrip any type of

play28:57

information on internet if you're

play28:58

interested in that Universal agent

play29:00

scripper I put a link in the description

play29:02

below so you can sign up once I finish

play29:04

that scripper agent I will let you know

play29:06

and please comment below for any

play29:07

question you have thank you and I see

play29:09

you next time

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
Web ScrapingAI AgentsData ExtractionInternet BotsAPI ServicesUser BehaviorHeadless BrowsersScripting ChallengesE-commerce ScraperLarge Language Models
Besoin d'un résumé en anglais ?