This AI Agent can Scrape ANY WEBSITE!!!
Summary
TLDRThis video script introduces a revolutionary approach to web scraping using large language models, specifically the 'firec' library. It demonstrates how to harness the power of AI to extract structured data from web pages with minimal effort, eliminating the need for manual inspection. The tutorial guides viewers through setting up an API key, using the library to scrape markdown from URLs, and then leveraging OpenAI's GPT-3.5 model to convert this markdown into JSON and Excel formats. The script showcases the process with examples, including scraping a real estate website and a French property listing, highlighting the flexibility and universality of this method. The video concludes with a reminder of the challenges, such as context length limits, and the potential of this technology to transform web scraping.
Takeaways
- 📚 The video discusses the use of libraries that leverage large language models to scrape web data without manual intervention, offering a more efficient alternative to traditional methods like BeautifulSoup.
- 🔍 It highlights the advantages of these libraries, such as saving effort and creating universal web scrapers that can be applied to various websites with minimal changes to the code.
- 🔑 The presenter introduces 'firec', an open-source library with a large community following, and demonstrates how to obtain an API key for using its services.
- 💻 The workflow for the universal web scraping agent involves passing a URL to 'firec' to get markdown, which is then processed by a large language model to extract structured data.
- 📝 The presenter guides through setting up a new Python project, including creating a virtual environment, handling API keys, and installing necessary packages.
- 👨💻 The script includes a step-by-step coding tutorial, starting from initializing the 'firec' app with an API key to defining functions for scraping, saving, and formatting data.
- 🤖 The use of large language models, such as OpenAI's GPT models, is emphasized for intelligent text extraction and conversion from raw markdown to structured JSON format.
- 🏠 A practical example using Zillow's website is provided to illustrate how the code can extract real estate listing data, including address, price, and other relevant details.
- 🌐 The video demonstrates the flexibility of the code by showing its application on different websites, including those in a foreign language, highlighting the power of large language models in web scraping.
- 🛠 The presenter addresses potential issues, such as context length limitations of language models, and provides solutions like switching to a model with a larger context size.
- 📊 The tutorial concludes with a successful demonstration of extracting and saving data in JSON and Excel formats, showcasing the effectiveness of the approach.
Q & A
What are the advantages of using large language models for web scraping compared to traditional methods like Beautiful Soup?
-The advantages include saving effort, creating a universal web scraper for specific use cases, and the ability to scrape data from multiple websites with minimal changes to the code.
What is the library 'firec' and how does it contribute to the web scraping process?
-'firec' is an open-source library with 4,000 stars that can be used to scrape web pages. It contributes by providing markdown of the entire page without the need for HTML tags, simplifying the data extraction process.
How does the speaker plan to demonstrate the effectiveness of the new web scraping libraries?
-The speaker plans to demonstrate by creating code that can scrape data from different types of websites with minimal changes, showcasing the universality of the approach.
What is the significance of obtaining markdowns from 'firec' instead of raw HTML?
-Markdowns are significant because they provide a cleaner data format that requires fewer tokens for processing by large language models, making the extraction process more efficient and cost-effective.
What is the role of the large language model in the web scraping process described in the script?
-The large language model is used to extract structured data from the markdown provided by 'firec'. It acts as an intelligent text extraction and conversion assistant, generating JSON format data from the raw markdown.
What is the workflow of the universal web scraping agent described in the script?
-The workflow involves passing a URL to 'firec' to get markdowns, then using a large language model to extract information according to specified fields, resulting in semi-structured data that is further formatted and saved.
How does the speaker handle the potential issue of different JSON names inside the structured data?
-The speaker acknowledges that the names inside the JSON cannot be controlled 100%, which is why the data is referred to as semi-structured. The speaker's code includes a step to handle this variability.
What are the storage options mentioned by the speaker for saving the scraped data?
-The speaker mentions JSON and Excel as storage options for the scraped data, indicating flexibility in how the data can be saved and accessed.
How does the speaker address the issue of different website structures in the scraping process?
-The speaker uses a large language model to handle different website structures, allowing the same code to be used for scraping data from various websites without needing to inspect or understand each page's unique structure.
What is the potential limitation the speaker encounters when trying to scrape data from a French website?
-The potential limitation encountered is the model's maximum context length, which may not be sufficient to process very long raw data from certain websites, such as the French website mentioned.
Outlines
🤖 Leveraging Large Language Models for Web Scraping
This paragraph introduces the concept of using large language models to automate web scraping tasks. It discusses the benefits of using such libraries, like saving effort and creating universal scripts that can scrape data from various websites with minimal changes. The speaker also mentions the process of obtaining an API key from the 'firec' library, which is open-sourced and has a significant community following. The workflow involves passing a URL to 'firec' to retrieve markdown, which is then given to a large language model for extraction into structured data. The speaker emphasizes the financial and practical advantages of this approach over traditional methods like BeautifulSoup.
🛠 Setting Up the Web Scraping Environment
The speaker outlines the steps for setting up the development environment for web scraping using large language models. This includes creating a new folder, initializing a new Python file, setting up a virtual environment, and obtaining necessary API keys. The paragraph details the installation of required packages and the initialization of the project with specific files for storing API keys. The speaker then proceeds to import necessary modules for the project and begins coding functions for scraping data and saving it in a structured format.
🔍 Implementing the Data Scraping and Saving Functions
This section delves into the coding process for the web scraping project. The speaker describes the creation of functions to load API keys, initialize the 'firec' app, scrape URLs, and handle markdown responses. The speaker also discusses error handling for empty responses. Additionally, a function to save extracted data in a folder called 'output' is explained, which saves data as a timestamped text file. The speaker then focuses on the 'format_data' function, which uses OpenAI to extract structured data from markdown. The process involves defining fields for extraction, crafting a system message, and using OpenAI's chat completion to parse the data into JSON format.
📊 Running the Scraper and Analyzing Results
The speaker demonstrates running the web scraping code and discusses the initial issues encountered due to datetime module naming conflicts. After resolving these issues, the speaker shows the successful extraction of raw data, its conversion into JSON and Excel formats, and the storage of this data. The speaker highlights the ability of the code to scrape data from different websites with varying structures, including a French website, by simply rerunning the code with a new URL. The speaker emphasizes the universality of the scraper and its ability to handle different languages and structures without manual inspection of web pages.
🌐 Testing the Scraper with a Foreign Language Website
In this final paragraph, the speaker tests the scraper's capabilities by attempting to scrape data from a French website. Initially, an error occurs due to the model's maximum context length limit, which prevents processing of the lengthy raw data. After switching to a model with a larger context size, the speaker successfully extracts data from the French website. The speaker discusses the results, noting some discrepancies in the interpretation of 'beds' and the confusion between square feet and square meters. Despite these minor issues, the speaker concludes by celebrating the scraper's versatility and the ease with which it can be adapted to different websites and languages.
Mindmap
Keywords
💡Large language models
💡Web scraping
💡Firec
💡Structured data
💡API key
💡Markdown
💡JSON
💡Virtual environment
💡Pandas
💡Universal web scraper
Highlights
Introduction of new libraries leveraging large language models for web scraping without manual inspection.
Advantages of using these libraries include effort saving and universality across different websites.
Demonstration of creating an account and obtaining an API key for the firec library.
Use of firec to scrape a webpage and receive markdown without HTML tags, reducing token count.
Explanation of the universal web scraping agent workflow involving URL input, markdown extraction, and data formatting.
Setup of a new project environment with virtualenv and API keys for firec and open AI.
Installation of required packages for the web scraping project.
Coding the 'scrape_data' function to load API keys and initialize firec app for URL scraping.
Development of the 'save_row_data' function for saving scraped data in an 'output' folder.
Creation of the 'format_data' function using open AI to extract structured data from markdown.
Use of system and user messages to instruct the AI on extracting specific fields from raw data.
Implementation of the 'save_formatted_text' function to save data in JSON and Excel formats.
Running the code to scrape a real estate listing website and structure the data.
Comparison of scraped data with the original website to validate accuracy.
Application of the same code to scrape a different website with a different structure.
Challenge of scraping a French website and the model's context length limitation.
Successful scraping of a French real estate website despite language and unit differences.
Conclusion on the capability of the created code to act as a universal web scraper using large language models.
Transcripts
So lately there has been a couple of
libraries that can use the power of
large language models to scrape the web
without us having to basically do
anything they will read the URL and give
us either a markdown or sometimes even a
structured data that we can add in our
Excel sheet or store in our database and
since we have really strong and really
cheap large language models today it
makes sense to use these libraries
instead of going with the free
alternative of beautiful soup or other
ways where we have to inspect the web
page understand the structure and locate
the specific elements that we want to
scrip so the advantages of these new
packages are so many and the biggest of
all of these advantages is as we said
saving the effort but also creating a
script that can act as a universal web
scraper for that specific use case that
you have say for example you want to
scrape data out of a News website you
can use the same code to scrape from
multiple news websites and sometimes you
can even use that same code to scrape
from other websit that has nothing to do
with news where you are looking for
totally different information and today
we are going to see how we can create
such code and how it can help you scrape
the web with minimal changes so let's go
ahead and jump to my screen all right so
before opening uh vs code and starting
to work let's just discover firec which
is the library that we are going to be
using by the way it's it's open sourced
and it has 4,000 Stars so here if we
come back to firec and create an account
you can just basically go to accounts in
here and get an API key that is the API
key that we will use later on in our
project so if we go to playground and
let's say for example we want to scrape
let's say for example open
pricing open a pricing here and click on
run we will see that we will receive
markdowns of the entire page and the
thing is we don't have any type of divs
or lists or any type of tags that we
have inside the HTML so this is very
important because before if you want to
do extraction using HTML you will
basically have to pass the whole
structure into a large language model
and that means that that is a lot of
tokens sometimes it goes well over
100,000 tokens so the fact that we have
markdowns will help tremendously in
order to get a cleaned enough data that
we can pass to a large language model
and then of course from there it will
make sense financially to use any new
cheap model in order to do the
extraction and make it a structured data
so now let's see the universal web
scraping agent workflow we have to see
that before opening vs code so you know
exactly what I am doing whenever I am
writing whatever code so our input is
going to be the URL that is always going
to be the case this URL will be passed
to fire craw in order to get the
markdowns once we get the markdowns from
fire C we going to give that to a large
language model it could be an open AI
model like GPC 3.5 or or gp40 or Gemini
Flash from Google or any other model
then of course we are going to ask it to
extract something from the markdown
according to our fields we have to
basically tell exactly which fields we
want to extract from this markdown after
that we going to get semi structured
data so even though that we are going to
get a Json answer from the data
extraction we cannot control 100% of the
names inside of that Json this is why I
call it semi structured even that it is
structured up and once we are going to
get that data we're going to go to
another stage where we are going to
format and save the data so we're going
to format it according to Json and then
we are going to have it in a data frame
a pandas data frame and we're going to
save both of them here you can basically
have a database or any sorts of storage
medium that you prefer and here I chose
Json and Excel so let's go ahead and
open vase code here we are going to
create always a new folder so let's call
it listing file CR and then inside of
here we are going to create a new file
let's call it appui and of course we are
going to create a virtual environment so
python DM VNV VNV and then let's go
ahead and get inside of the virtual
environment so let's do bnv scripts
activate for this let's clear this out
and then the latest step that we are
going to do in the initiation of every
project is creating a new file
sorry a new file let's call
ITV this is where we are going to place
our API keys so here if I go to accounts
I will be able to C copy this API key
and then I will of course use open AI so
for that we are going to use open Ai and
at this point everyone knows how to get
an open a API key let's copy it and then
we are going to place it inside of here
okay so now we are going to install all
of the requirements that we need so I
already have a file and this file I am
going to put it right here these are all
the packages that we are going to need
in our project so now we are going to
pip install DHR requirements by the way
it's fire c dasp not just fire C if you
want to install it independently without
the other ones okay now that it has
finished installing everything we are
going to start coding let's clear this
out and let's import so from fire C we
are going to import fire C app from open
AI
then we are going to import OS import
Json and then import FAS SPD and then
lastly we are going to import date time
okay so the first function that we are
going to start with is going to be
scrape data and the scrape data will
only take the URL first thing that we're
going to start with is to load the do
EnV so we can load the API keys that we
have here and then we are going to use
that API key to initialize the firec app
that we have here and then we're going
to use that app to scrape the URL that
we have here we are going to delete this
for now we're not going to need it and
then we are going to check for markdown
in case we have an empty response or we
have any kind of problem then we are
going to return the markdown if not the
case we are going to return an error the
second function that we will have is
basically saving that row data cuz I am
not a fan of doing an extraction and
then not saving the data somewhere so we
are going to save it inside a folder
called output that we have here so it is
pretty straightforward by the way it
just created so let's call it output
let's create it in here and then we will
have a text file at. MD which is
basically a text file inside of our
output folder that we have here and it's
going to be saved according to the
datetime of when we are running the
process so now we are going to go to the
most important part and please it is not
very complicated I'm going to paste it
in here it's just that the system prompt
is basically very long but other than
that it is not that complicated so this
is the format data function and this is
the function responsible of taking the
raw data that we will have and then
extracting the structured data from that
markdown that we had before so here we
are going to initiate our open AI
clients inside of clients and then if we
don't provide this optional parameter we
will basically end up with this Fields
by the way the use case that we are
going to use is going to be Zillow so in
our use case this is going to be the
website that we are going to extract as
you can see here we have a map in here
we have the listings around here so
basically what we want to do is that we
want to extract data out of this website
here and structure it so here as you can
see the fields that we have here are
basically the fields that we can find
normally in a real estate listing so the
address real estate agency the price the
bets etc etc so these are the
information that we are going to be
extracting from the website then we will
have the system message and the user
message that we are going to store
inside of of a variable in here so here
you are an intelligent text extraction
and conversion assistant Your Role is
basically to take a row data and then
extract a Json format from it this is
very important we have to mention that
it is a Json format and then later on we
are going to see that the response
format being a Json object this is
incredibly important if we don't do this
we're not guaranteed to have a Json
response every time and then we are
going to get the response out of the
chat completion from open AI we're not
going to use gp40 or gp4 Turbo we only
going to use gbt 3.5 turbo
110 this should be 1106 and then we are
going to pass the system message and the
user message by the way inside of the
user message we have is extract the
following information from the provided
text then we are going to have the data
this is the data that we have already
saved that we will get from scrape data
and then of course we will use the
fields that we have provided in here
good now we're going to go if we had
response and our response is not empty
we are going to parse that Json and
we're going to use J jon. loads in order
to get that string into a Json format
that we will later save in the next
function so here we will have our last
function which is basically save the
formatted text that we will save in a
Json format and then in an Excel format
in order for us to be able to visualize
it easily so inside of the output folder
always we are going to get that Json
format that we have just got from here
so the return value that we have here is
going to pass into here and then we are
going to create a Json format from here
and then we have a little function that
we will see why it will be useful later
on and then of course we are going to
get that format of data in a data frame
that we will later on save as an Excel
sheet okay so now let's go ahead and add
our last bit of code in order to run the
process and here we have the last part
of our code where we are going to
initiate a time stamp in order to use it
later on inside of our functions we are
going to call our functions using the
URL that we have here so this this is
the first website that we are going to
extract and then we are going to save
that row data that we just got here
using our function and then we are going
to pass that row data again to format
the data and then we are going to save
it in the format of a Json and an Excel
sheet so let's go ahead and run our code
and see what's going to happen okay so
here we have a problem and it's because
of datetime so we should use from
datetime import Daton because there is a
problem of naming detection it actually
uses the module instead of using the
function inide the daytime but that's
not important so let's run it again and
see what's going to happen all right so
we can already see that it have saved up
the row data that we have in here so we
are going to open it and as you can see
this is basically the markdown of the
whole page this markdown will then be
handed to open AI in order for it to
give us a Json and then Excel sheet so
this is the Json that it has came up
with and then from this Json it has been
able to format it and then to save it as
an Excel and here we have the sorted
data and then we have it as an Excel
sheet we are going to see both of them
so as you can see here in the output we
can see that instead of having just
basically the information that we want
usually it will give us one key and
inside of this one key we will have all
of the information this is why in my
code I have told you in format I have to
basically check if our dictionary have
only one key and if it's that the case I
am going to go inside of dictionary to
get all the keys for my formatted data
so here if I go to my project and then
go to Output I will find my Excel sheet
that I can then open and inside of here
I can find all of the information that I
want so basically now from this website
I have been able to generate a structure
Json and then a structured Excel that if
I open I find all of the information
that I want structured and even URLs to
take me to the exact listing that I have
scraped the data from so here if we
click on this listing here it will take
me exactly to the listing and I will be
able to compare between my scraping and
the listing that I have here and all of
this without using any tags or any type
of page inspection that I would usually
do if I am using beautiful soup or
traditional ways of scraping the data
and even more than this we can use the
same code that we have here to scrape
other websites that have nothing to do
with the structure of this website that
we have just scraped this is the closest
code to a univers web scraper that I
have ever created and it is amazing how
easy it is to create these processes
using large language models today so if
I go back to the website in order to
compare the data we can see that here we
have $630,000 it is exactly the price we
have ,600 square ft so exactly what we
want and then we have the home type and
the listing age so I would say that on
the listing Age part this has not been
very successful it should have kept it
empty but for all of the other ones we
can see that it has been able to
actually get the data and if we do a
simple date substraction we can get the
exact date when this house has been
listed so that is a problem that can be
easily resolved okay so this was the
first URL now let's go ahead and go to a
totally different website that has
nothing to do with this URL so this is
going to be a different website so let's
open it here so let's see if it's going
to be able to get data from here okay so
let's run it okay so we got the RW data
now and it has been able to get the
sorted data and then get us the Excel
sheet so first of all let's go to the
sorted data and see what it has been
able to uh extract so there is the
sorted data that has been able to
extract and then if we go back to our
folder we can find this Excel file that
if we open it we are going to find this
information so literally this code has
been able to scrape two different URLs
that has nothing to do with each other
simply because we have used large
language models so if we go back to our
Excel sheets and do the same thing we
are going to find that for example the
first one is this address with this real
estate agency with this price and two
beds and then one bath and 1,000 square
ft okay so that is already very good now
let's see if we are going to be able to
do this with a website that is basically
in a foreign language for example let's
see if this going to be able to work on
a website that is in French so this is a
French website these are homes in a city
called Leo in France let's see if it's
going to be able to get those listings
and understand how to get them even
though that our prompts are in English
so let's run our code and here we have
an error and this is very important the
error is that the model's maximum
context length is 16,000 tokens so here
we can see that the model that I have
used does not have enough of a context
length size in order to be able to treat
the raw data that we have just gotten in
here so here we have the raw data that
is basically so long so it hasn't been
able to actually get all of this
information and basically process it so
if I go back here to response here I am
using GPT 3.5 turbo 1106 we can just
change that to 40 and we can run the
code again and let's see if it's going
to be able to treat it this time okay so
we got the row data it's here and this
time it's going to take more time in
order to process it because it is so
many tokens
and finally we got an answer finally I
have waited so long for this one okay so
we got the raw data it is very long as
we said and then we got an answer so
let's go ahead and visualize the answer
let's group them by type Yep this is the
last one so let's open it and as you can
see here even in French it has been able
to scrape all the data so here I can see
the price of course the price is going
to be in Euros the address the real
estate agency the number of beds and of
course if I click here and I open it
it's going to open exactly the listing
so here we have to this so as you can
see here if we compare between the two
we can see that we have here the address
the right one we have the agency I don't
know where I found it but yes here so
this is the agency the price is correct
and then we have here three bets I don't
know what beds mean in English in French
rooms that can be closed are called CH
so maybe it is the equivalency but it's
like everything like rooms that can be
closed or living rooms or anything else
so here we have cat so I don't know if
it should have four in here you guys
tell me what beds mean what do you mean
by beds in English you can have three
beds in one room I don't understand the
logic of it but you guys tell me in the
comments if this is correct or no we
don't have any indication about bats
it's not really something that the frch
talk about that much but here we have
the first clearly wrong field here we
have square feet and here we have the
square meter so this is the imperial
system this is the metric one the
correct one and it could not understand
that this is basically square meters it
should have at least indicated that here
we have the square meter and of course
here we have the the rest of the
information so basically that is it so
it has been able same code to extract a
totally different URL so that is already
very good this code has been impossible
a year or two years ago it is impossible
to create kind of a universal whip
scraper that will work in any instance
without having to deal with any
inspection or web page specifications
anyways that has been me guys thank you
guys so much for watching it has been a
long video I know but thank you guys for
staying all the way through I really
appreciate it and catch you guys next
time peace
Voir Plus de Vidéos Connexes
Always Check for the Hidden API when Web Scraping
“Wait, this Agent can Scrape ANYTHING?!” - Build universal web scraping agent
How to Scrape Google Search Results: A Step-by-Step Guide
BeautifulSoup + Requests | Web Scraping in Python
Effortlessly Scrape Data from Websites using Power Automate and Power Apps
Scraping Data from a website in JSON format
5.0 / 5 (0 votes)