This AI Agent can Scrape ANY WEBSITE!!!

Reda Marzouk
23 May 202417:44

Summary

TLDRThis video script introduces a revolutionary approach to web scraping using large language models, specifically the 'firec' library. It demonstrates how to harness the power of AI to extract structured data from web pages with minimal effort, eliminating the need for manual inspection. The tutorial guides viewers through setting up an API key, using the library to scrape markdown from URLs, and then leveraging OpenAI's GPT-3.5 model to convert this markdown into JSON and Excel formats. The script showcases the process with examples, including scraping a real estate website and a French property listing, highlighting the flexibility and universality of this method. The video concludes with a reminder of the challenges, such as context length limits, and the potential of this technology to transform web scraping.

Takeaways

  • πŸ“š The video discusses the use of libraries that leverage large language models to scrape web data without manual intervention, offering a more efficient alternative to traditional methods like BeautifulSoup.
  • πŸ” It highlights the advantages of these libraries, such as saving effort and creating universal web scrapers that can be applied to various websites with minimal changes to the code.
  • πŸ”‘ The presenter introduces 'firec', an open-source library with a large community following, and demonstrates how to obtain an API key for using its services.
  • πŸ’» The workflow for the universal web scraping agent involves passing a URL to 'firec' to get markdown, which is then processed by a large language model to extract structured data.
  • πŸ“ The presenter guides through setting up a new Python project, including creating a virtual environment, handling API keys, and installing necessary packages.
  • πŸ‘¨β€πŸ’» The script includes a step-by-step coding tutorial, starting from initializing the 'firec' app with an API key to defining functions for scraping, saving, and formatting data.
  • πŸ€– The use of large language models, such as OpenAI's GPT models, is emphasized for intelligent text extraction and conversion from raw markdown to structured JSON format.
  • 🏠 A practical example using Zillow's website is provided to illustrate how the code can extract real estate listing data, including address, price, and other relevant details.
  • 🌐 The video demonstrates the flexibility of the code by showing its application on different websites, including those in a foreign language, highlighting the power of large language models in web scraping.
  • πŸ›  The presenter addresses potential issues, such as context length limitations of language models, and provides solutions like switching to a model with a larger context size.
  • πŸ“Š The tutorial concludes with a successful demonstration of extracting and saving data in JSON and Excel formats, showcasing the effectiveness of the approach.

Q & A

  • What are the advantages of using large language models for web scraping compared to traditional methods like Beautiful Soup?

    -The advantages include saving effort, creating a universal web scraper for specific use cases, and the ability to scrape data from multiple websites with minimal changes to the code.

  • What is the library 'firec' and how does it contribute to the web scraping process?

    -'firec' is an open-source library with 4,000 stars that can be used to scrape web pages. It contributes by providing markdown of the entire page without the need for HTML tags, simplifying the data extraction process.

  • How does the speaker plan to demonstrate the effectiveness of the new web scraping libraries?

    -The speaker plans to demonstrate by creating code that can scrape data from different types of websites with minimal changes, showcasing the universality of the approach.

  • What is the significance of obtaining markdowns from 'firec' instead of raw HTML?

    -Markdowns are significant because they provide a cleaner data format that requires fewer tokens for processing by large language models, making the extraction process more efficient and cost-effective.

  • What is the role of the large language model in the web scraping process described in the script?

    -The large language model is used to extract structured data from the markdown provided by 'firec'. It acts as an intelligent text extraction and conversion assistant, generating JSON format data from the raw markdown.

  • What is the workflow of the universal web scraping agent described in the script?

    -The workflow involves passing a URL to 'firec' to get markdowns, then using a large language model to extract information according to specified fields, resulting in semi-structured data that is further formatted and saved.

  • How does the speaker handle the potential issue of different JSON names inside the structured data?

    -The speaker acknowledges that the names inside the JSON cannot be controlled 100%, which is why the data is referred to as semi-structured. The speaker's code includes a step to handle this variability.

  • What are the storage options mentioned by the speaker for saving the scraped data?

    -The speaker mentions JSON and Excel as storage options for the scraped data, indicating flexibility in how the data can be saved and accessed.

  • How does the speaker address the issue of different website structures in the scraping process?

    -The speaker uses a large language model to handle different website structures, allowing the same code to be used for scraping data from various websites without needing to inspect or understand each page's unique structure.

  • What is the potential limitation the speaker encounters when trying to scrape data from a French website?

    -The potential limitation encountered is the model's maximum context length, which may not be sufficient to process very long raw data from certain websites, such as the French website mentioned.

Outlines

00:00

πŸ€– Leveraging Large Language Models for Web Scraping

This paragraph introduces the concept of using large language models to automate web scraping tasks. It discusses the benefits of using such libraries, like saving effort and creating universal scripts that can scrape data from various websites with minimal changes. The speaker also mentions the process of obtaining an API key from the 'firec' library, which is open-sourced and has a significant community following. The workflow involves passing a URL to 'firec' to retrieve markdown, which is then given to a large language model for extraction into structured data. The speaker emphasizes the financial and practical advantages of this approach over traditional methods like BeautifulSoup.

05:02

πŸ›  Setting Up the Web Scraping Environment

The speaker outlines the steps for setting up the development environment for web scraping using large language models. This includes creating a new folder, initializing a new Python file, setting up a virtual environment, and obtaining necessary API keys. The paragraph details the installation of required packages and the initialization of the project with specific files for storing API keys. The speaker then proceeds to import necessary modules for the project and begins coding functions for scraping data and saving it in a structured format.

10:03

πŸ” Implementing the Data Scraping and Saving Functions

This section delves into the coding process for the web scraping project. The speaker describes the creation of functions to load API keys, initialize the 'firec' app, scrape URLs, and handle markdown responses. The speaker also discusses error handling for empty responses. Additionally, a function to save extracted data in a folder called 'output' is explained, which saves data as a timestamped text file. The speaker then focuses on the 'format_data' function, which uses OpenAI to extract structured data from markdown. The process involves defining fields for extraction, crafting a system message, and using OpenAI's chat completion to parse the data into JSON format.

15:04

πŸ“Š Running the Scraper and Analyzing Results

The speaker demonstrates running the web scraping code and discusses the initial issues encountered due to datetime module naming conflicts. After resolving these issues, the speaker shows the successful extraction of raw data, its conversion into JSON and Excel formats, and the storage of this data. The speaker highlights the ability of the code to scrape data from different websites with varying structures, including a French website, by simply rerunning the code with a new URL. The speaker emphasizes the universality of the scraper and its ability to handle different languages and structures without manual inspection of web pages.

🌐 Testing the Scraper with a Foreign Language Website

In this final paragraph, the speaker tests the scraper's capabilities by attempting to scrape data from a French website. Initially, an error occurs due to the model's maximum context length limit, which prevents processing of the lengthy raw data. After switching to a model with a larger context size, the speaker successfully extracts data from the French website. The speaker discusses the results, noting some discrepancies in the interpretation of 'beds' and the confusion between square feet and square meters. Despite these minor issues, the speaker concludes by celebrating the scraper's versatility and the ease with which it can be adapted to different websites and languages.

Mindmap

Keywords

πŸ’‘Large language models

Large language models refer to advanced artificial intelligence systems designed to process and understand human language. In the context of the video, these models are utilized to scrape web content by reading URLs and generating structured data or markdown without the need for manual web page inspection. They are highlighted as a powerful tool that simplifies the web scraping process, as demonstrated by the use of the 'firec' library which leverages such models for extracting information from web pages.

πŸ’‘Web scraping

Web scraping is the process of programmatically extracting information from websites. The video discusses how traditional web scraping involves inspecting web pages and locating specific elements, but with the advent of large language models, a more automated and universal approach is now possible. The script illustrates this with the creation of a code that uses these models to scrape data from various websites with minimal changes.

πŸ’‘Firec

Firec is an open-source library mentioned in the video that facilitates web scraping using large language models. It is capable of converting web page content into markdown, which is then used to generate structured data. The script describes creating an account with Firec and using an API key from the platform to initialize the web scraping process.

πŸ’‘Structured data

Structured data refers to information that is organized in a specific way, making it easily readable and accessible for various applications. In the video, the process of converting raw markdown data into a JSON format is discussed, which is a form of structured data. This allows for easier storage and analysis, as demonstrated by saving the scraped data in both JSON and Excel formats.

πŸ’‘API key

An API key is a unique code that allows developers to access and use external software services or libraries, such as Firec or OpenAI. In the script, obtaining an API key from Firec and OpenAI is a prerequisite for using their services within the web scraping project. The API keys are used to authenticate requests and enable the functionality of the large language models.

πŸ’‘Markdown

Markdown is a lightweight markup language used for formatting text. In the context of the video, Firec converts web page content into markdown, which is a cleaner and more streamlined format than HTML. This simplifies the data extraction process by reducing the complexity and size of the data that needs to be processed by the large language models.

πŸ’‘JSON

JSON (JavaScript Object Notation) is a format for storing and exchanging data that is easy for humans to read and write and for machines to parse and generate. In the video, the large language model is instructed to extract information and return it in JSON format, which is then used to create structured data and saved as part of the web scraping output.

πŸ’‘Virtual environment

A virtual environment is a self-contained directory tree that contains a Python installation and can have its own set of libraries, separate from the system's Python installation. In the script, creating a virtual environment is part of setting up the project to manage dependencies and isolate the project's Python environment from others.

πŸ’‘Pandas

Pandas is a Python library used for data manipulation and analysis. It provides data structures and operations for manipulating numerical tables and time series. In the video, a pandas DataFrame is used to store and save the structured data in an Excel file, showcasing its utility in handling and exporting data.

πŸ’‘Universal web scraper

A universal web scraper is a tool or script capable of extracting data from various websites without needing to be tailored for each specific site's structure. The video demonstrates creating a code that can scrape different types of websites, including those in different languages, by leveraging the capabilities of large language models, thus approaching a 'universal' scraper concept.

Highlights

Introduction of new libraries leveraging large language models for web scraping without manual inspection.

Advantages of using these libraries include effort saving and universality across different websites.

Demonstration of creating an account and obtaining an API key for the firec library.

Use of firec to scrape a webpage and receive markdown without HTML tags, reducing token count.

Explanation of the universal web scraping agent workflow involving URL input, markdown extraction, and data formatting.

Setup of a new project environment with virtualenv and API keys for firec and open AI.

Installation of required packages for the web scraping project.

Coding the 'scrape_data' function to load API keys and initialize firec app for URL scraping.

Development of the 'save_row_data' function for saving scraped data in an 'output' folder.

Creation of the 'format_data' function using open AI to extract structured data from markdown.

Use of system and user messages to instruct the AI on extracting specific fields from raw data.

Implementation of the 'save_formatted_text' function to save data in JSON and Excel formats.

Running the code to scrape a real estate listing website and structure the data.

Comparison of scraped data with the original website to validate accuracy.

Application of the same code to scrape a different website with a different structure.

Challenge of scraping a French website and the model's context length limitation.

Successful scraping of a French real estate website despite language and unit differences.

Conclusion on the capability of the created code to act as a universal web scraper using large language models.

Transcripts

play00:00

So lately there has been a couple of

play00:01

libraries that can use the power of

play00:04

large language models to scrape the web

play00:06

without us having to basically do

play00:08

anything they will read the URL and give

play00:10

us either a markdown or sometimes even a

play00:12

structured data that we can add in our

play00:15

Excel sheet or store in our database and

play00:17

since we have really strong and really

play00:19

cheap large language models today it

play00:21

makes sense to use these libraries

play00:23

instead of going with the free

play00:25

alternative of beautiful soup or other

play00:27

ways where we have to inspect the web

play00:29

page understand the structure and locate

play00:32

the specific elements that we want to

play00:34

scrip so the advantages of these new

play00:36

packages are so many and the biggest of

play00:39

all of these advantages is as we said

play00:40

saving the effort but also creating a

play00:43

script that can act as a universal web

play00:45

scraper for that specific use case that

play00:47

you have say for example you want to

play00:48

scrape data out of a News website you

play00:50

can use the same code to scrape from

play00:53

multiple news websites and sometimes you

play00:55

can even use that same code to scrape

play00:58

from other websit that has nothing to do

play01:00

with news where you are looking for

play01:02

totally different information and today

play01:04

we are going to see how we can create

play01:06

such code and how it can help you scrape

play01:08

the web with minimal changes so let's go

play01:10

ahead and jump to my screen all right so

play01:13

before opening uh vs code and starting

play01:15

to work let's just discover firec which

play01:18

is the library that we are going to be

play01:19

using by the way it's it's open sourced

play01:22

and it has 4,000 Stars so here if we

play01:24

come back to firec and create an account

play01:26

you can just basically go to accounts in

play01:28

here and get an API key that is the API

play01:31

key that we will use later on in our

play01:33

project so if we go to playground and

play01:36

let's say for example we want to scrape

play01:38

let's say for example open

play01:43

pricing open a pricing here and click on

play01:48

run we will see that we will receive

play01:51

markdowns of the entire page and the

play01:54

thing is we don't have any type of divs

play01:58

or lists or any type of tags that we

play02:01

have inside the HTML so this is very

play02:03

important because before if you want to

play02:05

do extraction using HTML you will

play02:08

basically have to pass the whole

play02:10

structure into a large language model

play02:12

and that means that that is a lot of

play02:13

tokens sometimes it goes well over

play02:16

100,000 tokens so the fact that we have

play02:18

markdowns will help tremendously in

play02:21

order to get a cleaned enough data that

play02:23

we can pass to a large language model

play02:25

and then of course from there it will

play02:26

make sense financially to use any new

play02:29

cheap model in order to do the

play02:31

extraction and make it a structured data

play02:33

so now let's see the universal web

play02:35

scraping agent workflow we have to see

play02:38

that before opening vs code so you know

play02:40

exactly what I am doing whenever I am

play02:43

writing whatever code so our input is

play02:45

going to be the URL that is always going

play02:47

to be the case this URL will be passed

play02:49

to fire craw in order to get the

play02:51

markdowns once we get the markdowns from

play02:53

fire C we going to give that to a large

play02:55

language model it could be an open AI

play02:57

model like GPC 3.5 or or gp40 or Gemini

play03:01

Flash from Google or any other model

play03:04

then of course we are going to ask it to

play03:06

extract something from the markdown

play03:08

according to our fields we have to

play03:11

basically tell exactly which fields we

play03:13

want to extract from this markdown after

play03:15

that we going to get semi structured

play03:17

data so even though that we are going to

play03:19

get a Json answer from the data

play03:21

extraction we cannot control 100% of the

play03:25

names inside of that Json this is why I

play03:28

call it semi structured even that it is

play03:30

structured up and once we are going to

play03:32

get that data we're going to go to

play03:33

another stage where we are going to

play03:35

format and save the data so we're going

play03:37

to format it according to Json and then

play03:39

we are going to have it in a data frame

play03:41

a pandas data frame and we're going to

play03:43

save both of them here you can basically

play03:45

have a database or any sorts of storage

play03:47

medium that you prefer and here I chose

play03:50

Json and Excel so let's go ahead and

play03:52

open vase code here we are going to

play03:55

create always a new folder so let's call

play03:58

it listing file CR and then inside of

play04:01

here we are going to create a new file

play04:04

let's call it appui and of course we are

play04:07

going to create a virtual environment so

play04:10

python DM VNV VNV and then let's go

play04:13

ahead and get inside of the virtual

play04:16

environment so let's do bnv scripts

play04:20

activate for this let's clear this out

play04:24

and then the latest step that we are

play04:25

going to do in the initiation of every

play04:27

project is creating a new file

play04:31

sorry a new file let's call

play04:34

ITV this is where we are going to place

play04:37

our API keys so here if I go to accounts

play04:40

I will be able to C copy this API key

play04:43

and then I will of course use open AI so

play04:45

for that we are going to use open Ai and

play04:48

at this point everyone knows how to get

play04:50

an open a API key let's copy it and then

play04:53

we are going to place it inside of here

play04:54

okay so now we are going to install all

play04:56

of the requirements that we need so I

play04:59

already have a file and this file I am

play05:01

going to put it right here these are all

play05:03

the packages that we are going to need

play05:04

in our project so now we are going to

play05:07

pip install DHR requirements by the way

play05:11

it's fire c dasp not just fire C if you

play05:14

want to install it independently without

play05:16

the other ones okay now that it has

play05:19

finished installing everything we are

play05:20

going to start coding let's clear this

play05:22

out and let's import so from fire C we

play05:26

are going to import fire C app from open

play05:29

AI

play05:31

then we are going to import OS import

play05:34

Json and then import FAS SPD and then

play05:38

lastly we are going to import date time

play05:40

okay so the first function that we are

play05:41

going to start with is going to be

play05:44

scrape data and the scrape data will

play05:46

only take the URL first thing that we're

play05:48

going to start with is to load the do

play05:50

EnV so we can load the API keys that we

play05:53

have here and then we are going to use

play05:55

that API key to initialize the firec app

play05:57

that we have here and then we're going

play05:59

to use that app to scrape the URL that

play06:01

we have here we are going to delete this

play06:03

for now we're not going to need it and

play06:05

then we are going to check for markdown

play06:07

in case we have an empty response or we

play06:09

have any kind of problem then we are

play06:11

going to return the markdown if not the

play06:13

case we are going to return an error the

play06:15

second function that we will have is

play06:17

basically saving that row data cuz I am

play06:20

not a fan of doing an extraction and

play06:22

then not saving the data somewhere so we

play06:24

are going to save it inside a folder

play06:27

called output that we have here so it is

play06:29

pretty straightforward by the way it

play06:30

just created so let's call it output

play06:32

let's create it in here and then we will

play06:34

have a text file at. MD which is

play06:36

basically a text file inside of our

play06:38

output folder that we have here and it's

play06:40

going to be saved according to the

play06:42

datetime of when we are running the

play06:44

process so now we are going to go to the

play06:47

most important part and please it is not

play06:50

very complicated I'm going to paste it

play06:52

in here it's just that the system prompt

play06:54

is basically very long but other than

play06:56

that it is not that complicated so this

play06:58

is the format data function and this is

play07:01

the function responsible of taking the

play07:03

raw data that we will have and then

play07:05

extracting the structured data from that

play07:08

markdown that we had before so here we

play07:10

are going to initiate our open AI

play07:12

clients inside of clients and then if we

play07:14

don't provide this optional parameter we

play07:16

will basically end up with this Fields

play07:19

by the way the use case that we are

play07:20

going to use is going to be Zillow so in

play07:22

our use case this is going to be the

play07:24

website that we are going to extract as

play07:26

you can see here we have a map in here

play07:28

we have the listings around here so

play07:30

basically what we want to do is that we

play07:32

want to extract data out of this website

play07:34

here and structure it so here as you can

play07:37

see the fields that we have here are

play07:39

basically the fields that we can find

play07:41

normally in a real estate listing so the

play07:43

address real estate agency the price the

play07:45

bets etc etc so these are the

play07:47

information that we are going to be

play07:49

extracting from the website then we will

play07:51

have the system message and the user

play07:52

message that we are going to store

play07:54

inside of of a variable in here so here

play07:57

you are an intelligent text extraction

play07:58

and conversion assistant Your Role is

play08:01

basically to take a row data and then

play08:03

extract a Json format from it this is

play08:05

very important we have to mention that

play08:07

it is a Json format and then later on we

play08:10

are going to see that the response

play08:12

format being a Json object this is

play08:14

incredibly important if we don't do this

play08:17

we're not guaranteed to have a Json

play08:20

response every time and then we are

play08:21

going to get the response out of the

play08:23

chat completion from open AI we're not

play08:26

going to use gp40 or gp4 Turbo we only

play08:29

going to use gbt 3.5 turbo

play08:32

110 this should be 1106 and then we are

play08:35

going to pass the system message and the

play08:37

user message by the way inside of the

play08:39

user message we have is extract the

play08:41

following information from the provided

play08:43

text then we are going to have the data

play08:44

this is the data that we have already

play08:46

saved that we will get from scrape data

play08:49

and then of course we will use the

play08:50

fields that we have provided in here

play08:52

good now we're going to go if we had

play08:53

response and our response is not empty

play08:56

we are going to parse that Json and

play08:58

we're going to use J jon. loads in order

play09:00

to get that string into a Json format

play09:03

that we will later save in the next

play09:06

function so here we will have our last

play09:09

function which is basically save the

play09:11

formatted text that we will save in a

play09:13

Json format and then in an Excel format

play09:15

in order for us to be able to visualize

play09:17

it easily so inside of the output folder

play09:19

always we are going to get that Json

play09:22

format that we have just got from here

play09:23

so the return value that we have here is

play09:26

going to pass into here and then we are

play09:28

going to create a Json format from here

play09:31

and then we have a little function that

play09:32

we will see why it will be useful later

play09:35

on and then of course we are going to

play09:37

get that format of data in a data frame

play09:39

that we will later on save as an Excel

play09:42

sheet okay so now let's go ahead and add

play09:45

our last bit of code in order to run the

play09:48

process and here we have the last part

play09:50

of our code where we are going to

play09:51

initiate a time stamp in order to use it

play09:53

later on inside of our functions we are

play09:56

going to call our functions using the

play09:57

URL that we have here so this this is

play09:59

the first website that we are going to

play10:01

extract and then we are going to save

play10:02

that row data that we just got here

play10:04

using our function and then we are going

play10:06

to pass that row data again to format

play10:08

the data and then we are going to save

play10:10

it in the format of a Json and an Excel

play10:12

sheet so let's go ahead and run our code

play10:14

and see what's going to happen okay so

play10:17

here we have a problem and it's because

play10:20

of datetime so we should use from

play10:22

datetime import Daton because there is a

play10:24

problem of naming detection it actually

play10:27

uses the module instead of using the

play10:29

function inide the daytime but that's

play10:31

not important so let's run it again and

play10:32

see what's going to happen all right so

play10:35

we can already see that it have saved up

play10:37

the row data that we have in here so we

play10:39

are going to open it and as you can see

play10:41

this is basically the markdown of the

play10:42

whole page this markdown will then be

play10:46

handed to open AI in order for it to

play10:49

give us a Json and then Excel sheet so

play10:52

this is the Json that it has came up

play10:54

with and then from this Json it has been

play10:56

able to format it and then to save it as

play10:58

an Excel and here we have the sorted

play11:00

data and then we have it as an Excel

play11:02

sheet we are going to see both of them

play11:03

so as you can see here in the output we

play11:05

can see that instead of having just

play11:07

basically the information that we want

play11:09

usually it will give us one key and

play11:11

inside of this one key we will have all

play11:13

of the information this is why in my

play11:16

code I have told you in format I have to

play11:19

basically check if our dictionary have

play11:22

only one key and if it's that the case I

play11:25

am going to go inside of dictionary to

play11:27

get all the keys for my formatted data

play11:30

so here if I go to my project and then

play11:33

go to Output I will find my Excel sheet

play11:35

that I can then open and inside of here

play11:37

I can find all of the information that I

play11:39

want so basically now from this website

play11:41

I have been able to generate a structure

play11:44

Json and then a structured Excel that if

play11:46

I open I find all of the information

play11:49

that I want structured and even URLs to

play11:52

take me to the exact listing that I have

play11:55

scraped the data from so here if we

play11:58

click on this listing here it will take

play12:00

me exactly to the listing and I will be

play12:02

able to compare between my scraping and

play12:05

the listing that I have here and all of

play12:07

this without using any tags or any type

play12:10

of page inspection that I would usually

play12:12

do if I am using beautiful soup or

play12:15

traditional ways of scraping the data

play12:17

and even more than this we can use the

play12:19

same code that we have here to scrape

play12:21

other websites that have nothing to do

play12:24

with the structure of this website that

play12:26

we have just scraped this is the closest

play12:28

code to a univers web scraper that I

play12:30

have ever created and it is amazing how

play12:32

easy it is to create these processes

play12:34

using large language models today so if

play12:36

I go back to the website in order to

play12:38

compare the data we can see that here we

play12:40

have $630,000 it is exactly the price we

play12:43

have ,600 square ft so exactly what we

play12:47

want and then we have the home type and

play12:50

the listing age so I would say that on

play12:52

the listing Age part this has not been

play12:54

very successful it should have kept it

play12:56

empty but for all of the other ones we

play12:58

can see that it has been able to

play13:00

actually get the data and if we do a

play13:02

simple date substraction we can get the

play13:04

exact date when this house has been

play13:07

listed so that is a problem that can be

play13:09

easily resolved okay so this was the

play13:11

first URL now let's go ahead and go to a

play13:14

totally different website that has

play13:15

nothing to do with this URL so this is

play13:18

going to be a different website so let's

play13:20

open it here so let's see if it's going

play13:22

to be able to get data from here okay so

play13:24

let's run it okay so we got the RW data

play13:28

now and it has been able to get the

play13:30

sorted data and then get us the Excel

play13:33

sheet so first of all let's go to the

play13:34

sorted data and see what it has been

play13:36

able to uh extract so there is the

play13:39

sorted data that has been able to

play13:40

extract and then if we go back to our

play13:43

folder we can find this Excel file that

play13:46

if we open it we are going to find this

play13:48

information so literally this code has

play13:50

been able to scrape two different URLs

play13:53

that has nothing to do with each other

play13:55

simply because we have used large

play13:56

language models so if we go back to our

play13:58

Excel sheets and do the same thing we

play14:00

are going to find that for example the

play14:02

first one is this address with this real

play14:04

estate agency with this price and two

play14:07

beds and then one bath and 1,000 square

play14:11

ft okay so that is already very good now

play14:13

let's see if we are going to be able to

play14:15

do this with a website that is basically

play14:17

in a foreign language for example let's

play14:19

see if this going to be able to work on

play14:21

a website that is in French so this is a

play14:24

French website these are homes in a city

play14:26

called Leo in France let's see if it's

play14:28

going to be able to get those listings

play14:30

and understand how to get them even

play14:32

though that our prompts are in English

play14:34

so let's run our code and here we have

play14:37

an error and this is very important the

play14:40

error is that the model's maximum

play14:42

context length is 16,000 tokens so here

play14:45

we can see that the model that I have

play14:47

used does not have enough of a context

play14:50

length size in order to be able to treat

play14:53

the raw data that we have just gotten in

play14:55

here so here we have the raw data that

play14:57

is basically so long so it hasn't been

play14:59

able to actually get all of this

play15:02

information and basically process it so

play15:04

if I go back here to response here I am

play15:06

using GPT 3.5 turbo 1106 we can just

play15:10

change that to 40 and we can run the

play15:13

code again and let's see if it's going

play15:15

to be able to treat it this time okay so

play15:18

we got the row data it's here and this

play15:21

time it's going to take more time in

play15:22

order to process it because it is so

play15:24

many tokens

play15:30

and finally we got an answer finally I

play15:33

have waited so long for this one okay so

play15:35

we got the raw data it is very long as

play15:37

we said and then we got an answer so

play15:39

let's go ahead and visualize the answer

play15:42

let's group them by type Yep this is the

play15:44

last one so let's open it and as you can

play15:47

see here even in French it has been able

play15:50

to scrape all the data so here I can see

play15:56

the price of course the price is going

play15:57

to be in Euros the address the real

play16:00

estate agency the number of beds and of

play16:02

course if I click here and I open it

play16:05

it's going to open exactly the listing

play16:07

so here we have to this so as you can

play16:10

see here if we compare between the two

play16:12

we can see that we have here the address

play16:14

the right one we have the agency I don't

play16:16

know where I found it but yes here so

play16:19

this is the agency the price is correct

play16:21

and then we have here three bets I don't

play16:23

know what beds mean in English in French

play16:25

rooms that can be closed are called CH

play16:28

so maybe it is the equivalency but it's

play16:31

like everything like rooms that can be

play16:33

closed or living rooms or anything else

play16:36

so here we have cat so I don't know if

play16:38

it should have four in here you guys

play16:40

tell me what beds mean what do you mean

play16:41

by beds in English you can have three

play16:43

beds in one room I don't understand the

play16:45

logic of it but you guys tell me in the

play16:47

comments if this is correct or no we

play16:49

don't have any indication about bats

play16:51

it's not really something that the frch

play16:52

talk about that much but here we have

play16:54

the first clearly wrong field here we

play16:57

have square feet and here we have the

play16:59

square meter so this is the imperial

play17:01

system this is the metric one the

play17:02

correct one and it could not understand

play17:05

that this is basically square meters it

play17:07

should have at least indicated that here

play17:09

we have the square meter and of course

play17:11

here we have the the rest of the

play17:13

information so basically that is it so

play17:15

it has been able same code to extract a

play17:19

totally different URL so that is already

play17:21

very good this code has been impossible

play17:23

a year or two years ago it is impossible

play17:25

to create kind of a universal whip

play17:27

scraper that will work in any instance

play17:29

without having to deal with any

play17:30

inspection or web page specifications

play17:33

anyways that has been me guys thank you

play17:35

guys so much for watching it has been a

play17:37

long video I know but thank you guys for

play17:38

staying all the way through I really

play17:40

appreciate it and catch you guys next

play17:42

time peace

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
AI ScrapingWeb Data ExtractionLarge Language ModelsWeb Scraping AutomationAPI IntegrationData StructuringPython CodingNatural Language ProcessingWeb CrawlingMachine Learning