Try this SIMPLE trick when scraping product data
Summary
TLDRThis video script outlines a method for extracting product data from websites using the schema.org product schema. The tutorial demonstrates how to create a Python script with libraries like urllib3, selectox, and Rich to scrape structured data from script tags on various e-commerce sites. The script simplifies the process of pulling product information, potentially saving time and effort in data collection for analysis or database management.
Takeaways
- 🌐 Websites from different companies use the schema.org product schema to structure product data on their pages.
- 🔍 The script tag with 'application/ld+json' contains structured data that can be scraped from these websites.
- 📚 Schema.org provides documentation on how to organize data using their product schema, including common data points.
- 🤖 A single scraping script can be used to pull data from multiple sites that use the same schema format, simplifying the process.
- 🛠️ The script involves creating a Python environment, installing necessary packages, and using libraries like urllib3, selectox, and Rich.
- 🔧 A function called 'new_client' is used to create an HTTP client with a proxy and custom headers for making requests.
- 📈 The 'schema_get_html' function fetches HTML content from a given URL using the HTTP client, decoding the response data.
- 📝 The 'pass_html' function extracts script tags containing 'application/ld+json' from the HTML and converts the data into a dictionary.
- 🔎 The 'run' function orchestrates the process by creating a client, fetching HTML, and extracting schema data from the page.
- 📊 The resulting data can be used for various purposes, such as populating a database or creating a spreadsheet, leveraging structured data across multiple sites.
Q & A
What do the two websites mentioned in the script have in common?
-Both websites use the schema.org product schema to organize product data on their pages.
What is the purpose of the schema.org product schema?
-The schema.org product schema is used to organize product data in a structured way across different websites, making it easier to scrape and analyze the data.
Why is using a common schema like schema.org beneficial for data scraping?
-Using a common schema like schema.org allows a single scraping script to pull data from multiple websites, as the data is organized in the same way, saving time and effort.
What programming language is being used to create the scraping script in the video?
-The programming language used in the video is Python.
What libraries are being installed and used in the Python script?
-The libraries being installed and used are urllib3, selectolax, and Rich.
What is the purpose of creating a new client function in the script?
-The new client function creates an HTTP client that can be used to make requests to web servers, with specific headers and proxy settings.
How does the script identify and extract the product data from the HTML?
-The script identifies the product data by searching for script tags with type 'application/ld+json' and extracting the JSON-LD content.
What is JSON-LD and why is it used?
-JSON-LD (JavaScript Object Notation for Linked Data) is a standard for embedding linked data in web pages. It is used because it provides a structured and machine-readable way to represent data.
How does the script handle multiple URLs for scraping?
-The script uses a list of URLs and iterates through them, scraping data from each URL in turn.
What can be done with the data once it is scraped and stored in a dictionary?
-The scraped data can be further processed, such as being put into a spreadsheet using pandas, or certain parts of the information can be extracted and stored in a database.
Outlines
🌐 Utilizing Schema.org for Structured Data Extraction
This paragraph discusses the use of Schema.org's product schema by different websites to structure product data. The speaker demonstrates how two distinct websites, selling different products, implement the same schema in their HTML code. By inspecting the 'view source', the presence of 'application/ld+json' script tags is revealed, which contain product information in a standardized format. The speaker guides viewers to Schema.org to explore the product schema, highlighting the structured data points available. The script's purpose is to simplify data scraping across websites using this schema, making it easier to extract product information.
🛠️ Building a Python Script for Data Scraping
The speaker outlines the process of creating a Python script to scrape product data from websites using the Schema.org product schema. The script involves setting up a new virtual environment and installing necessary packages such as urllib3 for HTTP requests, select for HTML parsing, and rich for terminal display. The script includes functions for creating an HTTP client with proxy support and headers, and for fetching and parsing HTML content to extract data from the script tags. The speaker emphasizes the use of type hints and the installation of 'black' for code formatting, aiming for a clean and efficient codebase.
🔍 Extracting and Utilizing Schema Data with Python
In this paragraph, the speaker details the steps to extract schema data from HTML using the previously created Python script. The script includes functions to make HTTP requests with attached proxies and headers, parse HTML content, and specifically target script tags with 'application/ld+json' type. The extracted JSON data is then loaded into a dictionary. The speaker also discusses handling multiple script tags on a single page and the potential for varying data completeness. The script is designed to be versatile, allowing for easy data extraction from any URL that follows the Schema.org product schema format. The potential applications of the extracted data, such as database storage or spreadsheet creation, are also mentioned.
Mindmap
Keywords
💡Schema.org
💡Product Schema
💡JSON-LD
💡Web Scraping
💡HTML Parser
💡Virtual Environment
💡URLlib3
💡Proxy Manager
💡Rich Library
💡Structured Data
💡Python Scripting
Highlights
Introduction to using schema.org product schema for structured data on websites.
Demonstration of how different websites use the same schema for product data.
Explanation of viewing the source code to find the application LD+JSON script tag.
Visiting schema.org to understand the organization of data points for products.
Description of JSON-LD as a standard for Linked Data in JavaScript Object Notation.
Advantages of using a common data format for web scraping across different sites.
Introduction to building a script to extract product information from websites using schema.org.
Setting up a new Python virtual environment and installing necessary packages.
Use of urllib3 for making HTTP requests with a proxy manager.
Importance of using a virtual environment in Python for project dependencies.
Explanation of using BeautifulSoup for parsing HTML and Rich for terminal output.
Importing necessary modules for the web scraping script, including json and os.
Creating a function to generate a new HTTP client with headers and proxy settings.
Details on using environment variables for proxy settings in Python scripts.
Developing a function to fetch HTML content from a given URL using the configured client.
Extracting script tags containing product schema data using CSS selectors.
Filtering script tags to find those with the 'application/ld+json' type attribute.
Parsing the script tag content to extract JSON-LD formatted product data.
Creating a main function to run the web scraping process and print the results.
Using the script to scrape product data from multiple URLs and output the results.
Potential applications of the scraped data, such as populating databases or spreadsheets.
Encouragement to join the community on Discord and subscribe for further updates.
Transcripts
these are two completely different
websites telling from different
companies selling completely different
products but what do they both have in
common and that's the fact that they're
using the schema.org product schema to
actually put some of the product data on
the page so I've gone to view Source on
this one and you can see that we have in
this script tag application LD plus Json
the content from schema.org type product
and exactly the same thing over here
even though it's not quite as spaced out
as well you'll see that we have the same
information so if we go to schema.org
here let's paste this in come to this
site and we can have a look through and
if you read through the docs you'll see
that it's basically just a way of
organizing your data across these things
so here are the commonly used types the
one that we're interested in is the
product and it shows you all of the uh
available um data fun the data points
that you can have within this and if I
scroll down to the examples and look at
this one you'll see that we have this
application LD Json if we click on Json
LD you'll see that it is Javascript
object notation for Linked data and it's
kind of like a standard that's used for
this it's been around for quite a while
um what this basically means is that
these sites we can use one scraping
script to actually pull this data from
both of these sites because they do this
in the same way um it makes our lives
easier and I find it's a really good way
you can just run this script against
product sites and you can just quickly
see hey does this does this information
exist in this URL and you can then go
ahead and save yourself a load of time
which is really handy so what we're
going to do in this video is we're going
to build out a short script that's going
to be able to grab this information from
this script tag for any site that has it
in that we can then pull back so I've
got my uh new folder open up here
there's nothing in it so I'm just going
to create a new virtual environment
always use a virtual environment with
python I think that goes without saying
I'm going to activate that and we're
going to install a few different things
so I'm going to be using URL lib 3 for
this now URL lib 3 is an httv client
which is basically used by requests and
a whole load of other python packages to
make requests to web servers and I'm
just going to use it by itself in this
case for the proxy manager which I'll
show you how that works we're going to
use select toax of course because this
is the only HTML passer I ever used now
love it and also going to use r so when
I print stuff out to my terminal it's
easier to see you don't need to use that
if you don't want to so let's create a
new file I'm going to call this one
main.py and we are going to try using
Helix for this Helix is very similar to
neovim for written in Rust it comes with
more features out of the box um but it
doesn't quite have the same key binding
so I don't know we'll see how we get on
if it starts to get in the way I'll swap
back so first thing I'm going to do is
I'm going to import in some stuff let
make make this nice and
big we'll have URL lib 3 from select ox.
passer we Import in the HTML passer and
then from Rich we'll Import in print and
we're going to import in Json as well we
will need that because that data that
we're going to be pulling out of here
we're going to be using json. loads and
I'm going to also Import in OS because
um when I put in my proxy string I have
it saved as an environment variable on
my system so I can EAS import it into
any code that I want it means I don't
have to keep going and getting it and
showing it on on my screen when I'm
doing stuff like this if you want to use
proxies I have a a link in the
description below for for some proxies
that I use if you want to check that out
that's cool so we want to create our
first function so let's call this one uh
creating a new client so this is
optional I use this function just to
basically give myself a new HTTP client
that I can then call and use I lifted
this from when I've been learning go
it's not necessarily essential but it
makes it nice and neat I think so I'm
going to call this new client and we're
going to return out of this a URL lib
3. proxy manager like so so let's have
some headers so our headers are going to
be equal to a dictionary of course and
the one main one we want is a user agent
so I'm just going to go ahead back over
here and grab my user
agent this one is fine we'll stick him
in there put this on one string on one
line please then we'll say that our
client is going to be equal to the URL
Live 3. plxy manager so in url live
three like a session or a client is
actually a pool manager or a proxy
manager so if I look at pool manager you
can just about see at the top there it's
a bit big um but basically proxy manager
is just like a pool manager this is your
session essentially so I'm going to do
this then we're going to put in our um
os. Environ and then the key which is is
my I think you call it proxy like so we
also then need to have our headers is
equal
to headers I know this is behind my head
you'll see it in just a second don't
worry then we can return the client out
like this return client so uh one thing
I need to install now actually is I'm
just going to install black which is
going to let me format the code and
we'll come back to our PI file here so
we can now do fmt in Helix because we
have black installed and it's just going
to make that a bit Nether and tidier and
I'm hiding stuff with my head you're not
missing out on much it's just says
headers is equal to headers right so now
we have our our new client that we can
actually uh run and call we want to
actually get the data from the page so
let's say schema
HTML get and this is going to take in a
client which is going to be our URL Li
3. proxy manager we should use Ty pins
in Python now and this is also going to
take in a URL which is a string and it's
going to
return
out an HTML passer
instance make this a bit smaller so you
can kind of see what I'm doing there
hopefully it's still big enough to read
okay so we will do our response is equal
to
client. request now it works slightly
differently URL li3 and we'll have our
get and then our URL but what this means
is every time we make a request with
client that we're creating in this
function it's always going to have our
proxy attached and it's always going to
have the headers that we want attached
to so I can say that our HTML is going
to be equal to the HTML passer and we
need to give it our response. dat now we
need to decode this data because it's
going to be in bytes I think so we can
say uh
utf-8 just to encode it as a text and
then we'll return our HTML which will
satisfy the type in that there
format so this one will basically just
make the request and return the HTML for
us which is exactly what we
want now we can pass it so I'll just say
pass HTML and this is going to take in
the HTML that we got which is also type
of HTML passer and it's going to return
out of this function I think it's going
to be a dictionary um we'll leave it
like this for the moment so we want to
actually get that uh element that has
the script tags so let's do our
script data is going to be equal
to html. CSS now we want to do CSS
because we want to find all of them
because on this page there could be
multiple uh uh script tags that meet
this criteria so we want to find each
and every one so we can go through them
so we'll say it's going to be a script
tag
script and it's going to be a type and
we need to put this into a bracket so
I'm just going to grab this here
application Json not brackets we need to
put this into single quotes equals to
there we go so it formats it nice and
neatly from here what I'm going to do is
I'm going to Loop through each of these
and I'm going to find if it has the at
type in because that was the the
signifier here you could do at context
as well but I'm going to do at type so
we'll do for uh schema we'll call it
schema now in scripts
data dot uh text and I'm going to do
strip is equal to true I don't know if
this is strictly necessary here but
we'll do it
anyway and we'll do
if at
type
in schema so this is basically say if we
find this in there what's this
complaining
about of course we can't do the um EXC
excuse me we can't do this here we all
search through each of these and then we
want to search within the text here so
strip is equal to stri there we go that
makes more sense so if we find this type
this string of at type which is going to
be in every single one of these what
we're going to do is we're going to um
return json.
loads and we'll say our
schema.
text oh strip is equal to true
like so and now we're still our our type
pin is going to complain here because
like we could be returning none because
this is returning under an if statement
so we have a couple of choices I'm just
going to say this is either going to
return a dictionary or none just to
satisfy this this is Shifting the
problem along a little bit but you know
it'll be fine so we'll finally have a
new function that says run we'll call
this one run for now and we'll say that
our client is going to be equal to the
new client function that we created our
HTML is going to be equal to the
schema get HTML with the client and the
URL which I will get in just a second
and then we can say that our uh data is
equal
to
pass HTML on our HTML and let's print
out the data and then we just need our
if name is equal to
main so when we call
this directly it runs the run function
so now we just need our url url is equal
to so let's go and get this
one nice long
one in fact let's make no we'll just try
this first we'll just do it like this
first cool let's let's check to see that
this actually works I going to format
this so it's all done and then I'm going
to come back out here and I'm going to
run our main
file cool so that's worked so now we
have all of this information all this
formatted neat nice data that's come
back in the schema.org product format
into our code so now we can actually
just go ahead and add
in the other URL if we wanted to so
let's grab this scroll down here and
we'll create a
URLs list is equal to this we'll put
this one in here and we'll get the other
one as well
which I've just uncut I've just
[Music]
removed
sweet so now we can just do
four URL in URLs like this and indent
all this stuff
in sure there's a quick way to do this
indenting in Helix I haven't loed up yet
there we
go quick format I like this format
sweet so so let's save and try again we
should get two lots out one two so this
is basically just leveraging the fact
that there is structured data in this
format within this specific script tag
for a lot of different websites that use
this so anywhere that you find this you
can just put this you can put the URL of
the product in and it will match this
code here and will give you the actual
data back now this is a short script I
have one similar to this that I can just
call and run against any URL El
obviously I work in e-commerce I do a
lot of stuff with that and I can just
call that data out and just see what's
going on here there is sometimes there's
more data than others like you have
reviews and offers that doesn't always
exist but quite often it does so it's
all there to be played with now from
here we have a dictionary in Python that
we can do anything with you could colle
you could collect a load of these use
the use pandas to put the dictionary
into a spreadsheet or something or you
could pick out certain parts of
information that you were after and put
put it into a database so it's all about
just identifying patterns and seeing
what you can use to make your life
easier when you're scraping multiple
bits of data across multiple different
sites hopefully you enjoyed this video
if you have like comment subscribe
please it really does help me out and
also jump in the Discord uh we're almost
at a th000 members now which is
absolutely insane thank you to everyone
who's joined thank you to all the people
that uh actively contribute um there's
loads of you guys now and uh yeah super
thank you very much and I'll see you in
the next one bye
Ver Más Videos Relacionados
Always Check for the Hidden API when Web Scraping
Scraping Dark Web Sites with Python
How to start DSA from scratch? Important Topics for Placements? Language to choose? DSA Syllabus A-Z
Dream Report: Acquiring Data from SQL Server
ChatGPT - OpenAI API w Excelu (za darmo)
How I Would Learn Data Science in 2022
5.0 / 5 (0 votes)