Try this SIMPLE trick when scraping product data

John Watson Rooney
4 Feb 202413:31

Summary

TLDRThis video script outlines a method for extracting product data from websites using the schema.org product schema. The tutorial demonstrates how to create a Python script with libraries like urllib3, selectox, and Rich to scrape structured data from script tags on various e-commerce sites. The script simplifies the process of pulling product information, potentially saving time and effort in data collection for analysis or database management.

Takeaways

  • 🌐 Websites from different companies use the schema.org product schema to structure product data on their pages.
  • 🔍 The script tag with 'application/ld+json' contains structured data that can be scraped from these websites.
  • 📚 Schema.org provides documentation on how to organize data using their product schema, including common data points.
  • 🤖 A single scraping script can be used to pull data from multiple sites that use the same schema format, simplifying the process.
  • 🛠️ The script involves creating a Python environment, installing necessary packages, and using libraries like urllib3, selectox, and Rich.
  • 🔧 A function called 'new_client' is used to create an HTTP client with a proxy and custom headers for making requests.
  • 📈 The 'schema_get_html' function fetches HTML content from a given URL using the HTTP client, decoding the response data.
  • 📝 The 'pass_html' function extracts script tags containing 'application/ld+json' from the HTML and converts the data into a dictionary.
  • 🔎 The 'run' function orchestrates the process by creating a client, fetching HTML, and extracting schema data from the page.
  • 📊 The resulting data can be used for various purposes, such as populating a database or creating a spreadsheet, leveraging structured data across multiple sites.

Q & A

  • What do the two websites mentioned in the script have in common?

    -Both websites use the schema.org product schema to organize product data on their pages.

  • What is the purpose of the schema.org product schema?

    -The schema.org product schema is used to organize product data in a structured way across different websites, making it easier to scrape and analyze the data.

  • Why is using a common schema like schema.org beneficial for data scraping?

    -Using a common schema like schema.org allows a single scraping script to pull data from multiple websites, as the data is organized in the same way, saving time and effort.

  • What programming language is being used to create the scraping script in the video?

    -The programming language used in the video is Python.

  • What libraries are being installed and used in the Python script?

    -The libraries being installed and used are urllib3, selectolax, and Rich.

  • What is the purpose of creating a new client function in the script?

    -The new client function creates an HTTP client that can be used to make requests to web servers, with specific headers and proxy settings.

  • How does the script identify and extract the product data from the HTML?

    -The script identifies the product data by searching for script tags with type 'application/ld+json' and extracting the JSON-LD content.

  • What is JSON-LD and why is it used?

    -JSON-LD (JavaScript Object Notation for Linked Data) is a standard for embedding linked data in web pages. It is used because it provides a structured and machine-readable way to represent data.

  • How does the script handle multiple URLs for scraping?

    -The script uses a list of URLs and iterates through them, scraping data from each URL in turn.

  • What can be done with the data once it is scraped and stored in a dictionary?

    -The scraped data can be further processed, such as being put into a spreadsheet using pandas, or certain parts of the information can be extracted and stored in a database.

Outlines

00:00

🌐 Utilizing Schema.org for Structured Data Extraction

This paragraph discusses the use of Schema.org's product schema by different websites to structure product data. The speaker demonstrates how two distinct websites, selling different products, implement the same schema in their HTML code. By inspecting the 'view source', the presence of 'application/ld+json' script tags is revealed, which contain product information in a standardized format. The speaker guides viewers to Schema.org to explore the product schema, highlighting the structured data points available. The script's purpose is to simplify data scraping across websites using this schema, making it easier to extract product information.

05:00

🛠️ Building a Python Script for Data Scraping

The speaker outlines the process of creating a Python script to scrape product data from websites using the Schema.org product schema. The script involves setting up a new virtual environment and installing necessary packages such as urllib3 for HTTP requests, select for HTML parsing, and rich for terminal display. The script includes functions for creating an HTTP client with proxy support and headers, and for fetching and parsing HTML content to extract data from the script tags. The speaker emphasizes the use of type hints and the installation of 'black' for code formatting, aiming for a clean and efficient codebase.

10:02

🔍 Extracting and Utilizing Schema Data with Python

In this paragraph, the speaker details the steps to extract schema data from HTML using the previously created Python script. The script includes functions to make HTTP requests with attached proxies and headers, parse HTML content, and specifically target script tags with 'application/ld+json' type. The extracted JSON data is then loaded into a dictionary. The speaker also discusses handling multiple script tags on a single page and the potential for varying data completeness. The script is designed to be versatile, allowing for easy data extraction from any URL that follows the Schema.org product schema format. The potential applications of the extracted data, such as database storage or spreadsheet creation, are also mentioned.

Mindmap

Keywords

💡Schema.org

Schema.org is a collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond. In the video, it is used to describe how different websites implement a standardized way to markup their product data, which is crucial for structured data extraction and SEO purposes. The script mentions viewing the source code of a website and finding 'application LD+JSON' which is a part of Schema.org's product schema.

💡Product Schema

The Product Schema is a vocabulary type in Schema.org used to represent items that are sold or exchanged in an online or physical marketplace. In the context of the video, the Product Schema is highlighted as a common feature across different e-commerce websites, allowing for a uniform method to scrape product information regardless of the site's underlying structure.

💡JSON-LD

JSON-LD stands for JSON Linked Data and is a method to serialize Linked Data using JSON. It is a standard that allows for a machine-readable format of structured data. In the video, the script tag with 'application/ld+json' is identified as a carrier of product data in a standardized format that can be easily parsed and utilized by web scraping scripts.

💡Web Scraping

Web scraping is a technique used to extract large amounts of web data from websites. It involves the use of automated scripts to retrieve and process data from web pages. The video discusses building a script to scrape product data from websites using the structured data provided by Schema.org, making the process more efficient and less prone to errors.

💡HTML Parser

An HTML parser is a program that interprets HTML code and builds a document object model (DOM) that represents the structure of the webpage. In the video, the script mentions using 'selectox', an HTML parser, to navigate and extract data from the HTML content of web pages that contain product information.

💡Virtual Environment

A virtual environment is an isolated working copy of Python that allows developers to work on different projects with their own dependencies without interfering with each other. The video script emphasizes the importance of using a virtual environment when setting up the Python environment for the web scraping project.

💡URLlib3

URLlib3 is a powerful, high-level HTTP client for Python that provides a straightforward way to make requests to web servers. In the video, it is used for making HTTP requests, including handling proxies, which is essential for web scraping tasks that require rotating IP addresses to avoid being blocked by websites.

💡Proxy Manager

A Proxy Manager in the context of urllib3 is used to manage the use of proxies when making HTTP requests. The script in the video uses a Proxy Manager to handle the proxy settings, allowing the web scraping script to route its requests through different IP addresses stored in an environment variable.

💡Rich Library

Rich is a Python library for rich text and beautiful formatting in the terminal. It is used in the video to enhance the readability of the output when printing data to the terminal. This is particularly useful for debugging and displaying structured data in a more human-readable format.

💡Structured Data

Structured data refers to data that is organized in a specific way, making it easier for machines to understand and process. In the video, the script discusses how Schema.org's product schema provides a structured format for product information, which simplifies the task of extracting and analyzing product data from various websites.

💡Python Scripting

Python scripting involves writing Python programs to automate tasks. In the video, the script describes building a Python script to automate the process of web scraping product data from websites using structured data. The script includes importing necessary libraries, creating functions for making HTTP requests, and parsing HTML to extract the desired information.

Highlights

Introduction to using schema.org product schema for structured data on websites.

Demonstration of how different websites use the same schema for product data.

Explanation of viewing the source code to find the application LD+JSON script tag.

Visiting schema.org to understand the organization of data points for products.

Description of JSON-LD as a standard for Linked Data in JavaScript Object Notation.

Advantages of using a common data format for web scraping across different sites.

Introduction to building a script to extract product information from websites using schema.org.

Setting up a new Python virtual environment and installing necessary packages.

Use of urllib3 for making HTTP requests with a proxy manager.

Importance of using a virtual environment in Python for project dependencies.

Explanation of using BeautifulSoup for parsing HTML and Rich for terminal output.

Importing necessary modules for the web scraping script, including json and os.

Creating a function to generate a new HTTP client with headers and proxy settings.

Details on using environment variables for proxy settings in Python scripts.

Developing a function to fetch HTML content from a given URL using the configured client.

Extracting script tags containing product schema data using CSS selectors.

Filtering script tags to find those with the 'application/ld+json' type attribute.

Parsing the script tag content to extract JSON-LD formatted product data.

Creating a main function to run the web scraping process and print the results.

Using the script to scrape product data from multiple URLs and output the results.

Potential applications of the scraped data, such as populating databases or spreadsheets.

Encouragement to join the community on Discord and subscribe for further updates.

Transcripts

play00:00

these are two completely different

play00:02

websites telling from different

play00:03

companies selling completely different

play00:05

products but what do they both have in

play00:07

common and that's the fact that they're

play00:08

using the schema.org product schema to

play00:11

actually put some of the product data on

play00:13

the page so I've gone to view Source on

play00:15

this one and you can see that we have in

play00:17

this script tag application LD plus Json

play00:20

the content from schema.org type product

play00:23

and exactly the same thing over here

play00:25

even though it's not quite as spaced out

play00:26

as well you'll see that we have the same

play00:28

information so if we go to schema.org

play00:31

here let's paste this in come to this

play00:34

site and we can have a look through and

play00:35

if you read through the docs you'll see

play00:37

that it's basically just a way of

play00:39

organizing your data across these things

play00:42

so here are the commonly used types the

play00:44

one that we're interested in is the

play00:46

product and it shows you all of the uh

play00:50

available um data fun the data points

play00:53

that you can have within this and if I

play00:54

scroll down to the examples and look at

play00:57

this one you'll see that we have this

play00:59

application LD Json if we click on Json

play01:02

LD you'll see that it is Javascript

play01:05

object notation for Linked data and it's

play01:07

kind of like a standard that's used for

play01:09

this it's been around for quite a while

play01:11

um what this basically means is that

play01:14

these sites we can use one scraping

play01:16

script to actually pull this data from

play01:18

both of these sites because they do this

play01:20

in the same way um it makes our lives

play01:24

easier and I find it's a really good way

play01:26

you can just run this script against

play01:28

product sites and you can just quickly

play01:29

see hey does this does this information

play01:32

exist in this URL and you can then go

play01:34

ahead and save yourself a load of time

play01:35

which is really handy so what we're

play01:37

going to do in this video is we're going

play01:39

to build out a short script that's going

play01:41

to be able to grab this information from

play01:44

this script tag for any site that has it

play01:46

in that we can then pull back so I've

play01:49

got my uh new folder open up here

play01:52

there's nothing in it so I'm just going

play01:54

to create a new virtual environment

play01:56

always use a virtual environment with

play01:58

python I think that goes without saying

play02:00

I'm going to activate that and we're

play02:01

going to install a few different things

play02:03

so I'm going to be using URL lib 3 for

play02:06

this now URL lib 3 is an httv client

play02:08

which is basically used by requests and

play02:11

a whole load of other python packages to

play02:14

make requests to web servers and I'm

play02:17

just going to use it by itself in this

play02:18

case for the proxy manager which I'll

play02:21

show you how that works we're going to

play02:23

use select toax of course because this

play02:25

is the only HTML passer I ever used now

play02:28

love it and also going to use r so when

play02:30

I print stuff out to my terminal it's

play02:32

easier to see you don't need to use that

play02:34

if you don't want to so let's create a

play02:36

new file I'm going to call this one

play02:38

main.py and we are going to try using

play02:41

Helix for this Helix is very similar to

play02:44

neovim for written in Rust it comes with

play02:47

more features out of the box um but it

play02:49

doesn't quite have the same key binding

play02:52

so I don't know we'll see how we get on

play02:54

if it starts to get in the way I'll swap

play02:56

back so first thing I'm going to do is

play02:57

I'm going to import in some stuff let

play02:59

make make this nice and

play03:01

big we'll have URL lib 3 from select ox.

play03:07

passer we Import in the HTML passer and

play03:10

then from Rich we'll Import in print and

play03:13

we're going to import in Json as well we

play03:15

will need that because that data that

play03:17

we're going to be pulling out of here

play03:18

we're going to be using json. loads and

play03:21

I'm going to also Import in OS because

play03:24

um when I put in my proxy string I have

play03:26

it saved as an environment variable on

play03:28

my system so I can EAS import it into

play03:30

any code that I want it means I don't

play03:32

have to keep going and getting it and

play03:34

showing it on on my screen when I'm

play03:36

doing stuff like this if you want to use

play03:38

proxies I have a a link in the

play03:40

description below for for some proxies

play03:42

that I use if you want to check that out

play03:43

that's cool so we want to create our

play03:45

first function so let's call this one uh

play03:48

creating a new client so this is

play03:50

optional I use this function just to

play03:53

basically give myself a new HTTP client

play03:55

that I can then call and use I lifted

play03:57

this from when I've been learning go

play03:59

it's not necessarily essential but it

play04:01

makes it nice and neat I think so I'm

play04:03

going to call this new client and we're

play04:06

going to return out of this a URL lib

play04:10

3. proxy manager like so so let's have

play04:14

some headers so our headers are going to

play04:16

be equal to a dictionary of course and

play04:19

the one main one we want is a user agent

play04:21

so I'm just going to go ahead back over

play04:23

here and grab my user

play04:25

agent this one is fine we'll stick him

play04:27

in there put this on one string on one

play04:31

line please then we'll say that our

play04:33

client is going to be equal to the URL

play04:35

Live 3. plxy manager so in url live

play04:38

three like a session or a client is

play04:40

actually a pool manager or a proxy

play04:42

manager so if I look at pool manager you

play04:44

can just about see at the top there it's

play04:46

a bit big um but basically proxy manager

play04:49

is just like a pool manager this is your

play04:50

session essentially so I'm going to do

play04:52

this then we're going to put in our um

play04:55

os. Environ and then the key which is is

play05:00

my I think you call it proxy like so we

play05:03

also then need to have our headers is

play05:05

equal

play05:07

to headers I know this is behind my head

play05:09

you'll see it in just a second don't

play05:11

worry then we can return the client out

play05:13

like this return client so uh one thing

play05:17

I need to install now actually is I'm

play05:18

just going to install black which is

play05:19

going to let me format the code and

play05:22

we'll come back to our PI file here so

play05:25

we can now do fmt in Helix because we

play05:28

have black installed and it's just going

play05:29

to make that a bit Nether and tidier and

play05:31

I'm hiding stuff with my head you're not

play05:32

missing out on much it's just says

play05:34

headers is equal to headers right so now

play05:36

we have our our new client that we can

play05:38

actually uh run and call we want to

play05:41

actually get the data from the page so

play05:43

let's say schema

play05:47

HTML get and this is going to take in a

play05:50

client which is going to be our URL Li

play05:53

3. proxy manager we should use Ty pins

play05:56

in Python now and this is also going to

play05:59

take in a URL which is a string and it's

play06:02

going to

play06:04

return

play06:05

out an HTML passer

play06:09

instance make this a bit smaller so you

play06:12

can kind of see what I'm doing there

play06:13

hopefully it's still big enough to read

play06:15

okay so we will do our response is equal

play06:18

to

play06:20

client. request now it works slightly

play06:23

differently URL li3 and we'll have our

play06:25

get and then our URL but what this means

play06:27

is every time we make a request with

play06:29

client that we're creating in this

play06:30

function it's always going to have our

play06:32

proxy attached and it's always going to

play06:33

have the headers that we want attached

play06:35

to so I can say that our HTML is going

play06:38

to be equal to the HTML passer and we

play06:41

need to give it our response. dat now we

play06:43

need to decode this data because it's

play06:46

going to be in bytes I think so we can

play06:48

say uh

play06:51

utf-8 just to encode it as a text and

play06:54

then we'll return our HTML which will

play06:57

satisfy the type in that there

play07:00

format so this one will basically just

play07:03

make the request and return the HTML for

play07:05

us which is exactly what we

play07:07

want now we can pass it so I'll just say

play07:10

pass HTML and this is going to take in

play07:12

the HTML that we got which is also type

play07:15

of HTML passer and it's going to return

play07:17

out of this function I think it's going

play07:20

to be a dictionary um we'll leave it

play07:22

like this for the moment so we want to

play07:24

actually get that uh element that has

play07:27

the script tags so let's do our

play07:30

script data is going to be equal

play07:34

to html. CSS now we want to do CSS

play07:38

because we want to find all of them

play07:39

because on this page there could be

play07:41

multiple uh uh script tags that meet

play07:44

this criteria so we want to find each

play07:46

and every one so we can go through them

play07:48

so we'll say it's going to be a script

play07:50

tag

play07:53

script and it's going to be a type and

play07:57

we need to put this into a bracket so

play07:59

I'm just going to grab this here

play08:01

application Json not brackets we need to

play08:04

put this into single quotes equals to

play08:08

there we go so it formats it nice and

play08:13

neatly from here what I'm going to do is

play08:15

I'm going to Loop through each of these

play08:17

and I'm going to find if it has the at

play08:20

type in because that was the the

play08:21

signifier here you could do at context

play08:24

as well but I'm going to do at type so

play08:25

we'll do for uh schema we'll call it

play08:29

schema now in scripts

play08:33

data dot uh text and I'm going to do

play08:36

strip is equal to true I don't know if

play08:38

this is strictly necessary here but

play08:39

we'll do it

play08:40

anyway and we'll do

play08:44

if at

play08:47

type

play08:49

in schema so this is basically say if we

play08:52

find this in there what's this

play08:54

complaining

play08:55

about of course we can't do the um EXC

play08:59

excuse me we can't do this here we all

play09:01

search through each of these and then we

play09:03

want to search within the text here so

play09:05

strip is equal to stri there we go that

play09:07

makes more sense so if we find this type

play09:10

this string of at type which is going to

play09:12

be in every single one of these what

play09:13

we're going to do is we're going to um

play09:17

return json.

play09:19

loads and we'll say our

play09:23

schema.

play09:26

text oh strip is equal to true

play09:30

like so and now we're still our our type

play09:33

pin is going to complain here because

play09:35

like we could be returning none because

play09:38

this is returning under an if statement

play09:40

so we have a couple of choices I'm just

play09:41

going to say this is either going to

play09:42

return a dictionary or none just to

play09:44

satisfy this this is Shifting the

play09:46

problem along a little bit but you know

play09:48

it'll be fine so we'll finally have a

play09:50

new function that says run we'll call

play09:52

this one run for now and we'll say that

play09:54

our client is going to be equal to the

play09:55

new client function that we created our

play09:58

HTML is going to be equal to the

play10:01

schema get HTML with the client and the

play10:05

URL which I will get in just a second

play10:08

and then we can say that our uh data is

play10:11

equal

play10:12

to

play10:14

pass HTML on our HTML and let's print

play10:18

out the data and then we just need our

play10:21

if name is equal to

play10:23

main so when we call

play10:26

this directly it runs the run function

play10:30

so now we just need our url url is equal

play10:33

to so let's go and get this

play10:38

one nice long

play10:40

one in fact let's make no we'll just try

play10:43

this first we'll just do it like this

play10:45

first cool let's let's check to see that

play10:48

this actually works I going to format

play10:50

this so it's all done and then I'm going

play10:52

to come back out here and I'm going to

play10:53

run our main

play10:57

file cool so that's worked so now we

play10:59

have all of this information all this

play11:01

formatted neat nice data that's come

play11:04

back in the schema.org product format

play11:07

into our code so now we can actually

play11:09

just go ahead and add

play11:11

in the other URL if we wanted to so

play11:14

let's grab this scroll down here and

play11:18

we'll create a

play11:22

URLs list is equal to this we'll put

play11:26

this one in here and we'll get the other

play11:28

one as well

play11:29

which I've just uncut I've just

play11:31

[Music]

play11:33

removed

play11:37

sweet so now we can just do

play11:39

four URL in URLs like this and indent

play11:43

all this stuff

play11:46

in sure there's a quick way to do this

play11:49

indenting in Helix I haven't loed up yet

play11:52

there we

play11:53

go quick format I like this format

play11:57

sweet so so let's save and try again we

play12:01

should get two lots out one two so this

play12:03

is basically just leveraging the fact

play12:06

that there is structured data in this

play12:08

format within this specific script tag

play12:11

for a lot of different websites that use

play12:14

this so anywhere that you find this you

play12:15

can just put this you can put the URL of

play12:17

the product in and it will match this

play12:19

code here and will give you the actual

play12:22

data back now this is a short script I

play12:25

have one similar to this that I can just

play12:27

call and run against any URL El

play12:29

obviously I work in e-commerce I do a

play12:30

lot of stuff with that and I can just

play12:32

call that data out and just see what's

play12:33

going on here there is sometimes there's

play12:35

more data than others like you have

play12:37

reviews and offers that doesn't always

play12:39

exist but quite often it does so it's

play12:42

all there to be played with now from

play12:43

here we have a dictionary in Python that

play12:46

we can do anything with you could colle

play12:49

you could collect a load of these use

play12:50

the use pandas to put the dictionary

play12:52

into a spreadsheet or something or you

play12:55

could pick out certain parts of

play12:56

information that you were after and put

play12:58

put it into a database so it's all about

play13:01

just identifying patterns and seeing

play13:03

what you can use to make your life

play13:05

easier when you're scraping multiple

play13:07

bits of data across multiple different

play13:09

sites hopefully you enjoyed this video

play13:11

if you have like comment subscribe

play13:13

please it really does help me out and

play13:15

also jump in the Discord uh we're almost

play13:17

at a th000 members now which is

play13:18

absolutely insane thank you to everyone

play13:20

who's joined thank you to all the people

play13:22

that uh actively contribute um there's

play13:25

loads of you guys now and uh yeah super

play13:27

thank you very much and I'll see you in

play13:29

the next one bye

Rate This

5.0 / 5 (0 votes)

Related Tags
Web ScrapingSchema.orgProduct DataPython ScriptData ExtractionE-commerce ToolsJSON-LDHTML ParsingAPI ClientProxy Management