Render Dynamic Pages - Web Scraping Product Links with Python

John Watson Rooney
27 Jul 202013:41

Summary

TLDRThe video explores a method for web scraping dynamically loaded content using the `requests-html` library. The presenter demonstrates how to extract product information from a beer website where JavaScript loads the content. The technique involves creating a session, rendering the page in the background, and using XPath to extract product details such as names, prices, and availability. The video also covers handling pagination and how to avoid issues with missing elements. In part two, the presenter will expand on the script by adding CSV output and more advanced features.

Takeaways

  • 💻 This video explains how to scrape data from dynamically loaded websites using the `request-html` library.
  • 🍺 The example website used in the video is Beer Wolf, which dynamically loads product data with JavaScript.
  • 📄 Standard web scraping tools like `requests` and `BeautifulSoup` don't work for such dynamic sites, requiring a more advanced approach.
  • 🧑‍💻 The `request-html` module can be used to render JavaScript-heavy pages by simulating a browser in the background.
  • ⏳ A sleep parameter can be set to ensure that the page is fully rendered before scraping, preventing premature extraction.
  • 📑 XPath and CSS selectors are used to locate the desired elements (e.g., product containers, individual items) within the rendered page.
  • 🔗 Absolute links to each product are extracted, allowing navigation to individual product pages to retrieve more detailed information.
  • 💲 Details like product name, price, rating, and availability (in stock or out of stock) are extracted using relevant HTML classes.
  • 🔄 The video demonstrates looping through multiple product pages to collect data and handle pagination.
  • 📊 The next video promises to enhance the process by organizing the script into functions and exporting the data into CSV or Excel formats.

Q & A

  • What is the main focus of the video?

    -The video focuses on scraping dynamically loaded content from e-commerce websites, specifically using the 'requests-html' library in Python.

  • Why can't traditional web scraping methods like 'requests' and 'BeautifulSoup' be used for this task?

    -'requests' and 'BeautifulSoup' can't be used because the content is dynamically loaded via JavaScript, and these tools cannot render JavaScript to access the data.

  • What tool does the presenter recommend for handling dynamic content?

    -The presenter recommends using the 'requests-html' library, which can render JavaScript content in the background by launching a lightweight browser.

  • How does 'requests-html' handle JavaScript content differently from 'requests'?

    -'requests-html' creates a session and uses the 'render' function to execute JavaScript and render the page content, allowing access to dynamically loaded data.

  • Why does the presenter include a 'sleep' argument when rendering the page?

    -The 'sleep=1' argument is added to give the page time to fully load the content before trying to access it, which prevents failures when scraping the data.

  • What method is used to extract product links from the dynamically rendered page?

    -The presenter uses XPath to locate the product container and extract all product links by accessing 'r.html.find' with the XPath of the container.

  • How does the presenter suggest handling multiple products on a page?

    -The presenter suggests looping through each product link and fetching data from individual product pages by visiting each link in the extracted list.

  • What key information does the presenter extract from each product page?

    -Key information extracted includes product name, subtext, price, stock status, and rating.

  • How does the script determine whether a product is in stock or out of stock?

    -The script checks for a 'div' element with the class 'add to cart container' for in-stock products and 'disable container' for out-of-stock products.

  • What future improvements does the presenter plan for the script?

    -In part two, the presenter plans to separate the script into distinct functions for requests, parsing, and output, handle pagination, and export the data into a CSV or Excel file.

Outlines

00:00

🔍 Introduction to Scraping Dynamic Websites with Request-HTML

The video begins with an introduction by John, explaining that the tutorial will cover scraping product pages from dynamically loaded websites using a Python library called `requests-html`. He demonstrates with the example of 'Beer Wolf', a beer website that loads content through JavaScript, which makes traditional scraping techniques ineffective. John mentions that to handle this, we can use `requests-html` to render the page, which loads the necessary content in the background, allowing us to access the data dynamically.

05:02

📦 Setting Up the Session and Initializing the Scraping Process

John walks through setting up a session using `requests-html`, starting with importing necessary modules, creating a session, and making an initial GET request to the website. He explains that `requests-html` automatically handles user agents, and how to render the JavaScript of the page using the `.render()` function, which processes the JavaScript and retrieves the dynamic content. A delay (sleep) of one second is introduced after rendering to ensure that the content is fully loaded before extraction. John concludes this part by demonstrating how to verify successful page retrieval through status codes.

10:03

🛍 Extracting Product Information from the Web Page

In this section, John inspects the web page to locate the product container using the browser's developer tools. He shows how to use XPath to pinpoint the container that holds all the product items. By utilizing `r.html.find()` and `XPath`, John retrieves the list of product elements from the page and demonstrates how to extract their absolute URLs. He sets up a loop to iterate over the product links, showing how to scrape individual product details like name, price, and other product-related information from each item's specific page.

💰 Scraping Additional Product Details: Price, Stock, and Rating

John demonstrates how to scrape further product details, including the product's name, description, price, and stock status. He explains how to handle cases where some products may be out of stock by identifying differences in the HTML structure for in-stock and out-of-stock products. Using conditional logic, John flags products as in-stock or out-of-stock based on the presence of certain HTML classes. Additionally, he covers how to retrieve product ratings from the span elements that contain rating values.

🔁 Handling Missing Data and Exception Handling

In this section, John deals with potential issues arising from missing product ratings, which can cause errors during scraping. He uses a try-except block to handle situations where the rating information is not available, setting the rating to 'None' if it's missing. This ensures the scraper can continue processing other products without failing. He reviews how the script loops through each product, retrieving relevant data such as product name, stock status, price, and ratings where available.

🔗 Preparing for Part Two: Pagination and Data Export

John wraps up this video by outlining the next steps, which will be covered in part two of the tutorial. He plans to break the script into functions for better organization, separating the request, parsing, and output stages. The next part will include handling pagination to scrape multiple pages of products and exporting the data into a CSV or Excel file. He briefly touches on pagination logic and encourages viewers to stay tuned for the upcoming continuation of the project.

Mindmap

Keywords

💡Dynamic Content Loading

Dynamic content loading refers to the process where web pages load content in real-time via JavaScript or other means, instead of being fully loaded when the page is first accessed. In the video, the speaker explains how many ecommerce websites, like the example Beer Wolf site, use JavaScript to dynamically load product information. Traditional web scraping tools would fail here because the content is not present in the initial HTML source.

💡Requests-HTML

Requests-HTML is a Python library that allows users to interact with dynamically rendered websites. In the video, the speaker explains how this library can be used to fetch content from websites that rely on JavaScript to load their elements. Unlike traditional scraping libraries, Requests-HTML renders the JavaScript on the page, making it possible to access dynamically loaded data.

💡Session

A session in web scraping refers to an ongoing connection with a website, allowing multiple requests to be made using the same set of cookies and other context information. In the video, the speaker creates a session using Requests-HTML, which enables consistent access to the site and renders the dynamic content as if a browser is loading it in the background.

💡XPath

XPath is a language for selecting nodes in an XML document, often used in web scraping to locate specific HTML elements on a webpage. In the video, the speaker uses XPath to identify and extract the product information from the Beer Wolf website, such as the product container holding all individual items on the page.

💡Render Function

The render function in Requests-HTML is used to process and load all the dynamic content on a page by simulating a browser environment. The video highlights how this function is critical for rendering JavaScript on pages like the Beer Wolf site, ensuring that product information is fully loaded and accessible before attempting to extract it.

💡CSS Selectors

CSS selectors are used in web development and scraping to target HTML elements based on their class, ID, or other attributes. The speaker mentions that Requests-HTML allows access to CSS selectors, which are used to pinpoint elements such as product details or links. This is crucial in gathering the correct data from dynamically loaded web pages.

💡Product Information

Product information in the context of the video includes data like product names, prices, ratings, and stock status. The speaker explains how to loop through all products on the Beer Wolf website and retrieve this information by rendering the page and locating the specific elements that contain these details.

💡Pagination

Pagination refers to the division of content across multiple pages, typically with a 'next' button to load more items. In the video, the speaker notes that the Beer Wolf website has multiple pages of products, and addresses how handling pagination is essential in order to scrape data from all product listings across the site.

💡In Stock / Out of Stock

In stock and out of stock refer to whether a product is available for purchase or not. The video shows how to check the presence of certain HTML elements, such as the 'add to cart' button, to determine if a product is in stock. The speaker compares it to another element class used for out-of-stock products, showing how this information can be programmatically extracted.

💡CSV/Excel Output

CSV/Excel output refers to saving scraped data into a spreadsheet format for further analysis or use. Toward the end of the video, the speaker mentions that part two will involve writing the scraped product information to a CSV or Excel file, which is a common practice in web scraping to store data in a structured format.

Highlights

Introduction to scraping dynamically loaded content using request.html

Demonstration of handling JavaScript-loaded content with request.html

Explanation of creating a session to render pages in the background

Importing request.html for web scraping tasks

Setting up a session with request.html

Using the render function to execute JavaScript and load content

Pausing the script to ensure content is fully loaded

Checking the response status code for successful data retrieval

Navigating to the product container using the inspect tool

Using XPath to extract product information

Looping through product links to scrape individual product pages

Extracting product details like name, rating, and price

Determining product stock status by checking for the 'add to cart' button

Handling exceptions for products without ratings

Combining all extracted data into a structured format

Preview of part two focusing on outputting data to CSV or Excel

Discussion on handling pagination for large datasets

Final thoughts and call to action for likes and subscriptions

Transcripts

play00:00

hi everyone and welcome john here and

play00:02

today's video we're going to be looking

play00:03

at more

play00:04

econ websites and product pages but this

play00:06

time uh we're going to be doing

play00:07

stuff that is dynamically loaded so this

play00:10

little uh this technique will work for

play00:12

pages like the one i'm about to show you

play00:13

and also

play00:14

some other websites that use like

play00:16

javascript or whatever to load

play00:18

their content dynamically and so the

play00:21

website we are going to be

play00:23

looking at is this one it's called beer

play00:25

wolf it is a beer website

play00:27

and if i go to view the page source

play00:30

you'll see right here that this is a

play00:33

load of script code

play00:34

that is loading up all of the product

play00:36

information for us

play00:38

so if we were to use requests in

play00:39

beautiful soup to try and get the

play00:41

product information

play00:42

we wouldn't get anywhere but what we can

play00:44

do is we can use request

play00:46

html to create a session and then render

play00:48

the page in the background for us

play00:51

so how that works is we give it the url

play00:53

we create our session and then we use

play00:55

the render function and what that does

play00:57

is it loads a lightweight browser in the

play01:00

background

play01:01

as a process and then renders the page

play01:03

that we've given it for us

play01:04

and then lets us access that page so

play01:06

what we need to do is

play01:08

to start we need to import request.html

play01:12

if you don't have this installed already

play01:15

you can do pip install

play01:16

and it's this one right here so you

play01:19

could

play01:20

come here and we can see request html

play01:23

html passing for humans and this pip

play01:25

install right here so if you need to do

play01:27

that go ahead and do that

play01:29

so to import that we are going to do

play01:31

from request.html and we're going to

play01:33

import

play01:34

the session and then we are going to set

play01:37

our url

play01:39

and we're going to copy this page right

play01:41

here

play01:42

put that there and then we can do s is

play01:45

equal to

play01:47

html session so we're basically just

play01:49

setting a s variable

play01:51

to be our session and then very similar

play01:54

we similarly to

play01:55

uh our standard requests we do r is

play01:58

equal but this time it's

play01:59

uh s dot get because that's our session

play02:02

that we've set here

play02:03

and then the url for those of you that

play02:06

have watched my video on user agents

play02:08

requests html module actually cycles

play02:11

through different

play02:12

user agents so we don't need to specify

play02:15

our own ones when we're using this

play02:17

request html library but if we are using

play02:19

the request one we do

play02:20

so now we want to get it to render the

play02:22

page for us so we can do

play02:24

r.html dot render

play02:27

now what that's going to do is it's

play02:29

going to

play02:30

load up the browser in the background

play02:31

and render all the page for us so

play02:33

execute all the javascript and then let

play02:36

us take the information from it

play02:38

what we're going to do is i'm going to

play02:39

put sleep is equal to one in here

play02:42

now what that does is it just gives it a

play02:43

one second break after it's rendering so

play02:46

it just gives it a little bit of time to

play02:48

make sure the information is there

play02:49

before we start trying to grab it

play02:51

if you're trying to do this and you

play02:52

don't have that and it's failing that

play02:54

could be why so go ahead and put sleep

play02:55

is equal to one in there so now i'm

play02:57

going to go ahead and print out the

play02:59

r dot status code just to check that

play03:02

we're getting something back

play03:04

correctly uh if this is the first time

play03:06

that you're running this you'll see

play03:07

for the first one you'll see that you'll

play03:09

get a loading bar and it'll say i think

play03:11

it's something to do with puppeteer and

play03:12

it'll say installed in chrome

play03:13

or something just let that go ahead and

play03:16

that'll install everything you need and

play03:17

that's the browser that runs in the

play03:18

background

play03:19

so we've got 200 response which is good

play03:22

so we can get rid of that

play03:23

and we can go back to our page and now

play03:26

we want to

play03:26

get our inspect tool so let's do inspect

play03:30

and let's hover over the first item

play03:34

which is this one here so we've got

play03:35

everything um but what we want is

play03:38

all of the products not just the

play03:40

individual ones at the moment and i can

play03:42

see right above it

play03:44

here's the product container so this has

play03:45

got everything in it so these are all

play03:47

the individual products you can see them

play03:49

there

play03:49

i highlight over them so what we want to

play03:52

do is we want to copy

play03:54

we're actually going to use the xpath

play03:55

for this one to get this information

play03:57

when you use request html you will have

play03:59

access to the css selectors or the xpath

play04:02

so we're going to use xpath for this one

play04:04

for this demo

play04:05

let's copy the xpath now as we would do

play04:08

normally let's do

play04:09

products it's equal to now we want to do

play04:12

r.html because we've just rendered it

play04:15

and we want to do dot find sorry excuse

play04:17

me we want to do x path

play04:19

and paste that in there what's also

play04:21

useful to do when you're doing the

play04:23

find or the searching for variable app

play04:25

and searching for selectors or x paths

play04:27

is i like to put in that first is equal

play04:29

to true just in case it partially

play04:31

matches or if it matches

play04:33

other elements so we know that this is

play04:35

the first one on the page that's going

play04:36

to match this

play04:38

so that's going to work for us but if

play04:39

there were other ones further down it

play04:41

would bring a list back which would then

play04:42

be more difficult for us to interrogate

play04:44

so

play04:44

i'm going to put first is equal to true

play04:46

in there so if i now do

play04:49

print products it's going to render the

play04:52

page for us and it's going to return an

play04:54

element for us so there we got it we've

play04:58

got it back we've got the element div

play05:00

and it's got the id of product items

play05:02

container

play05:03

which matches this so we know we're in

play05:05

the right place

play05:07

now what we can do is we can do various

play05:10

things with this element we could print

play05:12

the text out

play05:13

but what i'm interested in is the link

play05:15

to each and every

play05:16

element that's within it so i'm going to

play05:18

do print

play05:21

and do products and we're going to do

play05:23

absolute

play05:26

links like that so what that's done is

play05:29

it's printed out

play05:30

every link within that element

play05:33

and because that is the element that's

play05:34

got all of the products in it

play05:36

these are all the product links and

play05:38

we've got a nice big long list here

play05:40

so what we can do is we want to create a

play05:43

loop

play05:44

so we can loop through each one of those

play05:46

so i'm going to do four item

play05:48

in products dot

play05:51

absolute links now if i print item we're

play05:55

going to get them back individually

play05:57

and we can actually use those to then go

play06:00

to each and every individual product

play06:02

page and get back

play06:03

the information from that page so when

play06:06

we go to one

play06:07

individual product we can see that we

play06:09

have a bit more information than we do

play06:12

on the main page

play06:13

we've got the name we've got our rating

play06:15

we've got this little subtext

play06:17

information here

play06:17

our price description add to cart

play06:22

and we'll come back to that in a minute

play06:23

because that'll be our that'll be how we

play06:24

know whether it's in stock or not so if

play06:26

we go back to our code

play06:28

we want to do another

play06:32

request so we want to go out and get

play06:34

that page for each one of these links

play06:36

let's do

play06:36

r is equal to s dot get again

play06:40

and then item so now if we go back here

play06:43

and we look for where the name of the

play06:45

product is

play06:46

we can see it's in a div class of

play06:49

product

play06:49

info detail so we can copy this class

play06:53

and now if we print r.html

play06:58

dot find and because it's a div

play07:02

div and because it's a class we can do

play07:06

dot

play07:06

and then put our class in first is equal

play07:09

to true just to make sure we get the

play07:11

first one and dot text

play07:13

so all this is is this is looking in

play07:16

the are the response which we've just

play07:18

got from getting the

play07:20

each and apps each and every link from

play07:22

the product page

play07:24

and we're going to go ahead and we're

play07:25

going to use our s variable which is our

play07:26

session to get the information from that

play07:28

page

play07:29

okay so we've already rendered it and

play07:31

then we use

play07:32

r and html dot find

play07:35

and because we're looking for a div we

play07:38

put div

play07:39

and then because the class was this a

play07:41

dot

play07:42

so now if i run that we should get all

play07:45

of the

play07:48

names except i've missed a bracket

play07:53

here we should get all the names back

play07:56

for every product that's on that page

play07:58

which i think was about

play08:00

30 40 or something like that you can see

play08:02

it's

play08:03

trickling through them all now so i'm

play08:06

actually going to stop that because we

play08:07

don't need to go through all of those we

play08:08

know it's working

play08:10

so let's change this into name

play08:16

and see what other information we can

play08:17

get so let's get this bit here

play08:20

nice and easy so this is again

play08:23

div class of product subtext

play08:27

so that's exactly the same as this

play08:31

copy that and it was product

play08:35

excuse me product subtext in there

play08:40

and we'll just call that subtext

play08:44

okay the next thing we'll do is price

play08:47

uh same thing again let's copy this

play08:51

and the price is click on our element

play08:54

selector

play08:56

here so it's in a span with class of

play08:58

price okay that works for us

play09:00

so we can do instead of div we do span

play09:05

and the class was price

play09:09

excuse me like that

play09:13

okay so let's test that so let's do

play09:16

print name subtext

play09:19

and price check that all of that

play09:23

information is there for us great so

play09:25

that's running through

play09:27

again i'm just going to stop that okay

play09:29

so what else can we get out

play09:31

well as i mentioned earlier we can see

play09:34

that

play09:34

this one says add to cart and we hover

play09:37

over that and we can see that we've got

play09:38

a button

play09:40

we go a bit further up we've got a div

play09:42

class of add to cart container

play09:45

now that's interesting because if we go

play09:48

and

play09:49

find a product that is out of stock

play09:52

it won't have this class in it so this

play09:54

product is

play09:55

out of stock so if we again do inspect

play09:59

if we look at the out of stock button

play10:02

this is a div class disable container so

play10:05

there is no

play10:07

div with a class of add to cart

play10:09

container so we can do is we can just do

play10:11

an if statement on this so

play10:13

if this class exists the product is in

play10:15

stock

play10:16

if it isn't it's out of stock so we can

play10:19

just do

play10:21

if r.html.find

play10:25

and it's a div

play10:31

yep so if that's there then stock is

play10:33

equal to

play10:35

in stock else

play10:39

stock is equal to out

play10:45

out of stock

play10:49

okay so now we can put stock in here as

play10:51

well

play10:54

and i'm also whilst we're doing this i'm

play10:56

going to get the

play10:57

rating as well so we can see

play11:00

get the inspect tool the stars is

play11:02

probably a bit more difficult to get i'm

play11:04

not sure we could do that but

play11:05

fortunately it's got a number right here

play11:07

under the span of class label stars and

play11:09

that gives us

play11:10

3.10 this one's rated so we can just do

play11:14

rating is equal to

play11:17

r.html.find and it was a span

play11:22

label stars again first is equal to

play11:26

true dot text

play11:29

okay so now we can do stock and we can

play11:32

actually put the rating in as well let's

play11:34

put it

play11:34

here so let's let's run through that

play11:37

okay

play11:38

we can see that it's failed now

play11:41

i'm going to think i think that that is

play11:42

because one of these

play11:44

doesn't have a rating because it's

play11:46

looking for this

play11:47

um labels uh label style span and it's

play11:51

not finding it

play11:52

so what i'm going to do is i'm actually

play11:53

just going to put this into a try and

play11:55

accept

play11:56

so it's going to try to look for

play11:59

that element and if it can't find it

play12:03

we are going to do rating

play12:06

is equal to none

play12:10

all we're going to do is we're just

play12:11

going to do it's going to try and find

play12:12

it and if it's not there it's going to

play12:13

say that it

play12:14

is none so if we print that now it

play12:17

should run through every single one and

play12:18

we should have some that are in stock

play12:20

some that are out stock and some that

play12:21

don't have any ratings

play12:23

you can see there's one that doesn't

play12:24

have a rating that we were falling over

play12:26

before

play12:27

so what this has done is it's gone and

play12:28

got the link for every product

play12:31

we've looped through it we've loaded

play12:32

each page up

play12:34

and we have found information for every

play12:37

individual product including whether

play12:39

it's in stock and the price

play12:41

um so that's where i'm going to leave it

play12:43

on for this one

play12:44

but next video in part two of this one

play12:46

what we're going to do is we're going to

play12:47

split this up properly into the three

play12:50

steps that i like so we're going to have

play12:52

the request and we're going to have

play12:54

the pass and we're going to have the

play12:55

output so we're going to split them up

play12:57

into functions

play12:58

and the output is going to be an into

play13:01

csv or

play13:02

excel file we're also going to deal with

play13:04

the pagination

play13:05

so if we scroll right to the bottom of

play13:06

the page we can see that we've got

play13:09

lots of products and quite a few pages

play13:12

10 plus

play13:13

so we're going to deal with that as well

play13:14

and then we're going to end up with

play13:16

a script that's going to load up

play13:20

and render the javascript for this whole

play13:22

website

play13:23

and get every product info from this

play13:25

category

play13:26

so that'll do it for now thank you guys

play13:28

cheers uh like the video if you like

play13:30

subscribe for more web scraping content

play13:32

and i've got a lot more web scraping

play13:33

content already on my channel so if

play13:35

you're looking for something specific go

play13:36

back through my videos you might find

play13:38

something that's useful to you

play13:39

thanks bye

Rate This

5.0 / 5 (0 votes)

関連タグ
Web ScrapingDynamic PagesPython TutorialRequest-HTMLData ExtractionE-commerceXpathProduct PagesJavaScriptAutomation
英語で要約が必要ですか?