Render Dynamic Pages - Web Scraping Product Links with Python
Summary
TLDRThe video explores a method for web scraping dynamically loaded content using the `requests-html` library. The presenter demonstrates how to extract product information from a beer website where JavaScript loads the content. The technique involves creating a session, rendering the page in the background, and using XPath to extract product details such as names, prices, and availability. The video also covers handling pagination and how to avoid issues with missing elements. In part two, the presenter will expand on the script by adding CSV output and more advanced features.
Takeaways
- 💻 This video explains how to scrape data from dynamically loaded websites using the `request-html` library.
- 🍺 The example website used in the video is Beer Wolf, which dynamically loads product data with JavaScript.
- 📄 Standard web scraping tools like `requests` and `BeautifulSoup` don't work for such dynamic sites, requiring a more advanced approach.
- 🧑💻 The `request-html` module can be used to render JavaScript-heavy pages by simulating a browser in the background.
- ⏳ A sleep parameter can be set to ensure that the page is fully rendered before scraping, preventing premature extraction.
- 📑 XPath and CSS selectors are used to locate the desired elements (e.g., product containers, individual items) within the rendered page.
- 🔗 Absolute links to each product are extracted, allowing navigation to individual product pages to retrieve more detailed information.
- 💲 Details like product name, price, rating, and availability (in stock or out of stock) are extracted using relevant HTML classes.
- 🔄 The video demonstrates looping through multiple product pages to collect data and handle pagination.
- 📊 The next video promises to enhance the process by organizing the script into functions and exporting the data into CSV or Excel formats.
Q & A
What is the main focus of the video?
-The video focuses on scraping dynamically loaded content from e-commerce websites, specifically using the 'requests-html' library in Python.
Why can't traditional web scraping methods like 'requests' and 'BeautifulSoup' be used for this task?
-'requests' and 'BeautifulSoup' can't be used because the content is dynamically loaded via JavaScript, and these tools cannot render JavaScript to access the data.
What tool does the presenter recommend for handling dynamic content?
-The presenter recommends using the 'requests-html' library, which can render JavaScript content in the background by launching a lightweight browser.
How does 'requests-html' handle JavaScript content differently from 'requests'?
-'requests-html' creates a session and uses the 'render' function to execute JavaScript and render the page content, allowing access to dynamically loaded data.
Why does the presenter include a 'sleep' argument when rendering the page?
-The 'sleep=1' argument is added to give the page time to fully load the content before trying to access it, which prevents failures when scraping the data.
What method is used to extract product links from the dynamically rendered page?
-The presenter uses XPath to locate the product container and extract all product links by accessing 'r.html.find' with the XPath of the container.
How does the presenter suggest handling multiple products on a page?
-The presenter suggests looping through each product link and fetching data from individual product pages by visiting each link in the extracted list.
What key information does the presenter extract from each product page?
-Key information extracted includes product name, subtext, price, stock status, and rating.
How does the script determine whether a product is in stock or out of stock?
-The script checks for a 'div' element with the class 'add to cart container' for in-stock products and 'disable container' for out-of-stock products.
What future improvements does the presenter plan for the script?
-In part two, the presenter plans to separate the script into distinct functions for requests, parsing, and output, handle pagination, and export the data into a CSV or Excel file.
Outlines
🔍 Introduction to Scraping Dynamic Websites with Request-HTML
The video begins with an introduction by John, explaining that the tutorial will cover scraping product pages from dynamically loaded websites using a Python library called `requests-html`. He demonstrates with the example of 'Beer Wolf', a beer website that loads content through JavaScript, which makes traditional scraping techniques ineffective. John mentions that to handle this, we can use `requests-html` to render the page, which loads the necessary content in the background, allowing us to access the data dynamically.
📦 Setting Up the Session and Initializing the Scraping Process
John walks through setting up a session using `requests-html`, starting with importing necessary modules, creating a session, and making an initial GET request to the website. He explains that `requests-html` automatically handles user agents, and how to render the JavaScript of the page using the `.render()` function, which processes the JavaScript and retrieves the dynamic content. A delay (sleep) of one second is introduced after rendering to ensure that the content is fully loaded before extraction. John concludes this part by demonstrating how to verify successful page retrieval through status codes.
🛍 Extracting Product Information from the Web Page
In this section, John inspects the web page to locate the product container using the browser's developer tools. He shows how to use XPath to pinpoint the container that holds all the product items. By utilizing `r.html.find()` and `XPath`, John retrieves the list of product elements from the page and demonstrates how to extract their absolute URLs. He sets up a loop to iterate over the product links, showing how to scrape individual product details like name, price, and other product-related information from each item's specific page.
💰 Scraping Additional Product Details: Price, Stock, and Rating
John demonstrates how to scrape further product details, including the product's name, description, price, and stock status. He explains how to handle cases where some products may be out of stock by identifying differences in the HTML structure for in-stock and out-of-stock products. Using conditional logic, John flags products as in-stock or out-of-stock based on the presence of certain HTML classes. Additionally, he covers how to retrieve product ratings from the span elements that contain rating values.
🔁 Handling Missing Data and Exception Handling
In this section, John deals with potential issues arising from missing product ratings, which can cause errors during scraping. He uses a try-except block to handle situations where the rating information is not available, setting the rating to 'None' if it's missing. This ensures the scraper can continue processing other products without failing. He reviews how the script loops through each product, retrieving relevant data such as product name, stock status, price, and ratings where available.
🔗 Preparing for Part Two: Pagination and Data Export
John wraps up this video by outlining the next steps, which will be covered in part two of the tutorial. He plans to break the script into functions for better organization, separating the request, parsing, and output stages. The next part will include handling pagination to scrape multiple pages of products and exporting the data into a CSV or Excel file. He briefly touches on pagination logic and encourages viewers to stay tuned for the upcoming continuation of the project.
Mindmap
Keywords
💡Dynamic Content Loading
💡Requests-HTML
💡Session
💡XPath
💡Render Function
💡CSS Selectors
💡Product Information
💡Pagination
💡In Stock / Out of Stock
💡CSV/Excel Output
Highlights
Introduction to scraping dynamically loaded content using request.html
Demonstration of handling JavaScript-loaded content with request.html
Explanation of creating a session to render pages in the background
Importing request.html for web scraping tasks
Setting up a session with request.html
Using the render function to execute JavaScript and load content
Pausing the script to ensure content is fully loaded
Checking the response status code for successful data retrieval
Navigating to the product container using the inspect tool
Using XPath to extract product information
Looping through product links to scrape individual product pages
Extracting product details like name, rating, and price
Determining product stock status by checking for the 'add to cart' button
Handling exceptions for products without ratings
Combining all extracted data into a structured format
Preview of part two focusing on outputting data to CSV or Excel
Discussion on handling pagination for large datasets
Final thoughts and call to action for likes and subscriptions
Transcripts
hi everyone and welcome john here and
today's video we're going to be looking
at more
econ websites and product pages but this
time uh we're going to be doing
stuff that is dynamically loaded so this
little uh this technique will work for
pages like the one i'm about to show you
and also
some other websites that use like
javascript or whatever to load
their content dynamically and so the
website we are going to be
looking at is this one it's called beer
wolf it is a beer website
and if i go to view the page source
you'll see right here that this is a
load of script code
that is loading up all of the product
information for us
so if we were to use requests in
beautiful soup to try and get the
product information
we wouldn't get anywhere but what we can
do is we can use request
html to create a session and then render
the page in the background for us
so how that works is we give it the url
we create our session and then we use
the render function and what that does
is it loads a lightweight browser in the
background
as a process and then renders the page
that we've given it for us
and then lets us access that page so
what we need to do is
to start we need to import request.html
if you don't have this installed already
you can do pip install
and it's this one right here so you
could
come here and we can see request html
html passing for humans and this pip
install right here so if you need to do
that go ahead and do that
so to import that we are going to do
from request.html and we're going to
import
the session and then we are going to set
our url
and we're going to copy this page right
here
put that there and then we can do s is
equal to
html session so we're basically just
setting a s variable
to be our session and then very similar
we similarly to
uh our standard requests we do r is
equal but this time it's
uh s dot get because that's our session
that we've set here
and then the url for those of you that
have watched my video on user agents
requests html module actually cycles
through different
user agents so we don't need to specify
our own ones when we're using this
request html library but if we are using
the request one we do
so now we want to get it to render the
page for us so we can do
r.html dot render
now what that's going to do is it's
going to
load up the browser in the background
and render all the page for us so
execute all the javascript and then let
us take the information from it
what we're going to do is i'm going to
put sleep is equal to one in here
now what that does is it just gives it a
one second break after it's rendering so
it just gives it a little bit of time to
make sure the information is there
before we start trying to grab it
if you're trying to do this and you
don't have that and it's failing that
could be why so go ahead and put sleep
is equal to one in there so now i'm
going to go ahead and print out the
r dot status code just to check that
we're getting something back
correctly uh if this is the first time
that you're running this you'll see
for the first one you'll see that you'll
get a loading bar and it'll say i think
it's something to do with puppeteer and
it'll say installed in chrome
or something just let that go ahead and
that'll install everything you need and
that's the browser that runs in the
background
so we've got 200 response which is good
so we can get rid of that
and we can go back to our page and now
we want to
get our inspect tool so let's do inspect
and let's hover over the first item
which is this one here so we've got
everything um but what we want is
all of the products not just the
individual ones at the moment and i can
see right above it
here's the product container so this has
got everything in it so these are all
the individual products you can see them
there
i highlight over them so what we want to
do is we want to copy
we're actually going to use the xpath
for this one to get this information
when you use request html you will have
access to the css selectors or the xpath
so we're going to use xpath for this one
for this demo
let's copy the xpath now as we would do
normally let's do
products it's equal to now we want to do
r.html because we've just rendered it
and we want to do dot find sorry excuse
me we want to do x path
and paste that in there what's also
useful to do when you're doing the
find or the searching for variable app
and searching for selectors or x paths
is i like to put in that first is equal
to true just in case it partially
matches or if it matches
other elements so we know that this is
the first one on the page that's going
to match this
so that's going to work for us but if
there were other ones further down it
would bring a list back which would then
be more difficult for us to interrogate
so
i'm going to put first is equal to true
in there so if i now do
print products it's going to render the
page for us and it's going to return an
element for us so there we got it we've
got it back we've got the element div
and it's got the id of product items
container
which matches this so we know we're in
the right place
now what we can do is we can do various
things with this element we could print
the text out
but what i'm interested in is the link
to each and every
element that's within it so i'm going to
do print
and do products and we're going to do
absolute
links like that so what that's done is
it's printed out
every link within that element
and because that is the element that's
got all of the products in it
these are all the product links and
we've got a nice big long list here
so what we can do is we want to create a
loop
so we can loop through each one of those
so i'm going to do four item
in products dot
absolute links now if i print item we're
going to get them back individually
and we can actually use those to then go
to each and every individual product
page and get back
the information from that page so when
we go to one
individual product we can see that we
have a bit more information than we do
on the main page
we've got the name we've got our rating
we've got this little subtext
information here
our price description add to cart
and we'll come back to that in a minute
because that'll be our that'll be how we
know whether it's in stock or not so if
we go back to our code
we want to do another
request so we want to go out and get
that page for each one of these links
let's do
r is equal to s dot get again
and then item so now if we go back here
and we look for where the name of the
product is
we can see it's in a div class of
product
info detail so we can copy this class
and now if we print r.html
dot find and because it's a div
div and because it's a class we can do
dot
and then put our class in first is equal
to true just to make sure we get the
first one and dot text
so all this is is this is looking in
the are the response which we've just
got from getting the
each and apps each and every link from
the product page
and we're going to go ahead and we're
going to use our s variable which is our
session to get the information from that
page
okay so we've already rendered it and
then we use
r and html dot find
and because we're looking for a div we
put div
and then because the class was this a
dot
so now if i run that we should get all
of the
names except i've missed a bracket
here we should get all the names back
for every product that's on that page
which i think was about
30 40 or something like that you can see
it's
trickling through them all now so i'm
actually going to stop that because we
don't need to go through all of those we
know it's working
so let's change this into name
and see what other information we can
get so let's get this bit here
nice and easy so this is again
div class of product subtext
so that's exactly the same as this
copy that and it was product
excuse me product subtext in there
and we'll just call that subtext
okay the next thing we'll do is price
uh same thing again let's copy this
and the price is click on our element
selector
here so it's in a span with class of
price okay that works for us
so we can do instead of div we do span
and the class was price
excuse me like that
okay so let's test that so let's do
print name subtext
and price check that all of that
information is there for us great so
that's running through
again i'm just going to stop that okay
so what else can we get out
well as i mentioned earlier we can see
that
this one says add to cart and we hover
over that and we can see that we've got
a button
we go a bit further up we've got a div
class of add to cart container
now that's interesting because if we go
and
find a product that is out of stock
it won't have this class in it so this
product is
out of stock so if we again do inspect
if we look at the out of stock button
this is a div class disable container so
there is no
div with a class of add to cart
container so we can do is we can just do
an if statement on this so
if this class exists the product is in
stock
if it isn't it's out of stock so we can
just do
if r.html.find
and it's a div
yep so if that's there then stock is
equal to
in stock else
stock is equal to out
out of stock
okay so now we can put stock in here as
well
and i'm also whilst we're doing this i'm
going to get the
rating as well so we can see
get the inspect tool the stars is
probably a bit more difficult to get i'm
not sure we could do that but
fortunately it's got a number right here
under the span of class label stars and
that gives us
3.10 this one's rated so we can just do
rating is equal to
r.html.find and it was a span
label stars again first is equal to
true dot text
okay so now we can do stock and we can
actually put the rating in as well let's
put it
here so let's let's run through that
okay
we can see that it's failed now
i'm going to think i think that that is
because one of these
doesn't have a rating because it's
looking for this
um labels uh label style span and it's
not finding it
so what i'm going to do is i'm actually
just going to put this into a try and
accept
so it's going to try to look for
that element and if it can't find it
we are going to do rating
is equal to none
all we're going to do is we're just
going to do it's going to try and find
it and if it's not there it's going to
say that it
is none so if we print that now it
should run through every single one and
we should have some that are in stock
some that are out stock and some that
don't have any ratings
you can see there's one that doesn't
have a rating that we were falling over
before
so what this has done is it's gone and
got the link for every product
we've looped through it we've loaded
each page up
and we have found information for every
individual product including whether
it's in stock and the price
um so that's where i'm going to leave it
on for this one
but next video in part two of this one
what we're going to do is we're going to
split this up properly into the three
steps that i like so we're going to have
the request and we're going to have
the pass and we're going to have the
output so we're going to split them up
into functions
and the output is going to be an into
csv or
excel file we're also going to deal with
the pagination
so if we scroll right to the bottom of
the page we can see that we've got
lots of products and quite a few pages
10 plus
so we're going to deal with that as well
and then we're going to end up with
a script that's going to load up
and render the javascript for this whole
website
and get every product info from this
category
so that'll do it for now thank you guys
cheers uh like the video if you like
subscribe for more web scraping content
and i've got a lot more web scraping
content already on my channel so if
you're looking for something specific go
back through my videos you might find
something that's useful to you
thanks bye
関連動画をさらに表示
#16 Transforming JSON data into HTML | Fundamentals of NODE JS | A Complete NODE JS Course
Always Check for the Hidden API when Web Scraping
BeautifulSoup + Requests | Web Scraping in Python
This AI Agent can Scrape ANY WEBSITE!!!
Web Scraping Tutorial | Data Scraping from Websites to Excel | Web Scraper Chorme Extension
Inline Vs Block Elements | Div & Span Tags Explained | Frontend Bootcamp Hindi | Ep.03
5.0 / 5 (0 votes)