Render Dynamic Pages - Web Scraping Product Links with Python

John Watson Rooney

27 Jul 202013:41

Summary

TLDRThe video explores a method for web scraping dynamically loaded content using the `requests-html` library. The presenter demonstrates how to extract product information from a beer website where JavaScript loads the content. The technique involves creating a session, rendering the page in the background, and using XPath to extract product details such as names, prices, and availability. The video also covers handling pagination and how to avoid issues with missing elements. In part two, the presenter will expand on the script by adding CSV output and more advanced features.

Takeaways

💻 This video explains how to scrape data from dynamically loaded websites using the `request-html` library.
🍺 The example website used in the video is Beer Wolf, which dynamically loads product data with JavaScript.
📄 Standard web scraping tools like `requests` and `BeautifulSoup` don't work for such dynamic sites, requiring a more advanced approach.
🧑‍💻 The `request-html` module can be used to render JavaScript-heavy pages by simulating a browser in the background.
⏳ A sleep parameter can be set to ensure that the page is fully rendered before scraping, preventing premature extraction.
📑 XPath and CSS selectors are used to locate the desired elements (e.g., product containers, individual items) within the rendered page.
🔗 Absolute links to each product are extracted, allowing navigation to individual product pages to retrieve more detailed information.
💲 Details like product name, price, rating, and availability (in stock or out of stock) are extracted using relevant HTML classes.
🔄 The video demonstrates looping through multiple product pages to collect data and handle pagination.
📊 The next video promises to enhance the process by organizing the script into functions and exporting the data into CSV or Excel formats.

Q & A

What is the main focus of the video?
-The video focuses on scraping dynamically loaded content from e-commerce websites, specifically using the 'requests-html' library in Python.
Why can't traditional web scraping methods like 'requests' and 'BeautifulSoup' be used for this task?
-'requests' and 'BeautifulSoup' can't be used because the content is dynamically loaded via JavaScript, and these tools cannot render JavaScript to access the data.
What tool does the presenter recommend for handling dynamic content?
-The presenter recommends using the 'requests-html' library, which can render JavaScript content in the background by launching a lightweight browser.
How does 'requests-html' handle JavaScript content differently from 'requests'?
-'requests-html' creates a session and uses the 'render' function to execute JavaScript and render the page content, allowing access to dynamically loaded data.
Why does the presenter include a 'sleep' argument when rendering the page?
-The 'sleep=1' argument is added to give the page time to fully load the content before trying to access it, which prevents failures when scraping the data.
What method is used to extract product links from the dynamically rendered page?
-The presenter uses XPath to locate the product container and extract all product links by accessing 'r.html.find' with the XPath of the container.
How does the presenter suggest handling multiple products on a page?
-The presenter suggests looping through each product link and fetching data from individual product pages by visiting each link in the extracted list.
What key information does the presenter extract from each product page?
-Key information extracted includes product name, subtext, price, stock status, and rating.
How does the script determine whether a product is in stock or out of stock?
-The script checks for a 'div' element with the class 'add to cart container' for in-stock products and 'disable container' for out-of-stock products.
What future improvements does the presenter plan for the script?
-In part two, the presenter plans to separate the script into distinct functions for requests, parsing, and output, handle pagination, and export the data into a CSV or Excel file.