Scraping with Playwright 101 - Easy Mode

John Watson Rooney

29 Mar 202419:55

Summary

TLDRThe video demonstrates using Playwright for browser automation and web scraping without additional tools. The presenter walks through scraping product data from a paginated e-commerce website, highlighting how to extract JSON schema data directly from product pages. The tutorial includes setting up a Python virtual environment, configuring Playwright, and navigating through the site's pagination. It also provides tips for optimizing the process, like blocking images to speed up scraping. The code is simple, involving about 43 lines, and shows a practical approach to efficiently extract and store data using Playwright.

Takeaways

🔍 Before scraping, always inspect the website's structure to understand its behavior, like pagination and product details.
📄 Playwright is being used for web scraping in this video, handling all tasks without additional libraries.
🧑‍💻 To extract product information, inspect the product page for JSON-LD schema, which contains structured data.
🔄 The script loops through product pages and fetches their details, then moves to the next page.
💻 A Python virtual environment is set up, and Playwright and Rich libraries are installed for web scraping and console formatting.
🌐 Chromium is used as the browser engine for Playwright in this demo, running with 'headless' mode disabled for better results.
📑 Each product page is opened in a new tab to avoid returning to the main product list page, improving efficiency.
📜 The script extracts JSON-LD data from each product page, allowing easy access to product details without heavy parsing.
📋 A pagination loop is handled by clicking the 'Next Page' button until the last page is reached, which is detected via a comparison of the total product count.
🚫 Images and other heavy content are blocked during scraping to speed up the process, reducing data usage and load times.

Q & A

What is the purpose of the video?
-The video aims to demonstrate how to use Playwright for browser automation and data scraping, specifically to extract product information from a website with paginated content.
Why does the presenter prefer using Playwright in this scenario?
-The presenter prefers using Playwright because it allows them to easily access structured data in JSON format from product pages without needing extensive parsing.
How does the presenter handle the pagination on the website?
-The presenter uses a while loop to continuously iterate through the pages. They select the 'next' button element to navigate to subsequent pages until they reach the last page.
What is the significance of viewing the page source and searching for 'schema'?
-The presenter searches for 'schema' in the page source to find the structured data in JSON format, which includes detailed product information embedded in a script tag.
How is the JSON data extracted from each product page?
-The JSON data is extracted using a Playwright locator that selects the script tag containing the JSON data. The data is then converted from a string format to JSON for easier handling.
What method does the presenter use to run Playwright code?
-The presenter uses a synchronous Playwright context to run the code, which involves launching a Chromium browser, opening a new page, and navigating to the start URL.
How does the presenter handle browser context closing after each product page is scraped?
-After scraping data from each product page, the presenter closes the browser context to prevent excessive open tabs, which helps manage system resources efficiently.
What approach does the presenter take to break out of the loop on the last page?
-The presenter breaks out of the loop by comparing two numbers in the page text: the number of items shown on the current page and the total items. When they match, it indicates the last page.
How does the presenter block images during scraping, and why?
-The presenter blocks images by intercepting the page requests for PNG and JPEG files. This saves bandwidth and speeds up the process by preventing unnecessary media from loading.
What are the final steps recommended by the presenter for handling the scraped data?
-The presenter suggests saving the JSON data to a file, such as a JSON or JSON Lines file, which allows for easy data handling and recovery if the script is interrupted.