Always Check for the Hidden API when Web Scraping

John Watson Rooney
1 Aug 202111:49

Summary

TLDRThis video script offers a detailed tutorial on web scraping without relying on Selenium for clicking actions. It guides viewers through inspecting network requests, identifying the right API calls, and using tools like Insomnia to mimic these requests. The script demonstrates how to extract raw product data, navigate through pagination, and convert JSON responses into a CSV file using Python and pandas, providing a streamlined method to scrape large amounts of data efficiently.

Takeaways

  • πŸ” The script discusses an alternative to Selenium for web scraping by analyzing network requests.
  • πŸ‘€ It emphasizes the importance of looking beyond the visual elements and understanding the underlying data flow.
  • πŸ›  The process involves using the browser's 'Inspect Element' tool to access the 'Network' tab and identify relevant requests.
  • πŸ”„ By reloading the page and filtering for 'XHR', one can observe the server requests and find useful data.
  • πŸ”‘ The script introduces a method to mimic server requests to extract raw data without the need for Selenium.
  • πŸ“š The use of API tools like Insomnia or Postman is recommended for crafting and sending custom requests.
  • πŸ”Ž It shows how to dissect and understand the structure of the API response to identify the relevant data.
  • πŸ“ˆ The script demonstrates adjusting query parameters, such as 'page size', to retrieve more data in fewer requests.
  • πŸ”„ It explains how to automate the process of iterating through pages to collect all the necessary information.
  • πŸ“ The final step involves using Python and the 'requests' library to automate the data retrieval process.
  • πŸ“Š The script concludes with converting the JSON data into a pandas DataFrame for easy manipulation and export to CSV.

Q & A

  • What is the main focus of the video script?

    -The main focus of the video script is to demonstrate how to scrape a website for product data without using Selenium, by inspecting network requests and mimicking them in code.

  • Why might one initially think Selenium is necessary for scraping?

    -One might initially think Selenium is necessary for scraping because the product data appears to be loaded dynamically through buttons and interactions that Selenium can automate.

  • What tool is suggested for inspecting network requests in a browser?

    -The 'Inspect Element' tool, specifically the 'Network' tab, is suggested for inspecting network requests in a browser.

  • What is the significance of looking at the 'XHR' requests in the network tab?

    -The significance of looking at the 'XHR' requests is to identify the server-to-server communication that might be loading the product data, which can then be mimicked in code.

  • Why is clicking the 'Load More' button important in the network analysis?

    -Clicking the 'Load More' button is important because it triggers new requests that contain the product data, which is what the scraper needs to identify and replicate.

  • What is the purpose of using an API program like Postman or Insomnia in this context?

    -The purpose of using an API program is to easily create, test, and mimic the network requests that retrieve the product data, and to generate code snippets for automation.

  • How can one determine the number of pages needed to scrape all products?

    -One can determine the number of pages needed by examining the 'total products' value in the API response and dividing it by the 'page size' to calculate the total number of pages.

  • What is the benefit of increasing the 'page size' in the API request?

    -Increasing the 'page size' reduces the number of requests needed to scrape all data, making the scraping process more efficient and potentially reducing server load.

  • Why is it recommended to loop through pages in the scraping code?

    -Looping through pages in the scraping code ensures that all product data across all pages is retrieved, not just what is available on the initial page load.

  • How can the scraped data be organized and exported for further use?

    -The scraped data can be organized into a pandas DataFrame, normalized, and then exported to a CSV file for further analysis or use.

  • What is the advantage of using pandas to handle JSON data from the scrape?

    -Pandas allows for easy normalization and flattening of JSON data, making it simpler to manage and export structured data, such as to a CSV file.

Outlines

00:00

πŸ” Beyond the Surface: Understanding Web Scraping Techniques

This paragraph introduces the viewer to the nuances of web scraping, emphasizing the importance of looking beyond the visual elements of a website to understand the underlying data flow. The speaker suggests that instead of using Selenium for clicking buttons, one should inspect the network requests to find the data being fetched. The focus is on identifying the correct API requests that fetch product data, which might not be directly visible in the HTML. The speaker guides the viewer through using the browser's developer tools, specifically the network tab, to monitor and mimic these requests.

05:01

πŸ› οΈ Crafting Efficient Scraping with API Tools

In this segment, the speaker demonstrates how to use API tools like Postman or Insomnia to replicate the network requests discovered in the previous step. The process involves copying a GET request from the browser's network tab, adjusting parameters such as the page size to fetch more data in fewer requests, and then sending the request through the API tool. The speaker also discusses the benefits of using a high page size, such as reducing the number of requests needed and simplifying the process of iterating through pages. The goal is to automate the data fetching process by generating code snippets from the API tool that can be used in a Python script.

10:02

πŸ“ˆ Automating Data Collection and Exporting to CSV

The final paragraph focuses on automating the data collection process and exporting the results to a CSV file. The speaker shows how to loop through multiple pages of data by adjusting the 'current page' parameter in the API request. Once the data is fetched, the speaker introduces the use of pandas in Python to normalize and flatten the JSON data into a structured format. The ultimate aim is to create a CSV file that contains all the product information, which can be easily opened and analyzed in tools like Excel. The speaker also mentions the possibility of flattening the data further for more detailed analysis, but the primary focus is on demonstrating the end-to-end process of web scraping and data extraction.

Mindmap

Keywords

πŸ’‘Web Scraping

Web scraping is the process of programmatically extracting data from websites. It's central to the video's theme, as the script describes a method to scrape product data without using Selenium for clicking buttons. The script demonstrates how to inspect network requests to identify the data source behind a website's interface.

πŸ’‘Inspect Element Tool

The inspect element tool is a feature in web browsers that allows users to view and edit the HTML and CSS of a webpage. In the video, it's used to access the 'Network' tab, which is crucial for analyzing the site's server requests and identifying the data-fetching mechanism.

πŸ’‘XHR (XMLHttpRequest)

XHR is a JavaScript object that allows developers to make asynchronous requests to a server without reloading the web page. The script mentions looking for 'xhr' in the Network tab to find requests that fetch product data, which is a key step in identifying how data is loaded on the page.

πŸ’‘API (Application Programming Interface)

An API is a set of rules and protocols for building and interacting with software applications. The video script refers to using an API tool like Insomnia to mimic and analyze the server requests that fetch product data, demonstrating how to interact with the site's backend directly.

πŸ’‘GET Request

A GET request is a method used in HTTP for requesting data from a specified resource. The script explains that the product data is fetched using a GET request, which is then copied and used in Insomnia to replicate the data retrieval process.

πŸ’‘JSON (JavaScript Object Notation)

JSON is a lightweight data format that is easy for humans to read and write and for machines to parse and generate. The script describes how the product data is returned in JSON format, which needs to be parsed to extract the information.

πŸ’‘Pagination

Pagination is the process of dividing a large set of data into smaller chunks or pages. The video discusses changing the 'current page' parameter in the GET request to navigate through different pages of product data, illustrating how to access all available data.

πŸ’‘Python Requests Library

The Python 'requests' library is used for making HTTP requests in Python code. The script shows how to generate Python code using Insomnia to automate the process of making requests to the server and retrieving data.

πŸ’‘Pandas Library

Pandas is a Python library for data manipulation and analysis. The script mentions using Pandas to create a DataFrame from the scraped JSON data, which allows for easy manipulation and export of the data to formats like CSV.

πŸ’‘CSV (Comma-Separated Values)

CSV is a file format used to store tabular data, with each line representing a row and commas separating each field. The video script concludes with exporting the scraped data into a CSV file using Pandas, demonstrating a practical application of the scraped data.

πŸ’‘Data Normalization

Data normalization is the process of organizing data to reduce redundancy and improve data integrity. The script refers to using Pandas' 'json_normalize' function to flatten the JSON data structure into a tabular format suitable for analysis and storage.

Highlights

Scraping a website without needing Selenium by looking at network requests.

Using the inspect element tool to access the network tab for server requests.

Identifying useful information by reloading the page and observing XHR requests.

Mimicking server requests to extract raw data without relying on JavaScript.

Utilizing API programs like Postman or Insomnia to replicate GET requests.

Analyzing the query parameters in the request to understand available options.

Increasing the page size to reduce the number of requests needed for data retrieval.

Adjusting the current page parameter to navigate through different pages of data.

Automating the process of data extraction using Python and the requests library.

Using Insomnia to generate Python code for automated data requests.

Experimenting with headers and payloads to customize the request.

Looping through pages to collect all product information efficiently.

Converting JSON data into a pandas DataFrame for easier manipulation.

Flattening nested JSON data for a cleaner data structure.

Exporting the collected data to a CSV file for further analysis or use.

Demonstrating the process of extracting data that isn't directly available in HTML.

Providing a practical example of web scraping without the need for complex tools.

Transcripts

play00:00

if you're scraping a site when your code

play00:01

looks like this

play00:04

and then you see this then you might

play00:06

think that you need to use selenium to

play00:08

click that button

play00:09

when in fact what you need to do is look

play00:11

past the pretty pictures

play00:12

and the css and the html and see what's

play00:15

actually happening behind all of that

play00:17

here's the code for getting all the

play00:18

products and all the data even some

play00:20

information that isn't actually

play00:21

available in the html itself

play00:23

and if you follow along with me for the

play00:24

next seven or eight minutes i'll show

play00:26

you exactly what i did to get here

play00:28

what i used and where i looked so the

play00:30

first thing that we're going to do is

play00:31

we're going to come back to the website

play00:33

and we're going to open up the inspect

play00:34

element tool this is normally where

play00:36

you'd go to see and have a look at the

play00:37

html done but what we're going to do is

play00:39

we're going to head over to the network

play00:40

tab

play00:41

we're going to click on xhr and we're

play00:43

going to reload the page

play00:44

what's going to do is we're going to see

play00:46

all of the requests between the server

play00:47

here

play00:48

and we're going to see if there's any

play00:49

useful information that pops up

play00:51

if you're new to this sort of request

play00:54

this is possibly the first delay so you

play00:55

really want to come and have a look if

play00:57

you're not new to

play00:58

this you might think i know this about

play01:00

this but i can't see the actual products

play01:01

here

play01:02

so let's click on a few there's no

play01:03

product information there

play01:05

this is just some random javascript

play01:07

stuff going on here

play01:08

let's make this over bigger so we can

play01:09

see no product information this looks

play01:11

promising

play01:12

nope no good though so here's a nice

play01:15

trick scroll to the bottom of where that

play01:16

load more button was

play01:18

we're going to click on that and it's

play01:20

going to fire off some new requests

play01:22

and some of these ones are going to be

play01:23

the ones that we're interested in let's

play01:24

check this one out at the bottom

play01:26

there we go what does this look like

play01:27

this looks like it's got a page size

play01:30

it's got current page

play01:31

product information excellent so this is

play01:33

basically all the information that is

play01:35

being taken by the website and run

play01:38

through the javascript and turned into

play01:40

what you see on the left hand side

play01:42

what we can do because we don't want any

play01:44

of these actual pretty pictures or any

play01:45

of this stuff

play01:46

we just want the raw information is we

play01:48

can just mimic this request

play01:50

in our code to get this exact data out

play01:54

now there's a few different ways of

play01:55

doing this you will need some kind of

play01:57

api

play01:58

program like postman i use insomnia

play02:01

they're both free it doesn't matter

play02:02

which but what you want to do is when

play02:04

you find the response here

play02:06

if you want to go here and go copy copy

play02:08

as

play02:09

curl c url i'm clicking on the windows

play02:11

one doesn't matter

play02:12

i'm going to come over to my insomnia we

play02:15

have our new

play02:16

thingy up here our environment and i'm

play02:18

going to hit new request and we're going

play02:20

to call this one

play02:21

sg huts and then in the get request

play02:24

because we saw it was a get request

play02:26

i'm just going to paste it and hit send

play02:28

now if everything works as we hoped it

play02:30

will

play02:30

that is exactly what we saw in our

play02:33

browser but what can we do here well

play02:35

because we're using our api tool

play02:37

it split everything out for us nicely so

play02:40

we can look at the query and see

play02:41

all these nice options we have that we

play02:44

can easily change

play02:46

now the page size this one i'm thinking

play02:48

is quite interesting what i'd like to do

play02:49

when i see a page size is i like to

play02:51

smash it straight up

play02:52

and see what happens so i'm going to hit

play02:54

100 and we're going to see what comes

play02:55

back

play02:55

you might get an error but in this case

play02:57

what we're going to get is our response

play02:59

now says page size 100

play03:01

and we can see if we go to collapse the

play03:04

whole products

play03:05

there's a hundred products here now this

play03:07

means a couple of things to me the first

play03:09

one which is

play03:10

less requests to the server to get the

play03:11

information that i want and also

play03:14

it makes our lives a bit easier because

play03:15

we know we can change more and more

play03:17

different things within our request it

play03:20

tells us how many total products we have

play03:22

and we have all the product

play03:23

information here that we saw before so

play03:25

you can see all this information here

play03:27

so what can we do to move through pages

play03:29

well if you see up the top here

play03:31

it says current page is equal to two and

play03:33

that's because that's the request i

play03:34

copied

play03:35

if we come down you can see current page

play03:37

here let's just change that to one

play03:39

let's run that again and we've got page

play03:41

one so now what we want to do is we want

play03:43

to transfer

play03:44

this into something in our code so we

play03:46

can automate going through all the pages

play03:48

and getting the information that we want

play03:51

fortunately insomnia and postman do this

play03:53

for us

play03:54

you can come over to the request hit the

play03:56

down button here

play03:57

and click generate code change it to

play03:59

python and requests

play04:01

and there you go that's all the

play04:02

information that you need to run

play04:04

in your code editor to get this

play04:06

information out here

play04:07

now there's a lot of headers and stuff

play04:09

here you can experiment and change

play04:11

and see which ones you need to remove

play04:13

and we can see we've got an empty

play04:14

payload

play04:15

that's okay generally speaking i tend to

play04:18

just leave all the headers in for now

play04:20

although if you wanted to you could

play04:22

experiment here and start removing bits

play04:23

of information that you may or may not

play04:25

need

play04:26

to customize your request i'm going to

play04:27

copy this to clipboard

play04:29

i'm going to come back to our code

play04:30

editor i'm going to paste it all in

play04:33

now the headers here have everything

play04:34

including our user agent and our cookie

play04:37

now the cookie could be important it

play04:39

might be or it might not be

play04:41

you can try getting rid of it in your

play04:43

request if you want to

play04:44

but because this for me is just getting

play04:46

this information out once this time

play04:49

i'm just going to leave it in and we're

play04:51

going to let it be there so i'm going to

play04:52

collapse the headers

play04:53

because i'm happy with the way they look

play04:55

now here's our query string this is all

play04:57

on one line i know it's not very tidy

play04:59

you definitely want to tidy this up but

play05:01

for just for this the case of this

play05:03

example

play05:03

i'm going to do is just scroll right

play05:04

across until i see the current page

play05:07

is equal to one you see it there and i'm

play05:09

just going to go ahead and put my

play05:10

f string here and i'm going to change

play05:12

this to

play05:14

our x and hit save and we're going to

play05:17

come all the way back here

play05:20

and we're going to tidy some of this up

play05:22

there is no payload we can remove that

play05:24

we're gonna put our axis equal to one up

play05:26

here just for demonstration purposes

play05:28

we're gonna hit

play05:29

run and we're gonna see what we get back

play05:31

hopefully we get back all the

play05:32

information we just looked at

play05:34

all just flicked by i'm guessing that is

play05:36

exactly it that's good

play05:38

so what can we do from here well we know

play05:40

that there were

play05:41

around about a thousand products what

play05:43

you could do

play05:44

is you could make a risk make a request

play05:47

here

play05:47

check out the response and grab the

play05:50

number of products and then work out how

play05:51

many

play05:52

pages you needed so we can see over here

play05:54

total products 1013

play05:56

so you could say 1013 100 per page

play05:59

okay how many pages do we need we're

play06:01

going to need to do 11. you could do

play06:02

that

play06:03

if you were trying to make this

play06:04

repeatable but in this case i'm just

play06:06

going to make a loop

play06:07

that goes through x 1 to 11 and gets all

play06:10

the information

play06:11

so let's do that here let's indent this

play06:14

and then here we're going to make our 4

play06:18

x in range

play06:21

and we'll do 1 to 12 because that will

play06:24

be

play06:25

1 including and up to 12 not including

play06:28

up to 12 sorry

play06:30

get these headers collapsed again

play06:31

they're taking up an awful lot of screen

play06:33

space

play06:34

there we go now to deal with the

play06:35

response we don't want the text response

play06:38

we want the json response the r response

play06:41

is a long word let's get rid of that

play06:43

r.json there we go let's put that inside

play06:46

our loop

play06:46

and let's run that and if we get some

play06:48

info flicking by we know that our loop

play06:50

is working we're just checking for any

play06:52

errors that seems to be good to me

play06:55

fine and now we just need to do

play06:57

something with this data

play06:58

now the easiest thing to do is to take

play07:01

it and put it into a panda's data frame

play07:04

because we can normalize all the json

play07:06

and we can generally flatten it out

play07:08

nicely

play07:08

and get something quick and easy that we

play07:10

can export to a csv

play07:12

file or whatever output we need so what

play07:15

we're going to do is we're going to

play07:16

import

play07:16

pandas as pd we're going to save that

play07:20

and now we just need to figure out where

play07:22

all the actual json product information

play07:24

is that we want

play07:25

easiest way to do that is to come back

play07:27

to our api client

play07:29

just smush this over a bit if you're

play07:31

trying to work out how to

play07:33

get your data properly out of your json

play07:35

response

play07:36

you can click up here and you can save

play07:39

it or you can copy the whole lot

play07:41

paste it into a vs code file so you can

play07:43

sort of look through it and

play07:44

examine it maybe write some code to get

play07:47

through it that way

play07:48

but i don't need to do that in this one

play07:49

i know that there's it we have one up

play07:51

here

play07:52

that opens so then we have this so we

play07:54

need this key

play07:55

then we have a products key underneath

play07:58

then we have another products key

play08:00

and then we have a product list and that

play08:02

is the product list

play08:03

here that has 100 items so i'm going

play08:06

from here

play08:07

down here here and here so i'm going to

play08:09

copy this one two lots of products and a

play08:11

product

play08:12

okay so let's do let's do data is equal

play08:16

to

play08:17

our dot json and now we can let's print

play08:21

out

play08:21

data the first key

play08:25

then we want the next key which was

play08:26

products and you can see how i'm just

play08:29

chaining these together as i go down

play08:31

the tree products

play08:35

and then product need the quote marks

play08:40

product so what i'm going to do is

play08:42

instead of printing that all out

play08:44

i'm just going to print the length of

play08:45

that because that is a list

play08:47

so i'm going to run this and we should

play08:49

just get numbers each time so we should

play08:51

get 100

play08:52

100 100 etc etc there we go that's all

play08:54

the products

play08:55

there was 99 on that page for some

play08:57

reason one two three four

play08:58

page four only had 99 that's interesting

play09:01

it's starting to slow down a bit

play09:03

maybe we need to time our requests a bit

play09:05

better so i'm going to stop that but we

play09:06

know that that's working

play09:08

so what do we want to do with this

play09:09

information we want to loop through

play09:11

each and every product in here and add

play09:14

it to

play09:14

a new product list so up here i'm going

play09:17

to go

play09:18

results list and i'm going to say here

play09:21

four p for product in

play09:25

now this is the where we just saw all of

play09:27

our product lists so we can go in here

play09:28

and we can do res dot append p

play09:32

and at the end of this let's print the

play09:34

length

play09:35

of res just to check that this is

play09:37

working let's make the page numbers

play09:40

less so we do one to three so we can

play09:42

just double check

play09:43

that we get plenty of results in our

play09:46

results list

play09:47

200 results two pages seems good to me

play09:50

now what we can do is we can take this

play09:53

results list and we can create a data

play09:54

frame

play09:55

so df remember to call your data frame

play09:58

something better than df if this is in

play10:00

some kind of code that you're not just

play10:01

running through like i am and we want to

play10:03

do pd.json

play10:05

normalize then we need to give it our

play10:08

res

play10:09

and then we can just do df.2

play10:13

csv and let's just call this one first

play10:16

results dot csv now by running this

play10:20

i'm going to let's increase the pages

play10:22

let's just do five so

play10:23

our four pages so we should have 400

play10:25

results let's let that run

play10:27

and maybe you would want to put some

play10:29

kind of print statement in so you can

play10:31

see what's going on but it's finished

play10:33

and here's our first results there was

play10:35

my test file we can see that we have

play10:37

this information here

play10:38

so if i open it up in excel we'll have a

play10:40

better idea of what we've actually got

play10:42

so here's our results file we can see we

play10:44

have our index you may or may not want

play10:45

that

play10:46

there's a color number of colors you can

play10:48

have here

play10:49

and then there's also the name of the

play10:51

product the list price

play10:53

all the information that was in that

play10:54

json format all the way along to the url

play10:57

model name

play10:58

some other stuff now we can see in here

play11:00

that we actually have

play11:01

a list of dictionaries depicting all the

play11:04

other

play11:04

colors and etc etc so that's got all

play11:07

that information and this is what i was

play11:09

showing you when i said

play11:10

that it wasn't in the html to start with

play11:12

now i haven't flattened this out but you

play11:14

could quite easily write something that

play11:15

would basically flatten this all out for

play11:17

you

play11:18

but this is just a rough demonstration

play11:19

of how to get the information not

play11:21

necessarily how to deal with it all

play11:23

but it's not too difficult just to

play11:24

flatten this all out so if we pop back

play11:26

to our code now

play11:27

what we started with was this and then

play11:29

we ran into a

play11:30

problem where we needed to load more and

play11:33

go through the pages

play11:34

and we ended up with this which is

play11:36

basically getting us

play11:38

all of the information super quick super

play11:40

easy

play11:41

and straight to a csv file if you've

play11:43

enjoyed this you should check this video

play11:45

out because it's got more information

play11:47

how to web scrape like this

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Web ScrapingData ExtractionAPI RequestsJavaScriptCSS/HTMLXHR RequestsInsomnia ToolPostman ToolJSON ParsingPandas DataFrameCSV Export