Always Check for the Hidden API when Web Scraping
Summary
TLDRThis video script offers a detailed tutorial on web scraping without relying on Selenium for clicking actions. It guides viewers through inspecting network requests, identifying the right API calls, and using tools like Insomnia to mimic these requests. The script demonstrates how to extract raw product data, navigate through pagination, and convert JSON responses into a CSV file using Python and pandas, providing a streamlined method to scrape large amounts of data efficiently.
Takeaways
- 🔍 The script discusses an alternative to Selenium for web scraping by analyzing network requests.
- 👀 It emphasizes the importance of looking beyond the visual elements and understanding the underlying data flow.
- 🛠 The process involves using the browser's 'Inspect Element' tool to access the 'Network' tab and identify relevant requests.
- 🔄 By reloading the page and filtering for 'XHR', one can observe the server requests and find useful data.
- 🔑 The script introduces a method to mimic server requests to extract raw data without the need for Selenium.
- 📚 The use of API tools like Insomnia or Postman is recommended for crafting and sending custom requests.
- 🔎 It shows how to dissect and understand the structure of the API response to identify the relevant data.
- 📈 The script demonstrates adjusting query parameters, such as 'page size', to retrieve more data in fewer requests.
- 🔄 It explains how to automate the process of iterating through pages to collect all the necessary information.
- 📝 The final step involves using Python and the 'requests' library to automate the data retrieval process.
- 📊 The script concludes with converting the JSON data into a pandas DataFrame for easy manipulation and export to CSV.
Q & A
What is the main focus of the video script?
-The main focus of the video script is to demonstrate how to scrape a website for product data without using Selenium, by inspecting network requests and mimicking them in code.
Why might one initially think Selenium is necessary for scraping?
-One might initially think Selenium is necessary for scraping because the product data appears to be loaded dynamically through buttons and interactions that Selenium can automate.
What tool is suggested for inspecting network requests in a browser?
-The 'Inspect Element' tool, specifically the 'Network' tab, is suggested for inspecting network requests in a browser.
What is the significance of looking at the 'XHR' requests in the network tab?
-The significance of looking at the 'XHR' requests is to identify the server-to-server communication that might be loading the product data, which can then be mimicked in code.
Why is clicking the 'Load More' button important in the network analysis?
-Clicking the 'Load More' button is important because it triggers new requests that contain the product data, which is what the scraper needs to identify and replicate.
What is the purpose of using an API program like Postman or Insomnia in this context?
-The purpose of using an API program is to easily create, test, and mimic the network requests that retrieve the product data, and to generate code snippets for automation.
How can one determine the number of pages needed to scrape all products?
-One can determine the number of pages needed by examining the 'total products' value in the API response and dividing it by the 'page size' to calculate the total number of pages.
What is the benefit of increasing the 'page size' in the API request?
-Increasing the 'page size' reduces the number of requests needed to scrape all data, making the scraping process more efficient and potentially reducing server load.
Why is it recommended to loop through pages in the scraping code?
-Looping through pages in the scraping code ensures that all product data across all pages is retrieved, not just what is available on the initial page load.
How can the scraped data be organized and exported for further use?
-The scraped data can be organized into a pandas DataFrame, normalized, and then exported to a CSV file for further analysis or use.
What is the advantage of using pandas to handle JSON data from the scrape?
-Pandas allows for easy normalization and flattening of JSON data, making it simpler to manage and export structured data, such as to a CSV file.
Outlines
🔍 Beyond the Surface: Understanding Web Scraping Techniques
This paragraph introduces the viewer to the nuances of web scraping, emphasizing the importance of looking beyond the visual elements of a website to understand the underlying data flow. The speaker suggests that instead of using Selenium for clicking buttons, one should inspect the network requests to find the data being fetched. The focus is on identifying the correct API requests that fetch product data, which might not be directly visible in the HTML. The speaker guides the viewer through using the browser's developer tools, specifically the network tab, to monitor and mimic these requests.
🛠️ Crafting Efficient Scraping with API Tools
In this segment, the speaker demonstrates how to use API tools like Postman or Insomnia to replicate the network requests discovered in the previous step. The process involves copying a GET request from the browser's network tab, adjusting parameters such as the page size to fetch more data in fewer requests, and then sending the request through the API tool. The speaker also discusses the benefits of using a high page size, such as reducing the number of requests needed and simplifying the process of iterating through pages. The goal is to automate the data fetching process by generating code snippets from the API tool that can be used in a Python script.
📈 Automating Data Collection and Exporting to CSV
The final paragraph focuses on automating the data collection process and exporting the results to a CSV file. The speaker shows how to loop through multiple pages of data by adjusting the 'current page' parameter in the API request. Once the data is fetched, the speaker introduces the use of pandas in Python to normalize and flatten the JSON data into a structured format. The ultimate aim is to create a CSV file that contains all the product information, which can be easily opened and analyzed in tools like Excel. The speaker also mentions the possibility of flattening the data further for more detailed analysis, but the primary focus is on demonstrating the end-to-end process of web scraping and data extraction.
Mindmap
Keywords
💡Web Scraping
💡Inspect Element Tool
💡XHR (XMLHttpRequest)
💡API (Application Programming Interface)
💡GET Request
💡JSON (JavaScript Object Notation)
💡Pagination
💡Python Requests Library
💡Pandas Library
💡CSV (Comma-Separated Values)
💡Data Normalization
Highlights
Scraping a website without needing Selenium by looking at network requests.
Using the inspect element tool to access the network tab for server requests.
Identifying useful information by reloading the page and observing XHR requests.
Mimicking server requests to extract raw data without relying on JavaScript.
Utilizing API programs like Postman or Insomnia to replicate GET requests.
Analyzing the query parameters in the request to understand available options.
Increasing the page size to reduce the number of requests needed for data retrieval.
Adjusting the current page parameter to navigate through different pages of data.
Automating the process of data extraction using Python and the requests library.
Using Insomnia to generate Python code for automated data requests.
Experimenting with headers and payloads to customize the request.
Looping through pages to collect all product information efficiently.
Converting JSON data into a pandas DataFrame for easier manipulation.
Flattening nested JSON data for a cleaner data structure.
Exporting the collected data to a CSV file for further analysis or use.
Demonstrating the process of extracting data that isn't directly available in HTML.
Providing a practical example of web scraping without the need for complex tools.
Transcripts
if you're scraping a site when your code
looks like this
and then you see this then you might
think that you need to use selenium to
click that button
when in fact what you need to do is look
past the pretty pictures
and the css and the html and see what's
actually happening behind all of that
here's the code for getting all the
products and all the data even some
information that isn't actually
available in the html itself
and if you follow along with me for the
next seven or eight minutes i'll show
you exactly what i did to get here
what i used and where i looked so the
first thing that we're going to do is
we're going to come back to the website
and we're going to open up the inspect
element tool this is normally where
you'd go to see and have a look at the
html done but what we're going to do is
we're going to head over to the network
tab
we're going to click on xhr and we're
going to reload the page
what's going to do is we're going to see
all of the requests between the server
here
and we're going to see if there's any
useful information that pops up
if you're new to this sort of request
this is possibly the first delay so you
really want to come and have a look if
you're not new to
this you might think i know this about
this but i can't see the actual products
here
so let's click on a few there's no
product information there
this is just some random javascript
stuff going on here
let's make this over bigger so we can
see no product information this looks
promising
nope no good though so here's a nice
trick scroll to the bottom of where that
load more button was
we're going to click on that and it's
going to fire off some new requests
and some of these ones are going to be
the ones that we're interested in let's
check this one out at the bottom
there we go what does this look like
this looks like it's got a page size
it's got current page
product information excellent so this is
basically all the information that is
being taken by the website and run
through the javascript and turned into
what you see on the left hand side
what we can do because we don't want any
of these actual pretty pictures or any
of this stuff
we just want the raw information is we
can just mimic this request
in our code to get this exact data out
now there's a few different ways of
doing this you will need some kind of
api
program like postman i use insomnia
they're both free it doesn't matter
which but what you want to do is when
you find the response here
if you want to go here and go copy copy
as
curl c url i'm clicking on the windows
one doesn't matter
i'm going to come over to my insomnia we
have our new
thingy up here our environment and i'm
going to hit new request and we're going
to call this one
sg huts and then in the get request
because we saw it was a get request
i'm just going to paste it and hit send
now if everything works as we hoped it
will
that is exactly what we saw in our
browser but what can we do here well
because we're using our api tool
it split everything out for us nicely so
we can look at the query and see
all these nice options we have that we
can easily change
now the page size this one i'm thinking
is quite interesting what i'd like to do
when i see a page size is i like to
smash it straight up
and see what happens so i'm going to hit
100 and we're going to see what comes
back
you might get an error but in this case
what we're going to get is our response
now says page size 100
and we can see if we go to collapse the
whole products
there's a hundred products here now this
means a couple of things to me the first
one which is
less requests to the server to get the
information that i want and also
it makes our lives a bit easier because
we know we can change more and more
different things within our request it
tells us how many total products we have
and we have all the product
information here that we saw before so
you can see all this information here
so what can we do to move through pages
well if you see up the top here
it says current page is equal to two and
that's because that's the request i
copied
if we come down you can see current page
here let's just change that to one
let's run that again and we've got page
one so now what we want to do is we want
to transfer
this into something in our code so we
can automate going through all the pages
and getting the information that we want
fortunately insomnia and postman do this
for us
you can come over to the request hit the
down button here
and click generate code change it to
python and requests
and there you go that's all the
information that you need to run
in your code editor to get this
information out here
now there's a lot of headers and stuff
here you can experiment and change
and see which ones you need to remove
and we can see we've got an empty
payload
that's okay generally speaking i tend to
just leave all the headers in for now
although if you wanted to you could
experiment here and start removing bits
of information that you may or may not
need
to customize your request i'm going to
copy this to clipboard
i'm going to come back to our code
editor i'm going to paste it all in
now the headers here have everything
including our user agent and our cookie
now the cookie could be important it
might be or it might not be
you can try getting rid of it in your
request if you want to
but because this for me is just getting
this information out once this time
i'm just going to leave it in and we're
going to let it be there so i'm going to
collapse the headers
because i'm happy with the way they look
now here's our query string this is all
on one line i know it's not very tidy
you definitely want to tidy this up but
for just for this the case of this
example
i'm going to do is just scroll right
across until i see the current page
is equal to one you see it there and i'm
just going to go ahead and put my
f string here and i'm going to change
this to
our x and hit save and we're going to
come all the way back here
and we're going to tidy some of this up
there is no payload we can remove that
we're gonna put our axis equal to one up
here just for demonstration purposes
we're gonna hit
run and we're gonna see what we get back
hopefully we get back all the
information we just looked at
all just flicked by i'm guessing that is
exactly it that's good
so what can we do from here well we know
that there were
around about a thousand products what
you could do
is you could make a risk make a request
here
check out the response and grab the
number of products and then work out how
many
pages you needed so we can see over here
total products 1013
so you could say 1013 100 per page
okay how many pages do we need we're
going to need to do 11. you could do
that
if you were trying to make this
repeatable but in this case i'm just
going to make a loop
that goes through x 1 to 11 and gets all
the information
so let's do that here let's indent this
and then here we're going to make our 4
x in range
and we'll do 1 to 12 because that will
be
1 including and up to 12 not including
up to 12 sorry
get these headers collapsed again
they're taking up an awful lot of screen
space
there we go now to deal with the
response we don't want the text response
we want the json response the r response
is a long word let's get rid of that
r.json there we go let's put that inside
our loop
and let's run that and if we get some
info flicking by we know that our loop
is working we're just checking for any
errors that seems to be good to me
fine and now we just need to do
something with this data
now the easiest thing to do is to take
it and put it into a panda's data frame
because we can normalize all the json
and we can generally flatten it out
nicely
and get something quick and easy that we
can export to a csv
file or whatever output we need so what
we're going to do is we're going to
import
pandas as pd we're going to save that
and now we just need to figure out where
all the actual json product information
is that we want
easiest way to do that is to come back
to our api client
just smush this over a bit if you're
trying to work out how to
get your data properly out of your json
response
you can click up here and you can save
it or you can copy the whole lot
paste it into a vs code file so you can
sort of look through it and
examine it maybe write some code to get
through it that way
but i don't need to do that in this one
i know that there's it we have one up
here
that opens so then we have this so we
need this key
then we have a products key underneath
then we have another products key
and then we have a product list and that
is the product list
here that has 100 items so i'm going
from here
down here here and here so i'm going to
copy this one two lots of products and a
product
okay so let's do let's do data is equal
to
our dot json and now we can let's print
out
data the first key
then we want the next key which was
products and you can see how i'm just
chaining these together as i go down
the tree products
and then product need the quote marks
product so what i'm going to do is
instead of printing that all out
i'm just going to print the length of
that because that is a list
so i'm going to run this and we should
just get numbers each time so we should
get 100
100 100 etc etc there we go that's all
the products
there was 99 on that page for some
reason one two three four
page four only had 99 that's interesting
it's starting to slow down a bit
maybe we need to time our requests a bit
better so i'm going to stop that but we
know that that's working
so what do we want to do with this
information we want to loop through
each and every product in here and add
it to
a new product list so up here i'm going
to go
results list and i'm going to say here
four p for product in
now this is the where we just saw all of
our product lists so we can go in here
and we can do res dot append p
and at the end of this let's print the
length
of res just to check that this is
working let's make the page numbers
less so we do one to three so we can
just double check
that we get plenty of results in our
results list
200 results two pages seems good to me
now what we can do is we can take this
results list and we can create a data
frame
so df remember to call your data frame
something better than df if this is in
some kind of code that you're not just
running through like i am and we want to
do pd.json
normalize then we need to give it our
res
and then we can just do df.2
csv and let's just call this one first
results dot csv now by running this
i'm going to let's increase the pages
let's just do five so
our four pages so we should have 400
results let's let that run
and maybe you would want to put some
kind of print statement in so you can
see what's going on but it's finished
and here's our first results there was
my test file we can see that we have
this information here
so if i open it up in excel we'll have a
better idea of what we've actually got
so here's our results file we can see we
have our index you may or may not want
that
there's a color number of colors you can
have here
and then there's also the name of the
product the list price
all the information that was in that
json format all the way along to the url
model name
some other stuff now we can see in here
that we actually have
a list of dictionaries depicting all the
other
colors and etc etc so that's got all
that information and this is what i was
showing you when i said
that it wasn't in the html to start with
now i haven't flattened this out but you
could quite easily write something that
would basically flatten this all out for
you
but this is just a rough demonstration
of how to get the information not
necessarily how to deal with it all
but it's not too difficult just to
flatten this all out so if we pop back
to our code now
what we started with was this and then
we ran into a
problem where we needed to load more and
go through the pages
and we ended up with this which is
basically getting us
all of the information super quick super
easy
and straight to a csv file if you've
enjoyed this you should check this video
out because it's got more information
how to web scrape like this
تصفح المزيد من مقاطع الفيديو ذات الصلة
This AI Agent can Scrape ANY WEBSITE!!!
BeautifulSoup + Requests | Web Scraping in Python
Scraping Data from a Real Website | Web Scraping in Python
Effortlessly Scrape Data from Websites using Power Automate and Power Apps
How to Scrape Google Search Results: A Step-by-Step Guide
Scraping Dark Web Sites with Python
5.0 / 5 (0 votes)