If you're web scraping, don't make these mistakes
Summary
TLDRThis video script offers essential tips for web scraping, highlighting common mistakes to avoid such as not checking for a site's hidden API, sending too many requests in a short time, and sticking to a single scraping tool. It emphasizes the importance of adapting to changes in websites and recommends tools like Selenium, Scrapy, and no-code options like Chat GPT for different scraping needs. The script also suggests using Bright Data's scraping solutions for complex sites and provides a bonus tip for handling login systems with strong security measures.
Takeaways
- 🔍 Always check if a website has a hidden API before scraping its HTML document, as it can simplify the scraping process.
- 🛠️ Use developer tools to inspect network requests and identify potential APIs by looking for 'xhr' and 'API' in the network tab.
- 📚 Learn to recognize APIs by examining JSON data in the preview tab to ensure it contains the desired information.
- 🚫 Avoid sending too many requests in a short period to prevent server overload and being blocked by the website.
- ⏱️ Implement delays in your scraping scripts using libraries like 'time' in Python to manage request frequency.
- 🔄 Websites change over time, so expect and prepare for your scraper to require updates in the future.
- 🛡️ Use tools like Bright Data's Web Unlocker and Scraping Browser to handle complex scraping tasks, including CAPTCHA solving and IP rotation.
- 🔧 Diversify your scraping toolset; don't rely solely on one tool like Selenium, as different tools are better suited for different tasks.
- 🤖 Selenium is ideal for dynamic websites and tasks that mimic human behavior but can be slow for large-scale scraping.
- 🕸️ Scrapy is fast and suitable for medium to large projects but does not handle JavaScript natively and requires additional tools like Splash.
- 🔑 For websites with strong login systems, consider using an existing browser session with Selenium to bypass login hurdles.
Q & A
What is the first mistake mentioned in the script when it comes to web scraping?
-The first mistake is scraping the HTML document of a site without checking if there is a hidden API that could be used for scraping instead.
Why is using an API for web scraping preferable to using Selenium for clicking buttons and scrolling?
-Using an API is preferable because it is easier and avoids the hassle of building a scraper with Selenium, which requires actions like clicking buttons and handling infinite scrolling.
How can one check if a website has an API available for scraping?
-One can check for an API by right-clicking on the website, selecting 'Inspect', going to the 'Network' tab, selecting 'XHR', and reloading the website to see if there are elements with the word 'API' in the link.
What should be done after identifying a potential API element on a website?
-After identifying a potential API element, one should click on it, go to the 'Preview' tab to see the data in JSON format, and expand elements to verify if it contains the data they want to scrape.
What is the common issue with using for loops for web scraping?
-The common issue with using for loops is that they can send too many requests in a short period of time, which can cause high traffic on the website and potentially burden or even shut down the server.
How can one avoid sending too many requests in a short period of time while web scraping?
-One can avoid this by adding delays using the 'time' library in Python, or by using explicit waits with Selenium, or by setting a 'download delay' parameter in Scrapy.
Why is it important to expect the unexpected when web scraping?
-It is important because websites change over time, and new features can be added that might break your existing scraper, making it necessary to update your scraping strategy to handle these changes.
What does the script suggest as a solution for overcoming challenges like captchas and IP blocks during web scraping?
-The script suggests using Bright Data's scraping solutions, such as the Web Unlocker and the Scraping Browser, which offer features like browser fingerprinting, captcha solving, and IP rotations.
Why should one avoid sticking to only one web scraping tool?
-One should avoid sticking to only one tool because different tools have different strengths and weaknesses, and using a variety of tools can help efficiently and effectively scrape different types of websites.
What is the advantage of using Selenium for web scraping?
-Selenium is advantageous for scraping dynamic websites and for projects that require mimicking human behavior like clicking buttons or scrolling, due to its ease of learning and use.
What is the suggested method for handling login systems with strong security measures during web automation?
-The suggested method is to connect Selenium to an existing browser by manually logging in once and then using that browser session for subsequent script runs to avoid captchas or two-factor authentication.
Outlines
🔍 Avoiding Common Web Scraping Mistakes
This paragraph highlights the importance of checking for a hidden API before scraping a website's HTML document. It explains that using an API can simplify the scraping process by eliminating the need for actions like clicking buttons or scrolling. The speaker demonstrates how to inspect a website for an API by using developer tools and the network tab, and how to verify if the API contains the desired data. The summary also advises on copying the API link and using a web scraping library to extract data, emphasizing the ease and efficiency of API scraping over traditional methods.
⏱ Managing Request Rates in Web Scraping
The second paragraph discusses the pitfalls of sending too many requests in a short period, which can lead to server strain and potential blocking by the website. It suggests adding delays to for loops when scraping multiple pages or websites to reduce the frequency of requests. The speaker provides examples of how to implement delays using Python's time library and explains alternative methods such as explicit waits in Selenium. The paragraph also touches on the use of Scrapy's custom settings to add delays, advocating for responsible scraping practices to avoid being blocked by websites.
🔄 Expecting the Unexpected in Web Scraping
This paragraph emphasizes the importance of anticipating changes in websites, which can render a web scraper obsolete over time. The speaker advises web scrapers to consider potential issues such as IP blocking, new features, captchas, and HTML changes. The paragraph introduces Bright Data as a sponsor offering solutions like the Web Unlocker and the Scraping Browser to handle complex scraping challenges, including captcha solving and IP rotation. The speaker encourages viewers to start a free trial with Bright Data to enhance their web scraping capabilities.
🛠 Diversifying Your Web Scraping Toolset
The speaker warns against relying solely on one web scraping tool, as different tools have their own strengths and weaknesses. They provide a brief overview of various tools, including Chat GBT for no-code scraping, Selenium for dynamic websites, and Scrapy for large-scale projects. The paragraph also mentions the limitations of each tool, such as Chat GBT's inability to handle pagination and Scrapy's lack of JavaScript handling by default. The speaker encourages viewers to explore different tools and choose the one that best fits their specific scraping needs.
🔐 Handling Login Systems in Web Scraping
The final paragraph offers a tip for dealing with websites that require login, such as those with captchas, QR code verification, or two-factor authentication. The speaker suggests connecting Selenium to an existing browser to avoid repeated logins and captchas. They provide a step-by-step guide on how to launch Chrome with a specific profile and connect Selenium to that browser session. This method allows for seamless scraping without manual login interruptions, streamlining the web scraping process for websites with robust login systems.
Mindmap
Keywords
💡Web Scraping
💡API
💡Selenium
💡Beautiful Soup
💡XHR
💡Infinite Scrolling
💡Delays
💡IP Blocking
💡Bright Data
💡CAPTCHA
💡Scrapy
💡JavaScript Rendering
Highlights
Scraping an API is easier than using Selenium for actions like clicking buttons or infinite scrolling.
To find a site's API, inspect network traffic for 'xhr' and look for the word 'API' in the link.
Verify an element is the API by checking the 'Preview' tab for JSON-formatted data.
Copying the API link allows for data extraction using web scraping libraries.
Avoid sending too many requests in a short time to prevent server burden and being blocked.
Use Python's 'time' library to add delays between requests to mitigate high traffic.
Explicit waits in Selenium can be used to manage delays based on element presence.
For Scrapy, use custom settings to add a 'download delay' to the scraper.
Expect changes in websites and ensure your scraper can handle unexpected updates.
Bright Data offers solutions for scraping tough websites with features like browser fingerprinting and CAPTCHA solving.
Avoid sticking to one scraping tool; consider alternatives based on the project's needs.
Chat GBT is a no-code option for web scraping that extracts data without writing code.
Selenium is suitable for dynamic websites and small to medium projects but is slower for large-scale scraping.
Scrapy is ideal for medium to large projects due to its speed but requires additional tools for JavaScript rendering.
Connecting Selenium to an existing browser can bypass login systems and CAPTCHAs.
Use Chrome's executable path and a profile folder to connect Selenium for automated logins.
This video provides tips and tools to avoid common web scraping mistakes and enhance efficiency.
Transcripts
if you're scrapping a website you
shouldn't make these big mistakes some
of these mistakes I used to make myself
as a beginner while I discover the rest
in the handers of questions that
students of my web scraping course asked
so let's start with this list all right
the first mistake we make is scraping
the HTML document of a site without
checking if the site has a hidden API
that we could scrape if you scrape an
API you won't have to add actions to
your scraper such as clicking on buttons
scrolling and more here's an example
right now I'm on the website quotes.
tcp.com and here if you check this
website and you scroll down you'll see
there is a next page button that you
have to click to load more quotes and
also there is a version of this website
with infinite scrolling that means that
you have to scroll down multiple times
to load more data now if you start
building a web scraper right away
without further analysis probably you'll
think that using selenium will be the
best idea to scrap this website because
with selenium you can click on a button
to go to the next page page and also you
can do some workarounds to get infinite
scrolling that's not a bad idea but you
can avoid the hassle of building a
scraper with selenium from scratch by
checking if there is an API available on
the website scraping an API is way
easier than clicking on buttons with
selenium or doing infinite scrolling and
I show you how to do it in a few steps
all right the first thing we have to do
is right click on the website and then
select on inspect after we do this we're
going to see developer tools and here
instead of choosing the elements tab we
have to go to to the networks Tab and
then we have to select xhr and we have
to reload the website and after we do
this we're going to see some new
elements in the tab for example here I
have quotes and then the question mark
and if you hover on this new element
you're going to see that the word API is
in the link and that's an easy way to
recognize it of course most websites are
not going to make obvious where the API
is located and a good way to recognize
an element if it has the data that you
want to extract is by clicking on it so
here I'm going to click on this element
and to see if this element is the API
and has the data that we want to struct
we have to go to preview in the preview
tab you're going to see the elements in
ad Json format and you can expand some
elements to verify if the elements
contain the data that you want to struct
for example here if I expand the quote
element there's different numbers from
Zer to 9 and you can expand one of these
numbers to see what's in there and here
I expand the the first element and there
is an element author and also TX and if
you check on the left you can see that
for example the first author is Albert
Einstein and the tag for example is
change deep thought and is the same as
on the left so we can verify that it's
the same data and if we scrape this
element we're going to get all the data
in the website and avo the hassle to
build a scraper with cenum that clicks
on buttons or that does infinite
scrolling now that we recognize that
this is the element that has the API
we have to copy this link and then use
any web scraping library to get the data
from this API okay the next mistake we
should avoid is sending too many
requests in a short period of time
usually when we scrap websites using
selenium or beautiful soup a common step
is adding a for Loop either to scrape
multiple websites or to scrape multiple
Pages or multiple links now websites
hate these for Loops because when we use
these for Loops we send too many
requests within seconds say we want to
send requests to a list of links using a
for Loop most of us will do something
like this we'll create a for Loop and
we'll use this request.get now if there
are thousands of links you'll send
thousands of requests that can cause
high traffic on the website and could
burden them in the worst scenario you
could shut down their server this is a
bad practice because you're going to get
blocked by this website and an easy way
to stay safe here here is to reduce the
number of requests by adding some delays
and a simple way to add delays in Python
is importing the time Library so you
import time and then here in this for
Loop for example we type time that sleep
and then we specify the number of
seconds in the delay and with these
delays we're going to avoid sending
multiple requests in a short period of
time I set this as 1 second but you can
set this as two 3 5 Seconds it depends
on you on your project and also on the
website that you were scraping now now
the time Library offers the easiest way
to add lace but there are other ways to
add lace for example you can also add
explicit weights using selenium and here
I'm showing you an example you have to
import expected conditions from selenium
and explicit was Works in a different
way from implicit way because here the
drivers is going to wait some seconds
until an action happens for example here
the driver is going to wait at most 5
Seconds until the presence of an element
is located if the element is located in
3 seconds then the driver is only going
to wait 3 seconds which is different
from the time library because here when
we do time that sleep we're going to
wait 2 seconds no matter what happens
and will I show you how to add theay in
beautiful soup using the time library
and in selenium using either the time
library or explicit weights and in case
you're using a Scrapy you can add delays
by using custom settings and you can
just add the parameter download delay
and then here you can specify the delay
that you want to add to your scraper so
the next time you're scraping a website
don't forget to add delays all right the
next tip I want to give you as somebody
who has been scraping websites for many
years is to expect the unexpected
remember that websites change throughout
time and in the future new features can
be added to a website that you're
scraping so your scraper might be
working right now but in some weeks or
some months or years that same script is
not going to work these updates to
websites is what makes web scraping a
bit more complex so we have to make sure
that our web scraper can handle any new
Rod block so after building your web
scraper I want you to ask yourself this
question what if what if I get IP
blocked what if new features are added
to the page what if there is a capture
what if the HTML of the website changes
the answer to some of these questions
will be as simple as updating the EXP
that you're using to locate an element
in a website but in some other cases it
will be complex to deal with some of
these Rod blocks like dealing with capes
or with IP blocks and a tool recommend
you to overcome these challenges is
bright data which is the sponsor of this
video bright data offers different
scraping solutions to unlock and scrape
even the toughest websites one of them
is the web unlocker which you can use to
access public websites at scale
employing features like browser
fingerprinting cap solving IP rotations
and more but my favorite scraping
solution of bright data is the scraping
browser which is good for projects that
require browser interactions and can
connect to thirdparty tools like
playright and selenium bright data
scraping browser has features like
capture solving which will help you
analyze and solve captas that you might
find when log to websites automatic IP
rotation to reduce the risk of Ip bands
and JavaScript rendering to trct data
from we websites that rely on Dynamic
elements you can connect the scraping
browser to paper playright or selenium
and it'll handle all proing unlocking
operations behind the scenes start a
free trial today and unlock the internet
with bright data using the link in the
description thanks to Bri data for
sponsoring this video and now let's go
back to the video all right the next
mistake we should avoid at any cost is a
sticking to one scraping tool I can tell
how many times I've seen people
struggling to scrape a website with tool
a when they could have used tool be to
easily get the job done the problem is
that they only know how to use one tool
like for example selenium and want to
stick to it even though there are other
options out there that will help them
scrape the website easily or more
efficiently I made a complete video
explaining the pros and cons of some
python libraries for web scraping and
even other no code options like chat gbt
and I'm going to start with chat gbt
because it's a tool that you can use for
web scraping and you don't need to write
any code the easiest way to scrape
website with chb is by using the GPT
scraper with this GPT you only need to
give the link of a website that you want
to scrape and then you have to specify
which data you want to extract and this
GPT is going to extract all this data in
just a few seconds some of the
disadvantages of chat gbt for web
scraping is that it doesn't support
Dynamic websites also in sub websites
with pagination it will not be possible
to scrape data from multiple Pages
sometimes it's possible in some
scenarios but in some others it's not
possible possible then we have python
libraries for web scraping and let's
start with selenium selenium is a great
option to scrape Dynamic websites and
it's also a good option because it's
easy to learn and in general selenium is
a good option for small or medium web
scraping projects where you have to
mimic human behavior like clicking on
buttons or scrolling down on pages
however the biggest disadvantage of
selenium is that it's slow so it's not a
good option for large web scraping
projects then we have Scrapy which is
good for medium to large web scraping
projects one of the biggest advantage of
Scrapy is speed Scrapy spiders don't
have to wait to make requests one at a
time but it can make requests in
parallel and some of the drawbacks of
Scrapy is that it doesn't handle
JavaScript by default but it relies on
Splash to do this job so you have to
learn Splash to be able to scrape
JavaScript different websites now this
is only summary if you want more details
about each web scraping Library check
out my other YouTube video in the
description below all right finally as a
bonus I want to give you a tip that
might come in handy when scraping or
automating websites that require you to
log in some sites have strong login
systems for example open a captas
WhatsApp webs QR code verification or
Tinder two Factor authentication a
simple way I deal with this is by
connecting selenium to an existing
browser basically I do the login
manually once and then connect my script
to that browser so that I don't have to
log in again to do so we have to run a
command on the terminal this command
needs the Chrome executable path and we
also need to create a new profile folder
and copy its path first to obtain that
Chrome executable path you have to right
click on Chrome and copy the location of
the chrome. X for Mac users you go to
Applications right click on Google
Chrome select show package content then
content Mac OS and then copy the path of
a file name Google Chrome now we create
a profile folder to do so just create an
empty folder and copy its path now that
you have the Chrome executable path and
also the profile folder path open up a
terminal P the Chrome executable path
followed by a parameter with a port
9222 and another parameter with a path
of the profile and I leave this commands
in the description below after running
this command Chrome will be launched on
Port
9222 then on your script you have to
connect to the same port using selenium
for this you just need to add a few
lines of code first import options
instantiate opt options and then use the
add experimental option method to
connect selenium with a session open in
Port 9222 finally don't forget to add
the options parameter to the Chrome
method and set it equal to the variable
options and that's it next time you run
the script you'll connect with the same
browser and avoid captas or two Factor
authentication and that's it for this
video These are the mistake that we
should avoid and the tips we should
follow when scraping websites and if you
know another mistake take or tip we
should follow let me know in the comment
section below that's it for this video
and I'll see you on the next video
5.0 / 5 (0 votes)