If you're web scraping, don't make these mistakes

The PyCoach
20 May 202412:06

Summary

TLDRThis video script offers essential tips for web scraping, highlighting common mistakes to avoid such as not checking for a site's hidden API, sending too many requests in a short time, and sticking to a single scraping tool. It emphasizes the importance of adapting to changes in websites and recommends tools like Selenium, Scrapy, and no-code options like Chat GPT for different scraping needs. The script also suggests using Bright Data's scraping solutions for complex sites and provides a bonus tip for handling login systems with strong security measures.

Takeaways

  • πŸ” Always check if a website has a hidden API before scraping its HTML document, as it can simplify the scraping process.
  • πŸ› οΈ Use developer tools to inspect network requests and identify potential APIs by looking for 'xhr' and 'API' in the network tab.
  • πŸ“š Learn to recognize APIs by examining JSON data in the preview tab to ensure it contains the desired information.
  • 🚫 Avoid sending too many requests in a short period to prevent server overload and being blocked by the website.
  • ⏱️ Implement delays in your scraping scripts using libraries like 'time' in Python to manage request frequency.
  • πŸ”„ Websites change over time, so expect and prepare for your scraper to require updates in the future.
  • πŸ›‘οΈ Use tools like Bright Data's Web Unlocker and Scraping Browser to handle complex scraping tasks, including CAPTCHA solving and IP rotation.
  • πŸ”§ Diversify your scraping toolset; don't rely solely on one tool like Selenium, as different tools are better suited for different tasks.
  • πŸ€– Selenium is ideal for dynamic websites and tasks that mimic human behavior but can be slow for large-scale scraping.
  • πŸ•ΈοΈ Scrapy is fast and suitable for medium to large projects but does not handle JavaScript natively and requires additional tools like Splash.
  • πŸ”‘ For websites with strong login systems, consider using an existing browser session with Selenium to bypass login hurdles.

Q & A

  • What is the first mistake mentioned in the script when it comes to web scraping?

    -The first mistake is scraping the HTML document of a site without checking if there is a hidden API that could be used for scraping instead.

  • Why is using an API for web scraping preferable to using Selenium for clicking buttons and scrolling?

    -Using an API is preferable because it is easier and avoids the hassle of building a scraper with Selenium, which requires actions like clicking buttons and handling infinite scrolling.

  • How can one check if a website has an API available for scraping?

    -One can check for an API by right-clicking on the website, selecting 'Inspect', going to the 'Network' tab, selecting 'XHR', and reloading the website to see if there are elements with the word 'API' in the link.

  • What should be done after identifying a potential API element on a website?

    -After identifying a potential API element, one should click on it, go to the 'Preview' tab to see the data in JSON format, and expand elements to verify if it contains the data they want to scrape.

  • What is the common issue with using for loops for web scraping?

    -The common issue with using for loops is that they can send too many requests in a short period of time, which can cause high traffic on the website and potentially burden or even shut down the server.

  • How can one avoid sending too many requests in a short period of time while web scraping?

    -One can avoid this by adding delays using the 'time' library in Python, or by using explicit waits with Selenium, or by setting a 'download delay' parameter in Scrapy.

  • Why is it important to expect the unexpected when web scraping?

    -It is important because websites change over time, and new features can be added that might break your existing scraper, making it necessary to update your scraping strategy to handle these changes.

  • What does the script suggest as a solution for overcoming challenges like captchas and IP blocks during web scraping?

    -The script suggests using Bright Data's scraping solutions, such as the Web Unlocker and the Scraping Browser, which offer features like browser fingerprinting, captcha solving, and IP rotations.

  • Why should one avoid sticking to only one web scraping tool?

    -One should avoid sticking to only one tool because different tools have different strengths and weaknesses, and using a variety of tools can help efficiently and effectively scrape different types of websites.

  • What is the advantage of using Selenium for web scraping?

    -Selenium is advantageous for scraping dynamic websites and for projects that require mimicking human behavior like clicking buttons or scrolling, due to its ease of learning and use.

  • What is the suggested method for handling login systems with strong security measures during web automation?

    -The suggested method is to connect Selenium to an existing browser by manually logging in once and then using that browser session for subsequent script runs to avoid captchas or two-factor authentication.

Outlines

00:00

πŸ” Avoiding Common Web Scraping Mistakes

This paragraph highlights the importance of checking for a hidden API before scraping a website's HTML document. It explains that using an API can simplify the scraping process by eliminating the need for actions like clicking buttons or scrolling. The speaker demonstrates how to inspect a website for an API by using developer tools and the network tab, and how to verify if the API contains the desired data. The summary also advises on copying the API link and using a web scraping library to extract data, emphasizing the ease and efficiency of API scraping over traditional methods.

05:02

⏱ Managing Request Rates in Web Scraping

The second paragraph discusses the pitfalls of sending too many requests in a short period, which can lead to server strain and potential blocking by the website. It suggests adding delays to for loops when scraping multiple pages or websites to reduce the frequency of requests. The speaker provides examples of how to implement delays using Python's time library and explains alternative methods such as explicit waits in Selenium. The paragraph also touches on the use of Scrapy's custom settings to add delays, advocating for responsible scraping practices to avoid being blocked by websites.

10:02

πŸ”„ Expecting the Unexpected in Web Scraping

This paragraph emphasizes the importance of anticipating changes in websites, which can render a web scraper obsolete over time. The speaker advises web scrapers to consider potential issues such as IP blocking, new features, captchas, and HTML changes. The paragraph introduces Bright Data as a sponsor offering solutions like the Web Unlocker and the Scraping Browser to handle complex scraping challenges, including captcha solving and IP rotation. The speaker encourages viewers to start a free trial with Bright Data to enhance their web scraping capabilities.

πŸ›  Diversifying Your Web Scraping Toolset

The speaker warns against relying solely on one web scraping tool, as different tools have their own strengths and weaknesses. They provide a brief overview of various tools, including Chat GBT for no-code scraping, Selenium for dynamic websites, and Scrapy for large-scale projects. The paragraph also mentions the limitations of each tool, such as Chat GBT's inability to handle pagination and Scrapy's lack of JavaScript handling by default. The speaker encourages viewers to explore different tools and choose the one that best fits their specific scraping needs.

πŸ” Handling Login Systems in Web Scraping

The final paragraph offers a tip for dealing with websites that require login, such as those with captchas, QR code verification, or two-factor authentication. The speaker suggests connecting Selenium to an existing browser to avoid repeated logins and captchas. They provide a step-by-step guide on how to launch Chrome with a specific profile and connect Selenium to that browser session. This method allows for seamless scraping without manual login interruptions, streamlining the web scraping process for websites with robust login systems.

Mindmap

Keywords

πŸ’‘Web Scraping

Web scraping is the process of programmatically extracting information from websites. It is central to the video's theme, discussing common mistakes and best practices. The script mentions web scraping in the context of using tools like Selenium and Beautiful Soup, and the importance of checking for a site's API before scraping.

πŸ’‘API

An API, or Application Programming Interface, is a set of rules and protocols for accessing a service or data. In the script, the presenter advises checking for a hidden API on a website before scraping its HTML document, as APIs can simplify the scraping process by providing direct access to data without the need for actions like clicking or scrolling.

πŸ’‘Selenium

Selenium is a popular tool for automating web browsers. It is mentioned in the script as a method to handle dynamic websites that require user interaction, such as clicking buttons or scrolling. However, it is also noted that Selenium can be slow and not ideal for large-scale scraping projects.

πŸ’‘Beautiful Soup

Beautiful Soup is a Python library used for parsing HTML and XML documents. It is referenced in the script as a common tool for web scraping, but the presenter cautions against sending too many requests in a short period of time, which can lead to being blocked by websites.

πŸ’‘XHR

XHR stands for XMLHttpRequest, a way for web pages to communicate with servers without refreshing the page. The script describes using the 'Network' tab in developer tools to look for XHR requests, which can indicate the presence of an API endpoint for data extraction.

πŸ’‘Infinite Scrolling

Infinite scrolling is a web design technique where content loads automatically as the user scrolls down the page. The script uses this as an example of a feature that can make web scraping more complex, but which can be avoided by finding an API endpoint.

πŸ’‘Delays

In the context of web scraping, delays refer to the intentional pauses between requests to a server. The script emphasizes the importance of adding delays to avoid overwhelming the server and getting blocked, suggesting the use of Python's time library or Selenium's explicit waits.

πŸ’‘IP Blocking

IP blocking is a security measure where a server refuses requests from a specific IP address. The script warns about the risk of IP blocking when scraping without delays and suggests using tools like Bright Data's Web Unlocker to handle such challenges.

πŸ’‘Bright Data

Bright Data, formerly known as Scrapinghub, is a platform offering various web scraping solutions. The script mentions it as a sponsor and highlights its features like browser fingerprinting, CAPTCHA solving, and IP rotation to overcome scraping obstacles.

πŸ’‘CAPTCHA

A CAPTCHA is a type of challenge-response test used to determine whether the user is human or a bot. The script discusses CAPTCHA as a common obstacle in web scraping and automation, and mentions Bright Data's Scraping Browser as a tool that can help analyze and solve CAPTCHAs.

πŸ’‘Scrapy

Scrapy is an open-source and collaborative web crawling framework for Python. The script describes Scrapy as suitable for medium to large web scraping projects due to its speed and ability to make requests in parallel, but notes that it does not handle JavaScript natively.

πŸ’‘JavaScript Rendering

JavaScript rendering refers to the process of executing JavaScript code on a webpage to generate dynamic content. The script mentions that some websites rely on JavaScript for their content, and tools like Bright Data's Scraping Browser can render this JavaScript to extract data.

Highlights

Scraping an API is easier than using Selenium for actions like clicking buttons or infinite scrolling.

To find a site's API, inspect network traffic for 'xhr' and look for the word 'API' in the link.

Verify an element is the API by checking the 'Preview' tab for JSON-formatted data.

Copying the API link allows for data extraction using web scraping libraries.

Avoid sending too many requests in a short time to prevent server burden and being blocked.

Use Python's 'time' library to add delays between requests to mitigate high traffic.

Explicit waits in Selenium can be used to manage delays based on element presence.

For Scrapy, use custom settings to add a 'download delay' to the scraper.

Expect changes in websites and ensure your scraper can handle unexpected updates.

Bright Data offers solutions for scraping tough websites with features like browser fingerprinting and CAPTCHA solving.

Avoid sticking to one scraping tool; consider alternatives based on the project's needs.

Chat GBT is a no-code option for web scraping that extracts data without writing code.

Selenium is suitable for dynamic websites and small to medium projects but is slower for large-scale scraping.

Scrapy is ideal for medium to large projects due to its speed but requires additional tools for JavaScript rendering.

Connecting Selenium to an existing browser can bypass login systems and CAPTCHAs.

Use Chrome's executable path and a profile folder to connect Selenium for automated logins.

This video provides tips and tools to avoid common web scraping mistakes and enhance efficiency.

Transcripts

play00:00

if you're scrapping a website you

play00:01

shouldn't make these big mistakes some

play00:03

of these mistakes I used to make myself

play00:05

as a beginner while I discover the rest

play00:07

in the handers of questions that

play00:09

students of my web scraping course asked

play00:11

so let's start with this list all right

play00:13

the first mistake we make is scraping

play00:15

the HTML document of a site without

play00:18

checking if the site has a hidden API

play00:20

that we could scrape if you scrape an

play00:22

API you won't have to add actions to

play00:25

your scraper such as clicking on buttons

play00:27

scrolling and more here's an example

play00:29

right now I'm on the website quotes.

play00:31

tcp.com and here if you check this

play00:34

website and you scroll down you'll see

play00:36

there is a next page button that you

play00:38

have to click to load more quotes and

play00:40

also there is a version of this website

play00:42

with infinite scrolling that means that

play00:44

you have to scroll down multiple times

play00:46

to load more data now if you start

play00:48

building a web scraper right away

play00:49

without further analysis probably you'll

play00:52

think that using selenium will be the

play00:55

best idea to scrap this website because

play00:57

with selenium you can click on a button

play00:59

to go to the next page page and also you

play01:00

can do some workarounds to get infinite

play01:03

scrolling that's not a bad idea but you

play01:05

can avoid the hassle of building a

play01:06

scraper with selenium from scratch by

play01:09

checking if there is an API available on

play01:11

the website scraping an API is way

play01:13

easier than clicking on buttons with

play01:15

selenium or doing infinite scrolling and

play01:17

I show you how to do it in a few steps

play01:19

all right the first thing we have to do

play01:21

is right click on the website and then

play01:23

select on inspect after we do this we're

play01:25

going to see developer tools and here

play01:27

instead of choosing the elements tab we

play01:29

have to go to to the networks Tab and

play01:31

then we have to select xhr and we have

play01:34

to reload the website and after we do

play01:37

this we're going to see some new

play01:38

elements in the tab for example here I

play01:40

have quotes and then the question mark

play01:43

and if you hover on this new element

play01:44

you're going to see that the word API is

play01:46

in the link and that's an easy way to

play01:48

recognize it of course most websites are

play01:50

not going to make obvious where the API

play01:52

is located and a good way to recognize

play01:54

an element if it has the data that you

play01:57

want to extract is by clicking on it so

play01:59

here I'm going to click on this element

play02:02

and to see if this element is the API

play02:05

and has the data that we want to struct

play02:07

we have to go to preview in the preview

play02:09

tab you're going to see the elements in

play02:11

ad Json format and you can expand some

play02:14

elements to verify if the elements

play02:16

contain the data that you want to struct

play02:18

for example here if I expand the quote

play02:21

element there's different numbers from

play02:23

Zer to 9 and you can expand one of these

play02:26

numbers to see what's in there and here

play02:28

I expand the the first element and there

play02:30

is an element author and also TX and if

play02:34

you check on the left you can see that

play02:36

for example the first author is Albert

play02:38

Einstein and the tag for example is

play02:42

change deep thought and is the same as

play02:44

on the left so we can verify that it's

play02:47

the same data and if we scrape this

play02:49

element we're going to get all the data

play02:51

in the website and avo the hassle to

play02:52

build a scraper with cenum that clicks

play02:54

on buttons or that does infinite

play02:56

scrolling now that we recognize that

play02:58

this is the element that has the API

play03:00

we have to copy this link and then use

play03:02

any web scraping library to get the data

play03:04

from this API okay the next mistake we

play03:07

should avoid is sending too many

play03:09

requests in a short period of time

play03:12

usually when we scrap websites using

play03:14

selenium or beautiful soup a common step

play03:18

is adding a for Loop either to scrape

play03:21

multiple websites or to scrape multiple

play03:24

Pages or multiple links now websites

play03:27

hate these for Loops because when we use

play03:30

these for Loops we send too many

play03:32

requests within seconds say we want to

play03:35

send requests to a list of links using a

play03:38

for Loop most of us will do something

play03:40

like this we'll create a for Loop and

play03:42

we'll use this request.get now if there

play03:44

are thousands of links you'll send

play03:46

thousands of requests that can cause

play03:49

high traffic on the website and could

play03:51

burden them in the worst scenario you

play03:53

could shut down their server this is a

play03:55

bad practice because you're going to get

play03:56

blocked by this website and an easy way

play03:58

to stay safe here here is to reduce the

play04:01

number of requests by adding some delays

play04:04

and a simple way to add delays in Python

play04:06

is importing the time Library so you

play04:09

import time and then here in this for

play04:12

Loop for example we type time that sleep

play04:15

and then we specify the number of

play04:16

seconds in the delay and with these

play04:18

delays we're going to avoid sending

play04:19

multiple requests in a short period of

play04:21

time I set this as 1 second but you can

play04:23

set this as two 3 5 Seconds it depends

play04:26

on you on your project and also on the

play04:28

website that you were scraping now now

play04:29

the time Library offers the easiest way

play04:32

to add lace but there are other ways to

play04:34

add lace for example you can also add

play04:37

explicit weights using selenium and here

play04:40

I'm showing you an example you have to

play04:42

import expected conditions from selenium

play04:46

and explicit was Works in a different

play04:48

way from implicit way because here the

play04:51

drivers is going to wait some seconds

play04:53

until an action happens for example here

play04:56

the driver is going to wait at most 5

play04:58

Seconds until the presence of an element

play05:01

is located if the element is located in

play05:04

3 seconds then the driver is only going

play05:06

to wait 3 seconds which is different

play05:08

from the time library because here when

play05:10

we do time that sleep we're going to

play05:12

wait 2 seconds no matter what happens

play05:15

and will I show you how to add theay in

play05:16

beautiful soup using the time library

play05:18

and in selenium using either the time

play05:20

library or explicit weights and in case

play05:22

you're using a Scrapy you can add delays

play05:25

by using custom settings and you can

play05:27

just add the parameter download delay

play05:30

and then here you can specify the delay

play05:32

that you want to add to your scraper so

play05:34

the next time you're scraping a website

play05:35

don't forget to add delays all right the

play05:38

next tip I want to give you as somebody

play05:40

who has been scraping websites for many

play05:42

years is to expect the unexpected

play05:45

remember that websites change throughout

play05:47

time and in the future new features can

play05:49

be added to a website that you're

play05:50

scraping so your scraper might be

play05:53

working right now but in some weeks or

play05:56

some months or years that same script is

play05:59

not going to work these updates to

play06:01

websites is what makes web scraping a

play06:03

bit more complex so we have to make sure

play06:05

that our web scraper can handle any new

play06:08

Rod block so after building your web

play06:10

scraper I want you to ask yourself this

play06:13

question what if what if I get IP

play06:16

blocked what if new features are added

play06:18

to the page what if there is a capture

play06:20

what if the HTML of the website changes

play06:23

the answer to some of these questions

play06:24

will be as simple as updating the EXP

play06:27

that you're using to locate an element

play06:29

in a website but in some other cases it

play06:31

will be complex to deal with some of

play06:33

these Rod blocks like dealing with capes

play06:36

or with IP blocks and a tool recommend

play06:39

you to overcome these challenges is

play06:41

bright data which is the sponsor of this

play06:43

video bright data offers different

play06:45

scraping solutions to unlock and scrape

play06:48

even the toughest websites one of them

play06:50

is the web unlocker which you can use to

play06:53

access public websites at scale

play06:56

employing features like browser

play06:58

fingerprinting cap solving IP rotations

play07:01

and more but my favorite scraping

play07:03

solution of bright data is the scraping

play07:05

browser which is good for projects that

play07:07

require browser interactions and can

play07:10

connect to thirdparty tools like

play07:12

playright and selenium bright data

play07:14

scraping browser has features like

play07:16

capture solving which will help you

play07:18

analyze and solve captas that you might

play07:20

find when log to websites automatic IP

play07:23

rotation to reduce the risk of Ip bands

play07:26

and JavaScript rendering to trct data

play07:29

from we websites that rely on Dynamic

play07:31

elements you can connect the scraping

play07:32

browser to paper playright or selenium

play07:35

and it'll handle all proing unlocking

play07:38

operations behind the scenes start a

play07:40

free trial today and unlock the internet

play07:42

with bright data using the link in the

play07:44

description thanks to Bri data for

play07:46

sponsoring this video and now let's go

play07:48

back to the video all right the next

play07:49

mistake we should avoid at any cost is a

play07:52

sticking to one scraping tool I can tell

play07:54

how many times I've seen people

play07:55

struggling to scrape a website with tool

play07:57

a when they could have used tool be to

play08:00

easily get the job done the problem is

play08:02

that they only know how to use one tool

play08:04

like for example selenium and want to

play08:06

stick to it even though there are other

play08:08

options out there that will help them

play08:10

scrape the website easily or more

play08:12

efficiently I made a complete video

play08:14

explaining the pros and cons of some

play08:16

python libraries for web scraping and

play08:18

even other no code options like chat gbt

play08:21

and I'm going to start with chat gbt

play08:22

because it's a tool that you can use for

play08:24

web scraping and you don't need to write

play08:26

any code the easiest way to scrape

play08:28

website with chb is by using the GPT

play08:31

scraper with this GPT you only need to

play08:33

give the link of a website that you want

play08:35

to scrape and then you have to specify

play08:38

which data you want to extract and this

play08:40

GPT is going to extract all this data in

play08:42

just a few seconds some of the

play08:44

disadvantages of chat gbt for web

play08:46

scraping is that it doesn't support

play08:48

Dynamic websites also in sub websites

play08:50

with pagination it will not be possible

play08:52

to scrape data from multiple Pages

play08:55

sometimes it's possible in some

play08:57

scenarios but in some others it's not

play08:59

possible possible then we have python

play09:00

libraries for web scraping and let's

play09:02

start with selenium selenium is a great

play09:04

option to scrape Dynamic websites and

play09:07

it's also a good option because it's

play09:08

easy to learn and in general selenium is

play09:10

a good option for small or medium web

play09:13

scraping projects where you have to

play09:14

mimic human behavior like clicking on

play09:16

buttons or scrolling down on pages

play09:19

however the biggest disadvantage of

play09:20

selenium is that it's slow so it's not a

play09:23

good option for large web scraping

play09:25

projects then we have Scrapy which is

play09:27

good for medium to large web scraping

play09:29

projects one of the biggest advantage of

play09:31

Scrapy is speed Scrapy spiders don't

play09:33

have to wait to make requests one at a

play09:35

time but it can make requests in

play09:37

parallel and some of the drawbacks of

play09:38

Scrapy is that it doesn't handle

play09:40

JavaScript by default but it relies on

play09:43

Splash to do this job so you have to

play09:44

learn Splash to be able to scrape

play09:47

JavaScript different websites now this

play09:49

is only summary if you want more details

play09:51

about each web scraping Library check

play09:52

out my other YouTube video in the

play09:54

description below all right finally as a

play09:56

bonus I want to give you a tip that

play09:57

might come in handy when scraping or

play09:59

automating websites that require you to

play10:01

log in some sites have strong login

play10:04

systems for example open a captas

play10:07

WhatsApp webs QR code verification or

play10:09

Tinder two Factor authentication a

play10:12

simple way I deal with this is by

play10:14

connecting selenium to an existing

play10:16

browser basically I do the login

play10:19

manually once and then connect my script

play10:22

to that browser so that I don't have to

play10:24

log in again to do so we have to run a

play10:27

command on the terminal this command

play10:29

needs the Chrome executable path and we

play10:31

also need to create a new profile folder

play10:34

and copy its path first to obtain that

play10:36

Chrome executable path you have to right

play10:39

click on Chrome and copy the location of

play10:41

the chrome. X for Mac users you go to

play10:44

Applications right click on Google

play10:46

Chrome select show package content then

play10:49

content Mac OS and then copy the path of

play10:52

a file name Google Chrome now we create

play10:54

a profile folder to do so just create an

play10:57

empty folder and copy its path now that

play11:00

you have the Chrome executable path and

play11:02

also the profile folder path open up a

play11:05

terminal P the Chrome executable path

play11:07

followed by a parameter with a port

play11:10

9222 and another parameter with a path

play11:13

of the profile and I leave this commands

play11:15

in the description below after running

play11:16

this command Chrome will be launched on

play11:18

Port

play11:19

9222 then on your script you have to

play11:21

connect to the same port using selenium

play11:24

for this you just need to add a few

play11:25

lines of code first import options

play11:28

instantiate opt options and then use the

play11:30

add experimental option method to

play11:33

connect selenium with a session open in

play11:35

Port 9222 finally don't forget to add

play11:38

the options parameter to the Chrome

play11:40

method and set it equal to the variable

play11:43

options and that's it next time you run

play11:45

the script you'll connect with the same

play11:47

browser and avoid captas or two Factor

play11:51

authentication and that's it for this

play11:52

video These are the mistake that we

play11:54

should avoid and the tips we should

play11:55

follow when scraping websites and if you

play11:58

know another mistake take or tip we

play12:00

should follow let me know in the comment

play12:01

section below that's it for this video

play12:04

and I'll see you on the next video

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Web ScrapingAPI CheckingSeleniumBeautiful SoupRate LimitingData ExtractionDynamic WebsitesScraper EfficiencyBright DataAutomation