Web Scraping Tutorial | Data Scraping from Websites to Excel | Web Scraper Chorme Extension
Summary
TLDRIn this tutorial video, Rafi demonstrates how to use a free Google Chrome extension called 'Web Scraper' to extract data from multiple web pages automatically. He provides a step-by-step guide on scraping information from the Yellow Pages business directory, focusing on car insurance service providers in New York City. The data collected includes business names, phone numbers, addresses, websites, and email addresses. The video also covers navigating pagination and setting up selectors for efficient data extraction.
Takeaways
- π» The video demonstrates how to scrape data from websites using a free Google Chrome extension called Web Scraper.
- π’ The target data source is the Yellow Pages business directory, specifically gathering information about car insurance service providers in New York City and State.
- π The scraping process involves collecting details like business name, phone number, address, website, and email address from multiple pages.
- π Web Scraper automates the process by moving from one page to the next after scraping 30 results per page.
- π οΈ To begin, users need to install the Web Scraper extension from the Chrome Web Store and then reload the target website.
- π±οΈ Using 'Inspect Element,' users can identify and create a sitemap with selectors to scrape specific information from the webpage.
- π The tutorial includes selecting business listings, extracting information like name, phone number, and website, and handling multi-page navigation for continuous scraping.
- π The tool allows users to adjust scraping intervals to avoid hitting website restrictions or being blocked.
- π₯ After completing the scraping process, users can export the gathered data into a CSV file for further use and cleaning.
- π§ The video emphasizes the importance of cleaning data post-extraction, such as removing unnecessary text (e.g., 'mailto:') from email addresses.
Q & A
What is the main topic of the video?
-The main topic of the video is how to scrape data from websites using a free Google Chrome extension called Web Scraper.
What specific information is the presenter going to extract from the Yellow Pages business directory?
-The presenter is going to extract car insurance service providers' information from New York City and State, including business profiles' names, phone numbers, addresses, website addresses, and email addresses.
How does the tool handle pagination on the website?
-The tool automatically visits subsequent pages after completing the data extraction from the first page, continuing to scrape data from each page.
What is the name of the Google Chrome extension used in the video?
-The Google Chrome extension used in the video is called 'Web Scraper'.
How does one install the Web Scraper extension on Google Chrome?
-To install the Web Scraper extension, one needs to visit the extension page, click on 'Add to Chrome', and then confirm by clicking 'Add extension'.
What is a sitemap in the context of web scraping with the Web Scraper extension?
-A sitemap in the context of web scraping with the Web Scraper extension is a configuration that defines how the tool navigates and extracts data from a website.
How does the presenter select the data points to be scraped from each business listing?
-The presenter selects data points by clicking on 'Add new selector', choosing the type (text or link), and then selecting the specific elements on the webpage such as business name, phone number, address, website, and email.
What is the purpose of setting a delay between requests when scraping?
-Setting a delay between requests prevents the scraper from being blocked by the website due to too many rapid requests, as most websites have limitations on the number of accesses per user per day.
How can one export the scraped data from the Web Scraper extension?
-The scraped data can be exported by clicking on the 'Export data' button and then choosing 'Export data as CSV' to download the data into an Excel document.
What is the final format of the scraped data as mentioned in the video?
-The final format of the scraped data is a CSV file containing the information such as business or person's name, phone number, address, website, and email.
How does the presenter clean the extracted email addresses in the CSV file?
-The presenter cleans the extracted email addresses by using the 'Find and Replace' feature in Excel to remove the 'mailto:' prefix from each email address.
Outlines
π Introduction to Web Scraping with Chrome Extension
The video starts with a warm welcome from the presenter, Ashore Rafi, explaining that the focus of the tutorial is on scraping data from websites using a free Chrome extension. He plans to demonstrate this by extracting car insurance service provider information from New York's Yellow Pages directory. The presenter explains that the tool will scrape business names, phone numbers, addresses, website URLs, and email addresses. He also highlights that the tool can automatically scrape multiple pages of data, with each page containing 30 results, and the process continues until all pages are scraped.
𧩠Installing the Web Scraper Chrome Extension
This section explains the step-by-step process of installing the Web Scraper Chrome extension. The presenter provides a detailed guide on how to navigate to the extension page, add it to Chrome, and verify the installation. Once installed, the user is advised to reload the search result page, after which they can begin scraping. He then shows how to open the browserβs developer tools and highlights the 'Inspect' function, which allows access to the Web Scraper extension within the browser.
π Setting Up Selectors for Web Scraping
In this paragraph, the presenter explains how to create a new sitemap within the Web Scraper tool and set up selectors for collecting data. He describes how to identify and select business listings by clicking on their links, assigning IDs, and selecting multiple listings. He walks through selecting the first few business profiles and how the tool automatically recognizes the rest, making the process efficient. Once the business listings are selected, the user is instructed to save the selector.
π Collecting Business Details: Name, Phone, Address, and Website
This part dives into the process of extracting specific data points from each business listing. The presenter shows how to set up new selectors to scrape business names, phone numbers, addresses, websites, and email addresses. He describes how to handle different business profiles, whether personal or organizational, and ensure all relevant fields are captured. He also explains how the tool can automatically scrape multiple pages, collecting data from various business profiles listed on Yellow Pages.
π Automating the Scraping Process Across Multiple Pages
In this section, the presenter explains how to make the scraping tool visit subsequent pages after finishing the first one. He shows how to set up a new selector for pagination, selecting all the page numbers and the 'Next' button to ensure the tool can scrape across multiple pages. He advises adjusting settings to avoid scraping restrictions imposed by business directories and websites and demonstrates how to view the selector graph to visualize the scraping structure.
π Running and Configuring the Scraping Script
Here, the presenter discusses how to configure the scraping script by setting appropriate time intervals to avoid triggering website restrictions. He walks through the process of starting the scraping operation and monitors the tool as it begins to extract data from multiple pages. The scraping process continues in the background, collecting data such as business names, phone numbers, addresses, and emails, all while avoiding website blocking issues.
π Viewing and Exporting the Scraped Data
The presenter shows how to access and review the collected data within the Web Scraper tool. He explains how users can refresh the tool to view updated information and export the data in a CSV format. The presenter demonstrates how to clean the data in Excel, removing unwanted fields, formatting emails, and avoiding duplication. He explains that while some businesses may list multiple contacts, this is not an issue, as each entry is distinct.
β Final Thoughts and Tips for Successful Web Scraping
The video concludes with the presenter cleaning up the scraped data and summarizing the process of automatic data extraction from a business listing site. He offers tips on using the 'Find and Replace' function in Excel to clean up email addresses and ensure a neat data structure. Finally, he encourages viewers to like, share, and comment if they found the video helpful, and invites them to subscribe for more tutorials.
Mindmap
Keywords
π‘Web Scraping
π‘Google Chrome Extension
π‘Sitemap
π‘Selectors
π‘Yellow Pages
π‘Data Extraction
π‘Pagination
π‘CSV File
π‘Interval Settings
π‘Duplicate Data
Highlights
Introduction to scraping data from websites using a free Google Chrome extension.
Demonstration of extracting business information from the Yellow Pages directory.
Focus on collecting car insurance service providersβ details from New York City and State.
The tool scrapes data such as business name, phone number, address, website, and email.
A single page contains 30 results, and the tool can navigate to subsequent pages automatically.
Installation process of the 'Web Scraper' extension from Google Chrome Web Store.
Step-by-step guide on using the Inspect tool in Chrome to identify elements for scraping.
Creation of a new sitemap for scraping, starting with the Yellow Pages directory URL.
Selecting multiple business listings for scraping using a 'link' type selector.
Detailing the process of creating specific selectors for scraping business names, phone numbers, and addresses.
Instructions for capturing website URLs and email addresses from the listings.
Automating the navigation between multiple pages for scraping beyond the first page of results.
Explanation of handling website limitations by setting interval gaps between requests to avoid restrictions.
Exporting the collected data as a CSV file and cleaning the data, including removing unnecessary fields.
The final cleaned CSV file includes essential information such as business name, phone number, address, website, and email.
Transcripts
hello and welcome back this is ashore
rafi once again in this video i'm going
to show you how to scrape data from
websites we are going to learn how we
can get information from multiple web
pages automatically at one go by using a
free google chrome extension and to
demonstrate the full process step by
step to you i am going to extract data
from the yellow page business directory
and i'm going to collect car insurance
service providers information from new
york city and state so to make it more
clear to you i'm going to collect each
of these business profiles
name their phone number their address
information their website address their
email address and you can follow the
steps to collect any other information
which is required for your need so
without further ado and one thing that
i'd love to mention as well first of all
if you just notice on the first page we
have got 30 results and after completing
these 30 results the tool will
start visiting the second page the third
page fourth page fifth page and so on
and get all the other business listings
informations as well so let's sum it up
on the first page we have got 30 results
on the second page you have got more 30
result that means total 60 and from the
third page it is going to get more 30
results the total is going to be 90 and
it is going to be um this way right so
without further ado let me download or
actually install the required extension
which is
web
scraper on google chrome and i'm going
to let me open this link in a new tab
just take a look here it is this is the
extension page i'm going to attach this
link into the video description for your
easy access so after visiting this page
simply you have to click on this add to
chrome button right here then confirm it
by clicking on this add extension button
right here and within like seconds it is
going to be added just take a look web
scraper free web stripping has been
added to chrome so now we are all set
with the tool um installation now it's
time to reload our search result page on
yellow page or whatever uh directory
that you would like to scrape data from
all right so here we go now after
reloading the page simply you have to
click on the right button of your mouse
and then you are going to find this
inspect option or you can use a shortcut
key which is ctrl plus shift plus i but
i'd love to click here on this inspect
button so that will get this console tab
or this console information into uh your
browser right so after
coming up here you are going to notice
this option web scraper if you have
installed the extension already on your
chrome browser you will notice this
button or option so let's click on it
after that you have to click on create
new sitemap
and then we are going to click on create
sitemap and then we are going to give a
name to the sitemap so let's say i'm
going to type out yellow page extraction
and the url start url is going to be
this url
copy and paste it and then click on
create sitemap after that we have to add
a new selector for this root um sitemap
so let's click on add new selector and
then we are going to select on the first
stage we are going to select all of the
business listings page or these links
let's say this is the first link
for the first business professionals
information then i have got this is the
second link for the second business or
personal uh person's information right
and so on we have got third fourth atc
so now what we have to do we have to
click here and then we have to provide a
id name which is let's say i'd love to
give it as links and then we have to
change the type from text to link
and then we are going to select multiple
as we are going to select multiple links
from this page let's say so we are going
to click on select now after that we are
going to click here and just take a look
the name of this person has been
selected after that go a little bit down
click on the second listing and if you
just click on the second listing you
will notice that
rest of the listings has already been
selected automatically and the tool
worked for us right so if i take you to
the bottom of the page you are going to
notice that we have selected all these
30 uh profiles already right so now
let's go to the top of the page so after
selecting these listings we have to
click on done selecting button right
here and then go a little bit down you
will find this option save selector
let's click on save selector we are done
with creating our very first um
selector now what we have to do we have
to go inside the selector because now we
are going to collect information from
this
uh business listing inside this business
listing so let's click on it
after that we are going to click here as
well on this id
and now we are inside the first selector
now what we have to do we have to click
on add new selector button right here
after that we are going to select the
business name so that the name or the id
name is going to be let's say business
name
or let's just put name here
because i have noticed some of these
profiles are personal profiles some of
these profiles are business profiles as
well after that we can keep the type to
text then let's click on select and then
we are going to select this business
name just take a look it's just selected
now let's click on
done selecting after that we are going
to
the bottom of the page or this option
and then click on save selector so we
have selected our uh so we have just
selected this parameter for taking our
business name now it's time to select
another selector for um this phone
number address and all of this
information so these are basically these
two are basically the repetitive task of
the first one but we have to make a
little change here on website and email
address collection so let's just click
on it and then we are going to collect
let's say phone number
and then click on
keep it as text and then click on select
after that we are going to click here so
that the phone number will be selected
let's click on done selecting and then
we are going to click on save selector
after that we are going to click on add
new selector again after that we are
going to give it a name let's say
address
after that keep it as text click on
select then we are going to select both
of these uh lines so click here so that
both lines has been selected now let's
go
oh sorry we have to click on this done
selecting and then we have to click on
save selector okay so we are done with
all of this phone number address and
business name selection it's time to
select our website so let's click on add
new selector after that we're going to
give it a name let's say website
after that we have to change the type
from text to
link
and then we are going to click on select
and after that we have to click on this
visit website button right here and we
are going to click on this c
then click on done selecting click on
save selector
now let's do the same for the email
address let's click on add new selector
after that we are going to type out
email and then change the type from text
to link after that let's click on select
then we are going to click on email
business then we are going to click on c
we are going to click on done selecting
then click on this save selector all
right so we are done with this page
information selection now let's go back
to the previous page by clicking here
so
if we run this script now it is going to
collect information from all of these 30
results from this page only but we have
got few more pages appearing here so now
it's time to select these pages as well
so that our tool will go to this page
and then get information as they're
going to collect from the first page uh
listings to make this happen we have to
go up and then we'll notice this option
sitemaps let's click on sitemaps after
that we are going to click here
inside this
let's say
sitemap we are going to find our first
link sitemap then we are going to click
on this add new selector button right
here and then we're going to add a new
id let's say these are pages so i'm
going to type out let's say
pages after that we're going to change
the type from text to link
and then we are going to
click on select and this should be
multiple as you can see we have got two
three four five next so we are going to
click on multiple then lets click on
select
and after that i am going to click on
two three and just take a look it
automatically selected two three four
five and then it is going to go if there
are the six seven eight nine pages it is
going to go on these pages as well as
the next also selected now let's click
on save selector button right here
okay so
i missed something so let me click on
select again yeah okay so let's do this
part again so again i have provided the
id here pages type is link now it's time
to select it to multiple and then let's
click on select after that we are going
to click here on 2 3 and it is going to
be selected all of these pages now let's
click on done selecting okay so i
actually missed this part so let's click
on done selecting and then we are going
to click on save selector now one more
thing we have to do here we have to go
here on the first selectors edit option
and then from here we have to select as
you can see parent selectors we have to
select root
and the pages now let's click on save
selector after that if we go here and
click on this option and click on
selector graph we are going to see the
graph of what we have done so far so
just take a look from links we are going
to find all the pages just take a look
name phone address website and email
these are the points we have selected
now if i click on pages it is going to
show us inside from pages it is going to
visit each of the links and then it is
going to get all the information from
here
so our repetitive task has been settled
properly now what we have to do we have
to go here then click on script button
right here
and then make sure you have settled a
large number here let's say 2000 is the
basic but the more gap you are giving on
the intervals the better it is going to
work on the website because most of
these uh business directories or website
has some limitations of accessing their
website from a views a user each day so
after visiting let's say 10 15 pages or
more than that or whatever their limit
is after visiting these pages they are
going to send you a restrict message
restriction message or they're going to
make the tool stop so in this case if
you provide let's say 7 000 your chances
will be less to get
stopped faster so in this case as this
is a tutorial purpose so i'm going to
select 2 000 millisecond it is going to
work perfectly fine for this tutorial
purpose so in this case i'm going to
click on start scraping and it is going
to start visiting each of these pages
automatically and just take a look it is
working here
and
okay
let's just wait just take a look it's
already been visiting the 61th listing
31
and here it is visiting the first page
and
yeah
now it is going to visit each of the
persons or businesses profiles it is
going to collect their name phone number
website email address when they are
available some of the websites you will
notice for some of the business listings
you will notice that the website is
missing the email address is missing or
the phone number is missing and this is
because
there are no information provided for
these fields and it happens for some
pages okay so now if i click on this
refresh button we are going to be able
to see some information has been already
populated here just take a look
now the more time it will
um
stay active in the background the more
data it is going to collect for us
so
as i have settled everything properly it
is going to visit each of these pages it
is going to get the data from these
pages automatically into our um
database now it is going to do its work
so what i'm going to do i am going to
close this data extraction process now
and let me show you
how much we have got as of now these
things it's totally fine to show you the
process of downloading your in results
so what i'm going to do i need to click
on this button after that you have to
click on export data as csb to download
it into your um
excel document so let me click on it
after that we are going to click on this
download now button and then we have to
click on save
and the files will be saved on a csv
file now i'm going to open this file
and if you just notice here we have got
all the information
just take a look
all right so now what i'm going to do
i'm going to clean this um
things up
in real quick to show you the end result
i'm going to clean and i'd love to keep
these fills as a reference to each of
these listings then i have got these
business or the person's name we have
got the phone number address information
which includes the street address city
and state and the zip code then i have
got the website i'm going to delete this
one from here if you notice here we have
got the website and wow it looks that we
have got some uh repetitive or let's say
duplicate
websites but i don't think there is as
duplicate because if you just notice
here at these emails you want you are
going to see that these emails are
different and the person names are
different so these are actually these
businesses
uh has multiple people's listed on the
uh listing so this is the reason it we
are seeing some um relevancy here so it
is not a problem then we have got this
email field which we don't need we don't
need
the other pages from here and if you
just notice here we have got mail to
these types of text appearing on each of
these emails so what we can do we can
get rid of this easily to do this we
have to simply click on control f into
our keyboard and then we will find this
option find and replace then click on
replace we are going to type out this
information like mail to
mail to and then there is colon then
we're going to click on replace all
and just take a look we have got a clean
email list already right so we have got
27 listing it's totally fine
so this was the process of automatically
extracting data from any business
listing website or any website so i
believe you have found this video
helpful if you did please give this
video a like share this video to help
your friends and let me know if you have
any question by commenting below and
your opinion will be highly appreciated
and please subscribe to my channel to
get more helpful videos in near future i
hope to see you in my next videos have a
good day bye
Browse More Related Video
LinkedIn Data Scraping Tutorial | 1-Click To Save to Sheets
Scrape website data without code using Bardeen
The easiest way to get data from ANY site in minutes
βWait, this Agent can Scrape ANYTHING?!β - Build universal web scraping agent
How To Extract All Business Data And Emails From Google Maps
Local SEO - Outrank 99% Of Your Competitors
5.0 / 5 (0 votes)