Scraping with Google Sheets
Summary
TLDRThis video script demonstrates how to import data from web tables into Google Sheets using the 'IMPORTHTML' formula. It explains the parameters for the formula, including the URL, query, and table index, and shows examples of fetching data from Wikipedia and a list of highest-grossing Indian films. The script also touches on sorting the imported data and mentions other web scraping formulas like 'IMPORTXML', 'IMPORTFEED', 'IMPORTRANGE', and 'IMPORTDATA', highlighting the dynamic nature of live formulas in Google Sheets.
Takeaways
- 🔍 The script discusses how to import data from web tables into Google Sheets using the 'IMPORTHTML' formula.
- 📝 'IMPORTHTML' requires a URL and a query to specify the data to be imported, such as a table or list from the webpage.
- 🔑 The third parameter of 'IMPORTHTML' is the index, which helps to select a specific table or list from the webpage.
- 🛠️ The script mentions that some formulas might require access permissions to interact with external data sources.
- 🔗 It's important to verify the correctness of the imported data by checking the source webpage, such as Wikipedia in the example.
- 📊 The script demonstrates how to import a list of the highest-grossing Indian films and sort them using Google Sheets features.
- 🚫 Sorting the imported data directly in the formula is not possible; the data must be copied and pasted as values first.
- 🌐 The 'IMPORTHTML' formula is dynamic, meaning it updates automatically if the source webpage changes.
- 📚 Other web import formulas mentioned include 'IMPORTXML' for structured data like XML files, and 'IMPORTFEED' for atom or RSS feeds.
- 🔄 'IMPORTXML' is highlighted as a powerful tool for fetching specific elements using XPath from a webpage.
- 📈 The script concludes by mentioning other import functions like 'IMPORTRANGE', 'IMPORTDATA', and their respective uses.
Q & A
What is the purpose of the 'IMPORTHTML' function in Google Sheets?
-The 'IMPORTHTML' function in Google Sheets is used to import data from tables on web pages into the spreadsheet.
What are the parameters required by the 'IMPORTHTML' function?
-The 'IMPORTHTML' function requires two parameters: the URL of the web page and a query specifying the table or list to import.
What does the query parameter in 'IMPORTHTML' represent?
-The query parameter in 'IMPORTHTML' can be a table or a list, indicating which table or list from the web page should be imported.
How does the index parameter in 'IMPORTHTML' work?
-The index parameter in 'IMPORTHTML' specifies the position of the table or list to be imported, such as the first, second, or third table or list on the page.
Why might you encounter an error when using 'IMPORTHTML'?
-An error might occur when using 'IMPORTHTML' if the spreadsheet attempts to send or receive data from an external party without access permission, which needs to be granted by the user.
How can you verify the correctness of the data imported using 'IMPORTHTML'?
-You can verify the correctness of the imported data by checking the source web page to ensure it contains the expected information.
What happens if the website data changes after using 'IMPORTHTML'?
-If the website data changes, the 'IMPORTHTML' function will automatically update with the new results when the spreadsheet is refreshed.
Can the data imported with 'IMPORTHTML' be sorted directly?
-No, the data imported with 'IMPORTHTML' cannot be sorted directly because it is the result of a formula. It needs to be copied and then sorted by value.
What are some alternative functions to 'IMPORTHTML' for importing data into Google Sheets?
-Alternative functions to 'IMPORTHTML' include 'IMPORTXML' for structured data, 'IMPORTFEED' for atom or RSS feeds, 'IMPORTRANGE' for data from another spreadsheet, and 'IMPORTDATA' for CSV or TSV formats.
What is the significance of a live formula in Google Sheets?
-A live formula in Google Sheets automatically updates its result when the source data changes, ensuring that the spreadsheet always reflects the most current information.
How can you use 'IMPORTHTML' to import a specific table from a Wikipedia page?
-You can use 'IMPORTHTML' to import a specific table from a Wikipedia page by providing the Wikipedia page URL and specifying the table index in the query parameter.
Outlines
📊 Importing Web Data with Google Sheets' IMPORTHTML Function
This paragraph introduces the IMPORTHTML function in Google Sheets, which allows users to import data from web pages. The script explains the function's parameters, including the URL of the web page and a query to specify the table or list to be imported. It also discusses the third parameter, the index, which is used to select the specific table or list from the webpage. The video demonstrates how to use the function, troubleshoot errors related to accessing external data, and verify the imported data by cross-referencing it with the source webpage. Additionally, it shows how to sort the imported data by copying the results and using the spreadsheet's sorting features.
🌐 Exploring Other Data Import Functions in Google Sheets
The second paragraph expands on the variety of data import functions available in Google Sheets, aside from IMPORTHTML. It mentions functions like IMPORTXML for extracting structured data from XML files, which is often used for APIs, and IMPORTFEED for importing atom or RSS feeds. The paragraph also includes IMPORTRANGE for importing data from another spreadsheet and IMPORTDATA for CSV or tab-separated value formats. The script notes that while these functions are useful for data scraping, IMPORTHTML is the most commonly used among them, likely due to its versatility and ease of use.
Mindmap
Keywords
💡Import HTML
💡Query
💡Table
💡Index
💡Access Permission
💡Live Formula
💡Sorting
💡API
💡XPath
💡Import XML
💡RSS Feed
Highlights
Import data from web tables into Google Sheets using the IMPORTHTML formula.
IMPORTHTML formula requires a URL and a query to specify the data source.
Query can be a table or a list to specify which data structure to import.
The index parameter determines which table or list to import from a webpage.
Formula execution may require access permissions to external data sources.
The formula result is dynamic and updates when the webpage changes.
IMPORTHTML can be used to import specific tables from Wikipedia pages.
Sorting the data imported by IMPORTHTML requires copying and pasting it into a new range.
IMPORTHTML can fetch lists, such as the highest-grossing Indian films.
Data imported with IMPORTHTML can be sorted by column to analyze information.
IMPORTXML is a powerful formula for fetching structured data from XML files or APIs.
IMPORTXML allows specifying an XPath to extract specific elements from a webpage.
IMPORTFEED can be used to import data from Atom or RSS feeds.
IMPORTRANGE allows importing data from another spreadsheet.
IMPORTDATA can be used to import CSV or TSV formatted data.
IMPORTHTML is highlighted as the most commonly used web formula in Google Sheets.
Transcripts
[Music]
let's look at how we can import data
from tables on the web into Google
Sheets there's a formula which let me
zoom in and type equals import
HTML and this accepts a URL a page from
which we want to get stuff from and a
query now to understand what a query is
let's look at the documentation for
import HTML it has a web page and then a
second parameter query which can be a
table for instance which tells us that
it's going to pick a table or it can be
a list as well instead of a table so
it'll pick up the first list second list
Etc and the third parameter is the index
that tells us whether we want to pick up
the first table or the first list or the
fourth table and so on let's take this
formula itself and copy this
so I'm going to say equals import HTML
of whatever press enter now it's loading
and we have an error let's look at this
a little closer so firstly it's saying
some formulas are trying to send and
receive data from external parties and
we need to allow access fair enough
let's allow access now it's run and it's
gotten a result now is this correct
let's take a look at the Wikipedia page
and see if it actually contains that
information so okay there seems to be
okay this may be a table not sure this
is definitely a table this is a table
this is a table so maybe this is the
fourth table we have
5761 which is exactly the value that we
have so yes it picked up the fourth
table and if you wanted this big table
that would be table number two if you
wanted this it would be table number one
which is great uh let's pick another
page so I want to let's say uh get the
um uh biggest box office hits uh in
India all time and get that list that
seems to be there in the list of highest
grossing Indian
films um and I think this is the first
table it begins with dangal then
bahubali and so on so let's get this
list uh
equals import
HTML of this oh I keep making a
mistake and then table I want the first
table I think let's load that and yes it
seems to give us a list which is great
now I'd like to sort this but here's the
thing this is this whole thing is
actually the result of a formula so I
can't directly sort it even if I tried
what I'd have to do is maybe copy this
and uh edit pay Special by
value now I have the results copied and
then I can maybe sort this so if I
[Music]
said how do you sort data sour sheet by
column L uh I should have said with the
header row so data s sheet by column or
okay H I moved it one cell down let me
freeze the
rows and then
sort by column L and yeah it turns out
that three idiots is the top on that
list year-wise that is the oldest movie
on that list and the newest movie on
that list is pan sent to uh ADI purush
is just a little bit behind and so on
not in terms of um the gross amount but
rather in terms of years actually
there's a whole bunch of 2023 movies so
it's hard to say which is ahead and
which is behind but this is a live
formula what that means is if the
website changes and we refresh the page
it will automatically get the new
results again this is not the only web
formula there is you can also use other
formulas like import XML which is useful
for getting structured data so so if you
have uh XML files which is typical for
an API or HTML itself and we want to get
specific parts of uh of an HTML if you
want to get CSV tsv atom RSS which are
all XML formats this is useful import
XML is perhaps the most uh powerful of
these because you can also specify an
xath that is the exact structure or the
exact element that you want to fetch
from a particular page but that's not
something we're going to be covering
right now U then there is import feed
which can import an atom or an RSS feed
there is import range that can import
from another spreadsheet and finally
import data which can get you a CSV or a
tab separated value
format all of these are ways of
effectively scraping from something from
Google Sheets but I have personally
found the import HTML to be the most
commonly used of the laog
関連動画をさらに表示
Using Google Sheets to Calculate Measures of Central Tendency
Link Building with Google Sheets: Start Guest Posting in 15 Minutes
LinkedIn Data Scraping Tutorial | 1-Click To Save to Sheets
Combine Excel Sheets with *this* simple formula
Understanding Import Sets in ServiceNow
Empirical Formula and Molecular Formula Introduction
5.0 / 5 (0 votes)