Scraping with Google Sheets

Anand S
1 Jun 202405:36

Summary

TLDRThis video script demonstrates how to import data from web tables into Google Sheets using the 'IMPORTHTML' formula. It explains the parameters for the formula, including the URL, query, and table index, and shows examples of fetching data from Wikipedia and a list of highest-grossing Indian films. The script also touches on sorting the imported data and mentions other web scraping formulas like 'IMPORTXML', 'IMPORTFEED', 'IMPORTRANGE', and 'IMPORTDATA', highlighting the dynamic nature of live formulas in Google Sheets.

Takeaways

  • 🔍 The script discusses how to import data from web tables into Google Sheets using the 'IMPORTHTML' formula.
  • 📝 'IMPORTHTML' requires a URL and a query to specify the data to be imported, such as a table or list from the webpage.
  • 🔑 The third parameter of 'IMPORTHTML' is the index, which helps to select a specific table or list from the webpage.
  • 🛠️ The script mentions that some formulas might require access permissions to interact with external data sources.
  • 🔗 It's important to verify the correctness of the imported data by checking the source webpage, such as Wikipedia in the example.
  • 📊 The script demonstrates how to import a list of the highest-grossing Indian films and sort them using Google Sheets features.
  • 🚫 Sorting the imported data directly in the formula is not possible; the data must be copied and pasted as values first.
  • 🌐 The 'IMPORTHTML' formula is dynamic, meaning it updates automatically if the source webpage changes.
  • 📚 Other web import formulas mentioned include 'IMPORTXML' for structured data like XML files, and 'IMPORTFEED' for atom or RSS feeds.
  • 🔄 'IMPORTXML' is highlighted as a powerful tool for fetching specific elements using XPath from a webpage.
  • 📈 The script concludes by mentioning other import functions like 'IMPORTRANGE', 'IMPORTDATA', and their respective uses.

Q & A

  • What is the purpose of the 'IMPORTHTML' function in Google Sheets?

    -The 'IMPORTHTML' function in Google Sheets is used to import data from tables on web pages into the spreadsheet.

  • What are the parameters required by the 'IMPORTHTML' function?

    -The 'IMPORTHTML' function requires two parameters: the URL of the web page and a query specifying the table or list to import.

  • What does the query parameter in 'IMPORTHTML' represent?

    -The query parameter in 'IMPORTHTML' can be a table or a list, indicating which table or list from the web page should be imported.

  • How does the index parameter in 'IMPORTHTML' work?

    -The index parameter in 'IMPORTHTML' specifies the position of the table or list to be imported, such as the first, second, or third table or list on the page.

  • Why might you encounter an error when using 'IMPORTHTML'?

    -An error might occur when using 'IMPORTHTML' if the spreadsheet attempts to send or receive data from an external party without access permission, which needs to be granted by the user.

  • How can you verify the correctness of the data imported using 'IMPORTHTML'?

    -You can verify the correctness of the imported data by checking the source web page to ensure it contains the expected information.

  • What happens if the website data changes after using 'IMPORTHTML'?

    -If the website data changes, the 'IMPORTHTML' function will automatically update with the new results when the spreadsheet is refreshed.

  • Can the data imported with 'IMPORTHTML' be sorted directly?

    -No, the data imported with 'IMPORTHTML' cannot be sorted directly because it is the result of a formula. It needs to be copied and then sorted by value.

  • What are some alternative functions to 'IMPORTHTML' for importing data into Google Sheets?

    -Alternative functions to 'IMPORTHTML' include 'IMPORTXML' for structured data, 'IMPORTFEED' for atom or RSS feeds, 'IMPORTRANGE' for data from another spreadsheet, and 'IMPORTDATA' for CSV or TSV formats.

  • What is the significance of a live formula in Google Sheets?

    -A live formula in Google Sheets automatically updates its result when the source data changes, ensuring that the spreadsheet always reflects the most current information.

  • How can you use 'IMPORTHTML' to import a specific table from a Wikipedia page?

    -You can use 'IMPORTHTML' to import a specific table from a Wikipedia page by providing the Wikipedia page URL and specifying the table index in the query parameter.

Outlines

00:00

📊 Importing Web Data with Google Sheets' IMPORTHTML Function

This paragraph introduces the IMPORTHTML function in Google Sheets, which allows users to import data from web pages. The script explains the function's parameters, including the URL of the web page and a query to specify the table or list to be imported. It also discusses the third parameter, the index, which is used to select the specific table or list from the webpage. The video demonstrates how to use the function, troubleshoot errors related to accessing external data, and verify the imported data by cross-referencing it with the source webpage. Additionally, it shows how to sort the imported data by copying the results and using the spreadsheet's sorting features.

05:01

🌐 Exploring Other Data Import Functions in Google Sheets

The second paragraph expands on the variety of data import functions available in Google Sheets, aside from IMPORTHTML. It mentions functions like IMPORTXML for extracting structured data from XML files, which is often used for APIs, and IMPORTFEED for importing atom or RSS feeds. The paragraph also includes IMPORTRANGE for importing data from another spreadsheet and IMPORTDATA for CSV or tab-separated value formats. The script notes that while these functions are useful for data scraping, IMPORTHTML is the most commonly used among them, likely due to its versatility and ease of use.

Mindmap

Keywords

💡Import HTML

Import HTML is a formula in Google Sheets that allows users to pull data from tables on web pages into the spreadsheet. It is central to the video's theme of data importation. The script demonstrates its use by specifying a URL and a query to extract information from a webpage, such as Wikipedia tables, and mentions the need for access permissions to execute the formula.

💡Query

In the context of the Import HTML function, a query is a parameter that specifies the part of the web page from which data should be imported. It could refer to a table or a list, and it determines the structure of the data to be fetched. The video script uses the term to explain how to select different tables or lists from a webpage for data extraction.

💡Table

A table in web terminology is a structured set of data presented in rows and columns. In the video, tables are the primary source of data for the Import HTML function. The script discusses identifying and selecting specific tables from a webpage, such as the fourth table on a Wikipedia page, to import data into Google Sheets.

💡Index

The index parameter in the Import HTML formula specifies the position of the table or list to be imported from a webpage. It is used to differentiate between multiple tables or lists when there is more than one available. The script uses the index to select the fourth table from a Wikipedia page as an example.

💡Access Permission

Access permission refers to the authorization required to execute certain functions that interact with external data sources. In the video, the script mentions that access needs to be allowed for formulas that send and receive data from external parties, such as when using the Import HTML function.

💡Live Formula

A live formula in Google Sheets is one that automatically updates when the source data changes. The script highlights this feature of the Import HTML formula, noting that if the webpage is updated, the imported data in Google Sheets will reflect those changes upon refreshing the page.

💡Sorting

Sorting is the process of arranging data in a specific order, typically alphabetical or numerical. The video script discusses the challenge of sorting data that is the result of a formula, such as Import HTML, and demonstrates a workaround by copying the data and then sorting it manually.

💡API

API stands for Application Programming Interface, which is a set of rules and protocols for accessing a web-based software application. The script mentions XML files, which are often used in APIs to structure data for exchange between systems, indicating the potential use of Import XML for fetching data from APIs.

💡XPath

XPath is a language used to navigate through elements in an XML document. In the context of the Import XML function mentioned in the script, XPath can be used to specify the exact element or structure from a webpage that the user wants to fetch, providing a more precise method of data extraction.

💡Import XML

Import XML is another Google Sheets function that allows users to import structured data from XML files, which can be particularly useful for data from APIs or specific parts of an HTML document. The script briefly introduces this function as a more powerful alternative to Import HTML for certain data import tasks.

💡RSS Feed

RSS stands for Really Simple Syndication, a type of web feed that allows users to access updates from websites in a standardized format. The script mentions Import Feed as a function to import data from an atom or RSS feed, which can be useful for keeping spreadsheets updated with the latest content from these sources.

Highlights

Import data from web tables into Google Sheets using the IMPORTHTML formula.

IMPORTHTML formula requires a URL and a query to specify the data source.

Query can be a table or a list to specify which data structure to import.

The index parameter determines which table or list to import from a webpage.

Formula execution may require access permissions to external data sources.

The formula result is dynamic and updates when the webpage changes.

IMPORTHTML can be used to import specific tables from Wikipedia pages.

Sorting the data imported by IMPORTHTML requires copying and pasting it into a new range.

IMPORTHTML can fetch lists, such as the highest-grossing Indian films.

Data imported with IMPORTHTML can be sorted by column to analyze information.

IMPORTXML is a powerful formula for fetching structured data from XML files or APIs.

IMPORTXML allows specifying an XPath to extract specific elements from a webpage.

IMPORTFEED can be used to import data from Atom or RSS feeds.

IMPORTRANGE allows importing data from another spreadsheet.

IMPORTDATA can be used to import CSV or TSV formatted data.

IMPORTHTML is highlighted as the most commonly used web formula in Google Sheets.

Transcripts

play00:00

[Music]

play00:10

let's look at how we can import data

play00:13

from tables on the web into Google

play00:15

Sheets there's a formula which let me

play00:18

zoom in and type equals import

play00:21

HTML and this accepts a URL a page from

play00:24

which we want to get stuff from and a

play00:27

query now to understand what a query is

play00:30

let's look at the documentation for

play00:32

import HTML it has a web page and then a

play00:36

second parameter query which can be a

play00:39

table for instance which tells us that

play00:41

it's going to pick a table or it can be

play00:44

a list as well instead of a table so

play00:46

it'll pick up the first list second list

play00:48

Etc and the third parameter is the index

play00:52

that tells us whether we want to pick up

play00:53

the first table or the first list or the

play00:55

fourth table and so on let's take this

play00:57

formula itself and copy this

play01:00

so I'm going to say equals import HTML

play01:04

of whatever press enter now it's loading

play01:07

and we have an error let's look at this

play01:11

a little closer so firstly it's saying

play01:13

some formulas are trying to send and

play01:17

receive data from external parties and

play01:19

we need to allow access fair enough

play01:22

let's allow access now it's run and it's

play01:25

gotten a result now is this correct

play01:28

let's take a look at the Wikipedia page

play01:30

and see if it actually contains that

play01:34

information so okay there seems to be

play01:37

okay this may be a table not sure this

play01:40

is definitely a table this is a table

play01:44

this is a table so maybe this is the

play01:46

fourth table we have

play01:50

5761 which is exactly the value that we

play01:52

have so yes it picked up the fourth

play01:55

table and if you wanted this big table

play01:57

that would be table number two if you

play01:59

wanted this it would be table number one

play02:01

which is great uh let's pick another

play02:03

page so I want to let's say uh get the

play02:06

um uh biggest box office hits uh in

play02:13

India all time and get that list that

play02:17

seems to be there in the list of highest

play02:19

grossing Indian

play02:21

films um and I think this is the first

play02:24

table it begins with dangal then

play02:26

bahubali and so on so let's get this

play02:30

list uh

play02:33

equals import

play02:35

HTML of this oh I keep making a

play02:41

mistake and then table I want the first

play02:45

table I think let's load that and yes it

play02:49

seems to give us a list which is great

play02:52

now I'd like to sort this but here's the

play02:55

thing this is this whole thing is

play02:57

actually the result of a formula so I

play03:00

can't directly sort it even if I tried

play03:03

what I'd have to do is maybe copy this

play03:06

and uh edit pay Special by

play03:10

value now I have the results copied and

play03:14

then I can maybe sort this so if I

play03:18

[Music]

play03:19

said how do you sort data sour sheet by

play03:23

column L uh I should have said with the

play03:26

header row so data s sheet by column or

play03:32

okay H I moved it one cell down let me

play03:37

freeze the

play03:39

rows and then

play03:43

sort by column L and yeah it turns out

play03:47

that three idiots is the top on that

play03:49

list year-wise that is the oldest movie

play03:51

on that list and the newest movie on

play03:53

that list is pan sent to uh ADI purush

play03:57

is just a little bit behind and so on

play03:59

not in terms of um the gross amount but

play04:02

rather in terms of years actually

play04:03

there's a whole bunch of 2023 movies so

play04:06

it's hard to say which is ahead and

play04:07

which is behind but this is a live

play04:11

formula what that means is if the

play04:13

website changes and we refresh the page

play04:16

it will automatically get the new

play04:18

results again this is not the only web

play04:21

formula there is you can also use other

play04:24

formulas like import XML which is useful

play04:27

for getting structured data so so if you

play04:30

have uh XML files which is typical for

play04:33

an API or HTML itself and we want to get

play04:37

specific parts of uh of an HTML if you

play04:42

want to get CSV tsv atom RSS which are

play04:45

all XML formats this is useful import

play04:48

XML is perhaps the most uh powerful of

play04:51

these because you can also specify an

play04:54

xath that is the exact structure or the

play04:58

exact element that you want to fetch

play05:01

from a particular page but that's not

play05:03

something we're going to be covering

play05:04

right now U then there is import feed

play05:06

which can import an atom or an RSS feed

play05:09

there is import range that can import

play05:11

from another spreadsheet and finally

play05:13

import data which can get you a CSV or a

play05:16

tab separated value

play05:18

format all of these are ways of

play05:22

effectively scraping from something from

play05:25

Google Sheets but I have personally

play05:27

found the import HTML to be the most

play05:30

commonly used of the laog

Rate This

5.0 / 5 (0 votes)

関連タグ
Google SheetsIMPORTHTMLData ImportWeb ScrapingTable ImportList ExtractionSorting DataLive FormulasXML DataRSS Feeds
英語で要約が必要ですか?