Use wget to download / scrape a full website

Melvin L
18 Jan 201814:35

Summary

TLDRThis video tutorial introduces the use of 'wget', a simple tool for downloading and scraping websites. It demonstrates basic usage, advanced functionalities, and common parameters for web scraping or file downloading. The video provides step-by-step examples, including downloading an entire website's content for offline access, with parameters to handle recursion, resource downloading, and domain restrictions. It also covers best practices for large-scale scraping, such as wait times and rate limiting, to avoid IP blacklisting and ensure efficient data retrieval.

Takeaways

  • 😀 The video introduces the use of Wget as a simple tool for downloading and scraping websites.
  • 🔍 The script outlines three examples of using Wget, starting with a basic example and moving to more advanced functionalities.
  • 📚 It is clarified that 'scraping' in this context may not involve in-depth data extraction but more about downloading web pages for offline use.
  • 🌐 The video mentions other tools like Scrapy for more advanced scraping capabilities, which have been covered in other videos.
  • 📁 The first example demonstrates downloading a simple HTML file using Wget, which can be opened in a browser but lacks other resources.
  • 🔄 The second example introduces parameters like recursive navigation, no-clobber, page requisite, convert links, and domain restrictions to download associated resources.
  • 🚀 The third example includes advanced parameters for handling larger sites, such as wait times between requests, rate limiting, user-agent specification, recursion levels, and logging progress.
  • 🛠 The script emphasizes the need to adjust parameters based on the specific requirements of the site being scraped.
  • 🔒 It highlights the importance of respecting website policies and being a considerate 'netizen' when scraping to avoid IP blacklisting.
  • 📝 The video concludes with a note on the utility of Wget for quick scraping tasks and post-processing of content offline.

Q & A

  • What is the main purpose of the video?

    -The main purpose of the video is to demonstrate how to use Wget as a simple tool to download and scrape an entire website.

  • What are the three examples outlined in the video?

    -The video outlines three examples: 1) A basic example similar to a 'hello world', 2) Advanced functionality with common parameters used for screen scraping or downloading files, and 3) More advanced parameters for more complex scenarios.

  • What does the term 'download and scrape' imply in the context of the video?

    -In the context of the video, 'download and scrape' implies downloading the webpage content and extracting structured content from the web pages, although the video emphasizes that it is a simple example and not an in-depth scraping capability.

  • What are some limitations of using Wget for scraping?

    -Some limitations of using Wget for scraping include that it might not extract all content such as CSS files, images, and other resources by default, and it is not as versatile as full-fledged web scraping tools.

  • What is the significance of the 'no-clobber' parameter in Wget?

    -The 'no-clobber' parameter in Wget ensures that if a page has already been crawled and a file has been created, it won't crawl and recreate that page, which is helpful for avoiding duplication and managing connectivity issues.

  • How does the 'page-requisites' parameter in Wget affect the download process?

    -The 'page-requisites' parameter in Wget ensures that all resources associated with a page, such as images and CSS files, are also downloaded, making the content more complete for offline use.

  • What is the purpose of the 'convert-links' parameter in Wget?

    -The 'convert-links' parameter in Wget is used to convert the links in the downloaded pages to point to local file paths instead of pointing back to the server, making the content accessible for offline browsing.

  • Why is the 'no-parent' parameter used in Wget?

    -The 'no-parent' parameter in Wget is used to restrict the download process to the specified hierarchy level, ensuring that it does not traverse up to higher levels such as parent directories or different language versions of the site.

  • What are some advanced parameters discussed in the video for handling larger websites?

    -Some advanced parameters discussed for handling larger websites include 'wait' to introduce a delay between requests, 'limit-rate' to control the download speed, 'user-agent' to mimic a browser, 'recursive' to control the depth of recursion, and redirecting output to a log file.

  • How can Wget be used to monitor website changes?

    -Wget can be used to download webpages and keep the content offline, which can then be used for comparison over time to monitor if websites have changed, although the video does not detail specific methods for this monitoring.

Outlines

00:00

🌐 Introduction to Using Wget for Website Downloading and Scraping

The video introduces the use of Wget, a simple tool for downloading and scraping websites. The presenter outlines the plan to demonstrate three examples: a basic example akin to a 'hello world', advanced functionality, and more complex parameters. The purpose is to show how to keep a website offline or scrape parts of it. The presenter clarifies that Wget is not primarily for in-depth scraping but for downloading web pages for offline use or local processing. They also mention other tools like Scrapy for more advanced scraping capabilities.

05:01

🔍 Advanced Wget Parameters for Comprehensive Website Scraping

This paragraph delves into more advanced parameters of Wget for scraping websites. The presenter discusses the use of recursive navigation to crawl through a site like a web crawler, the 'no-clobber' option to avoid re-crawling pages, and the 'page-requisites' parameter to download all associated resources like images and CSS. The 'convert-links' and 'convert-file-only' options are highlighted for local file path linking and character escaping. The presenter also emphasizes the importance of restricting the download to a specific domain or subdomain to avoid downloading external resources unintentionally.

10:04

🛠️ Enhancing Wget Efficiency with Rate Limiting and User-Agent Customization

The final paragraph focuses on enhancing Wget's efficiency and etiquette when scraping larger sites. The presenter suggests using a delay between requests to avoid IP blacklisting and a rate limit to control the download speed. They also mention the importance of specifying a user agent to mimic a browser, which can be crucial for sites that check for familiar user agents. The default recursion depth is discussed, along with the option to adjust it for deeper or shallower crawling. The presenter concludes with a demonstration of running Wget in the background and logging the output to a file, which is useful for monitoring progress over time.

Mindmap

Keywords

💡Wget

Wget is a free utility for non-interactive download of files from the web. It supports HTTP, HTTPS, and FTP protocols, and is often used for downloading web pages for offline viewing or scraping. In the video, Wget is used to demonstrate how to download and scrape content from websites, making it a central tool in the discussion.

💡Web scraping

Web scraping is the process of extracting data from websites. It involves programmatically accessing the content of web pages and extracting information from them. In the video, the concept is introduced as a way to extract content from web pages using Wget, although the speaker clarifies that Wget is not primarily a scraping tool but can be used for basic scraping tasks.

💡Recursion

In the context of web scraping, recursion refers to the process of following links on a webpage to discover and download additional pages. The video mentions using recursion with Wget to navigate through a website and download its content, simulating the behavior of a web crawler.

💡No clobber

The 'no clobber' option in Wget prevents the program from overwriting existing files with the same name. In the video, this option is used to ensure that if a page has already been downloaded, it will not be re-downloaded and overwritten, which can be useful during development or testing phases.

💡HTML

HTML, or HyperText Markup Language, is the standard language used for creating web pages. The video script discusses downloading HTML files using Wget, which is a fundamental aspect of web scraping as it allows users to access the structure and content of web pages.

💡CSS

CSS, or Cascading Style Sheets, is used for describing the presentation of a document written in HTML. The video mentions that while Wget can download HTML files, it might not always download associated CSS files, which can affect the appearance of the downloaded pages when viewed offline.

💡Images

Images are a common element of web pages that may be downloaded during web scraping. The video script discusses the limitations of Wget in downloading images, especially when they are hosted on different domains, highlighting the need for additional parameters to ensure comprehensive content downloading.

💡User agent

The user agent is a string that identifies the software, operating system, and vendor of the requester agent to the accessed server. In the video, the speaker mentions that some websites might block requests from unfamiliar user agents, suggesting that specifying a user agent can be a strategy to avoid being blocked during web scraping.

💡Rate limit

A rate limit in web scraping is a constraint on the number of requests that can be made to a server in a given period. The video script suggests using a rate limit with Wget to avoid overwhelming the server and potentially getting the IP address blacklisted.

💡Log file

A log file is a record of events that occur in a system. In the context of the video, Wget can be configured to output its progress to a log file, which is useful for monitoring the status of a download or scrape operation, especially when running in the background.

Highlights

Introduction to using Wget as a simple tool for downloading and scraping websites.

Three examples outlined: a basic example, advanced functionality, and more advanced parameters.

Clarification that 'download and scrape' may have different implications but focuses on simple Wget usage.

Comparison with other tools like Scrapy for more advanced capabilities.

Demonstration of using Wget to download a simple HTML file from a website.

Explanation of limitations such as not downloading CSS, images, and other resources by default.

Introduction to advanced parameters like recursive navigation and no-clobber option.

Use of page requisite, convert links, and domain restrictions for more comprehensive scraping.

Demonstration of a complete website download including HTML and associated resources.

Discussion on the importance of adjusting parameters for different websites and scraping needs.

Introduction of additional parameters for handling larger sites, such as wait time and rate limiting.

Explanation of user agent specification to avoid being detected as a bot.

Demonstration of running Wget in the background with log file output for monitoring progress.

Note on the use of random wait times for more effective scraping on certain sites.

Final note on the quick and dirty nature of Wget compared to full-fledged web scraping tools.

Conclusion and thanks for watching the video on using Wget for downloading and scraping websites.

Transcripts

play00:00

hi everyone in this video we'll take a

play00:02

look at how we can use W get as a simple

play00:05

tool to download and enter website slash

play00:08

scrape an entire website now for

play00:11

purposes of keeping this video really

play00:13

short and simple try to outline three

play00:16

examples we'll take a look at a basic

play00:18

example kind of like a hello world if

play00:20

you will and then we'll take a look at

play00:22

some more advanced functionality will

play00:25

kind of like go through a very common

play00:27

set of scenarios or parameters that we

play00:29

won't use for screen scraping or

play00:31

downloading files off of a website and

play00:34

then finally we'll take a look at some

play00:35

more advanced parameters now I do want

play00:38

to clarify that when when we talk about

play00:42

download and scrape it might have

play00:44

different implications but keep in mind

play00:49

that this is a very simple example of

play00:51

using W get I've covered other tools

play00:54

like scrap IR and even off-the-shelf

play00:58

scraping tools in the channel so you

play01:01

might want to take a look at that for

play01:03

more advanced capability so but this is

play01:05

probably the simplest way how you can

play01:07

download or keep a site offline or even

play01:13

scrape some parts of the site so I say

play01:16

squid with a bit of a hesitation because

play01:18

in most cases when you talk about scrape

play01:20

you're talking about extracting content

play01:22

from webpages that's basically crawling

play01:25

the web pages and passing the web pages

play01:28

and taking content or structured content

play01:30

from the web page and storing that in

play01:33

some other structured way or you know

play01:35

kind of like extracting information from

play01:39

web pages so just want to emphasize that

play01:41

this video is not really an in-depth

play01:44

scraping capability it's not something

play01:47

you used W get for it's more of a case

play01:49

that you want to download the video

play01:51

I'm sorry download the web page and then

play01:55

keep the website content offline or do

play01:58

some for the processing locally on your

play02:00

desk or you can build other tools to

play02:02

monitor if websites have changed etc so

play02:06

we are going to exclusively use W gate

play02:08

tool so a lot of the parameters

play02:12

of Webjet if that's unclear or you want

play02:15

to investigate further you can head over

play02:17

to this URL and all of this is in the

play02:19

description of the video below to get

play02:22

started let's start off with a simple

play02:24

example I've taken a random website

play02:28

I say random in the sense that the

play02:30

tongue-in-cheek because I'm pointing

play02:33

into scrap I which is one of the tools

play02:35

have covered in an earlier video

play02:36

incidentally scrappy is is a web

play02:41

scraping tool as an open-source web

play02:43

scraping tool so it's kind of funny

play02:44

we're using web get to scrape or

play02:47

download scrappy anyway that wasn't the

play02:49

original web site I had in mind but I

play02:51

thought it would be interesting from a

play02:54

video demo standpoint anyway so let's

play02:57

actually go and see oops what the

play03:00

website looks like if I can get rid of

play03:04

web get yeah yeah all right so this is

play03:07

basically the content that we are trying

play03:09

to extract in all of these examples you

play03:12

might argue that this isn't the best

play03:14

example there are hardly any images here

play03:16

except them but again I wanted a simple

play03:19

site for a demo feel free to substitute

play03:20

this URL with your preferred one all

play03:24

right so let's let's head over to the

play03:29

console let me just just copy that again

play03:32

and on our console I'm gonna just I've

play03:36

created a folder which currently does

play03:38

not contain anything and let's just

play03:41

paste that the web get Oh would be

play03:46

helpful if I na

play03:51

all right so connection established

play03:53

let's try that again

play03:56

alright so what you saw was the the web

play04:03

get in action but it didn't do anything

play04:05

amazing it's just downloaded a simple

play04:07

HTML file which if you open obviously

play04:11

opens up on the browser here and you can

play04:14

see that it's pointing to a local

play04:15

resource now keep in mind that not all

play04:18

the contents of this page are extracted

play04:22

and pulled down to your computer so for

play04:25

example

play04:26

CSS files JPEG images etc these are

play04:29

downloaded which actually brings us to

play04:32

the next example so this was you know

play04:36

just a very simple example if you've

play04:37

never used web gait for HTML chances are

play04:41

you've used web get in the past to

play04:42

download zip files and other software

play04:44

installation files but interesting to

play04:47

think of it as a tool to download

play04:49

webpages now things get more interesting

play04:52

in example two here we are going to use

play04:56

a few parameters here so the first thing

play04:59

we'll want to do you'll notice that I'm

play05:01

pointing into the same folder so I am

play05:04

sorry the same URI what I've specified

play05:07

here is that it needs to recursively

play05:11

navigate the contents of this initial

play05:14

page and then start recursively

play05:16

traversing through the site more like a

play05:18

web crawler would do this is not a

play05:23

mandatory option but the no clobber

play05:25

basically implies that if if a page if a

play05:28

URL had already been crawl and if a page

play05:31

was created don't crawl and recreate

play05:33

that page so it's typically helpful when

play05:35

you have issues with connectivity or you

play05:39

want to stop and restart a couple of

play05:40

times so typically during development or

play05:43

testing this is where things get more

play05:46

interesting so we have the page

play05:48

requisite so remember I talked about in

play05:51

this particular case and we extracted it

play05:54

the first time it was just the HTML but

play05:56

it did not download or keep an offline

play05:59

copy of any of the other resources like

play06:02

images and CSS etc so setting this will

play06:06

ensure that all the other resources are

play06:08

also downloaded HTML extension base is

play06:12

helpful when your navigator when you're

play06:14

crawling scraping and downloading files

play06:18

which typically have an extension like

play06:20

JSP or ASP X or CGI scripts which when

play06:25

you want to store on your local file and

play06:27

when you hard disk and when you want to

play06:28

click on it you want the extension to be

play06:31

an HTML extension so that it

play06:32

automatically opens in the browser

play06:35

so that's the only reason for this

play06:37

convert lengths basically and sure

play06:40

that any links HTML anchor or various

play06:43

others linked to a local file path as

play06:46

opposed to pointing it back to the

play06:49

server you are eyes that's basically

play06:53

what a convert link is and finally we

play06:55

have some escaping of characters and

play07:00

finally this what you're seeing is we

play07:03

are specifying that it needs to only

play07:07

stay within the domain or the sub domain

play07:11

and finally no parent specifies that it

play07:15

needs to be at this hierarchy level and

play07:18

it'll ensure that it does not go up a

play07:21

hierarchy say for example into French or

play07:24

German language so let's run this

play07:27

example now alright so if we run that

play07:36

example let me just cancel that and

play07:39

clear that folder just so that it stays

play07:42

clean and we know what's going on all

play07:46

right so that's now you can see that

play07:49

it's a completely scraped downloaded all

play07:52

these pages and if we go down here

play07:57

you'll notice we have the index dot HTML

play08:01

page so let me close the old one and now

play08:04

this is our latest downloaded page now

play08:08

just be mindful that while this

play08:11

selection of mine might not it's

play08:14

actually a poor choice of a website that

play08:16

I've used for this video because this

play08:19

page does still rely on external CSS

play08:24

files and various external resources and

play08:26

since we have locked it down to only the

play08:31

scrappy org domain it's not going to

play08:33

download those resources locally

play08:37

additionally you'll find that if you try

play08:39

to copy it from the same site or follow

play08:41

the same example you don't notice some

play08:43

of the images haven't been downloaded

play08:45

because this is an example of an image

play08:48

that in a different

play08:52

domaine so we have restricted it to only

play08:55

be in scrappy dot org domain whereas

play08:58

this is trying to point to read the docs

play08:59

that aughh domain so again it depends on

play09:03

your particular mileage of how much you

play09:05

want to scrape but I'm just highlighting

play09:06

that based on the parameters you

play09:09

provided here it may or may not scrape

play09:12

the content to the fullest degree and

play09:14

keep everything offline but obviously

play09:17

you can kind of like tweak these

play09:19

parameters to your heart's content

play09:21

now just a quick temp given where we are

play09:25

right now what we have done is we have

play09:27

downloaded the files it's it's all there

play09:30

in our local hard disk and that's

play09:33

brilliant it's kind of allows for us to

play09:36

do offline browsing /do for the

play09:38

processing offline but in most cases

play09:41

you're you are trying to download or

play09:43

scrape a much larger site typically

play09:46

these parameters alone will not suffice

play09:49

so in which case that's typically where

play09:52

I use the third example here now this

play09:56

has some additional parameters here so

play10:00

in in most cases you'll definitely want

play10:03

to keep these parameters so in in the

play10:09

previous examples you you've noticed

play10:12

that the requests were being sent

play10:14

immediately one after the other and this

play10:17

is OK for small sites or if you are

play10:21

scraping only a small subset but when

play10:23

you're scraping larger sites they might

play10:25

blacklist your IP so one of the ways you

play10:29

can get around that or just be a good

play10:31

netizen if you will is to allow for a

play10:34

wait so it waits for five seconds before

play10:37

it it sends the next request also you

play10:42

can specify a rate limit so how much of

play10:45

beta you're downloading so by default

play10:48

it's bytes so you can specify the key or

play10:51

I believe you can also specify the M as

play10:54

in megabytes parameter so again it

play10:56

ensures that your sites don't get

play10:58

blacklisted and some sites are a little

play11:03

more intelligent it it checks that the

play11:07

user agent is not in a familiar list so

play11:10

it might not serve you content so there

play11:13

are ways that you can specify the user

play11:15

agent and then finally there are ways

play11:18

that the default keep in mind recursive

play11:22

again you can check the docs for more

play11:25

up-to-date information but you'll find

play11:28

that the default recursion I believe is

play11:30

only five levels deep but if you did

play11:33

want to recurse down to lower levels or

play11:38

restrict it to something less like maybe

play11:40

two for example you can set the level

play11:42

here and the last thing I will point out

play11:46

is in most cases when I'm downloading a

play11:50

larger website I typically would want to

play11:52

run it in the background but at the same

play11:54

point in time have a log file where I

play11:57

can send that data to so W get provides

play12:02

some inbuilt functionality so we can

play12:04

send the output or see what the process

play12:06

or progress is in kind of like a log

play12:09

file and we will run it in the

play12:12

background

play12:12

so let's take a look at that in action

play12:15

and just before I do that let me just

play12:19

split this into two sections here so

play12:22

that we can see the files

play12:31

let's run that here all right so that's

play12:37

running in the background if if you take

play12:40

a look you'll notice that I up it's a

play12:42

it's the process is still running in the

play12:44

background and we have sent the data to

play12:48

this file here so if I look at the

play12:51

contents you notice here we have a new

play12:53

file here so let's tell the activity so

play12:58

here we can see the log is being updated

play13:00

as and when new requests and progress is

play13:03

made we our web get it's sending the

play13:07

results to the log file so I can just

play13:09

you know close all this consoles and

play13:12

maybe come back in a couple of hours or

play13:15

maybe even much longer if it does take

play13:17

that long

play13:19

and then I'll see what the progress is

play13:21

so that's quite handy here I would say

play13:24

in vast majority of the cases this is a

play13:27

kind of like template that I would use

play13:29

one final note before we wrap up for

play13:33

this video is again you'll want to run

play13:36

it on different sites and see how it

play13:38

performs but some sites do do know that

play13:42

there's a bot you know an automated

play13:46

process that's working or making these

play13:49

requests so instead of having a fixed

play13:51

five-second wait you can remove this

play13:54

wait and instead put this which is a

play13:57

random wait which works on some sites

play13:59

again it's a this has a tool it's a

play14:02

quick and dirty in a manner of speaking

play14:04

and it's not as versatile as a

play14:07

full-fledged web scraping tool obviously

play14:10

again I've covered these in other videos

play14:13

in the past some other examples of web

play14:15

scripting tools but just something I

play14:17

found quite handy when I wanted to

play14:19

scrape off some some content really

play14:22

quickly and keep it either offline I'll

play14:24

do some post processing and content

play14:26

extraction all right so that's it for

play14:29

this quick video thanks everyone for

play14:30

watching

Rate This

5.0 / 5 (0 votes)

関連タグ
Web ScrapingWget ToolOffline BrowsingContent ExtractionSite DownloadingHTML FilesCSS ImagesRecursive CrawlingParameter SettingsVideo TutorialWeb Development
英語で要約が必要ですか?