EvilZone

Programming and Scripting => Scripting Languages => : proxx June 10, 2013, 08:35:15 PM

: Scrape a website with wget
: proxx June 10, 2013, 08:35:15 PM
Hello,

This is not really bash but whatever, dont know where to put it.
Wget is a very useful tool and for those unaware of its power I would like to show how to download a website, partially or completely.
One thing that is useful is the option to ignore robots.txt
:
wget -H -r --level=5 -e robots=off http://somewebsite.org/

-H  --span-hosts                go to foreign hosts when recursive.
-r   --recursive --level=5     recursive + recursion dept, in this case 5
-e robots=off                    execute + ignore robots.txt


This will download the entire webpage including files etc.
Must have used it a million times. screw GUI tools :P
Have a nice day.
: Re: Scrape a website with wget
: vezzy June 10, 2013, 08:42:56 PM
Yeah, I'm pretty sure every *nix user is aware of this. Even some more advanced Windoo users too.
: Re: Scrape a website with wget
: Kulverstukas June 10, 2013, 08:49:57 PM
If you use link depth of 5, does wget download stuff outside the root website too?
Example: let's say there is a link to some blog post, does wget check if the link goes out of the website to another and skip or does wget download everything on the blog too?
: Re: Scrape a website with wget
: proxx June 10, 2013, 09:10:07 PM
If you use link depth of 5, does wget download stuff outside the root website too?
Example: let's say there is a link to some blog post, does wget check if the link goes out of the website to another and skip or does wget download everything on the blog too?

My mistake its recursion dept.
Not link dept.

Yeah, I'm pretty sure every *nix user is aware of this. Even some more advanced Windoo users too.
Me too but for those new to *nix it might be a nice example of its power.
: Re: Scrape a website with wget
: zoup June 10, 2013, 11:46:33 PM
wget has a --spider switch too. So it don't downloads anything and you can go through the output and look if you find some interesting. But i think you already know. But i have to get posts :-))))