[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Wget



Peter Mount wrote:
> However it takes quite some time (~1.5 hours) but it does work. The only
> down side is that it retrieves not just the current, but every version of
> each page, so it ends up downloading 22Meg.
>
> I don't think it's good for backup purposes, but is useful for making an
> offline copy.


I reckon Wiki's keeping the old pages in place provides an excellent way to 
detect and audit the changes to the site. Which may be helpful one day to 
resolve an argument.

By default wget will grab everything it's told to, unless there's a Robots 
Exclusion Protocol file defined for the site. FYI this Protocol is designed 
to keep search engine robots and mirroring programs out of non-public parts 
of your site. The Protocol is routinely flouted by spam harvesters and other 
malicious programs (a misbehaviour you can turn to your advantage by 
deploying nasty countermeasures, but I digress...).

Incidentally wget's --mirror option only downloads the files that are new 
and/or changed since its previous run. You got them all this time because it 
was your ~first~ run; therefore all of the files were "new". Next time it 
won't be so bad - provided Tom doesn't upload his data CDs to Wiki first :-)


cheers,

-- 

Fraser Farrell

----------------------------------
http://astronomy.trilobytes.com.au
----------------------------------