Tuesday, December 26, 2017

archiving - wget Download webpage completely into one file and an assets folder


I am trying to emulate right-clicking and "Saving As, complete" in Chrome or Firefox. I tried this:


wget -E -H -k -K -p http://gizmodo.com

But that created a several folders, one for each domain that hosts resources. I also tried


wget -r -N -l inf --no-remove-listing -x http://gizmodo.com

Both from here, if anyone is interested. I also tried here, here, and here but none of those did what I wanted.


But that did not download all the dependencies. What I really want is what Chrome and Firefox do, which creates one index.html file, with all the dependency file paths modified to point to all the dependencies that sit in an 'assests' folder next to it.


I've also tried the wget manual, and can't find anything more than what it is already doing. Is this even possible?


Answer



From the WGET manual:


--no-directories (or -nd)


Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions .n).


--no-host-directories (or -nH)


Disable generation of host-prefixed directories. By default, invoking Wget with -r http://fly.srk.fer.hr/ will create a structure of directories beginning with fly.srk.fer.hr/. This option disables such behavior.


--page-requisites (or -p)


This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets. ...


--no-parent (or -np)


Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.


--convert-links (or -k)


After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.


Those options should help.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...