Wednesday, April 10, 2019

Exclude list of specific files in wget

I am trying to download a lot of pages from a website on dial-up and it can be brutally slow. I have almost got the perfect wget command, but because I'm downloading pages from the same site wget wastes times downloading the same standard images for each page.


If I know the name of the default page images, is there any way to have wget ignore and thus avoid downloading those for each and every page?


Here is an example of one of the wget commands that my shell script generates into another shell script to download all of the pages:


mkdir candy-canes-on-the-flannel-board-in-preschool
cd candy-canes-on-the-flannel-board-in-preschool
wget -p -nd -A jpg,html -k http://www.teachpreschool.org/2011/12/candy-canes-on-the-flannel-board-in-preschool/
wget -c --random-wait --timeout=30 --user-agent="Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.3) Gecko/2008092416 Firefox/3.0.3" http://www.teachpreschool.org/2011/12/candy-canes-on-the-flannel-board-in-preschool/ -O "candy-canes-on-the-flannel-board-in-preschool"
rm Baby-and-Toddler.jpg Childrens-Books.jpg Creative-Art.jpg Felt-Fun.jpg Happy_Rainbow-e1338766526528.jpg index.html Language-and-Literacy.jpg Light-table-Button.jpg Math.jpg Outdoor-Play.jpg outer-jacket1-300x153.jpg preschoolspot-button-small.jpg robots.txt Science-and-Nature.jpg Signature-2.jpg Story-Telling.jpg Tags-on-Preschool.jpg Teaching-Two-and-Three-Year-olds.jpg
cd ../

Now I realize the script is not likely as savvy as it could be but it is doing what I need at the moment except that you can see from the rm command that I would just like to prevent wget from downloading the files in the first place if possible.


I almost forgot to mention, there are two wget commands and that is because the first one downloads the page as index.html and for some reason it does not open in my browser, however, when I open it and look at it in vim all of the page's content is there, so I am not sure why it does not open. But if I just issue the second wget command as it is then that page, same file really with an alternate name, opens up fine. Something that if I could fix would also help to streamline the process.

No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...