Sunday, March 25, 2018

linux - Httrack filter links with certain pattern

I am trying to use httrack to download an entire webarchive from archive.org. The idea is to download only the archive links (as many as possible), but only the links that are really from the archive not from the current website. In other words, I want to download only the links that contain this pattern:


/web/[archive_timestamp]/[website]/*

Here's an example


Here is an archive link: http://web.archive.org/web/20011209181356/http://www.emag.ro:80/


In other to download the links that I need, I am using this command:


httrack http://web.archive.org/web/20011209181356/http://www.emag.ro:80/ -* +*/web/20011209181356/http://www.emag.ro/*

This should mean, to filter all the links (disable all of them), and enable only those that contain /web/20011209181356/http://www.emag.ro/


The command downloads just the homepage, so I guess I'm doing something wrong.


If somebody has an idea of how to get this one done (except from building my own scraper - tried to avoid this in order to save time), even with a different tool that I can use from command line and also works on windows.

No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...