Wednesday, November 14, 2018

unicode - How do I use HTTrack to download gzipped files from URLs with accented characters?

I am downloading a site with HTTrack with spotty results. Several directories return 2 or more versions of the same HTML file. These duplicates in any given directory may include:



  1. a file named índice.html (note the accented í) that shows gibberish in the browser. When studied more carefully, this turns out to be a .z archive with an incorrect extension, containing the correct HTML file

  2. a file named índice.html.z, which is a an archive containing a readable version of that file

  3. a file named índice-2.html, which is a good version of the original índice.html, perfectly readable in the browser

  4. a file named índice-2.html.z, which is an archive containing the same file again, but sometimes that file will be somewhat different in size from the first one

  5. etc


The HTTrack error log shows the following:



18:07:32 Error: "Error when decompressing" (-1) at link example.com/conversación/índice.html



This is a Spanish site, and some directories have accents in them and files are called índice.html instead of index.html. This makes me suspect that the reason HTTrack messes up the download is the accents, but I can't prove it, except that I downloaded the English version of the same site without any problems.


To summarize, the problem might lie either in the accented characters in the URL or something else related to HTTrack's way of handling gzipped HTML files, but my main question remains the same:


Is this a bug in HTTrack or expected behavior, and how do I get around it to download the Spanish version of the site successfully?

No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...