Wednesday, March 1, 2017

parsing - I installed and ran Heritrix Web Crawler. It stored data in .arc.gz files


If you have used Heritrix Web Crawler, I'd really appreciate your help.


3 questions:



  1. An arc file probably contains source codes of MANY pages in there. How do I figure out which is which?

  2. How do I interpret the .arc.gz files? I opened them in VIM and realized there were HTML code + junk (which I can't even parse using Python SGMLParser because of the junk).

  3. Is it recommended to compress? (.gz)


Basically, I have no idea what .ARC files are and what I can do with them.
I'm used to using URLLIB2 to download and parse HTML manually.


Answer



Her's a link to download ArcReader, and an explanation: http://crawler.archive.org/articles/developer_manual/arcs.html.


I Googled for reading arc files and this was the first link.


First you need to unzip the files (they are gzipped, hence the .gz extension.). Then you can read the ARC file.


No comments:

Post a Comment

hard drive - Leaving bad sectors in unformatted partition?

Laptop was acting really weird, and copy and seek times were really slow, so I decided to scan the hard drive surface. I have a couple hundr...