parsing - I installed and ran Heritrix Web Crawler. It stored data in .arc.gz files

Wednesday, March 1, 2017

If you have used Heritrix Web Crawler, I'd really appreciate your help.

3 questions:

An arc file probably contains source codes of MANY pages in there. How do I figure out which is which?

How do I interpret the .arc.gz files? I opened them in VIM and realized there were HTML code + junk (which I can't even parse using Python SGMLParser because of the junk).

Basically, I have no idea what .ARC files are and what I can do with them.
I'm used to using URLLIB2 to download and parse HTML manually.

Answer

I Googled for reading arc files and this was the first link.

First you need to unzip the files (they are gzipped, hence the .gz extension.). Then you can read the ARC file.

Blog