If you have used Heritrix Web Crawler, I'd really appreciate your help.
3 questions:
- An arc file probably contains source codes of MANY pages in there. How do I figure out which is which?
- How do I interpret the .arc.gz files? I opened them in VIM and realized there were HTML code + junk (which I can't even parse using Python SGMLParser because of the junk).
- Is it recommended to compress? (.gz)
Basically, I have no idea what .ARC files are and what I can do with them.
I'm used to using URLLIB2 to download and parse HTML manually.
Answer
Her's a link to download ArcReader, and an explanation: http://crawler.archive.org/articles/developer_manual/arcs.html.
I Googled for reading arc files and this was the first link.
First you need to unzip the files (they are gzipped, hence the .gz extension.). Then you can read the ARC file.
No comments:
Post a Comment