This library is designed to handle web crawl data fetched using the Heritrix web crawler (or other tools producing WARC files), extract the plain text from structured formats and resave the data as WARC "conversion" records.
The primary use for this tool is to extract text from webcrawl data sets for use in machine learning and supervised classification work.
WARC (Web ARChive) is a file format for storing web crawls: http://bibnum.bnf.fr/WARC/
Note:
This library was originally based upon the "warc" library by the Internet Archive and others, but was refactored to rely upon the hanzo warctools and since then has no code in common with the original library. In particular, it is no longer linked in any real sense with the library from which it was originally forked on GitHub, and all the code in this release has been written by Tom Nicholls: the other authors listed as collaborators to the original library are not responsible in any way for the bugs that are present!
The hanzo library which this code is dependent upon can be installed with 'pip install warctools'. Beware that there are several old versions floating around under different names in the index.
The software at this stage should be considered feature-complete, though it may have minor additions in the future.