8. Release 1.10.2 - 2007-01-15

Abstract

This is primarily a bug-fix release, with a couple of new features, provided before a number of significant changes to the Heritrix project that will require developer and crawl operator adjustments. Post-1.10.2, Heritrix source code control, issue tracking, and build process will migrate to new systems. Also, updates to core classes, especially with regard to the settings architecture, will noticeably break backward compatibility with 1.10.2 and prior crawler settings files and formats.

8.1. Contributors

  • Olaf Freyer

  • Max Schöfmann

8.2. Changes

8.2.1. Jericho HTML Extractor

Olaf Freyer has contributed an HTML Extractor named JerichoExtractorHTML based on the Jericho HTML Parser. Following is a quote from the JerichoExtractorHTML class comment describing how the new Extractor differs from ExtractorHTML, its advantages and downsides: “ This extractor extends ExtractorHTML and mimics its workflow - but has some substantial differences when it comes to internal implementation. Instead of heavily relying upon java regular expressions it uses a real html parser library - namely Jericho HTML Parser (http://jerichohtml.sourceforge.net). Using this parser it can better handle broken html (i.e. missing quotes) and also offer improved extraction of HTML form URLs (not only extract the action of a form, but also its default values). Unfortunately this parser also has one major drawback - it has to read the whole document into memory for parsing, thus has an inherent OOME risk. This OOME risk can be reduced/eleminated by limiting the size of documents to be parsed (i.e. using NotExceedsDocumentLengthTresholdDecideRule). Also note that this extractor seems to have a lower overall memory consumption compared to ExtractorHTML. (still to be confirmed on a larger scale crawl)

Table 1. All Tracked Changes

IDTypeSummaryOpen DateByFiler
913002 AddMake ExtractorHTML aggressiveness configurable2004-03-09gojomogojomo
1573708 Add[Contrib] JerichoExtractorHTML2006-10-09nobodypandae
1633458 Add[arcreader] Support for s3 and streaming improvements2007-01-11stackstack
1629242 Fixfilehandle leak: ReplayInputStream/BufferedSeekInputStream2007-01-05karl-iagojomo
1218961 Fix"failed get of replay" in ExtractorHTML... usu: UTF-16BE2005-06-11karl-iagojomo
996161 FixFix DNSJava issues (memory)2004-07-22karl-iagojomo
1477371 FixExtractorDOC wants whole doc in memory2006-04-26paul_jackgojomo
1618928 FixDo not allow http:/ and https:/ urls2006-12-19stack-sfstack-sf
1596176 FixNotMatchesListRegExpDecideRule extends wrong class2006-11-14nobodypandae
1593540 FixNPE in quotaEnforcer.checkQuotas2006-11-09nobodysvc
1587413 Fix[PATCH] Webapp doesn't find profiles and ignores jobsdir2006-10-30nobodynobody
1572391 FixSURTs for IP-address URIs unhelpful2006-10-06gojomogojomo
1501810 FixNPE in FetchHTTP.saveCookies2006-06-06gojomostack-sf
1633117 FixUseragent compare because of case in RobotsExclusionPolicy2007-01-11stack-sfstack-sf