10. Release 1.10.0 - 2006-09-11

Abstract

Release 1.10.0 adds new configuration options, experimental new protocol and format support, and lots of fixes. 43 tracked bugs have been fixed and 35 feature requests added.

Release 1.10.0 requires JDK 1.5.x ("Java 5") Java facilities.

10.1. Contributors

Aside from the usual suspects, the following contributed to this release:

  • Eric C Jensen

  • Olaf Freyer

  • Karl Wright (of MetaCarta)

  • Frank McCown (of Old Dominion University)

  • Max Schöfmann

  • Søren Vejrup Carlsen (of Royal Library, Denmark)

10.2. Known Limitations/Issues

10.2.1. java.io.IOException: No locks available

See Section 11.1.1, “java.io.IOException: No locks available” in 1.8.0 Release Notes.

10.3. Pre-1.10.0 checkpoints

For sure 1.8.0 checkpoints will not be recoverable with 1.10.0.

10.4. Changes

10.4.1. No default login/password for web UI and JMX

The old default login of 'admin' and password of 'letmein' for access to the crawler web UI (and JMX agent control) have been eliminated. It is now necessary to specify an access username and password to start Heritrix. This may be done with the -a or --admin command-line argument or via the system property 'heritrix.cmdline.admin'. (These each take a colon-separated username and password, like 'username:password'.)

10.4.2. Web UI binds to localhost only by default

Previously, the Jetty web server that runs the Heritrix web UI listened on all available network interfaces. In 1.10.0, Jetty will only bind to localhost by default. The -b or --bind command-line argument can be used to specify a different interface or list of interfaces to bind to instead. You may specify "-b /" to get the old behavior -- binding on all interfaces -- but only take this step after reading section 2.3 of the User Manual, "Security Considerations".

10.4.3. QuotaEnforcer 'force-retire' option

The optional QuotaEnforcer processor has a new setting, 'force-retire', which is by default 'true', and changes the default behavior of QuotaEnforcer. Previously, when a URI was noted as being over-quota, it would be marked with a special over-quota failure code which caused it to complete processing as an error. As a result, all over-quota URIs would quickly be finished as errors and appear in the crawl.log, but there would be no opportunity to raise the quota and continue crawling.

The new default behavior instead marks the URI with a directive requesting its frontier queue be retired. If the frontier supports this directive, the URI will be returned to its queue as if never tried, and the whole queue retired from active crawling. This offers the opportunity to raise the quota and continue crawling the URI and others of its queue. (All settings changes cause all retired queues to be reevaluated.) However, the over-quota URIs will not appear as errors in the crawl.log.

If the old behavior is preferred, set 'force-retire' to 'false'.

10.4.4. URL canonicalization changes

In 1.10.0, URL canonicalization has changed in two ways. First, the stripping of sessionids has improved [See Stripping sessionid can leave behind doubled ampersands]. Previous, if the sessionid was in the middle of a query string bookended by other query parameters, canonicalization would leave behind the encasing ampersands: E.g. If the URL http://a.com/?a=1&sid=00000000000000000000000000000000&b=1 was passed through canonicalization, the result would be: http://a.com/?a=1&&b1. This has been fixed so that the result will now be: http://a.com/?a=1&b1.

The second change, [1550805] Add stripping of coldfusion sessionids, adds the new coldfusion sessionid stripper to the list of default canonicalization rules.

We bring your attention to these seemingly minor changes because for those of you running regular crawls, with both of the above changes in place, depending on the type of crawl, there should be a reduction in overall the number of (duplicate) pages crawled.

10.4.5. WARC

This release includes experimental WARC readers and writers. Be warned that both code and specification are not yet final and so are both subject to change with no guarantees of backward compatibility: i.e. newer readers may not be able to read WARCs written with older writers. See the org.archive.io.warc package documentation for more on the current state of code including documentation of initial version of Arc2Warc and Warc2Arc tools.

10.4.6. FTP

This release also include experimental support for FTP. This support is disabled by the default heritrix configuration. See the User Guide for information on how to enable FTP.

Table 3. All Tracked Changes

IDTypeSummaryOpen DateByFiler
1545462 AddExperimental WARC Readers and Writers2006-08-23stack-sfstack-sf
1494491 Addpath/role-sensitive robots (eg ignore for inline images/css)2006-05-24karl-iagojomo
1550849 Add'Implied' URI extractor (eg, YouTube)2006-09-01karl-iagojomo
1549665 AddAdd experimental Warc2Arc and Arc2Warc scripts2006-08-30stack-sfstack-sf
1546829 AddSecure admin UI: Bind cmd-line argument2006-08-25karl-iastack-sf
1545600 Addremove default admin username/password2006-08-23karl-iagojomo
1536441 Addhash-based CrawlMapper2006-08-08karl-iagojomo
1535744 Addforce reread of disk settings (for out-of-JVM/bulk changes)2006-08-06karl-iagojomo
1534280 Addscriptable (beanshell) Processor, DecideRule options2006-08-03gojomogojomo
1522112 AddCrawlMapper skip mapping 'E'mbeds (etc)2006-07-13karl-iagojomo
1520269 Addkeep over-limit (-500X) URIs in queues (don't 'finish/log)2006-07-10karl-iagojomo
1387423 Add[arcreader] Fetch records and iterate remote ARCs2005-12-21stack-sfstack-sf
1351778 Addfavicon.ico for heritrix web ui2005-11-08gojomogojomo
1209724 Add[contrib] Add BigMapFactory.getSynchronizedBigMap2005-05-27gojomock-heritrix
1526781 Addbroader rotation / wider 'frontline' frontier queue option2006-07-21karl-iagojomo
1092496 Addcrawl.log should have hash of DNS records2004-12-28stack-sfgojomo
1006194 AddFTP fetching2004-08-09karl-iagojomo
1550805 AddAdd stripping of coldfusion sessionids -- add to default lis2006-09-01stack-sfstack-sf
1547390 Add[contrib] patch to allow setting local IP to bind fetch from2006-08-26stack-sfecjensen
1545847 Add[contrib] allow to specify alternative conf location2006-08-24stack-sfpandae
1545840 Add[contrib] ContentLengthFilter2006-08-24stack-sfpandae
1537507 AddAdd checkpointing selftest2006-08-09stack-sfstack-sf
1535116 AddAdd creation/deletion of Heritrix instances to UI2006-08-05karl-iastack-sf
1530557 Add[contrib] Enhanced UI seed and crawl reports2006-07-28stack-sfstack-sf
1523276 AddShould support depth-first search priority scheduling (patch2006-07-15stack-sfecjensen
1518583 AddImproved handling when alloted runtime is exceeded2006-07-07kristinn_sigkristinn_sig
1514538 Add(contrib) Provide Windows batch file version of scripts2006-06-29nobodyecjensen
1510807 Add[contrib] Have Heritrix UI bind to localhost only2006-06-22karl-iastack-sf
1505111 AddMake deciding-default profile the default profile2006-06-12stack-sfstack-sf
1489231 AddMove to java 5.0/1.5.02006-05-15nobodystack-sf
1388295 Add[contrib] Throttling on a per-document basis2005-12-22karl-iastack-sf
1153882 Addchange username/password after launch2005-02-28karl-iagojomo
1058324 AddShow old crawl reports in UI (Was: Reports on finished...)2004-11-01nobodystack-sf
986985 AddFix API to allow ARCWriter replacement2004-07-07stack-sfstack-sf
1540381 Fixproxying of https gives errors/garbage/later problems2006-08-14karl-iagojomo
1534082 Fixoverride of user-agents and masquerade not working2006-08-03karl-iaia_igor
1495253 Fixmultiple usage of same arc id number within same crawl2006-05-25karl-iaia_igor
1533571 FixCheckpointing is broken (Parts 1 and 2)2006-08-02stack-sfstack-sf
1511596 Fixincorrect resolving relative links from flash files (swf)2006-06-23karl-iaia_igor
1510289 FixCSS keywords are case sensetive in extraction2006-06-21gojomocathcart
1489132 FixContain HttpClient HttpParser's OutOfMemoryError risk2006-05-15karl-iagojomo
1442679 FixHTMLExtractor and application/xhtml+xml type?2006-03-03karl-iagojomo
1549627 FixArchive file serialnumber is always 1 after checkpoint2006-08-30stack-sfstack-sf
1546808 FixDon't resume crawl after checkpoint if state is 'pausing'2006-08-25stack-sfstack-sf
1542933 Fixadjust prominence of instance/identifier info/tab2006-08-18karl-iagojomo
1540030 FixFetchDNS IOException: Stream closed2006-08-14stack-sfstack-sf
1538489 FixHeritrixProtocolSocketFactory synchronization causes delays2006-08-11stack-sfgojomo
1534153 Fixdon't insist on robots.txt if it need not be honored2006-08-03karl-iagojomo
1532787 FixOnDomainsDecideRule not working as expected2006-08-01gojomogojomo
1532665 FixAddRedirectFromRootServerToScope not working as expected2006-08-01gojomogojomo
1519056 FixIPQueueAssignmentPolicy broken by method signature mismatch2006-07-07gojomogojomo
1514716 Fixheritrix fails to save accept-headers in an override2006-06-29karl-iamagin-ia
1511624 FixNoOnDomainsDecideRule/NotOnHostsDecideRule superclass wrong2006-06-23karl-iagojomo
1482210 FixCachedBdbMap.keySet inefficient or broken2006-05-04karl-iagojomo
1475798 FixARCReader#read(byte [], off, len) broke for non-null offset2006-04-24stack-sfstack-sf
1189825 FixARC problem causing .invalid suffix needs better reporting2005-04-25paul_jackgojomo
1056919 FixNPE at CrawlStateUpdater.java:70 http:/robots.txt2004-10-29karl-iastack-sf
998275 Fixdoc security considerations2004-07-26gojomogojomo
1549587 Fix[jdk1.6] ComplexType#toString infinite loop2006-08-30stack-sfstack-sf
1543751 FixConcurrentModificationException in web UI frontier report2006-08-20karl-iagojomo
1522108 FixLinksScoper scope-embedded-links inconsistent/confusing2006-07-13gojomogojomo
1521563 FixUURIFactory '//' collapsing overeager2006-07-12karl-iagojomo
1519055 Fixqueued count wrong with retired queues; crawl doesn't end2006-07-07karl-iagojomo
1469517 FixARCWriterPool not fair to threads2006-04-12gojomogojomo
1379040 Fixregex for midfetch filter not being stored in crawl order2005-12-12gojomonobody
1550797 FixStripping sessionid can leave behind doubled ampersands2006-09-01stack-sfstack-sf
1541645 Fixexcessive WakeTask may be scheduled2006-08-16gojomogojomo
1534925 FixRemove MirrorJNDI. Its GPL2006-08-04stack-sfstack-sf
1517693 Fix[extractorhtml] Passes through entity-encodings2006-07-05stack-sfstack-sf
1516354 FixJob's crawl report link produces report for different job2006-07-03nobodyfmccown
1511609 FixBrowsers tolerate newlines in URLs, Heritrix doesn't2006-06-23nobodystack-sf
1507554 FixValues from dropdown getting tacked on for next hit.2006-06-16stack-sfnobody
1503781 Fix[jmx] Add rebind to JNDI2006-06-09nobodystack-sf
1490806 Fixhangs with queued documents not being assigned to queues2006-05-18nobodypandae
1489155 Fixhttpclient list of proto-factories is static2006-05-15stack-sfstack-sf
1479727 FixNon-serializable class ARCReader contains Exception2006-05-01stack-sflars_clausen
1469739 FixescapeJavaScript should escape HTML problem characters2006-04-13paul_jackpandae