Abstract
Release 1.10.0 adds new configuration options, experimental new protocol and format support, and lots of fixes. 43 tracked bugs have been fixed and 35 feature requests added.
Release 1.10.0 requires JDK 1.5.x ("Java 5") Java facilities.
Aside from the usual suspects, the following contributed to this release:
Eric C Jensen
Olaf Freyer
Karl Wright (of MetaCarta)
Frank McCown (of Old Dominion University)
Max Schöfmann
Søren Vejrup Carlsen (of Royal Library, Denmark)
See Section 11.1.1, “java.io.IOException: No locks available” in 1.8.0 Release Notes.
The old default login of 'admin' and password of 'letmein' for access to the crawler web UI (and JMX agent control) have been eliminated. It is now necessary to specify an access username and password to start Heritrix. This may be done with the -a or --admin command-line argument or via the system property 'heritrix.cmdline.admin'. (These each take a colon-separated username and password, like 'username:password'.)
Previously, the Jetty web server that runs the Heritrix web UI listened on all available network interfaces. In 1.10.0, Jetty will only bind to localhost by default. The -b or --bind command-line argument can be used to specify a different interface or list of interfaces to bind to instead. You may specify "-b /" to get the old behavior -- binding on all interfaces -- but only take this step after reading section 2.3 of the User Manual, "Security Considerations".
The optional QuotaEnforcer processor has a new setting, 'force-retire', which is by default 'true', and changes the default behavior of QuotaEnforcer. Previously, when a URI was noted as being over-quota, it would be marked with a special over-quota failure code which caused it to complete processing as an error. As a result, all over-quota URIs would quickly be finished as errors and appear in the crawl.log, but there would be no opportunity to raise the quota and continue crawling.
The new default behavior instead marks the URI with a directive requesting its frontier queue be retired. If the frontier supports this directive, the URI will be returned to its queue as if never tried, and the whole queue retired from active crawling. This offers the opportunity to raise the quota and continue crawling the URI and others of its queue. (All settings changes cause all retired queues to be reevaluated.) However, the over-quota URIs will not appear as errors in the crawl.log.
If the old behavior is preferred, set 'force-retire' to 'false'.
In 1.10.0, URL canonicalization has changed in two ways. First,
the stripping of sessionids has improved [See Stripping
sessionid can leave behind doubled ampersands]. Previous, if
the sessionid was in the middle of a query string bookended by other
query parameters, canonicalization would leave behind the encasing
ampersands: E.g. If the URL
http://a.com/?a=1&sid=00000000000000000000000000000000&b=1
was passed through canonicalization, the result would be:
http://a.com/?a=1&&b1
. This has been fixed
so that the result will now be:
http://a.com/?a=1&b1
.
The second change, [1550805] Add stripping of coldfusion sessionids, adds the new coldfusion sessionid stripper to the list of default canonicalization rules.
We bring your attention to these seemingly minor changes because for those of you running regular crawls, with both of the above changes in place, depending on the type of crawl, there should be a reduction in overall the number of (duplicate) pages crawled.
This release includes experimental WARC readers and writers. Be
warned that both code and specification are not yet final and so are
both subject to change with no guarantees of backward compatibility:
i.e. newer readers may not be able to read WARCs written with older
writers. See the org.archive.io.warc
package documentation for more on the current state of code including
documentation of initial version of Arc2Warc
and
Warc2Arc
tools.
This release also include experimental support for FTP. This support is disabled by the default heritrix configuration. See the User Guide for information on how to enable FTP.
Table 3. All Tracked Changes
ID | Type | Summary | Open Date | By | Filer |
---|---|---|---|---|---|
1545462 | Add | Experimental WARC Readers and Writers | 2006-08-23 | stack-sf | stack-sf |
1494491 | Add | path/role-sensitive robots (eg ignore for inline images/css) | 2006-05-24 | karl-ia | gojomo |
1550849 | Add | 'Implied' URI extractor (eg, YouTube) | 2006-09-01 | karl-ia | gojomo |
1549665 | Add | Add experimental Warc2Arc and Arc2Warc scripts | 2006-08-30 | stack-sf | stack-sf |
1546829 | Add | Secure admin UI: Bind cmd-line argument | 2006-08-25 | karl-ia | stack-sf |
1545600 | Add | remove default admin username/password | 2006-08-23 | karl-ia | gojomo |
1536441 | Add | hash-based CrawlMapper | 2006-08-08 | karl-ia | gojomo |
1535744 | Add | force reread of disk settings (for out-of-JVM/bulk changes) | 2006-08-06 | karl-ia | gojomo |
1534280 | Add | scriptable (beanshell) Processor, DecideRule options | 2006-08-03 | gojomo | gojomo |
1522112 | Add | CrawlMapper skip mapping 'E'mbeds (etc) | 2006-07-13 | karl-ia | gojomo |
1520269 | Add | keep over-limit (-500X) URIs in queues (don't 'finish/log) | 2006-07-10 | karl-ia | gojomo |
1387423 | Add | [arcreader] Fetch records and iterate remote ARCs | 2005-12-21 | stack-sf | stack-sf |
1351778 | Add | favicon.ico for heritrix web ui | 2005-11-08 | gojomo | gojomo |
1209724 | Add | [contrib] Add BigMapFactory.getSynchronizedBigMap | 2005-05-27 | gojomo | ck-heritrix |
1526781 | Add | broader rotation / wider 'frontline' frontier queue option | 2006-07-21 | karl-ia | gojomo |
1092496 | Add | crawl.log should have hash of DNS records | 2004-12-28 | stack-sf | gojomo |
1006194 | Add | FTP fetching | 2004-08-09 | karl-ia | gojomo |
1550805 | Add | Add stripping of coldfusion sessionids -- add to default lis | 2006-09-01 | stack-sf | stack-sf |
1547390 | Add | [contrib] patch to allow setting local IP to bind fetch from | 2006-08-26 | stack-sf | ecjensen |
1545847 | Add | [contrib] allow to specify alternative conf location | 2006-08-24 | stack-sf | pandae |
1545840 | Add | [contrib] ContentLengthFilter | 2006-08-24 | stack-sf | pandae |
1537507 | Add | Add checkpointing selftest | 2006-08-09 | stack-sf | stack-sf |
1535116 | Add | Add creation/deletion of Heritrix instances to UI | 2006-08-05 | karl-ia | stack-sf |
1530557 | Add | [contrib] Enhanced UI seed and crawl reports | 2006-07-28 | stack-sf | stack-sf |
1523276 | Add | Should support depth-first search priority scheduling (patch | 2006-07-15 | stack-sf | ecjensen |
1518583 | Add | Improved handling when alloted runtime is exceeded | 2006-07-07 | kristinn_sig | kristinn_sig |
1514538 | Add | (contrib) Provide Windows batch file version of scripts | 2006-06-29 | nobody | ecjensen |
1510807 | Add | [contrib] Have Heritrix UI bind to localhost only | 2006-06-22 | karl-ia | stack-sf |
1505111 | Add | Make deciding-default profile the default profile | 2006-06-12 | stack-sf | stack-sf |
1489231 | Add | Move to java 5.0/1.5.0 | 2006-05-15 | nobody | stack-sf |
1388295 | Add | [contrib] Throttling on a per-document basis | 2005-12-22 | karl-ia | stack-sf |
1153882 | Add | change username/password after launch | 2005-02-28 | karl-ia | gojomo |
1058324 | Add | Show old crawl reports in UI (Was: Reports on finished...) | 2004-11-01 | nobody | stack-sf |
986985 | Add | Fix API to allow ARCWriter replacement | 2004-07-07 | stack-sf | stack-sf |
1540381 | Fix | proxying of https gives errors/garbage/later problems | 2006-08-14 | karl-ia | gojomo |
1534082 | Fix | override of user-agents and masquerade not working | 2006-08-03 | karl-ia | ia_igor |
1495253 | Fix | multiple usage of same arc id number within same crawl | 2006-05-25 | karl-ia | ia_igor |
1533571 | Fix | Checkpointing is broken (Parts 1 and 2) | 2006-08-02 | stack-sf | stack-sf |
1511596 | Fix | incorrect resolving relative links from flash files (swf) | 2006-06-23 | karl-ia | ia_igor |
1510289 | Fix | CSS keywords are case sensetive in extraction | 2006-06-21 | gojomo | cathcart |
1489132 | Fix | Contain HttpClient HttpParser's OutOfMemoryError risk | 2006-05-15 | karl-ia | gojomo |
1442679 | Fix | HTMLExtractor and application/xhtml+xml type? | 2006-03-03 | karl-ia | gojomo |
1549627 | Fix | Archive file serialnumber is always 1 after checkpoint | 2006-08-30 | stack-sf | stack-sf |
1546808 | Fix | Don't resume crawl after checkpoint if state is 'pausing' | 2006-08-25 | stack-sf | stack-sf |
1542933 | Fix | adjust prominence of instance/identifier info/tab | 2006-08-18 | karl-ia | gojomo |
1540030 | Fix | FetchDNS IOException: Stream closed | 2006-08-14 | stack-sf | stack-sf |
1538489 | Fix | HeritrixProtocolSocketFactory synchronization causes delays | 2006-08-11 | stack-sf | gojomo |
1534153 | Fix | don't insist on robots.txt if it need not be honored | 2006-08-03 | karl-ia | gojomo |
1532787 | Fix | OnDomainsDecideRule not working as expected | 2006-08-01 | gojomo | gojomo |
1532665 | Fix | AddRedirectFromRootServerToScope not working as expected | 2006-08-01 | gojomo | gojomo |
1519056 | Fix | IPQueueAssignmentPolicy broken by method signature mismatch | 2006-07-07 | gojomo | gojomo |
1514716 | Fix | heritrix fails to save accept-headers in an override | 2006-06-29 | karl-ia | magin-ia |
1511624 | Fix | NoOnDomainsDecideRule/NotOnHostsDecideRule superclass wrong | 2006-06-23 | karl-ia | gojomo |
1482210 | Fix | CachedBdbMap.keySet inefficient or broken | 2006-05-04 | karl-ia | gojomo |
1475798 | Fix | ARCReader#read(byte [], off, len) broke for non-null offset | 2006-04-24 | stack-sf | stack-sf |
1189825 | Fix | ARC problem causing .invalid suffix needs better reporting | 2005-04-25 | paul_jack | gojomo |
1056919 | Fix | NPE at CrawlStateUpdater.java:70 http:/robots.txt | 2004-10-29 | karl-ia | stack-sf |
998275 | Fix | doc security considerations | 2004-07-26 | gojomo | gojomo |
1549587 | Fix | [jdk1.6] ComplexType#toString infinite loop | 2006-08-30 | stack-sf | stack-sf |
1543751 | Fix | ConcurrentModificationException in web UI frontier report | 2006-08-20 | karl-ia | gojomo |
1522108 | Fix | LinksScoper scope-embedded-links inconsistent/confusing | 2006-07-13 | gojomo | gojomo |
1521563 | Fix | UURIFactory '//' collapsing overeager | 2006-07-12 | karl-ia | gojomo |
1519055 | Fix | queued count wrong with retired queues; crawl doesn't end | 2006-07-07 | karl-ia | gojomo |
1469517 | Fix | ARCWriterPool not fair to threads | 2006-04-12 | gojomo | gojomo |
1379040 | Fix | regex for midfetch filter not being stored in crawl order | 2005-12-12 | gojomo | nobody |
1550797 | Fix | Stripping sessionid can leave behind doubled ampersands | 2006-09-01 | stack-sf | stack-sf |
1541645 | Fix | excessive WakeTask may be scheduled | 2006-08-16 | gojomo | gojomo |
1534925 | Fix | Remove MirrorJNDI. Its GPL | 2006-08-04 | stack-sf | stack-sf |
1517693 | Fix | [extractorhtml] Passes through entity-encodings | 2006-07-05 | stack-sf | stack-sf |
1516354 | Fix | Job's crawl report link produces report for different job | 2006-07-03 | nobody | fmccown |
1511609 | Fix | Browsers tolerate newlines in URLs, Heritrix doesn't | 2006-06-23 | nobody | stack-sf |
1507554 | Fix | Values from dropdown getting tacked on for next hit. | 2006-06-16 | stack-sf | nobody |
1503781 | Fix | [jmx] Add rebind to JNDI | 2006-06-09 | nobody | stack-sf |
1490806 | Fix | hangs with queued documents not being assigned to queues | 2006-05-18 | nobody | pandae |
1489155 | Fix | httpclient list of proto-factories is static | 2006-05-15 | stack-sf | stack-sf |
1479727 | Fix | Non-serializable class ARCReader contains Exception | 2006-05-01 | stack-sf | lars_clausen |
1469739 | Fix | escapeJavaScript should escape HTML problem characters | 2006-04-13 | paul_jack | pandae |