Abstract
Added new prefix ('SURT') scope and filter, compression of recovery log, mass adding of URIs to running crawler, crawling via a http proxy, adding of headers to request, improved out-of-the-box defaults, hash of content to crawl log and to arcreader output, and many bug fixes.
Heritrix 1.0.0 uses disk-based queues to hold any number of pending URIs bounded only by available disk space, but still relies on in-memory structures to efficiently track all discovered hosts and previously-scheduled URIs. Crawls whose total scheduled URIs or discovered hosts exhaust all available memory will trigger out-of-memory errors, which freeze a crawl at the point of the error.
With the default settings, and an assignment of a 256MB Java heap to the Heritrix process, crawling which discovers up to 10 000 hosts, and schedules over 6 000 000 URIs, should be possible. Discovery of higher numbers of URIs/hosts will likely trigger out-of-memory problems unless a larger java heap was assigned at startup.
Broad crawls -- those using the BroadScope or ranging over domains with many subdomains -- can easily and quickly exceed these parameters. Thus broad crawls in Heritrix 1.0.0 are not recommended, except for experimental purposes.
Narrower crawls, restricted to specific hosts or domains a limited number of subdomains, can run for a week or more, collecting millions of resources. Larger heaps can allow crawls to run into the tens of millions of collected URIS, and tens of thousands of discovered hosts.
An experimental alternate Frontier, the DiskIncludedFrontier, is also available via the 'Modules' crawl configuration tab. It uses a capped amount of memory plus disk storage to remember any number of scheduled URIs, but its performance is poor and it has not received the same testing as our default Frontier. The memory cost of additional discovered hosts continues to rise without limit when using a DiskIncludedFrontier.
Future versions of Heritrix will include other frontier implementations allowing larger and unbounded crawls with minimal performance penalties.
Its possible to get ConcurrentModificationsException editing options on a running crawl.
On macintoshes and linux kernel version 2.6, heritrix fails to build (unit tests fail).
See issue, [ 984390 ] Build fails: "rws" mode and Mac OS X interact badly, for source code workaround edit.
Heritrix fails to build on linux kernel 2.6.
Build fails unless you use a JDK in advance of pedigree 1.5 beta 2 (It works with jdk1.5.0-rc). See [ 955975 ] Build fails: JVM and kernel 2.6+ (Was 2 tests fail...) and above.
Table 10. Changes
ID | Type | Summary |
---|---|---|
939679 | Add | Mass-add URIs to running crawl and force reconsideration |
986977 | Add | SurtPrefix scope (and filter) |
989816 | Add | Specification of default CharSequence charset |
983001 | Add | crawl.log entries all on one line |
869584 | Add | Hash content-bodies, show in logs (and future ARCs) |
964581 | Add | option to preference (quick-get) embeds |
964493 | Add | Compress recover.log |
988106 | Add | [UURI] 'http:///...' converted to 'http://...' |
926143 | Add | enable use through HTTP proxy |
945922 | Add | Allow adding (subtracting?) http headers |
983109 | Add | Improved out-of-the-box defaults |
982909 | Add | ARCWriter makes FAT gzip header |
925734 | Add | exponential backoff URI/host retries |
- | Fix | Total data "written" isn't necessarily written (wording) |
- | Fix | embeds within scope problem |
- | Fix | NPE clearing alerts |
- | Fix | arcmetadata repeated once for every domain config |
- | Fix | CCE deserializing diskqueue [Was: IllegalArgumentExcepti...] |
- | Fix | no docs for recovery-journal feature |
- | Fix | Pause/Terminate ignored on 2.6 kernel 1.5 JVM |
- | Fix | Investigate "Relative URI but no base" |
- | Fix | User-Agent should be able to mimic Mozilla (as does Google) |
- | Fix | referral URL should be stored in recover.log |
- | Fix | ToeThreads hung in FetchDNS after Pause |
- | Fix | robots.txt lookup for different ports on same host |
- | Fix | Empty log percentages displayed as NaN% |
- | Fix | UURI doubly-encodes %XX sequences |
- | Fix | Single settings change causes two versions to be created |
- | Fix | New IA debian image is 2.6 (Was: Build fails: JVM and ...) |
- | Fix | NPE in PathDepthFilter |
- | Fix | [investigate & rule out] Thread report deadlock risks |
- | Fix | jetty susceptible to DoS attack |
- | Fix | 'ignore' robots does not ignore meta nofollow |
- | Fix | URI Syntax Errors stop page parsing. |
- | Fix | NPE in ExtractorHTML/TextUtils.getMatcher() |
- | Fix | ARCReader: Failed to find GZIP MAGIC |
- | Fix | javascript embedded URLs |
- | Fix | NoClassDefFoundError when starting a job |
- | Fix | Max number of deferrals hard-coded to 10. |
- | Fix | Frontier report thread safety problems? |
- | Fix | ARCReader hanging |
- | Fix | log-browsing by regexp outofmemoryerror |
- | Fix | Deferred URLs due the DNS problem -- Heritrix(-50)-Deferred |
- | Fix | Assertion failures shouldn't be more fatal than Runtime Exc. |
- | Fix | min-interval is superfluous; remove |
- | Fix | crawl doesn't end when using valence > 1 |
- | Fix | Giant (in # of files) state directory problematic |
- | Fix | robots-expiration units, default wrong |
- | Fix | NoSuchElementException in URI queues halts crawling |
- | Fix | #anchor links not trimmed, and thus recrawled |
- | Fix | arc's filedesc file name includes .gz |
- | Fix | [denmark-workshop] Cookie mangling |
- | Fix | HttpException: Unable to parse header |
- | Fix | bogus ARC-header when no Content-type |
- | Fix | paths when crawling without UI |
- | Fix | domain scope leakage |