17. Release 1.0.0 - 2004-08-06

Abstract

Added new prefix ('SURT') scope and filter, compression of recovery log, mass adding of URIs to running crawler, crawling via a http proxy, adding of headers to request, improved out-of-the-box defaults, hash of content to crawl log and to arcreader output, and many bug fixes.

17.1. Known Limitations

17.1.1. Crawl Size Upper Bounds

Heritrix 1.0.0 uses disk-based queues to hold any number of pending URIs bounded only by available disk space, but still relies on in-memory structures to efficiently track all discovered hosts and previously-scheduled URIs. Crawls whose total scheduled URIs or discovered hosts exhaust all available memory will trigger out-of-memory errors, which freeze a crawl at the point of the error.

With the default settings, and an assignment of a 256MB Java heap to the Heritrix process, crawling which discovers up to 10 000 hosts, and schedules over 6 000 000 URIs, should be possible. Discovery of higher numbers of URIs/hosts will likely trigger out-of-memory problems unless a larger java heap was assigned at startup.

Broad crawls -- those using the BroadScope or ranging over domains with many subdomains -- can easily and quickly exceed these parameters. Thus broad crawls in Heritrix 1.0.0 are not recommended, except for experimental purposes.

Narrower crawls, restricted to specific hosts or domains a limited number of subdomains, can run for a week or more, collecting millions of resources. Larger heaps can allow crawls to run into the tens of millions of collected URIS, and tens of thousands of discovered hosts.

An experimental alternate Frontier, the DiskIncludedFrontier, is also available via the 'Modules' crawl configuration tab. It uses a capped amount of memory plus disk storage to remember any number of scheduled URIs, but its performance is poor and it has not received the same testing as our default Frontier. The memory cost of additional discovered hosts continues to rise without limit when using a DiskIncludedFrontier.

Future versions of Heritrix will include other frontier implementations allowing larger and unbounded crawls with minimal performance penalties.

Its possible to get ConcurrentModificationsException editing options on a running crawl.

17.1.2.1. Workaround

Pause the crawl when making changes to crawl options.

On macintoshes and linux kernel version 2.6, heritrix fails to build (unit tests fail).

17.1.3.1. Workaround

See issue, [ 984390 ] Build fails: "rws" mode and Mac OS X interact badly, for source code workaround edit.

Heritrix fails to build on linux kernel 2.6.

17.1.4.1. Workaround

Build fails unless you use a JDK in advance of pedigree 1.5 beta 2 (It works with jdk1.5.0-rc). See [ 955975 ] Build fails: JVM and kernel 2.6+ (Was 2 tests fail...) and above.

17.2. Changes

Table 10. Changes

IDTypeSummary
939679AddMass-add URIs to running crawl and force reconsideration
986977AddSurtPrefix scope (and filter)
989816AddSpecification of default CharSequence charset
983001Addcrawl.log entries all on one line
869584AddHash content-bodies, show in logs (and future ARCs)
964581Addoption to preference (quick-get) embeds
964493AddCompress recover.log
988106Add[UURI] 'http:///...' converted to 'http://...'
926143Addenable use through HTTP proxy
945922AddAllow adding (subtracting?) http headers
983109AddImproved out-of-the-box defaults
982909AddARCWriter makes FAT gzip header
925734Addexponential backoff URI/host retries
-FixTotal data "written" isn't necessarily written (wording)
-Fixembeds within scope problem
-FixNPE clearing alerts
-Fixarcmetadata repeated once for every domain config
-FixCCE deserializing diskqueue [Was: IllegalArgumentExcepti...]
-Fixno docs for recovery-journal feature
-FixPause/Terminate ignored on 2.6 kernel 1.5 JVM
-FixInvestigate "Relative URI but no base"
-FixUser-Agent should be able to mimic Mozilla (as does Google)
-Fixreferral URL should be stored in recover.log
-FixToeThreads hung in FetchDNS after Pause
-Fixrobots.txt lookup for different ports on same host
-FixEmpty log percentages displayed as NaN%
-FixUURI doubly-encodes %XX sequences
-FixSingle settings change causes two versions to be created
-FixNew IA debian image is 2.6 (Was: Build fails: JVM and ...)
-FixNPE in PathDepthFilter
-Fix[investigate & rule out] Thread report deadlock risks
-Fixjetty susceptible to DoS attack
-Fix'ignore' robots does not ignore meta nofollow
-FixURI Syntax Errors stop page parsing.
-FixNPE in ExtractorHTML/TextUtils.getMatcher()
-FixARCReader: Failed to find GZIP MAGIC
-Fixjavascript embedded URLs
-FixNoClassDefFoundError when starting a job
-FixMax number of deferrals hard-coded to 10.
-FixFrontier report thread safety problems?
-FixARCReader hanging
-Fixlog-browsing by regexp outofmemoryerror
-FixDeferred URLs due the DNS problem -- Heritrix(-50)-Deferred
-FixAssertion failures shouldn't be more fatal than Runtime Exc.
-Fixmin-interval is superfluous; remove
-Fixcrawl doesn't end when using valence > 1
-FixGiant (in # of files) state directory problematic
-Fixrobots-expiration units, default wrong
-FixNoSuchElementException in URI queues halts crawling
-Fix#anchor links not trimmed, and thus recrawled
-Fixarc's filedesc file name includes .gz
-Fix[denmark-workshop] Cookie mangling
-FixHttpException: Unable to parse header
-Fixbogus ARC-header when no Content-type
-Fixpaths when crawling without UI
-Fixdomain scope leakage