A. Common Heritrix Use Cases

Frank McCown

Old Dominion University

There are many different ways you may perform a web crawl. Here we have listed several use cases which will allow you to become familiar with some of Heritrix's more frequently used crawling parameters.

A.1. Avoiding Too Much Dynamic Content

Suppose you want to crawl only pages from a particular host (http://www.foo.org/), and you want to avoid crawling too many pages of the dynamically generated calendar. Let's say the calendar is accessed by passing a year, month and day to the calendar directory, as in http://www.foo.org/calendar?year=2006&month=3&day=12.

When you first create the job for this crawl, you will specify a single seed URI: http://www.foo.org/. By default, your new crawl job will use the DecidingScope, which will contain a default set of DecideRules. One of the default rules is the SurtPrefixedDecideRule, which tells Heritrix to accept any URIs that match our seed URI's SURT prefix, http://(org,foo,www,)/. Subsequently, if the URI http://foo.org/ is encountered, it will be rejected since its SURT prefix http://(org,foo,) does not match the seed's SURT prefix. To allow both foo.org and www.foo.org, you could use the two seeds http://foo.org/ and http://www.foo.org/. To allow every subdomain of foo.org, you could use the seed http://foo.org (note the absence of a trailing slash).

You will need to delete the TransclusionDecideRule since this rule has the potential to lead Heritrix onto another host. For example, if a URI returned a 301 (moved permanently) or 302 (found) response code and a URI with a different host name, Heritrix would accept this URI using the TransclusionDecideRule. Removing this rule will keep Heritrix from straying off of our www.foo.org host.

A few of the rules like PathologicalPathDecideRule and TooManyPathSegmentsDecideRule will allow Heritrix to avoid some types of crawler traps. The TooManyHopsDecideRule will keep Heritrix from following too many links away from the seed so the calendar doesn't trap Heritrix in an infinite loop. By default, the hop path is set to 15, but you can change that on the Settings screen.

Alternatively, you may add the MatchesFilePatternDecideRule. Set use-preset-pattern to CUSTOM and set regexp to something like:

.*foo\.org(?!/calendar).*|.*foo\.org/calendar\?year=200[56].*

Finally, you'll need to set the user-agent and from fields on the Settings screen, and then you may submit the job and monitor the crawl.

A.2. Only Store Successful HTML Pages

Suppose you wanted to only grab the first 50 pages encountered from a set of seeds and archive only those pages that return a 200 response code and have the text/html MIME type. Additionally, you only want to look for links in HTML resources.

When you create your job, use the DecidingScope with the default set of DecideRules.

In order to examine HTML documents only for links, you will need to remove the following extractors that tell Heritrix to look for links in style sheets, JavaScript, and Flash files:

  1. ExtractorCSS
  2. ExtractorJS
  3. ExtractorSWF

You should leave in the ExtractorHTTP since it is useful in locating resources that can only be found using a redirect (301 or 302).

You can limit the number of files to download by setting max-document-download on the Settings screen. Setting this value to 50 will probably not have the results you intend. Since each DNS response and robots.txt file is counted in this number, you'll likely want to use the value of 50 * number of seeds * 2.

Next, you will need to add filters to the ARCWriterProcessor so that it only records documents with a 200 status code and a mime-type of text/html. The first filter to add is the ContentTypeRegExpFilter; set its regexp setting to text/html.*. Next, add a DecidingFilter to the ARCWriterProcessor, then add FetchStatusDecideRule to the DecidingFilter.

You'll probably want to apply the above filters to the mid-fetch-filters setting of FetchHTTP as well. That will prevent FetchHTTP from downloading the content of any non-html or non-successful documents.

Once you have entered the desired settings, start the job and monitor the crawl.

A.3. Mirroring .html Files Only

Suppose you only want to crawl URLs that match http://foo.org/bar/*.html, and you'd like to save the crawled files in a file/directory format instead of saving them in ARC files. Suppose you also know that you are crawling a web server that is case sensitive (http://foo.org/bar/abc.html and http://foo.org/bar/ABC.HTML are pointing to two different resources).

You would first need to create a job with the single seed http://foo.org/bar/. You'll need to add the MirrorWriterProcessor on the Modules screen and delete the ARCWriterProcessor. This will store your files in a directory structure that matches the crawled URIs, and the files will be stored in the crawl job's mirror directory.

Your job should use the DecidingScope with the following set of DecideRules:

  1. RejectDecideRule
  2. SurtPrefixedDecideRule
  3. TooManyHopsDecideRule
  4. PathologicalPathDecideRule
  5. TooManyPathSegmentsDecideRule
  6. NotMatchesFilePatternDecideRule
  7. PrerequisiteAcceptDecideRule

We are using the NotMatchesFilePatternDecideRule so we can eliminate crawling any URIs that don't end with .html. It's important that this DecideRule be placed immediately before PrerequisiteAcceptDecideRule; otherwise the DNS and robots.txt prerequisites will be rejected since they won't match the regexp.

On the Setting screen, you'll want to set the following for the NotMatchesFilePatternDecideRule:

  1. decision: REJECT
  2. use-preset-pattern: CUSTOM
  3. regexp: .*(/|\.html)$

Note that the regexp will accept URIs that end with / as well as .html. If we don't accept the /, the seed URI will be rejected. This also allows us to accept URIs like http://foo.org/bar/dir/ which are likely pointing to index.html. A stricter regexp would be .*\.html$, but you'll need to change your seed URI if you use it. One thing to be aware of: if Heritrix encounters the URI http://foo.org/bar/dir where dir is a directory, the URI will be rejected since it is missing the terminating slash.

Finally you'll need to allow Heritrix to differentiate between abc.html and ABC.HTML. Do this by removing the LowercaseRule under uri-canonicalization-rules on the Submodules screen.

Once you have entered the desired settings, start the job and monitor the crawl.