2. Obtaining and building Heritrix

2.1. Obtaining Heritrix

Heritrix can be obtained as packaged binary or source downloaded from the crawler sourceforge home page, or via checkout from archive-crawler.svn.sourceforge.net. See the crawler sourceforge svn page for how to fetch from subversion. The Module Name name to use checking out heritrix is ArchiveOpenCrawler, the name Heritrix had before it was called Heritrix.

Note

Note, anonymous access does not give you the current HEAD but a snapshot that can some times be up to 24 hours behind HEAD.

The packaged binary is named heritrix-?.?.?.tar.gz (or heritrix-?.?.?.zip) and the packaged source is named heritrix-?.?.?-src.tar.gz (or heritrix-?.?.?-src.zip) where ?.?.? is the heritrix release version.

2.2. Building Heritrix

You can build Heritrix from source using Maven. Heritrix build has been tested against maven-1.0.2. Do not use Maven 2.x to build Heritrix. See maven.apache.org for how to obtain the binary and setup of your maven environment.

In addition to the base maven build, if you want to generate the docbook user and developer manuals, you will need to add the maven sdocbook plugin which can be found at this page (If the sdocbook plugin is not present, the build skips the docbook manual generation). Be careful. Do not confuse the 'sdocbook' plugin with the similarly named 'docbook' plugin. This latter converts docbook to xdocs where what's wanted is the former, convert docbook xml to html. This 'sdocbook' plugin is used to generate the user and developer documentation.

Download the plugin jar -- currently, as of this writing, its maven-sdocbook-plugin-1.4.1.jar -- and put it into your maven repository plugins directory, usually at ${MAVEN_HOME}/plugins/ (in earlier versions of maven, pre 1.0.2, plugins are at ${HOME}/.maven/plugins/).

The sdocbook plugin has a dependency on the jimi jar from sun which you will have to manually pull down and place into your maven respository (Its got a sun license you must accept so maven cannot autodownload). Download the jimi package and unzip it. Rename the file named JimiProClasses.zip as jimi-1.0.jar and put it into your maven jars repository (Usually .maven/repository/jimi/jars. You may have to create the later directories manually). Maven will be looking for a jar named jimi-1.0.jar. Thats why you have to rename the jimi class zip (jars are effectively zips).

Note

It may be necessary to alter the sdocbook-plugin default configuration. By default, sdocbook will download the latest version of docbook-xsl. However, sdocbook hardcodes a specific version number for docbook-xsl in its plugin.properties file. If you get an error like "Error while expanding ~/.maven/repository/docbook/zips/docbook-xsl-1.66.1.zip", then you will have to manually edit sdocbook's properties. First determine the version of docbook-xsl that you have -- it's in ~/.maven/repository/docbook/zips. Once you have the version number, edit ~/.maven/cache/maven-sdocbook-plugin-1.4/plugin-properties and change the maven.sdocbook.stylesheets.version property to the version that was actually downloaded.

To build a source checkout with Maven:

% cd CHECKOUT_DIR 
% $MAVEN_HOME/bin/maven dist
In the target/distribution subdir, you will find packaged source and binary builds. Run $MAVEN_HOME/bin/maven -g for other Maven possibilities.

2.3. Running Heritrix

See the User Manual [Heritrix User Guide] for how to run the built Heritrix.

2.4. Eclipse

The development team uses Eclipse as the development environment. This is of course optional, but for those who want to use Eclipse, you can, at the head of the source tree, find Eclipse .project and .classpath configuration files that should make integrating the source checkout into your Eclipse development environment straight-forward.

When running direct from checkout directories, rather than a Maven build, be sure to use a JDK installation (so that JSP pages can compile). You will probably also want to set the 'heritrix.development' property (with the "-Dheritrix.development" VM command-line option) to indicate certain files are in their development, rather than deployment, locations.

2.5. Integration self test

Run the integration self test on the command line by doing the following:

% $HERITRIX_HOME/bin/heritrix --selftest
This will set the crawler going against itself, in particular, the selftest webapp. When done, it runs an analysis of the produced arc files and logs and dumps a ruling into heritrix_out.log. See the org.archive.crawler.selftest package for more on how the selftest works.