org.archive.crawler.processor
Class HashCrawlMapper

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.processor.CrawlMapper
                          extended by org.archive.crawler.processor.HashCrawlMapper
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, FetchStatusCodes

public class HashCrawlMapper
extends CrawlMapper

Maps URIs to one of N crawler names by applying a hash to the URI's (possibly-transformed) classKey.

Version:
$Date: 2007-06-19 02:00:24 +0000 (Tue, 19 Jun 2007) $, $Revision: 5215 $
Author:
gojomo
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
static java.lang.String ATTR_CRAWLER_COUNT
          count of crawlers
static java.lang.String ATTR_REDUCE_PATTERN
          regex pattern for reducing classKey
static java.lang.String ATTR_USE_PUBLICSUFFIX_REDUCE
          ruse publicsuffixes pattern for reducing classKey?
(package private)  long bucketCount
           
static java.lang.Long DEFAULT_CRAWLER_COUNT
           
static java.lang.String DEFAULT_REDUCE_PATTERN
           
static java.lang.Boolean DEFAULT_USE_PUBLICSUFFIX_REDUCE
           
(package private)  java.lang.String reducePattern
           
 
Fields inherited from class org.archive.crawler.processor.CrawlMapper
ATTR_CHECK_OUTLINKS, ATTR_CHECK_URI, ATTR_DIVERSION_DIR, ATTR_LOCAL_NAME, ATTR_MAP_OUTLINK_DECIDE_RULES, ATTR_ROTATION_DIGITS, cache, DEFAULT_CHECK_OUTLINKS, DEFAULT_CHECK_URI, DEFAULT_DIVERSION_DIR, DEFAULT_LOCAL_NAME, DEFAULT_ROTATION_DIGITS, diversionLogs, localName, logGeneration
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Constructor Summary
HashCrawlMapper(java.lang.String name)
          Constructor.
 
Method Summary
protected  void initialTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
 void kickUpdate()
           
protected  java.lang.String map(CandidateURI cauri)
          Look up the crawler node name to which the given CandidateURI should be mapped.
static java.lang.String mapString(java.lang.String key, java.lang.String reducePattern, long bucketCount)
           
 
Methods inherited from class org.archive.crawler.processor.CrawlMapper
decideToMapOutlink, divertLog, getDiversionLog, getMapOutlinkDecideRule, innerProcess, updateGeneration
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_CRAWLER_COUNT

public static final java.lang.String ATTR_CRAWLER_COUNT
count of crawlers

See Also:
Constant Field Values

DEFAULT_CRAWLER_COUNT

public static final java.lang.Long DEFAULT_CRAWLER_COUNT

ATTR_USE_PUBLICSUFFIX_REDUCE

public static final java.lang.String ATTR_USE_PUBLICSUFFIX_REDUCE
ruse publicsuffixes pattern for reducing classKey?

See Also:
Constant Field Values

DEFAULT_USE_PUBLICSUFFIX_REDUCE

public static final java.lang.Boolean DEFAULT_USE_PUBLICSUFFIX_REDUCE

ATTR_REDUCE_PATTERN

public static final java.lang.String ATTR_REDUCE_PATTERN
regex pattern for reducing classKey

See Also:
Constant Field Values

DEFAULT_REDUCE_PATTERN

public static final java.lang.String DEFAULT_REDUCE_PATTERN
See Also:
Constant Field Values

bucketCount

long bucketCount

reducePattern

java.lang.String reducePattern
Constructor Detail

HashCrawlMapper

public HashCrawlMapper(java.lang.String name)
Constructor.

Parameters:
name - Name of this processor.
Method Detail

map

protected java.lang.String map(CandidateURI cauri)
Look up the crawler node name to which the given CandidateURI should be mapped.

Specified by:
map in class CrawlMapper
Parameters:
cauri - CandidateURI to consider
Returns:
String node name which should handle URI

initialTasks

protected void initialTasks()
Description copied from class: Processor
Classes subclassing this one should override this method to perform processor specific actions.

This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.

Overrides:
initialTasks in class CrawlMapper

kickUpdate

public void kickUpdate()
Overrides:
kickUpdate in class Processor

mapString

public static java.lang.String mapString(java.lang.String key,
                                         java.lang.String reducePattern,
                                         long bucketCount)


Copyright © 2003-2011 Internet Archive. All Rights Reserved.