org.archive.crawler.processor
Class HashCrawlMapper
java.lang.Object
javax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.processor.CrawlMapper
org.archive.crawler.processor.HashCrawlMapper
- All Implemented Interfaces:
- java.io.Serializable, javax.management.DynamicMBean, FetchStatusCodes
public class HashCrawlMapper
- extends CrawlMapper
Maps URIs to one of N crawler names by applying a hash to the
URI's (possibly-transformed) classKey.
- Version:
- $Date: 2007-06-19 02:00:24 +0000 (Tue, 19 Jun 2007) $, $Revision: 5215 $
- Author:
- gojomo
- See Also:
- Serialized Form
Fields inherited from class org.archive.crawler.processor.CrawlMapper |
ATTR_CHECK_OUTLINKS, ATTR_CHECK_URI, ATTR_DIVERSION_DIR, ATTR_LOCAL_NAME, ATTR_MAP_OUTLINK_DECIDE_RULES, ATTR_ROTATION_DIGITS, cache, DEFAULT_CHECK_OUTLINKS, DEFAULT_CHECK_URI, DEFAULT_DIVERSION_DIR, DEFAULT_LOCAL_NAME, DEFAULT_ROTATION_DIGITS, diversionLogs, localName, logGeneration |
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes |
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE |
Method Summary |
protected void |
initialTasks()
Classes subclassing this one should override this method to perform
processor specific actions. |
void |
kickUpdate()
|
protected java.lang.String |
map(CandidateURI cauri)
Look up the crawler node name to which the given CandidateURI
should be mapped. |
static java.lang.String |
mapString(java.lang.String key,
java.lang.String reducePattern,
long bucketCount)
|
Methods inherited from class org.archive.crawler.framework.Processor |
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ComplexType |
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute |
Methods inherited from class org.archive.crawler.settings.Type |
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
getName, hashCode |
Methods inherited from class java.lang.Object |
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
ATTR_CRAWLER_COUNT
public static final java.lang.String ATTR_CRAWLER_COUNT
- count of crawlers
- See Also:
- Constant Field Values
DEFAULT_CRAWLER_COUNT
public static final java.lang.Long DEFAULT_CRAWLER_COUNT
ATTR_USE_PUBLICSUFFIX_REDUCE
public static final java.lang.String ATTR_USE_PUBLICSUFFIX_REDUCE
- ruse publicsuffixes pattern for reducing classKey?
- See Also:
- Constant Field Values
DEFAULT_USE_PUBLICSUFFIX_REDUCE
public static final java.lang.Boolean DEFAULT_USE_PUBLICSUFFIX_REDUCE
ATTR_REDUCE_PATTERN
public static final java.lang.String ATTR_REDUCE_PATTERN
- regex pattern for reducing classKey
- See Also:
- Constant Field Values
DEFAULT_REDUCE_PATTERN
public static final java.lang.String DEFAULT_REDUCE_PATTERN
- See Also:
- Constant Field Values
bucketCount
long bucketCount
reducePattern
java.lang.String reducePattern
HashCrawlMapper
public HashCrawlMapper(java.lang.String name)
- Constructor.
- Parameters:
name
- Name of this processor.
map
protected java.lang.String map(CandidateURI cauri)
- Look up the crawler node name to which the given CandidateURI
should be mapped.
- Specified by:
map
in class CrawlMapper
- Parameters:
cauri
- CandidateURI to consider
- Returns:
- String node name which should handle URI
initialTasks
protected void initialTasks()
- Description copied from class:
Processor
- Classes subclassing this one should override this method to perform
processor specific actions.
This method is garanteed to be called after the crawl is set up, but
before any URI-processing has occured.
- Overrides:
initialTasks
in class CrawlMapper
kickUpdate
public void kickUpdate()
- Overrides:
kickUpdate
in class Processor
mapString
public static java.lang.String mapString(java.lang.String key,
java.lang.String reducePattern,
long bucketCount)
Copyright © 2003-2011 Internet Archive. All Rights Reserved.