org.archive.crawler.framework
Class WriterPoolProcessor

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.framework.Processor
                      extended by org.archive.crawler.framework.WriterPoolProcessor
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, CrawlStatusListener
Direct Known Subclasses:
ARCWriterProcessor, WARCWriterProcessor

public abstract class WriterPoolProcessor
extends Processor
implements CoreAttributeConstants, CrawlStatusListener, FetchStatusCodes

Abstract implementation of a file pool processor. Subclass to implement for a particular WriterPoolMember instance.

Author:
Parker Thompson, stack
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
protected static java.lang.String ANNOTATION_UNWRITTEN
          CrawlURI annotation indicating no record was written
static java.lang.String ATTR_COMPRESS
          Key to use asking settings for file compression value.
static java.lang.String ATTR_MAX_BYTES_WRITTEN
          Key for the maximum bytes to write attribute.
static java.lang.String ATTR_MAX_SIZE_BYTES
          Key to use asking settings for file max size value.
static java.lang.String ATTR_PATH
          Key to use asking settings for arc path value.
static java.lang.String ATTR_POOL_MAX_ACTIVE
          Key to get maximum pool size.
static java.lang.String ATTR_POOL_MAX_WAIT
          Key to get maximum wait on pool object before we give up and throw IOException.
static java.lang.String ATTR_PREFIX
          Key to use asking settings for file prefix value.
static java.lang.String ATTR_SKIP_IDENTICAL_DIGESTS
          Key for whether to skip writing records of content-digest repeats
static java.lang.String ATTR_SUFFIX
          Key to use asking settings for file suffix value.
static boolean DEFAULT_COMPRESS
          Default as to whether we do compression of files.
 
Fields inherited from class org.archive.crawler.framework.Processor
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definition, definitionMap
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Constructor Summary
WriterPoolProcessor(java.lang.String name)
           
WriterPoolProcessor(java.lang.String name, java.lang.String description)
           
 
Method Summary
protected  java.util.List<java.lang.String> cacheMetadata()
           
protected  void checkBytesWritten()
           
protected  void checkpointRecover()
          Called out of initialTasks() when recovering a checkpoint.
 void crawlCheckpoint(java.io.File checkpointDir)
          Called by CrawlController when checkpointing.
 void crawlEnded(java.lang.String sExitMessage)
          Called when a CrawlController has ended a crawl and is about to exit.
 void crawlEnding(java.lang.String sExitMessage)
          Called when a CrawlController is ending a crawl (for any reason)
 void crawlPaused(java.lang.String statusMessage)
          Called when a CrawlController is actually paused (all threads are idle).
 void crawlPausing(java.lang.String statusMessage)
          Called when a CrawlController is going to be paused.
 void crawlResuming(java.lang.String statusMessage)
          Called when a CrawlController is resuming a crawl that had been paused.
 void crawlStarted(java.lang.String message)
          Called on crawl start.
 java.lang.Object getAttributeUnchecked(java.lang.String name)
          Version of getAttributes that catches and logs exceptions and returns null if failure to fetch the attribute.
protected  java.lang.String getCheckpointStateFile()
           
abstract  long getDefaultMaxFileSize()
          Default maximum file size.
protected  java.lang.String[] getDefaultPath()
           
protected  java.lang.String getFirstrecordBody(java.io.File orderFile)
          Write the arc metadata body content.
protected  java.lang.String getFirstrecordStylesheet()
           
protected  java.lang.String getHostAddress(CrawlURI curi)
          Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line).
 long getMaxSize()
          Max size we want files to be (bytes).
 long getMaxToWrite()
           
 java.util.List<java.lang.String> getMetadata()
          Return list of metadatas to add to first arc file metadata record.
 java.util.List<java.io.File> getOutputDirs()
           
protected  WriterPool getPool()
           
 int getPoolMaximumActive()
           
 int getPoolMaximumWait()
           
 java.lang.String getPrefix()
           
protected  java.util.concurrent.atomic.AtomicInteger getSerialNo()
           
 java.lang.String getSuffix()
           
protected  long getTotalBytesWritten()
           
 void initialTasks()
          Classes subclassing this one should override this method to perform processor specific actions.
protected abstract  void innerProcess(CrawlURI curi)
          Writes a CrawlURI and its associated data to store file.
 boolean isCompressed()
           
protected  int loadCheckpointSerialNumber()
           
protected  void saveCheckpointSerialNumber(java.io.File checkpointDir, int serialNo)
           
protected  void setPool(WriterPool pool)
           
protected  void setTotalBytesWritten(long totalBytesWritten)
           
protected abstract  void setupPool(java.util.concurrent.atomic.AtomicInteger serialNo)
          Set up pool of files.
protected  boolean shouldWrite(CrawlURI curi)
          Whether the given CrawlURI should be written to archive files.
 
Methods inherited from class org.archive.crawler.framework.Processor
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, removeElementFromDefinition, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, toString, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName, hashCode
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
 

Field Detail

ATTR_COMPRESS

public static final java.lang.String ATTR_COMPRESS
Key to use asking settings for file compression value.

See Also:
Constant Field Values

DEFAULT_COMPRESS

public static final boolean DEFAULT_COMPRESS
Default as to whether we do compression of files.

See Also:
Constant Field Values

ATTR_PREFIX

public static final java.lang.String ATTR_PREFIX
Key to use asking settings for file prefix value.

See Also:
Constant Field Values

ATTR_PATH

public static final java.lang.String ATTR_PATH
Key to use asking settings for arc path value.

See Also:
Constant Field Values

ATTR_SUFFIX

public static final java.lang.String ATTR_SUFFIX
Key to use asking settings for file suffix value.

See Also:
Constant Field Values

ATTR_MAX_SIZE_BYTES

public static final java.lang.String ATTR_MAX_SIZE_BYTES
Key to use asking settings for file max size value.

See Also:
Constant Field Values

ATTR_POOL_MAX_ACTIVE

public static final java.lang.String ATTR_POOL_MAX_ACTIVE
Key to get maximum pool size. This key is for maximum files active in the pool.

See Also:
Constant Field Values

ATTR_POOL_MAX_WAIT

public static final java.lang.String ATTR_POOL_MAX_WAIT
Key to get maximum wait on pool object before we give up and throw IOException.

See Also:
Constant Field Values

ATTR_MAX_BYTES_WRITTEN

public static final java.lang.String ATTR_MAX_BYTES_WRITTEN
Key for the maximum bytes to write attribute.

See Also:
Constant Field Values

ATTR_SKIP_IDENTICAL_DIGESTS

public static final java.lang.String ATTR_SKIP_IDENTICAL_DIGESTS
Key for whether to skip writing records of content-digest repeats

See Also:
Constant Field Values

ANNOTATION_UNWRITTEN

protected static final java.lang.String ANNOTATION_UNWRITTEN
CrawlURI annotation indicating no record was written

See Also:
Constant Field Values
Constructor Detail

WriterPoolProcessor

public WriterPoolProcessor(java.lang.String name)
Parameters:
name - Name of this processor.

WriterPoolProcessor

public WriterPoolProcessor(java.lang.String name,
                           java.lang.String description)
Parameters:
name - Name of this processor.
description - Description for this processor.
Method Detail

getDefaultMaxFileSize

public abstract long getDefaultMaxFileSize()
Default maximum file size.


getDefaultPath

protected java.lang.String[] getDefaultPath()

initialTasks

public void initialTasks()
Description copied from class: Processor
Classes subclassing this one should override this method to perform processor specific actions.

This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.

Overrides:
initialTasks in class Processor

getSerialNo

protected java.util.concurrent.atomic.AtomicInteger getSerialNo()

setupPool

protected abstract void setupPool(java.util.concurrent.atomic.AtomicInteger serialNo)
Set up pool of files.


innerProcess

protected abstract void innerProcess(CrawlURI curi)
Writes a CrawlURI and its associated data to store file. Currently this method understands the following uri types: dns, http, and https.

Overrides:
innerProcess in class Processor
Parameters:
curi - CrawlURI to process.

checkBytesWritten

protected void checkBytesWritten()

shouldWrite

protected boolean shouldWrite(CrawlURI curi)
Whether the given CrawlURI should be written to archive files. Annotates CrawlURI with a reason for any negative answer.

Parameters:
curi - CrawlURI
Returns:
true if URI should be written; false otherwise

getHostAddress

protected java.lang.String getHostAddress(CrawlURI curi)
Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line).

Parameters:
curi - CrawlURI
Returns:
String of IP address

getAttributeUnchecked

public java.lang.Object getAttributeUnchecked(java.lang.String name)
Version of getAttributes that catches and logs exceptions and returns null if failure to fetch the attribute.

Parameters:
name - Attribute name.
Returns:
Attribute or null.

getMaxSize

public long getMaxSize()
Max size we want files to be (bytes). Default is ARCConstants.DEFAULT_MAX_ARC_FILE_SIZE. Note that ARC files will usually be bigger than maxSize; they'll be maxSize + length to next boundary.

Returns:
ARC maximum size.

getPrefix

public java.lang.String getPrefix()

getOutputDirs

public java.util.List<java.io.File> getOutputDirs()

isCompressed

public boolean isCompressed()

getPoolMaximumActive

public int getPoolMaximumActive()
Returns:
Returns the poolMaximumActive.

getPoolMaximumWait

public int getPoolMaximumWait()
Returns:
Returns the poolMaximumWait.

getSuffix

public java.lang.String getSuffix()

getMaxToWrite

public long getMaxToWrite()

crawlEnding

public void crawlEnding(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is ending a crawl (for any reason)

Specified by:
crawlEnding in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlEnded

public void crawlEnded(java.lang.String sExitMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController has ended a crawl and is about to exit.

Specified by:
crawlEnded in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlStarted

public void crawlStarted(java.lang.String message)
Description copied from interface: CrawlStatusListener
Called on crawl start.

Specified by:
crawlStarted in interface CrawlStatusListener
Parameters:
message - Start message.

getCheckpointStateFile

protected java.lang.String getCheckpointStateFile()

crawlCheckpoint

public void crawlCheckpoint(java.io.File checkpointDir)
                     throws java.io.IOException
Description copied from interface: CrawlStatusListener
Called by CrawlController when checkpointing.

Specified by:
crawlCheckpoint in interface CrawlStatusListener
Parameters:
checkpointDir - Checkpoint dir. Write checkpoint state here.
Throws:
java.io.IOException

crawlPausing

public void crawlPausing(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is going to be paused.

Specified by:
crawlPausing in interface CrawlStatusListener
Parameters:
statusMessage - Should be STATUS_WAITING_FOR_PAUSE. Passed for convenience

crawlPaused

public void crawlPaused(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is actually paused (all threads are idle).

Specified by:
crawlPaused in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_PAUSED. Passed for convenience

crawlResuming

public void crawlResuming(java.lang.String statusMessage)
Description copied from interface: CrawlStatusListener
Called when a CrawlController is resuming a crawl that had been paused.

Specified by:
crawlResuming in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_RUNNING. Passed for convenience

getPool

protected WriterPool getPool()

setPool

protected void setPool(WriterPool pool)

getTotalBytesWritten

protected long getTotalBytesWritten()

setTotalBytesWritten

protected void setTotalBytesWritten(long totalBytesWritten)

checkpointRecover

protected void checkpointRecover()
Called out of initialTasks() when recovering a checkpoint. Restore state.


loadCheckpointSerialNumber

protected int loadCheckpointSerialNumber()
Returns:
Serial number from checkpoint state file or if unreadable, -1 (Client should check for -1).

saveCheckpointSerialNumber

protected void saveCheckpointSerialNumber(java.io.File checkpointDir,
                                          int serialNo)
                                   throws java.io.IOException
Throws:
java.io.IOException

getMetadata

public java.util.List<java.lang.String> getMetadata()
Return list of metadatas to add to first arc file metadata record. Default is to stylesheet the order file. To specify stylesheet, override getFirstrecordStylesheet(). Get xml files from settingshandler. Currently order file is the only xml file. We're NOT adding seeds to meta data.

Returns:
List of strings and/or files to add to arc file as metadata or null.

cacheMetadata

protected java.util.List<java.lang.String> cacheMetadata()

getFirstrecordStylesheet

protected java.lang.String getFirstrecordStylesheet()

getFirstrecordBody

protected java.lang.String getFirstrecordBody(java.io.File orderFile)
Write the arc metadata body content. Its based on the order xml file but into this base we'll add other info such as machine ip.

Parameters:
orderFile - Order file.
Returns:
String that holds the arc metaheader body.


Copyright © 2003-2011 Internet Archive. All Rights Reserved.