|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectjavax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.framework.Processor
org.archive.crawler.framework.WriterPoolProcessor
public abstract class WriterPoolProcessor
Abstract implementation of a file pool processor.
Subclass to implement for a particular WriterPoolMember
instance.
Nested Class Summary |
---|
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
---|
ComplexType.MBeanAttributeInfoIterator |
Field Summary | |
---|---|
protected static java.lang.String |
ANNOTATION_UNWRITTEN
CrawlURI annotation indicating no record was written |
static java.lang.String |
ATTR_COMPRESS
Key to use asking settings for file compression value. |
static java.lang.String |
ATTR_MAX_BYTES_WRITTEN
Key for the maximum bytes to write attribute. |
static java.lang.String |
ATTR_MAX_SIZE_BYTES
Key to use asking settings for file max size value. |
static java.lang.String |
ATTR_PATH
Key to use asking settings for arc path value. |
static java.lang.String |
ATTR_POOL_MAX_ACTIVE
Key to get maximum pool size. |
static java.lang.String |
ATTR_POOL_MAX_WAIT
Key to get maximum wait on pool object before we give up and throw IOException. |
static java.lang.String |
ATTR_PREFIX
Key to use asking settings for file prefix value. |
static java.lang.String |
ATTR_SKIP_IDENTICAL_DIGESTS
Key for whether to skip writing records of content-digest repeats |
static java.lang.String |
ATTR_SUFFIX
Key to use asking settings for file suffix value. |
static boolean |
DEFAULT_COMPRESS
Default as to whether we do compression of files. |
Fields inherited from class org.archive.crawler.framework.Processor |
---|
ATTR_DECIDE_RULES, ATTR_ENABLED, attrDecideRules |
Fields inherited from class org.archive.crawler.settings.ComplexType |
---|
definition, definitionMap |
Constructor Summary | |
---|---|
WriterPoolProcessor(java.lang.String name)
|
|
WriterPoolProcessor(java.lang.String name,
java.lang.String description)
|
Method Summary | |
---|---|
protected java.util.List<java.lang.String> |
cacheMetadata()
|
protected void |
checkBytesWritten()
|
protected void |
checkpointRecover()
Called out of initialTasks() when recovering a checkpoint. |
void |
crawlCheckpoint(java.io.File checkpointDir)
Called by CrawlController when checkpointing. |
void |
crawlEnded(java.lang.String sExitMessage)
Called when a CrawlController has ended a crawl and is about to exit. |
void |
crawlEnding(java.lang.String sExitMessage)
Called when a CrawlController is ending a crawl (for any reason) |
void |
crawlPaused(java.lang.String statusMessage)
Called when a CrawlController is actually paused (all threads are idle). |
void |
crawlPausing(java.lang.String statusMessage)
Called when a CrawlController is going to be paused. |
void |
crawlResuming(java.lang.String statusMessage)
Called when a CrawlController is resuming a crawl that had been paused. |
void |
crawlStarted(java.lang.String message)
Called on crawl start. |
java.lang.Object |
getAttributeUnchecked(java.lang.String name)
Version of getAttributes that catches and logs exceptions and returns null if failure to fetch the attribute. |
protected java.lang.String |
getCheckpointStateFile()
|
abstract long |
getDefaultMaxFileSize()
Default maximum file size. |
protected java.lang.String[] |
getDefaultPath()
|
protected java.lang.String |
getFirstrecordBody(java.io.File orderFile)
Write the arc metadata body content. |
protected java.lang.String |
getFirstrecordStylesheet()
|
protected java.lang.String |
getHostAddress(CrawlURI curi)
Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line). |
long |
getMaxSize()
Max size we want files to be (bytes). |
long |
getMaxToWrite()
|
java.util.List<java.lang.String> |
getMetadata()
Return list of metadatas to add to first arc file metadata record. |
java.util.List<java.io.File> |
getOutputDirs()
|
protected WriterPool |
getPool()
|
int |
getPoolMaximumActive()
|
int |
getPoolMaximumWait()
|
java.lang.String |
getPrefix()
|
protected java.util.concurrent.atomic.AtomicInteger |
getSerialNo()
|
java.lang.String |
getSuffix()
|
protected long |
getTotalBytesWritten()
|
void |
initialTasks()
Classes subclassing this one should override this method to perform processor specific actions. |
protected abstract void |
innerProcess(CrawlURI curi)
Writes a CrawlURI and its associated data to store file. |
boolean |
isCompressed()
|
protected int |
loadCheckpointSerialNumber()
|
protected void |
saveCheckpointSerialNumber(java.io.File checkpointDir,
int serialNo)
|
protected void |
setPool(WriterPool pool)
|
protected void |
setTotalBytesWritten(long totalBytesWritten)
|
protected abstract void |
setupPool(java.util.concurrent.atomic.AtomicInteger serialNo)
Set up pool of files. |
protected boolean |
shouldWrite(CrawlURI curi)
Whether the given CrawlURI should be written to archive files. |
Methods inherited from class org.archive.crawler.framework.Processor |
---|
checkForInterrupt, finalTasks, getController, getDecideRule, getDefaultNextProcessor, innerRejectProcess, isContentToProcess, isEnabled, isExpectedMimeType, isHttpTransactionContentToProcess, kickUpdate, process, report, rulesAccept, rulesAccept, setDefaultNextProcessor, spawn |
Methods inherited from class org.archive.crawler.settings.ModuleType |
---|
addElement, listUsedFiles |
Methods inherited from class org.archive.crawler.settings.Type |
---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
Methods inherited from class javax.management.Attribute |
---|
getName, hashCode |
Methods inherited from class java.lang.Object |
---|
clone, finalize, getClass, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final java.lang.String ATTR_COMPRESS
public static final boolean DEFAULT_COMPRESS
public static final java.lang.String ATTR_PREFIX
public static final java.lang.String ATTR_PATH
public static final java.lang.String ATTR_SUFFIX
public static final java.lang.String ATTR_MAX_SIZE_BYTES
public static final java.lang.String ATTR_POOL_MAX_ACTIVE
public static final java.lang.String ATTR_POOL_MAX_WAIT
public static final java.lang.String ATTR_MAX_BYTES_WRITTEN
public static final java.lang.String ATTR_SKIP_IDENTICAL_DIGESTS
protected static final java.lang.String ANNOTATION_UNWRITTEN
Constructor Detail |
---|
public WriterPoolProcessor(java.lang.String name)
name
- Name of this processor.public WriterPoolProcessor(java.lang.String name, java.lang.String description)
name
- Name of this processor.description
- Description for this processor.Method Detail |
---|
public abstract long getDefaultMaxFileSize()
protected java.lang.String[] getDefaultPath()
public void initialTasks()
Processor
This method is garanteed to be called after the crawl is set up, but before any URI-processing has occured.
initialTasks
in class Processor
protected java.util.concurrent.atomic.AtomicInteger getSerialNo()
protected abstract void setupPool(java.util.concurrent.atomic.AtomicInteger serialNo)
protected abstract void innerProcess(CrawlURI curi)
innerProcess
in class Processor
curi
- CrawlURI to process.protected void checkBytesWritten()
protected boolean shouldWrite(CrawlURI curi)
curi
- CrawlURI
protected java.lang.String getHostAddress(CrawlURI curi)
curi
- CrawlURI
public java.lang.Object getAttributeUnchecked(java.lang.String name)
name
- Attribute name.
public long getMaxSize()
public java.lang.String getPrefix()
public java.util.List<java.io.File> getOutputDirs()
public boolean isCompressed()
public int getPoolMaximumActive()
public int getPoolMaximumWait()
public java.lang.String getSuffix()
public long getMaxToWrite()
public void crawlEnding(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnding
in interface CrawlStatusListener
sExitMessage
- Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJob
public void crawlEnded(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnded
in interface CrawlStatusListener
sExitMessage
- Type of exit. Should be one of the STATUS constants
in defined in CrawlJob.CrawlJob
public void crawlStarted(java.lang.String message)
CrawlStatusListener
crawlStarted
in interface CrawlStatusListener
message
- Start message.protected java.lang.String getCheckpointStateFile()
public void crawlCheckpoint(java.io.File checkpointDir) throws java.io.IOException
CrawlStatusListener
CrawlController
when checkpointing.
crawlCheckpoint
in interface CrawlStatusListener
checkpointDir
- Checkpoint dir. Write checkpoint state here.
java.io.IOException
public void crawlPausing(java.lang.String statusMessage)
CrawlStatusListener
crawlPausing
in interface CrawlStatusListener
statusMessage
- Should be
STATUS_WAITING_FOR_PAUSE
. Passed for conveniencepublic void crawlPaused(java.lang.String statusMessage)
CrawlStatusListener
crawlPaused
in interface CrawlStatusListener
statusMessage
- Should be
CrawlJob.STATUS_PAUSED
. Passed for
conveniencepublic void crawlResuming(java.lang.String statusMessage)
CrawlStatusListener
crawlResuming
in interface CrawlStatusListener
statusMessage
- Should be
CrawlJob.STATUS_RUNNING
. Passed for
convenienceprotected WriterPool getPool()
protected void setPool(WriterPool pool)
protected long getTotalBytesWritten()
protected void setTotalBytesWritten(long totalBytesWritten)
protected void checkpointRecover()
initialTasks()
when recovering a checkpoint.
Restore state.
protected int loadCheckpointSerialNumber()
protected void saveCheckpointSerialNumber(java.io.File checkpointDir, int serialNo) throws java.io.IOException
java.io.IOException
public java.util.List<java.lang.String> getMetadata()
getFirstrecordStylesheet()
.
Get xml files from settingshandler. Currently order file is the
only xml file. We're NOT adding seeds to meta data.
protected java.util.List<java.lang.String> cacheMetadata()
protected java.lang.String getFirstrecordStylesheet()
protected java.lang.String getFirstrecordBody(java.io.File orderFile)
orderFile
- Order file.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |