|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectjavax.management.Attribute
org.archive.crawler.settings.Type
org.archive.crawler.settings.ComplexType
org.archive.crawler.settings.ModuleType
org.archive.crawler.frontier.HostQueuesFrontier
BdbFrontier.
public class HostQueuesFrontier
A basic mostly breadth-first frontier, which refrains from emitting more than one CrawlURI of the same 'key' (host) at once, and respects minimum-delay and delay-factor specifications for politeness.
There are an arbitrary number of 'KeyedQueues' each representing a certain 'key' class of URIs -- effectively, a single host (by hostname).
KeyedQueues may have an item in-process -- in which case they do not provide any other items for processing. KeyedQueues may also be 'snoozed' -- when they should be kept inactive for a period of time, to either enforce politeness policies or allow a configurable amount of time between error retries.
| Nested Class Summary |
|---|
| Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType |
|---|
ComplexType.MBeanAttributeInfoIterator |
| Field Summary | |
|---|---|
protected static java.lang.String |
ACCEPTABLE_FORCE_QUEUE
Deprecated. |
protected UriUniqFilter |
alreadyIncluded
Deprecated. |
static java.lang.String |
ATTR_DELAY_FACTOR
Deprecated. how many multiples of last fetch elapsed time to wait before recontacting same server |
static java.lang.String |
ATTR_FORCE_QUEUE
Deprecated. queue assignment to force onto CrawlURIs; intended to be overridden |
static java.lang.String |
ATTR_HOLD_QUEUES
Deprecated. whether to hold queues INACTIVE until needed for throughput |
static java.lang.String |
ATTR_HOST_QUEUES_MEMORY_CAPACITY
Deprecated. maximum how many items to store in memory atop each keyedqueue higher == more RAM used per active host; lower == more disk IO |
static java.lang.String |
ATTR_HOST_VALENCE
Deprecated. maximum simultaneous requests in process to a host (queue) |
static java.lang.String |
ATTR_IP_POLITENESS
Deprecated. whether to reassign URIs to IP-address based queues when IP known |
static java.lang.String |
ATTR_MAX_DELAY
Deprecated. never wait more than this long, regardless of multiple |
static java.lang.String |
ATTR_MAX_HOST_BANDWIDTH_USAGE
Deprecated. maximum per-host bandwidth usage |
static java.lang.String |
ATTR_MAX_OVERALL_BANDWIDTH_USAGE
Deprecated. maximum overall bandwidth usage |
static java.lang.String |
ATTR_MAX_RETRIES
Deprecated. maximum times to emit a CrawlURI without final disposition |
static java.lang.String |
ATTR_MIN_DELAY
Deprecated. always wait this long after one completion before recontacting same server, regardless of multiple |
static java.lang.String |
ATTR_PREFERENCE_EMBED_HOPS
Deprecated. number of hops of embeds (ERX) to bump to front of host queue |
static java.lang.String |
ATTR_RETRY_DELAY
Deprecated. for retryable problems, seconds to wait before a retry |
protected CrawlController |
controller
Deprecated. |
protected static java.lang.Float |
DEFAULT_DELAY_FACTOR
Deprecated. |
protected static java.lang.String |
DEFAULT_FORCE_QUEUE
Deprecated. |
protected static java.lang.Boolean |
DEFAULT_HOLD_QUEUES
Deprecated. |
protected static java.lang.Integer |
DEFAULT_HOST_QUEUES_MEMORY_CAPACITY
Deprecated. |
protected static java.lang.Integer |
DEFAULT_HOST_VALENCE
Deprecated. |
protected static java.lang.Boolean |
DEFAULT_IP_POLITENESS
Deprecated. |
protected static java.lang.Integer |
DEFAULT_MAX_DELAY
Deprecated. |
protected static java.lang.Integer |
DEFAULT_MAX_HOST_BANDWIDTH_USAGE
Deprecated. |
protected static java.lang.Integer |
DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE
Deprecated. |
protected static java.lang.Integer |
DEFAULT_MAX_RETRIES
Deprecated. |
protected static java.lang.Integer |
DEFAULT_MIN_DELAY
Deprecated. |
protected static java.lang.Integer |
DEFAULT_PREFERENCE_EMBED_HOPS
Deprecated. |
protected static java.lang.Long |
DEFAULT_RETRY_DELAY
Deprecated. |
(package private) long |
disregardedUriCount
Deprecated. |
(package private) long |
failedFetchCount
Deprecated. |
(package private) java.util.LinkedList |
inactiveClassQueues
Deprecated. All per-class queues that are INACTIVE; will be empty unless 'site-first'/'hold-queues' is set. |
protected static float |
KILO_FACTOR
Deprecated. |
(package private) int |
lastMaxBandwidthKB
Deprecated. |
(package private) static java.lang.String |
LOGNAME_RECOVER
Deprecated. |
protected long |
nextOrdinal
Deprecated. ordinal numbers to assign to created CrawlURIs |
(package private) long |
nextURIEmitTime
Deprecated. |
(package private) long |
processedBytesAfterLastEmittedURI
Deprecated. |
protected QueueAssignmentPolicy |
queueAssignmentPolicy
Deprecated. Policy for assigning CrawlURIs to named queues |
(package private) long |
queuedUriCount
Deprecated. |
protected java.util.LinkedList |
readyClassQueues
Deprecated. All per-class queues whose first item may be handed out (that is, they are READY). |
(package private) java.util.SortedSet |
snoozeQueues
Deprecated. All per-class queues who are on hold until a certain time. |
(package private) long |
succeededFetchCount
Deprecated. |
(package private) long |
totalProcessedBytes
Deprecated. |
| Fields inherited from class org.archive.crawler.settings.ComplexType |
|---|
definitionMap |
| Fields inherited from interface org.archive.crawler.framework.Frontier |
|---|
ATTR_NAME |
| Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants |
|---|
A_ANNOTATIONS, A_CONTENT_TYPE, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_HTML_BASE, A_HTTP_TRANSACTION, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION |
| Constructor Summary | |
|---|---|
HostQueuesFrontier(java.lang.String name)
Deprecated. |
|
HostQueuesFrontier(java.lang.String q,
java.lang.String description)
Deprecated. |
|
| Method Summary | |
|---|---|
void |
batchFlush()
Deprecated. Flush pending URI queues. |
protected void |
batchSchedule(CandidateURI caUri)
Deprecated. |
protected java.lang.String |
canonicalize(UURI uuri)
Deprecated. Canonicalize passed uuri. |
void |
considerIncluded(UURI u)
Deprecated. Notify Frontier that it should consider the given UURI as if already scheduled. |
void |
crawlCheckpoint(java.io.File checkpointDir)
Deprecated. Called by CrawlController when checkpointing. |
void |
crawlEnded(java.lang.String sExitMessage)
Deprecated. Called when a CrawlController has ended a crawl and is about to exit. |
void |
crawlEnding(java.lang.String sExitMessage)
Deprecated. Called when a CrawlController is ending a crawl (for any reason) |
void |
crawlPaused(java.lang.String statusMessage)
Deprecated. Called when a CrawlController is actually paused (all threads are idle). |
void |
crawlPausing(java.lang.String statusMessage)
Deprecated. Called when a CrawlController is going to be paused. |
void |
crawlResuming(java.lang.String statusMessage)
Deprecated. Called when a CrawlController is resuming a crawl that had been paused. |
void |
crawlStarted(java.lang.String message)
Deprecated. Called on crawl start. |
protected UriUniqFilter |
createAlreadyIncluded(java.io.File dir,
java.lang.String filePrefix)
Deprecated. Create a UURISet that will serve as record of already seen URIs. |
void |
deleted(CrawlURI curi)
Deprecated. Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle. |
long |
deleteURIs(java.lang.String match)
Deprecated. Delete any URI that matches the given regular expression from the list of discovered and pending URIs. |
protected CrawlURI |
dequeueFromReady()
Deprecated. |
protected void |
discardQueue(URIWorkQueue q)
Deprecated. |
long |
discoveredUriCount()
Deprecated. (non-Javadoc) |
protected void |
disregardDisposition(CrawlURI curi)
Deprecated. |
long |
disregardedUriCount()
Deprecated. (non-Javadoc) |
protected long |
earliestWakeTime()
Deprecated. |
protected CrawlURI |
emitCuri(CrawlURI curi)
Deprecated. Prepares a CrawlURI for crawling. |
protected void |
enqueueToKeyed(CrawlURI curi)
Deprecated. Place CrawlURI on the queue for its class (server). |
long |
failedFetchCount()
Deprecated. (non-Javadoc) |
protected void |
failureDisposition(CrawlURI curi)
Deprecated. The CrawlURI has encountered a problem, and will not be retried. |
void |
finished(CrawlURI curi)
Deprecated. Note that the previously emitted CrawlURI has completed its processing (for now). |
protected void |
finishedSuccess(CrawlURI c)
Deprecated. |
long |
finishedUriCount()
Deprecated. (non-Javadoc) |
protected void |
forget(CrawlURI curi)
Deprecated. Forget the given CrawlURI. |
java.lang.String |
getClassKey(CandidateURI cauri)
Deprecated. |
FrontierJournal |
getFrontierJournal()
Deprecated. |
FrontierMarker |
getInitialMarker(java.lang.String regexpr,
boolean inCacheOnly)
Deprecated. Get a URIFrontierMarker initialized with the given
regular expression at the 'start' of the Frontier. |
java.lang.String[] |
getReports()
Deprecated. Get an array of report names offered by this Reporter. |
protected CrawlServer |
getServer(CrawlURI curi)
Deprecated. |
java.util.ArrayList |
getURIsList(FrontierMarker marker,
int numberOfMatches,
boolean verbose)
Deprecated. Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached. |
void |
importRecoverLog(java.lang.String pathToLog,
boolean retainFailures)
Deprecated. Recover earlier state by reading a recovery log. |
void |
initialize(CrawlController c)
Deprecated. Initializes the Frontier, given the supplied CrawlController. |
protected void |
innerFinished(CrawlURI curi)
Deprecated. |
protected boolean |
isDisregarded(CrawlURI curi)
Deprecated. |
boolean |
isEmpty()
Deprecated. Store is empty only if all queues are empty and no URIs are in-process |
protected URIWorkQueue |
keyedQueueFor(CrawlURI curi)
Deprecated. Get the KeyedQueue for a CrawlURI. |
void |
kickUpdate()
Deprecated. Notify Frontier that it should consider updating configuration info that may have changed in external files. |
void |
loadSeeds()
Deprecated. Load up the seeds. |
protected boolean |
needsPromptRetry(CrawlURI curi)
Deprecated. Checks if a recently completed CrawlURI that did not finish successfully needs to be retried immediately (processed again as soon as politeness allows.) |
protected boolean |
needsRetrying(CrawlURI curi)
Deprecated. Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses) |
CrawlURI |
next()
Deprecated. Return the next CrawlURI to be processed (and presumably visited/fetched) by a a worker thread. |
protected void |
noteInProcess(CrawlURI curi)
Deprecated. Marks a CrawlURI as being in process. |
void |
pause()
Deprecated. Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise. |
long |
queuedUriCount()
Deprecated. (non-Javadoc) |
void |
receive(CandidateURI caUri)
Deprecated. This method is called if the URI has not already been seen. |
void |
reportTo(java.io.PrintWriter writer)
Deprecated. Make a default report to the passed-in Writer. |
void |
reportTo(java.lang.String name,
java.io.PrintWriter writer)
Deprecated. This method compiles a human readable report on the status of the frontier at the time of the call. |
protected void |
reschedule(CrawlURI curi)
Deprecated. Put near top of relevant keyedqueue (but behind anything recently scheduled 'high') |
void |
schedule(CandidateURI caUri)
Deprecated. Arrange for the given CandidateURI to be visited, if it is not already scheduled/completed. |
protected void |
scheduleForRetry(CrawlURI curi)
Deprecated. |
protected boolean |
shouldBeForgotten(CrawlURI curi)
Deprecated. Some URIs, if they recur, deserve another chance at consideration: they might not be too many hops away via another path, or the scope may have been updated to allow them passage. |
java.lang.String |
singleLineReport()
Deprecated. Return a short single-line summary report as a String. |
void |
singleLineReportTo(java.io.PrintWriter writer)
Deprecated. Make a single-line summary report to the passed-in writer |
protected void |
snoozeQueueUntil(URIWorkQueue kq,
long wake)
Deprecated. Snoozes a queue until a fixed point in time has passed. |
void |
start()
Deprecated. Request that Frontier allow crawling to begin. |
long |
succeededFetchCount()
Deprecated. (non-Javadoc) |
protected void |
successDisposition(CrawlURI curi)
Deprecated. The CrawlURI has been successfully crawled, and will be attempted no more. |
void |
terminate()
Deprecated. Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException. |
long |
totalBytesWritten()
Deprecated. Total number of bytes contained in all URIs that have been processed. |
void |
unpause()
Deprecated. Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed. |
protected void |
updateScheduling(CrawlURI curi,
URIWorkQueue kq)
Deprecated. Update any scheduling structures with the new information in this CrawlURI. |
protected void |
wakeReadyQueues(long now)
Deprecated. Wake any snoozed queues whose snooze time is up. |
| Methods inherited from class org.archive.crawler.settings.ModuleType |
|---|
addElement, listUsedFiles |
| Methods inherited from class org.archive.crawler.settings.Type |
|---|
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient |
| Methods inherited from class javax.management.Attribute |
|---|
getName |
| Methods inherited from class java.lang.Object |
|---|
clone, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final java.lang.String ATTR_DELAY_FACTOR
protected static final java.lang.Float DEFAULT_DELAY_FACTOR
public static final java.lang.String ATTR_MIN_DELAY
protected static final java.lang.Integer DEFAULT_MIN_DELAY
public static final java.lang.String ATTR_MAX_DELAY
protected static final java.lang.Integer DEFAULT_MAX_DELAY
public static final java.lang.String ATTR_MAX_RETRIES
protected static final java.lang.Integer DEFAULT_MAX_RETRIES
public static final java.lang.String ATTR_RETRY_DELAY
protected static final java.lang.Long DEFAULT_RETRY_DELAY
public static final java.lang.String ATTR_HOLD_QUEUES
protected static final java.lang.Boolean DEFAULT_HOLD_QUEUES
public static final java.lang.String ATTR_HOST_VALENCE
protected static final java.lang.Integer DEFAULT_HOST_VALENCE
public static final java.lang.String ATTR_PREFERENCE_EMBED_HOPS
protected static final java.lang.Integer DEFAULT_PREFERENCE_EMBED_HOPS
public static final java.lang.String ATTR_IP_POLITENESS
protected static final java.lang.Boolean DEFAULT_IP_POLITENESS
public static final java.lang.String ATTR_FORCE_QUEUE
protected static final java.lang.String DEFAULT_FORCE_QUEUE
protected static final java.lang.String ACCEPTABLE_FORCE_QUEUE
public static final java.lang.String ATTR_MAX_OVERALL_BANDWIDTH_USAGE
protected static final java.lang.Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE
public static final java.lang.String ATTR_MAX_HOST_BANDWIDTH_USAGE
protected static final java.lang.Integer DEFAULT_MAX_HOST_BANDWIDTH_USAGE
public static final java.lang.String ATTR_HOST_QUEUES_MEMORY_CAPACITY
protected static final java.lang.Integer DEFAULT_HOST_QUEUES_MEMORY_CAPACITY
protected static final float KILO_FACTOR
protected CrawlController controller
protected UriUniqFilter alreadyIncluded
protected long nextOrdinal
protected QueueAssignmentPolicy queueAssignmentPolicy
protected java.util.LinkedList readyClassQueues
java.util.SortedSet snoozeQueues
java.util.LinkedList inactiveClassQueues
long queuedUriCount
long succeededFetchCount
long failedFetchCount
long disregardedUriCount
long totalProcessedBytes
long nextURIEmitTime
long processedBytesAfterLastEmittedURI
int lastMaxBandwidthKB
static final java.lang.String LOGNAME_RECOVER
| Constructor Detail |
|---|
public HostQueuesFrontier(java.lang.String name)
public HostQueuesFrontier(java.lang.String q,
java.lang.String description)
| Method Detail |
|---|
public void initialize(CrawlController c)
throws FatalConfigurationException,
java.io.IOException
initialize in interface Frontierc - The CrawlController that created the Frontier.
FatalConfigurationException - If provided settings are illegal or
otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file
from disk.Frontier.initialize(org.archive.crawler.framework.CrawlController)
protected UriUniqFilter createAlreadyIncluded(java.io.File dir,
java.lang.String filePrefix)
throws java.io.IOException
dir - Directory where the set's files should be writtenfilePrefix - Prefix to names of the set's files
java.io.IOException - If problems occur creating files on diskpublic void loadSeeds()
loadSeeds in interface FrontierCrawlController.kickUpdate()protected void batchSchedule(CandidateURI caUri)
public void batchFlush()
public void schedule(CandidateURI caUri)
schedule in interface FrontiercaUri - The URI to schedule.Frontier.schedule(org.archive.crawler.datamodel.CandidateURI)public void receive(CandidateURI caUri)
receive in interface UriUniqFilter.HasUriReceivercaUri - An URI object that has not been seen before.
public CrawlURI next()
throws java.lang.InterruptedException,
EndedException
next in interface Frontierjava.lang.InterruptedException
EndedExceptionFrontier.next()public java.lang.String getClassKey(CandidateURI cauri)
getClassKey in interface Frontiercauri - CandidateURI to calculate class key for.
protected CrawlServer getServer(CrawlURI curi)
curi -
public void finished(CrawlURI curi)
finished in interface Frontiercuri - The URI that has finished processing.Frontier.finished(org.archive.crawler.datamodel.CrawlURI)protected void innerFinished(CrawlURI curi)
protected void disregardDisposition(CrawlURI curi)
protected boolean isDisregarded(CrawlURI curi)
protected void successDisposition(CrawlURI curi)
curi - The CrawlURIpublic boolean isEmpty()
isEmpty in interface Frontierprotected void wakeReadyQueues(long now)
now - Current time in millisec.protected void discardQueue(URIWorkQueue q)
protected CrawlURI dequeueFromReady()
protected CrawlURI emitCuri(CrawlURI curi)
curi - The CrawlURI
noteInProcess(CrawlURI)protected void noteInProcess(CrawlURI curi)
curi - The CrawlURI to mark.protected URIWorkQueue keyedQueueFor(CrawlURI curi)
curi - The CrawlURI
protected void enqueueToKeyed(CrawlURI curi)
curi - The CrawlURIprotected long earliestWakeTime()
protected void updateScheduling(CrawlURI curi,
URIWorkQueue kq)
throws javax.management.AttributeNotFoundException
curi - The CrawlURIkq - A KeyedQueue
javax.management.AttributeNotFoundExceptionprotected void failureDisposition(CrawlURI curi)
curi - The CrawlURI
protected boolean needsPromptRetry(CrawlURI curi)
throws javax.management.AttributeNotFoundException
curi - The CrawlURI to check
javax.management.AttributeNotFoundException - If problems occur trying to read the
maximum number of retries from the settings framework.
protected boolean needsRetrying(CrawlURI curi)
throws javax.management.AttributeNotFoundException
curi - The CrawlURI to check
javax.management.AttributeNotFoundException - If problems occur trying to read the
maximum number of retries from the settings framework.
protected void scheduleForRetry(CrawlURI curi)
throws javax.management.AttributeNotFoundException
javax.management.AttributeNotFoundExceptionprotected void reschedule(CrawlURI curi)
curi - CrawlURI to reschedule.
protected void snoozeQueueUntil(URIWorkQueue kq,
long wake)
kq - A KeyedQueue that we want to snoozewake - Time (in millisec.) when we want the queue to stop snoozing.protected boolean shouldBeForgotten(CrawlURI curi)
curi -
protected void forget(CrawlURI curi)
curi - The CrawlURI to forgetpublic long discoveredUriCount()
discoveredUriCount in interface FrontierFrontier.discoveredUriCount()public long queuedUriCount()
queuedUriCount in interface FrontierFrontier.queuedUriCount()public long finishedUriCount()
finishedUriCount in interface FrontierFrontier.finishedUriCount()public long succeededFetchCount()
succeededFetchCount in interface FrontierFrontier.succeededFetchCount()public long failedFetchCount()
failedFetchCount in interface FrontierFrontier.failedFetchCount()public long disregardedUriCount()
disregardedUriCount in interface FrontierFrontier.disregardedUriCount()public long totalBytesWritten()
Frontier
totalBytesWritten in interface Frontier
public FrontierMarker getInitialMarker(java.lang.String regexpr,
boolean inCacheOnly)
FrontierURIFrontierMarker initialized with the given
regular expression at the 'start' of the Frontier.
getInitialMarker in interface Frontierregexpr - The regular expression that URIs within the frontier must
match to be considered within the scope of this markerinCacheOnly - If set to true, only those URIs within the frontier
that are stored in cache (usually this means in memory
rather then on disk, but that is an implementation
detail) will be considered. Others will be entierly
ignored, as if they dont exist. This is usefull for quick
peeks at the top of the URI list.
public java.util.ArrayList getURIsList(FrontierMarker marker,
int numberOfMatches,
boolean verbose)
throws InvalidFrontierMarkerException
FrontiernumberOfMatches is reached.
Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.
The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).
The URIFrontierMarker will be advanced to the position at
which it's maximum number of matches found is reached. Reusing it for
subsequent calls will thus effectively get the 'next' batch. Making
any changes to the frontier can invalidate the marker.
While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.
Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
getURIsList in interface Frontiermarker - A marker specifing from what position in the Frontier the
list should begin.numberOfMatches - how many URIs to add at most to the list before returning itverbose - if set to true the strings returned will contain additional
information about each URI beyond their names.
InvalidFrontierMarkerException - when the
URIFronterMarker does not match the internal
state of the frontier. Tolerance for this can vary
considerably from one URIFrontier implementation to the next.FrontierMarker,
Frontier.getInitialMarker(String, boolean)public long deleteURIs(java.lang.String match)
FrontierAny encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.
Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.
deleteURIs in interface Frontiermatch - String to match.
public void deleted(CrawlURI curi)
Frontier
deleted in interface Frontiercuri - Deleted CrawlURI.
public void importRecoverLog(java.lang.String pathToLog,
boolean retainFailures)
throws java.io.IOException
FrontierSome Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.
importRecoverLog in interface FrontierpathToLog - The name (with full path) of the recover log.retainFailures - If true, failures in log should count as
having been included. (If false, failures will be ignored, meaning
the corresponding URIs will be retried in the recovered crawl.)
java.io.IOException - If problems occur reading the recover log.public void considerIncluded(UURI u)
Frontier
considerIncluded in interface Frontieru - UURI instance to add to the Already Included set.public void kickUpdate()
Frontier
kickUpdate in interface Frontierpublic void start()
Frontier
start in interface Frontierpublic void pause()
Frontier
pause in interface Frontierpublic void unpause()
Frontier
unpause in interface Frontierpublic void terminate()
Frontier
terminate in interface Frontierprotected void finishedSuccess(CrawlURI c)
protected java.lang.String canonicalize(UURI uuri)
AbstractFrontier.
uuri - Candidate URI to canonicalize.
caUri.
If a problem, no canonicalization is done and the
CandidateURI#getURIString() is returned.public FrontierJournal getFrontierJournal()
getFrontierJournal in interface FrontierFrontierJournal that
this Frontier is using. May be null if no journaling.public void crawlEnding(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnding in interface CrawlStatusListenersExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.CrawlJobpublic void crawlEnded(java.lang.String sExitMessage)
CrawlStatusListener
crawlEnded in interface CrawlStatusListenersExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.CrawlJobpublic void crawlStarted(java.lang.String message)
CrawlStatusListener
crawlStarted in interface CrawlStatusListenermessage - Start message.public void crawlCheckpoint(java.io.File checkpointDir)
CrawlStatusListenerCrawlController when checkpointing.
crawlCheckpoint in interface CrawlStatusListenercheckpointDir - Checkpoint dir. Write checkpoint state here.public void crawlPausing(java.lang.String statusMessage)
CrawlStatusListener
crawlPausing in interface CrawlStatusListenerstatusMessage - Should be STATUS_WAITING_FOR_PAUSE.
Passed for conveniencepublic void crawlPaused(java.lang.String statusMessage)
CrawlStatusListener
crawlPaused in interface CrawlStatusListenerstatusMessage - Should be CrawlJob.STATUS_PAUSED.
Passed for conveniencepublic void crawlResuming(java.lang.String statusMessage)
CrawlStatusListener
crawlResuming in interface CrawlStatusListenerstatusMessage - Should be CrawlJob.STATUS_RUNNING.
Passed for conveniencepublic java.lang.String[] getReports()
Reporter
getReports in interface Reporterpublic java.lang.String singleLineReport()
Reporter
singleLineReport in interface Reporter
public void reportTo(java.io.PrintWriter writer)
throws java.io.IOException
Reporter
reportTo in interface Reporterwriter - to receive report
java.io.IOException
public void singleLineReportTo(java.io.PrintWriter writer)
throws java.io.IOException
Reporter
singleLineReportTo in interface Reporterwriter - to receive report
java.io.IOException
public void reportTo(java.lang.String name,
java.io.PrintWriter writer)
throws java.io.IOException
reportTo in interface Reporterwriter - to receive report
java.io.IOException
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||