org.archive.crawler.frontier
Class HostQueuesFrontier

java.lang.Object
  extended by javax.management.Attribute
      extended by org.archive.crawler.settings.Type
          extended by org.archive.crawler.settings.ComplexType
              extended by org.archive.crawler.settings.ModuleType
                  extended by org.archive.crawler.frontier.HostQueuesFrontier
All Implemented Interfaces:
java.io.Serializable, javax.management.DynamicMBean, CoreAttributeConstants, FetchStatusCodes, UriUniqFilter.HasUriReceiver, CrawlStatusListener, Frontier, Reporter

Deprecated. As of release 1.4, replaced by BdbFrontier.

public class HostQueuesFrontier
extends ModuleType
implements Frontier, FetchStatusCodes, CoreAttributeConstants, UriUniqFilter.HasUriReceiver, CrawlStatusListener

A basic mostly breadth-first frontier, which refrains from emitting more than one CrawlURI of the same 'key' (host) at once, and respects minimum-delay and delay-factor specifications for politeness.

There are an arbitrary number of 'KeyedQueues' each representing a certain 'key' class of URIs -- effectively, a single host (by hostname).

KeyedQueues may have an item in-process -- in which case they do not provide any other items for processing. KeyedQueues may also be 'snoozed' -- when they should be kept inactive for a period of time, to either enforce politeness policies or allow a configurable amount of time between error retries.

Author:
Gordon Mohr
See Also:
Serialized Form

Nested Class Summary
 
Nested classes/interfaces inherited from class org.archive.crawler.settings.ComplexType
ComplexType.MBeanAttributeInfoIterator
 
Field Summary
protected static java.lang.String ACCEPTABLE_FORCE_QUEUE
          Deprecated.  
protected  UriUniqFilter alreadyIncluded
          Deprecated.  
static java.lang.String ATTR_DELAY_FACTOR
          Deprecated. how many multiples of last fetch elapsed time to wait before recontacting same server
static java.lang.String ATTR_FORCE_QUEUE
          Deprecated. queue assignment to force onto CrawlURIs; intended to be overridden
static java.lang.String ATTR_HOLD_QUEUES
          Deprecated. whether to hold queues INACTIVE until needed for throughput
static java.lang.String ATTR_HOST_QUEUES_MEMORY_CAPACITY
          Deprecated. maximum how many items to store in memory atop each keyedqueue higher == more RAM used per active host; lower == more disk IO
static java.lang.String ATTR_HOST_VALENCE
          Deprecated. maximum simultaneous requests in process to a host (queue)
static java.lang.String ATTR_IP_POLITENESS
          Deprecated. whether to reassign URIs to IP-address based queues when IP known
static java.lang.String ATTR_MAX_DELAY
          Deprecated. never wait more than this long, regardless of multiple
static java.lang.String ATTR_MAX_HOST_BANDWIDTH_USAGE
          Deprecated. maximum per-host bandwidth usage
static java.lang.String ATTR_MAX_OVERALL_BANDWIDTH_USAGE
          Deprecated. maximum overall bandwidth usage
static java.lang.String ATTR_MAX_RETRIES
          Deprecated. maximum times to emit a CrawlURI without final disposition
static java.lang.String ATTR_MIN_DELAY
          Deprecated. always wait this long after one completion before recontacting same server, regardless of multiple
static java.lang.String ATTR_PREFERENCE_EMBED_HOPS
          Deprecated. number of hops of embeds (ERX) to bump to front of host queue
static java.lang.String ATTR_RETRY_DELAY
          Deprecated. for retryable problems, seconds to wait before a retry
protected  CrawlController controller
          Deprecated.  
protected static java.lang.Float DEFAULT_DELAY_FACTOR
          Deprecated.  
protected static java.lang.String DEFAULT_FORCE_QUEUE
          Deprecated.  
protected static java.lang.Boolean DEFAULT_HOLD_QUEUES
          Deprecated.  
protected static java.lang.Integer DEFAULT_HOST_QUEUES_MEMORY_CAPACITY
          Deprecated.  
protected static java.lang.Integer DEFAULT_HOST_VALENCE
          Deprecated.  
protected static java.lang.Boolean DEFAULT_IP_POLITENESS
          Deprecated.  
protected static java.lang.Integer DEFAULT_MAX_DELAY
          Deprecated.  
protected static java.lang.Integer DEFAULT_MAX_HOST_BANDWIDTH_USAGE
          Deprecated.  
protected static java.lang.Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE
          Deprecated.  
protected static java.lang.Integer DEFAULT_MAX_RETRIES
          Deprecated.  
protected static java.lang.Integer DEFAULT_MIN_DELAY
          Deprecated.  
protected static java.lang.Integer DEFAULT_PREFERENCE_EMBED_HOPS
          Deprecated.  
protected static java.lang.Long DEFAULT_RETRY_DELAY
          Deprecated.  
(package private)  long disregardedUriCount
          Deprecated.  
(package private)  long failedFetchCount
          Deprecated.  
(package private)  java.util.LinkedList inactiveClassQueues
          Deprecated. All per-class queues that are INACTIVE; will be empty unless 'site-first'/'hold-queues' is set.
protected static float KILO_FACTOR
          Deprecated.  
(package private)  int lastMaxBandwidthKB
          Deprecated.  
(package private) static java.lang.String LOGNAME_RECOVER
          Deprecated.  
protected  long nextOrdinal
          Deprecated. ordinal numbers to assign to created CrawlURIs
(package private)  long nextURIEmitTime
          Deprecated.  
(package private)  long processedBytesAfterLastEmittedURI
          Deprecated.  
protected  QueueAssignmentPolicy queueAssignmentPolicy
          Deprecated. Policy for assigning CrawlURIs to named queues
(package private)  long queuedUriCount
          Deprecated.  
protected  java.util.LinkedList readyClassQueues
          Deprecated. All per-class queues whose first item may be handed out (that is, they are READY).
(package private)  java.util.SortedSet snoozeQueues
          Deprecated. All per-class queues who are on hold until a certain time.
(package private)  long succeededFetchCount
          Deprecated.  
(package private)  long totalProcessedBytes
          Deprecated.  
 
Fields inherited from class org.archive.crawler.settings.ComplexType
definitionMap
 
Fields inherited from interface org.archive.crawler.framework.Frontier
ATTR_NAME
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_TYPE, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_HTML_BASE, A_HTTP_TRANSACTION, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION
 
Constructor Summary
HostQueuesFrontier(java.lang.String name)
          Deprecated.  
HostQueuesFrontier(java.lang.String q, java.lang.String description)
          Deprecated.  
 
Method Summary
 void batchFlush()
          Deprecated. Flush pending URI queues.
protected  void batchSchedule(CandidateURI caUri)
          Deprecated.  
protected  java.lang.String canonicalize(UURI uuri)
          Deprecated. Canonicalize passed uuri.
 void considerIncluded(UURI u)
          Deprecated. Notify Frontier that it should consider the given UURI as if already scheduled.
 void crawlCheckpoint(java.io.File checkpointDir)
          Deprecated. Called by CrawlController when checkpointing.
 void crawlEnded(java.lang.String sExitMessage)
          Deprecated. Called when a CrawlController has ended a crawl and is about to exit.
 void crawlEnding(java.lang.String sExitMessage)
          Deprecated. Called when a CrawlController is ending a crawl (for any reason)
 void crawlPaused(java.lang.String statusMessage)
          Deprecated. Called when a CrawlController is actually paused (all threads are idle).
 void crawlPausing(java.lang.String statusMessage)
          Deprecated. Called when a CrawlController is going to be paused.
 void crawlResuming(java.lang.String statusMessage)
          Deprecated. Called when a CrawlController is resuming a crawl that had been paused.
 void crawlStarted(java.lang.String message)
          Deprecated. Called on crawl start.
protected  UriUniqFilter createAlreadyIncluded(java.io.File dir, java.lang.String filePrefix)
          Deprecated. Create a UURISet that will serve as record of already seen URIs.
 void deleted(CrawlURI curi)
          Deprecated. Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.
 long deleteURIs(java.lang.String match)
          Deprecated. Delete any URI that matches the given regular expression from the list of discovered and pending URIs.
protected  CrawlURI dequeueFromReady()
          Deprecated.  
protected  void discardQueue(URIWorkQueue q)
          Deprecated.  
 long discoveredUriCount()
          Deprecated. (non-Javadoc)
protected  void disregardDisposition(CrawlURI curi)
          Deprecated.  
 long disregardedUriCount()
          Deprecated. (non-Javadoc)
protected  long earliestWakeTime()
          Deprecated.  
protected  CrawlURI emitCuri(CrawlURI curi)
          Deprecated. Prepares a CrawlURI for crawling.
protected  void enqueueToKeyed(CrawlURI curi)
          Deprecated. Place CrawlURI on the queue for its class (server).
 long failedFetchCount()
          Deprecated. (non-Javadoc)
protected  void failureDisposition(CrawlURI curi)
          Deprecated. The CrawlURI has encountered a problem, and will not be retried.
 void finished(CrawlURI curi)
          Deprecated. Note that the previously emitted CrawlURI has completed its processing (for now).
protected  void finishedSuccess(CrawlURI c)
          Deprecated.  
 long finishedUriCount()
          Deprecated. (non-Javadoc)
protected  void forget(CrawlURI curi)
          Deprecated. Forget the given CrawlURI.
 java.lang.String getClassKey(CandidateURI cauri)
          Deprecated.  
 FrontierJournal getFrontierJournal()
          Deprecated.  
 FrontierMarker getInitialMarker(java.lang.String regexpr, boolean inCacheOnly)
          Deprecated. Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.
 java.lang.String[] getReports()
          Deprecated. Get an array of report names offered by this Reporter.
protected  CrawlServer getServer(CrawlURI curi)
          Deprecated.  
 java.util.ArrayList getURIsList(FrontierMarker marker, int numberOfMatches, boolean verbose)
          Deprecated. Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.
 void importRecoverLog(java.lang.String pathToLog, boolean retainFailures)
          Deprecated. Recover earlier state by reading a recovery log.
 void initialize(CrawlController c)
          Deprecated. Initializes the Frontier, given the supplied CrawlController.
protected  void innerFinished(CrawlURI curi)
          Deprecated.  
protected  boolean isDisregarded(CrawlURI curi)
          Deprecated.  
 boolean isEmpty()
          Deprecated. Store is empty only if all queues are empty and no URIs are in-process
protected  URIWorkQueue keyedQueueFor(CrawlURI curi)
          Deprecated. Get the KeyedQueue for a CrawlURI.
 void kickUpdate()
          Deprecated. Notify Frontier that it should consider updating configuration info that may have changed in external files.
 void loadSeeds()
          Deprecated. Load up the seeds.
protected  boolean needsPromptRetry(CrawlURI curi)
          Deprecated. Checks if a recently completed CrawlURI that did not finish successfully needs to be retried immediately (processed again as soon as politeness allows.)
protected  boolean needsRetrying(CrawlURI curi)
          Deprecated. Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)
 CrawlURI next()
          Deprecated. Return the next CrawlURI to be processed (and presumably visited/fetched) by a a worker thread.
protected  void noteInProcess(CrawlURI curi)
          Deprecated. Marks a CrawlURI as being in process.
 void pause()
          Deprecated. Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.
 long queuedUriCount()
          Deprecated. (non-Javadoc)
 void receive(CandidateURI caUri)
          Deprecated. This method is called if the URI has not already been seen.
 void reportTo(java.io.PrintWriter writer)
          Deprecated. Make a default report to the passed-in Writer.
 void reportTo(java.lang.String name, java.io.PrintWriter writer)
          Deprecated. This method compiles a human readable report on the status of the frontier at the time of the call.
protected  void reschedule(CrawlURI curi)
          Deprecated. Put near top of relevant keyedqueue (but behind anything recently scheduled 'high')
 void schedule(CandidateURI caUri)
          Deprecated. Arrange for the given CandidateURI to be visited, if it is not already scheduled/completed.
protected  void scheduleForRetry(CrawlURI curi)
          Deprecated.  
protected  boolean shouldBeForgotten(CrawlURI curi)
          Deprecated. Some URIs, if they recur, deserve another chance at consideration: they might not be too many hops away via another path, or the scope may have been updated to allow them passage.
 java.lang.String singleLineReport()
          Deprecated. Return a short single-line summary report as a String.
 void singleLineReportTo(java.io.PrintWriter writer)
          Deprecated. Make a single-line summary report to the passed-in writer
protected  void snoozeQueueUntil(URIWorkQueue kq, long wake)
          Deprecated. Snoozes a queue until a fixed point in time has passed.
 void start()
          Deprecated. Request that Frontier allow crawling to begin.
 long succeededFetchCount()
          Deprecated. (non-Javadoc)
protected  void successDisposition(CrawlURI curi)
          Deprecated. The CrawlURI has been successfully crawled, and will be attempted no more.
 void terminate()
          Deprecated. Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.
 long totalBytesWritten()
          Deprecated. Total number of bytes contained in all URIs that have been processed.
 void unpause()
          Deprecated. Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.
protected  void updateScheduling(CrawlURI curi, URIWorkQueue kq)
          Deprecated. Update any scheduling structures with the new information in this CrawlURI.
protected  void wakeReadyQueues(long now)
          Deprecated. Wake any snoozed queues whose snooze time is up.
 
Methods inherited from class org.archive.crawler.settings.ModuleType
addElement, listUsedFiles
 
Methods inherited from class org.archive.crawler.settings.ComplexType
addElementToDefinition, checkValue, earlyInitialize, getAbsoluteName, getAttribute, getAttribute, getAttribute, getAttributeInfo, getAttributeInfo, getAttributeInfoIterator, getAttributes, getDataContainerRecursive, getDataContainerRecursive, getDefaultValue, getDescription, getElementFromDefinition, getLegalValues, getLocalAttribute, getMBeanInfo, getMBeanInfo, getParent, getPreservedFields, getSettingsHandler, getUncheckedAttribute, getValue, globalSettings, invoke, isInitialized, isOverridden, iterator, setAsOrder, setAttribute, setAttribute, setAttributes, setDescription, setPreservedFields, unsetAttribute
 
Methods inherited from class org.archive.crawler.settings.Type
addConstraint, equals, getConstraints, getLegalValueType, isExpertSetting, isOverrideable, isTransient, setExpertSetting, setLegalValueType, setOverrideable, setTransient
 
Methods inherited from class javax.management.Attribute
getName
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ATTR_DELAY_FACTOR

public static final java.lang.String ATTR_DELAY_FACTOR
Deprecated. 
how many multiples of last fetch elapsed time to wait before recontacting same server

See Also:
Constant Field Values

DEFAULT_DELAY_FACTOR

protected static final java.lang.Float DEFAULT_DELAY_FACTOR
Deprecated. 

ATTR_MIN_DELAY

public static final java.lang.String ATTR_MIN_DELAY
Deprecated. 
always wait this long after one completion before recontacting same server, regardless of multiple

See Also:
Constant Field Values

DEFAULT_MIN_DELAY

protected static final java.lang.Integer DEFAULT_MIN_DELAY
Deprecated. 

ATTR_MAX_DELAY

public static final java.lang.String ATTR_MAX_DELAY
Deprecated. 
never wait more than this long, regardless of multiple

See Also:
Constant Field Values

DEFAULT_MAX_DELAY

protected static final java.lang.Integer DEFAULT_MAX_DELAY
Deprecated. 

ATTR_MAX_RETRIES

public static final java.lang.String ATTR_MAX_RETRIES
Deprecated. 
maximum times to emit a CrawlURI without final disposition

See Also:
Constant Field Values

DEFAULT_MAX_RETRIES

protected static final java.lang.Integer DEFAULT_MAX_RETRIES
Deprecated. 

ATTR_RETRY_DELAY

public static final java.lang.String ATTR_RETRY_DELAY
Deprecated. 
for retryable problems, seconds to wait before a retry

See Also:
Constant Field Values

DEFAULT_RETRY_DELAY

protected static final java.lang.Long DEFAULT_RETRY_DELAY
Deprecated. 

ATTR_HOLD_QUEUES

public static final java.lang.String ATTR_HOLD_QUEUES
Deprecated. 
whether to hold queues INACTIVE until needed for throughput

See Also:
Constant Field Values

DEFAULT_HOLD_QUEUES

protected static final java.lang.Boolean DEFAULT_HOLD_QUEUES
Deprecated. 

ATTR_HOST_VALENCE

public static final java.lang.String ATTR_HOST_VALENCE
Deprecated. 
maximum simultaneous requests in process to a host (queue)

See Also:
Constant Field Values

DEFAULT_HOST_VALENCE

protected static final java.lang.Integer DEFAULT_HOST_VALENCE
Deprecated. 

ATTR_PREFERENCE_EMBED_HOPS

public static final java.lang.String ATTR_PREFERENCE_EMBED_HOPS
Deprecated. 
number of hops of embeds (ERX) to bump to front of host queue

See Also:
Constant Field Values

DEFAULT_PREFERENCE_EMBED_HOPS

protected static final java.lang.Integer DEFAULT_PREFERENCE_EMBED_HOPS
Deprecated. 

ATTR_IP_POLITENESS

public static final java.lang.String ATTR_IP_POLITENESS
Deprecated. 
whether to reassign URIs to IP-address based queues when IP known

See Also:
Constant Field Values

DEFAULT_IP_POLITENESS

protected static final java.lang.Boolean DEFAULT_IP_POLITENESS
Deprecated. 

ATTR_FORCE_QUEUE

public static final java.lang.String ATTR_FORCE_QUEUE
Deprecated. 
queue assignment to force onto CrawlURIs; intended to be overridden

See Also:
Constant Field Values

DEFAULT_FORCE_QUEUE

protected static final java.lang.String DEFAULT_FORCE_QUEUE
Deprecated. 
See Also:
Constant Field Values

ACCEPTABLE_FORCE_QUEUE

protected static final java.lang.String ACCEPTABLE_FORCE_QUEUE
Deprecated. 
See Also:
Constant Field Values

ATTR_MAX_OVERALL_BANDWIDTH_USAGE

public static final java.lang.String ATTR_MAX_OVERALL_BANDWIDTH_USAGE
Deprecated. 
maximum overall bandwidth usage

See Also:
Constant Field Values

DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE

protected static final java.lang.Integer DEFAULT_MAX_OVERALL_BANDWIDTH_USAGE
Deprecated. 

ATTR_MAX_HOST_BANDWIDTH_USAGE

public static final java.lang.String ATTR_MAX_HOST_BANDWIDTH_USAGE
Deprecated. 
maximum per-host bandwidth usage

See Also:
Constant Field Values

DEFAULT_MAX_HOST_BANDWIDTH_USAGE

protected static final java.lang.Integer DEFAULT_MAX_HOST_BANDWIDTH_USAGE
Deprecated. 

ATTR_HOST_QUEUES_MEMORY_CAPACITY

public static final java.lang.String ATTR_HOST_QUEUES_MEMORY_CAPACITY
Deprecated. 
maximum how many items to store in memory atop each keyedqueue higher == more RAM used per active host; lower == more disk IO

See Also:
Constant Field Values

DEFAULT_HOST_QUEUES_MEMORY_CAPACITY

protected static final java.lang.Integer DEFAULT_HOST_QUEUES_MEMORY_CAPACITY
Deprecated. 

KILO_FACTOR

protected static final float KILO_FACTOR
Deprecated. 
See Also:
Constant Field Values

controller

protected CrawlController controller
Deprecated. 

alreadyIncluded

protected UriUniqFilter alreadyIncluded
Deprecated. 

nextOrdinal

protected long nextOrdinal
Deprecated. 
ordinal numbers to assign to created CrawlURIs


queueAssignmentPolicy

protected QueueAssignmentPolicy queueAssignmentPolicy
Deprecated. 
Policy for assigning CrawlURIs to named queues


readyClassQueues

protected java.util.LinkedList readyClassQueues
Deprecated. 
All per-class queues whose first item may be handed out (that is, they are READY).


snoozeQueues

java.util.SortedSet snoozeQueues
Deprecated. 
All per-class queues who are on hold until a certain time. Of KeyedQueue, sorted by wakeTime.


inactiveClassQueues

java.util.LinkedList inactiveClassQueues
Deprecated. 
All per-class queues that are INACTIVE; will be empty unless 'site-first'/'hold-queues' is set.


queuedUriCount

long queuedUriCount
Deprecated. 

succeededFetchCount

long succeededFetchCount
Deprecated. 

failedFetchCount

long failedFetchCount
Deprecated. 

disregardedUriCount

long disregardedUriCount
Deprecated. 

totalProcessedBytes

long totalProcessedBytes
Deprecated. 

nextURIEmitTime

long nextURIEmitTime
Deprecated. 

processedBytesAfterLastEmittedURI

long processedBytesAfterLastEmittedURI
Deprecated. 

lastMaxBandwidthKB

int lastMaxBandwidthKB
Deprecated. 

LOGNAME_RECOVER

static final java.lang.String LOGNAME_RECOVER
Deprecated. 
See Also:
Constant Field Values
Constructor Detail

HostQueuesFrontier

public HostQueuesFrontier(java.lang.String name)
Deprecated. 

HostQueuesFrontier

public HostQueuesFrontier(java.lang.String q,
                          java.lang.String description)
Deprecated. 
Method Detail

initialize

public void initialize(CrawlController c)
                throws FatalConfigurationException,
                       java.io.IOException
Deprecated. 
Initializes the Frontier, given the supplied CrawlController.

Specified by:
initialize in interface Frontier
Parameters:
c - The CrawlController that created the Frontier.
Throws:
FatalConfigurationException - If provided settings are illegal or otherwise unusable.
java.io.IOException - If there is a problem reading settings or seeds file from disk.
See Also:
Frontier.initialize(org.archive.crawler.framework.CrawlController)

createAlreadyIncluded

protected UriUniqFilter createAlreadyIncluded(java.io.File dir,
                                              java.lang.String filePrefix)
                                       throws java.io.IOException
Deprecated. 
Create a UURISet that will serve as record of already seen URIs.

Parameters:
dir - Directory where the set's files should be written
filePrefix - Prefix to names of the set's files
Returns:
A UURISet that will serve as a record of already seen URIs
Throws:
java.io.IOException - If problems occur creating files on disk

loadSeeds

public void loadSeeds()
Deprecated. 
Load up the seeds. This method is called on initialize and inside in the crawlcontroller when it wants to force reloading of configuration.

Specified by:
loadSeeds in interface Frontier
See Also:
CrawlController.kickUpdate()

batchSchedule

protected void batchSchedule(CandidateURI caUri)
Deprecated. 

batchFlush

public void batchFlush()
Deprecated. 
Flush pending URI queues. Used when scheduling URIs from the commandline.


schedule

public void schedule(CandidateURI caUri)
Deprecated. 
Arrange for the given CandidateURI to be visited, if it is not already scheduled/completed.

Specified by:
schedule in interface Frontier
Parameters:
caUri - The URI to schedule.
See Also:
Frontier.schedule(org.archive.crawler.datamodel.CandidateURI)

receive

public void receive(CandidateURI caUri)
Deprecated. 
This method is called if the URI has not already been seen. This method is the implementation of the HasUriReceiver interface.

Specified by:
receive in interface UriUniqFilter.HasUriReceiver
Parameters:
caUri - An URI object that has not been seen before.

next

public CrawlURI next()
              throws java.lang.InterruptedException,
                     EndedException
Deprecated. 
Return the next CrawlURI to be processed (and presumably visited/fetched) by a a worker thread. First checks any "Ready" per-host queues, then the global pending queue.

Specified by:
next in interface Frontier
Returns:
next CrawlURI to be processed. Or null if none is available.
Throws:
java.lang.InterruptedException
EndedException
See Also:
Frontier.next()

getClassKey

public java.lang.String getClassKey(CandidateURI cauri)
Deprecated. 
Specified by:
getClassKey in interface Frontier
Parameters:
cauri - CandidateURI to calculate class key for.
Returns:
a String token representing a queue

getServer

protected CrawlServer getServer(CrawlURI curi)
Deprecated. 
Parameters:
curi -
Returns:
the CrawlServer to be associated with this CrawlURI

finished

public void finished(CrawlURI curi)
Deprecated. 
Note that the previously emitted CrawlURI has completed its processing (for now). The CrawlURI may be scheduled to retry, if appropriate, and other related URIs may become eligible for release via the next next() call, as a result of finished(). (non-Javadoc)

Specified by:
finished in interface Frontier
Parameters:
curi - The URI that has finished processing.
See Also:
Frontier.finished(org.archive.crawler.datamodel.CrawlURI)

innerFinished

protected void innerFinished(CrawlURI curi)
Deprecated. 

disregardDisposition

protected void disregardDisposition(CrawlURI curi)
Deprecated. 

isDisregarded

protected boolean isDisregarded(CrawlURI curi)
Deprecated. 

successDisposition

protected void successDisposition(CrawlURI curi)
Deprecated. 
The CrawlURI has been successfully crawled, and will be attempted no more.

Parameters:
curi - The CrawlURI

isEmpty

public boolean isEmpty()
Deprecated. 
Store is empty only if all queues are empty and no URIs are in-process

Specified by:
isEmpty in interface Frontier
Returns:
True if queues are empty.

wakeReadyQueues

protected void wakeReadyQueues(long now)
Deprecated. 
Wake any snoozed queues whose snooze time is up.

Parameters:
now - Current time in millisec.

discardQueue

protected void discardQueue(URIWorkQueue q)
Deprecated. 

dequeueFromReady

protected CrawlURI dequeueFromReady()
Deprecated. 

emitCuri

protected CrawlURI emitCuri(CrawlURI curi)
Deprecated. 
Prepares a CrawlURI for crawling. Also marks it as 'being processed'.

Parameters:
curi - The CrawlURI
Returns:
The CrawlURI
See Also:
noteInProcess(CrawlURI)

noteInProcess

protected void noteInProcess(CrawlURI curi)
Deprecated. 
Marks a CrawlURI as being in process.

Parameters:
curi - The CrawlURI to mark.

keyedQueueFor

protected URIWorkQueue keyedQueueFor(CrawlURI curi)
Deprecated. 
Get the KeyedQueue for a CrawlURI. If it does not exist it will be created.

Parameters:
curi - The CrawlURI
Returns:
The KeyedQueue for the CrawlURI or null if it does not exist and an exception occured trying to create it.

enqueueToKeyed

protected void enqueueToKeyed(CrawlURI curi)
Deprecated. 
Place CrawlURI on the queue for its class (server). If KeyedQueue does not exist it will be created. Failure to create the KeyedQueue (due to errors) will cause the method to return without error. The failure to create the KeyedQueue will have been logged.

Parameters:
curi - The CrawlURI

earliestWakeTime

protected long earliestWakeTime()
Deprecated. 

updateScheduling

protected void updateScheduling(CrawlURI curi,
                                URIWorkQueue kq)
                         throws javax.management.AttributeNotFoundException
Deprecated. 
Update any scheduling structures with the new information in this CrawlURI. Chiefly means make necessary arrangements for no other URIs at the same host to be visited within the appropriate politeness window.

Parameters:
curi - The CrawlURI
kq - A KeyedQueue
Throws:
javax.management.AttributeNotFoundException

failureDisposition

protected void failureDisposition(CrawlURI curi)
Deprecated. 
The CrawlURI has encountered a problem, and will not be retried.

Parameters:
curi - The CrawlURI

needsPromptRetry

protected boolean needsPromptRetry(CrawlURI curi)
                            throws javax.management.AttributeNotFoundException
Deprecated. 
Checks if a recently completed CrawlURI that did not finish successfully needs to be retried immediately (processed again as soon as politeness allows.)

Parameters:
curi - The CrawlURI to check
Returns:
True if we need to retry promptly.
Throws:
javax.management.AttributeNotFoundException - If problems occur trying to read the maximum number of retries from the settings framework.

needsRetrying

protected boolean needsRetrying(CrawlURI curi)
                         throws javax.management.AttributeNotFoundException
Deprecated. 
Checks if a recently completed CrawlURI that did not finish successfully needs to be retried (processed again after some time elapses)

Parameters:
curi - The CrawlURI to check
Returns:
True if we need to retry.
Throws:
javax.management.AttributeNotFoundException - If problems occur trying to read the maximum number of retries from the settings framework.

scheduleForRetry

protected void scheduleForRetry(CrawlURI curi)
                         throws javax.management.AttributeNotFoundException
Deprecated. 
Throws:
javax.management.AttributeNotFoundException

reschedule

protected void reschedule(CrawlURI curi)
Deprecated. 
Put near top of relevant keyedqueue (but behind anything recently scheduled 'high')

Parameters:
curi - CrawlURI to reschedule.

snoozeQueueUntil

protected void snoozeQueueUntil(URIWorkQueue kq,
                                long wake)
Deprecated. 
Snoozes a queue until a fixed point in time has passed.

Parameters:
kq - A KeyedQueue that we want to snooze
wake - Time (in millisec.) when we want the queue to stop snoozing.

shouldBeForgotten

protected boolean shouldBeForgotten(CrawlURI curi)
Deprecated. 
Some URIs, if they recur, deserve another chance at consideration: they might not be too many hops away via another path, or the scope may have been updated to allow them passage.

Parameters:
curi -
Returns:
True if curi should be forgotten.

forget

protected void forget(CrawlURI curi)
Deprecated. 
Forget the given CrawlURI. This allows a new instance to be created in the future, if it is reencountered under different circumstances.

Parameters:
curi - The CrawlURI to forget

discoveredUriCount

public long discoveredUriCount()
Deprecated. 
(non-Javadoc)

Specified by:
discoveredUriCount in interface Frontier
Returns:
Number of discovered URIs.
See Also:
Frontier.discoveredUriCount()

queuedUriCount

public long queuedUriCount()
Deprecated. 
(non-Javadoc)

Specified by:
queuedUriCount in interface Frontier
Returns:
Number of queued URIs.
See Also:
Frontier.queuedUriCount()

finishedUriCount

public long finishedUriCount()
Deprecated. 
(non-Javadoc)

Specified by:
finishedUriCount in interface Frontier
Returns:
Number of finished URIs.
See Also:
Frontier.finishedUriCount()

succeededFetchCount

public long succeededFetchCount()
Deprecated. 
(non-Javadoc)

Specified by:
succeededFetchCount in interface Frontier
Returns:
Number of successfully processed URIs.
See Also:
Frontier.succeededFetchCount()

failedFetchCount

public long failedFetchCount()
Deprecated. 
(non-Javadoc)

Specified by:
failedFetchCount in interface Frontier
Returns:
Number of URIs that failed to process.
See Also:
Frontier.failedFetchCount()

disregardedUriCount

public long disregardedUriCount()
Deprecated. 
(non-Javadoc)

Specified by:
disregardedUriCount in interface Frontier
Returns:
The number of URIs that have been disregarded.
See Also:
Frontier.disregardedUriCount()

totalBytesWritten

public long totalBytesWritten()
Deprecated. 
Description copied from interface: Frontier
Total number of bytes contained in all URIs that have been processed.

Specified by:
totalBytesWritten in interface Frontier
Returns:
The total amounts of bytes in all processed URIs.

getInitialMarker

public FrontierMarker getInitialMarker(java.lang.String regexpr,
                                       boolean inCacheOnly)
Deprecated. 
Description copied from interface: Frontier
Get a URIFrontierMarker initialized with the given regular expression at the 'start' of the Frontier.

Specified by:
getInitialMarker in interface Frontier
Parameters:
regexpr - The regular expression that URIs within the frontier must match to be considered within the scope of this marker
inCacheOnly - If set to true, only those URIs within the frontier that are stored in cache (usually this means in memory rather then on disk, but that is an implementation detail) will be considered. Others will be entierly ignored, as if they dont exist. This is usefull for quick peeks at the top of the URI list.
Returns:
A URIFrontierMarker that is set for the 'start' of the frontier's URI list.

getURIsList

public java.util.ArrayList getURIsList(FrontierMarker marker,
                                       int numberOfMatches,
                                       boolean verbose)
                                throws InvalidFrontierMarkerException
Deprecated. 
Description copied from interface: Frontier
Returns a list of all uncrawled URIs starting from a specified marker until numberOfMatches is reached.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is included. As there may be duplicates in the frontier, there may also be duplicates in the report. Thus this includes both discovered and pending URIs.

The list is a set of strings containing the URI strings. If verbose is true the string will include some additional information (path to URI and parent).

The URIFrontierMarker will be advanced to the position at which it's maximum number of matches found is reached. Reusing it for subsequent calls will thus effectively get the 'next' batch. Making any changes to the frontier can invalidate the marker.

While the order returned is consistent, it does not have any explicit relation to the likely order in which they may be processed.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Specified by:
getURIsList in interface Frontier
Parameters:
marker - A marker specifing from what position in the Frontier the list should begin.
numberOfMatches - how many URIs to add at most to the list before returning it
verbose - if set to true the strings returned will contain additional information about each URI beyond their names.
Returns:
a list of all pending URIs falling within the specification of the marker
Throws:
InvalidFrontierMarkerException - when the URIFronterMarker does not match the internal state of the frontier. Tolerance for this can vary considerably from one URIFrontier implementation to the next.
See Also:
FrontierMarker, Frontier.getInitialMarker(String, boolean)

deleteURIs

public long deleteURIs(java.lang.String match)
Deprecated. 
Description copied from interface: Frontier
Delete any URI that matches the given regular expression from the list of discovered and pending URIs. This does not prevent them from being rediscovered.

Any encountered URI that has not been successfully crawled, terminally failed, disregarded or is currently being processed is considered to be a pending URI.

Warning: It is unsafe to make changes to the frontier while this method is executing. The crawler should be in a paused state before invoking it.

Specified by:
deleteURIs in interface Frontier
Parameters:
match - String to match.
Returns:
Number of items deleted.

deleted

public void deleted(CrawlURI curi)
Deprecated. 
Description copied from interface: Frontier
Notify Frontier that a CrawlURI has been deleted outside of the normal next()/finished() lifecycle.

Specified by:
deleted in interface Frontier
Parameters:
curi - Deleted CrawlURI.

importRecoverLog

public void importRecoverLog(java.lang.String pathToLog,
                             boolean retainFailures)
                      throws java.io.IOException
Deprecated. 
Description copied from interface: Frontier
Recover earlier state by reading a recovery log.

Some Frontiers are able to write detailed logs that can be loaded after a system crash to recover the state of the Frontier prior to the crash. This method is the one used to achive this.

Specified by:
importRecoverLog in interface Frontier
Parameters:
pathToLog - The name (with full path) of the recover log.
retainFailures - If true, failures in log should count as having been included. (If false, failures will be ignored, meaning the corresponding URIs will be retried in the recovered crawl.)
Throws:
java.io.IOException - If problems occur reading the recover log.

considerIncluded

public void considerIncluded(UURI u)
Deprecated. 
Description copied from interface: Frontier
Notify Frontier that it should consider the given UURI as if already scheduled.

Specified by:
considerIncluded in interface Frontier
Parameters:
u - UURI instance to add to the Already Included set.

kickUpdate

public void kickUpdate()
Deprecated. 
Description copied from interface: Frontier
Notify Frontier that it should consider updating configuration info that may have changed in external files.

Specified by:
kickUpdate in interface Frontier

start

public void start()
Deprecated. 
Description copied from interface: Frontier
Request that Frontier allow crawling to begin. Usually just unpauses Frontier, if paused.

Specified by:
start in interface Frontier

pause

public void pause()
Deprecated. 
Description copied from interface: Frontier
Notify Frontier that it should not release any URIs, instead holding all threads, until instructed otherwise.

Specified by:
pause in interface Frontier

unpause

public void unpause()
Deprecated. 
Description copied from interface: Frontier
Resumes the release of URIs to crawl, allowing worker ToeThreads to proceed.

Specified by:
unpause in interface Frontier

terminate

public void terminate()
Deprecated. 
Description copied from interface: Frontier
Notify Frontier that it should end the crawl, giving any worker ToeThread that askss for a next() an EndedException.

Specified by:
terminate in interface Frontier

finishedSuccess

protected void finishedSuccess(CrawlURI c)
Deprecated. 

canonicalize

protected java.lang.String canonicalize(UURI uuri)
Deprecated. 
Canonicalize passed uuri. Its would be sweeter if this canonicalize function was encapsulated by that which it canonicalizes but because settings change with context -- i.e. there may be overrides in operation for a particular URI -- its not so easy; Each CandidateURI would need a reference to the settings system. That's awkward to pass in. Copied from AbstractFrontier.

Parameters:
uuri - Candidate URI to canonicalize.
Returns:
Canonicalized version of passed caUri. If a problem, no canonicalization is done and the CandidateURI#getURIString() is returned.

getFrontierJournal

public FrontierJournal getFrontierJournal()
Deprecated. 
Specified by:
getFrontierJournal in interface Frontier
Returns:
Return the instance of FrontierJournal that this Frontier is using. May be null if no journaling.

crawlEnding

public void crawlEnding(java.lang.String sExitMessage)
Deprecated. 
Description copied from interface: CrawlStatusListener
Called when a CrawlController is ending a crawl (for any reason)

Specified by:
crawlEnding in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlEnded

public void crawlEnded(java.lang.String sExitMessage)
Deprecated. 
Description copied from interface: CrawlStatusListener
Called when a CrawlController has ended a crawl and is about to exit.

Specified by:
crawlEnded in interface CrawlStatusListener
Parameters:
sExitMessage - Type of exit. Should be one of the STATUS constants in defined in CrawlJob.
See Also:
CrawlJob

crawlStarted

public void crawlStarted(java.lang.String message)
Deprecated. 
Description copied from interface: CrawlStatusListener
Called on crawl start.

Specified by:
crawlStarted in interface CrawlStatusListener
Parameters:
message - Start message.

crawlCheckpoint

public void crawlCheckpoint(java.io.File checkpointDir)
Deprecated. 
Description copied from interface: CrawlStatusListener
Called by CrawlController when checkpointing.

Specified by:
crawlCheckpoint in interface CrawlStatusListener
Parameters:
checkpointDir - Checkpoint dir. Write checkpoint state here.

crawlPausing

public void crawlPausing(java.lang.String statusMessage)
Deprecated. 
Description copied from interface: CrawlStatusListener
Called when a CrawlController is going to be paused.

Specified by:
crawlPausing in interface CrawlStatusListener
Parameters:
statusMessage - Should be STATUS_WAITING_FOR_PAUSE. Passed for convenience

crawlPaused

public void crawlPaused(java.lang.String statusMessage)
Deprecated. 
Description copied from interface: CrawlStatusListener
Called when a CrawlController is actually paused (all threads are idle).

Specified by:
crawlPaused in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_PAUSED. Passed for convenience

crawlResuming

public void crawlResuming(java.lang.String statusMessage)
Deprecated. 
Description copied from interface: CrawlStatusListener
Called when a CrawlController is resuming a crawl that had been paused.

Specified by:
crawlResuming in interface CrawlStatusListener
Parameters:
statusMessage - Should be CrawlJob.STATUS_RUNNING. Passed for convenience

getReports

public java.lang.String[] getReports()
Deprecated. 
Description copied from interface: Reporter
Get an array of report names offered by this Reporter. A name in brackets indicates a free-form String, in accordance with the informal description inside the brackets, may yield a useful report.

Specified by:
getReports in interface Reporter
Returns:
String array of report names, empty if there is only one report type

singleLineReport

public java.lang.String singleLineReport()
Deprecated. 
Description copied from interface: Reporter
Return a short single-line summary report as a String.

Specified by:
singleLineReport in interface Reporter
Returns:
String single-line summary report

reportTo

public void reportTo(java.io.PrintWriter writer)
              throws java.io.IOException
Deprecated. 
Description copied from interface: Reporter
Make a default report to the passed-in Writer. Should be equivalent to reportTo(null, writer)

Specified by:
reportTo in interface Reporter
Parameters:
writer - to receive report
Throws:
java.io.IOException

singleLineReportTo

public void singleLineReportTo(java.io.PrintWriter writer)
                        throws java.io.IOException
Deprecated. 
Description copied from interface: Reporter
Make a single-line summary report to the passed-in writer

Specified by:
singleLineReportTo in interface Reporter
Parameters:
writer - to receive report
Throws:
java.io.IOException

reportTo

public void reportTo(java.lang.String name,
                     java.io.PrintWriter writer)
              throws java.io.IOException
Deprecated. 
This method compiles a human readable report on the status of the frontier at the time of the call.

Specified by:
reportTo in interface Reporter
writer - to receive report
Throws:
java.io.IOException


Copyright © 2003-2005 Internet Archive. All Rights Reserved.