org.archive.crawler.frontier
Class KeyedQueue

java.lang.Object
  extended by org.archive.crawler.frontier.KeyedQueue
All Implemented Interfaces:
java.io.Serializable, URIWorkQueue

public class KeyedQueue
extends java.lang.Object
implements java.io.Serializable, URIWorkQueue

Ordered collection of work items with the same "classKey". The collection itself has a state, which may reflect where it is stored or what can be done with the contained items.

For easy access to several locations in the main collection, it is held between 2 data structures: a top stack and a bottom queue. (These in turn may be disk-backed.)

Also maintains a collection 'off to the side' of 'frozen' items.

About KeyedQueue states:

All KeyedQueues begin INACTIVE. A call to activate() will render them READY (if not empty of eligible URIs) or EMPTY otherwise.

A noteInProcess() puts the KeyedQueue into IN_PROCESS state. A matching noteProcessDone() puts the KeyedQueue bank into READY or EMPTY.

A freeze() may be issued to any READY or EMPTY queue to put it into FROZEN state. Only an unfreeze() will move the queue to INACTIVE state.

A deactivate() may be issued to any READY or EMPTY queue to put it into INACTIVE state.

A snooze() may be issued to any READY or EMPTY queue to put it into SNOOZED state.

A discard() may be issued to any EMPTY queue to put it into the DISCARDED state. A queue never leaves the discarded state; if a queue of its hostname is needed again, a new one is created.

Version:
$Date: 2005/04/05 20:27:17 $ $Revision: 1.27 $
Author:
gojomo
See Also:
Serialized Form

Field Summary
(package private)  java.lang.String classKey
          common string 'key' of included items (typically hostname)
(package private)  CrawlServer crawlServer
          Associated CrawlServer instance, held to keep CrawlServer from being cache-flushed
(package private)  TieredQueue innerQ
           
(package private)  java.util.ArrayList inProcessItems
          items in progress
(package private)  int inProcessLoad
           
(package private)  java.lang.Object state
          current state; see above values
(package private)  int valence
          maximum simultaneous plain URIs to allow in-process at a time
(package private)  long wakeTime
          ms time to wake, if snoozed
 
Fields inherited from interface org.archive.crawler.frontier.URIWorkQueue
BUSY, DISCARDED, EMPTY, FROZEN, INACTIVE, READY, SNOOZED
 
Constructor Summary
KeyedQueue(java.lang.String key, CrawlServer server, java.io.File scratchDir, int maxMemLoad)
           
 
Method Summary
 void activate()
          Move queue from INACTIVE to ACTIVE state
 boolean checkEmpty()
          Update READY/EMPTY state after preceding queue edit operations.
 void deactivate()
          Move queue from READY or EMPTY state to INACTIVE
 long deleteMatchedItems(org.apache.commons.collections.Predicate matcher)
          Delete items matching the supplied criterion.
 CrawlURI dequeue()
          Remove an item in the default manner
 void discard()
          Move queue from READY or EMPTY to DISCARDED
 void enqueue(CrawlURI curi)
          Add an item in the default manner
 boolean equals(java.lang.Object o)
          The only equals() that matters for KeyedQueues is object equivalence.
 void freeze()
          Move queue from READY or EMPTY state to FROZEN
 java.lang.String getClassKey()
          The 'classKey' identifier common to items in this queue
 java.util.List getInProcessItems()
           
 java.util.Iterator getIterator(boolean inCacheOnly)
          Iterate over all available (non-frozen) items.
 java.lang.String getLastDequeued()
           
 java.lang.String getLastQueued()
           
 java.lang.String getSortFallback()
          To ensure total and consistent ordering when in scheduled order, a fallback sort criterion
 java.lang.Object getState()
           
 long getWakeTime()
           
 boolean isDiscardable()
          May this KeyedQueue be completely discarded.
 boolean isEmpty()
           
 long length()
           
 void noteInProcess(CrawlURI o)
          Note that the given item is 'in process'; move queue from READY or EMPTY to IN_PROCESS and remember in-process item.
 void noteProcessDone(CrawlURI o)
          Note that the given item's processing has completed; forget the in-process item and move queue from BUSY or READY to READY or EMPTY state if necessary
 CrawlURI peek()
           
 void setMaximumMemoryLoad(int i)
           
 void setValence(int v)
          Set 'valence', the number of simultaneous items to allow in process before becoming BUSY
 void setWakeTime(long w)
          Should take care not to mutate this value while queue is inside a sorted queue.
 void snooze()
          Move queue from READY or EMPTY state to SNOOZED
 void unfreeze()
          Move queue from FROZEN state to INACTIVE
 void unpeek()
           
 void wake()
          Move queue from SNOOZED state to READY or EMPTY
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

crawlServer

CrawlServer crawlServer
Associated CrawlServer instance, held to keep CrawlServer from being cache-flushed


wakeTime

long wakeTime
ms time to wake, if snoozed


classKey

java.lang.String classKey
common string 'key' of included items (typically hostname)


state

java.lang.Object state
current state; see above values


valence

int valence
maximum simultaneous plain URIs to allow in-process at a time


inProcessItems

java.util.ArrayList inProcessItems
items in progress


inProcessLoad

int inProcessLoad

innerQ

TieredQueue innerQ
Constructor Detail

KeyedQueue

public KeyedQueue(java.lang.String key,
                  CrawlServer server,
                  java.io.File scratchDir,
                  int maxMemLoad)
           throws java.io.IOException
Parameters:
key - A unique identifier used to distingush files related to this objects disk based data structures (will be a part of their file name, must therefor be a legal filename).
server - Server instance this queue is for.
scratchDir - Directory where disk based data structures will be created.
maxMemLoad - Maximum number of items to keep in memory
Throws:
java.io.IOException - When it fails to create disk based data structures.
Method Detail

getClassKey

public java.lang.String getClassKey()
The 'classKey' identifier common to items in this queue

Specified by:
getClassKey in interface URIWorkQueue
Returns:
Object

getState

public java.lang.Object getState()
Specified by:
getState in interface URIWorkQueue
Returns:
The state of this queue.

activate

public void activate()
Move queue from INACTIVE to ACTIVE state

Specified by:
activate in interface URIWorkQueue

deactivate

public void deactivate()
Move queue from READY or EMPTY state to INACTIVE

Specified by:
deactivate in interface URIWorkQueue

freeze

public void freeze()
Move queue from READY or EMPTY state to FROZEN

Specified by:
freeze in interface URIWorkQueue

unfreeze

public void unfreeze()
Move queue from FROZEN state to INACTIVE

Specified by:
unfreeze in interface URIWorkQueue

snooze

public void snooze()
Move queue from READY or EMPTY state to SNOOZED

Specified by:
snooze in interface URIWorkQueue

wake

public void wake()
Move queue from SNOOZED state to READY or EMPTY

Specified by:
wake in interface URIWorkQueue

discard

public void discard()
Move queue from READY or EMPTY to DISCARDED

Specified by:
discard in interface URIWorkQueue

noteInProcess

public void noteInProcess(CrawlURI o)
Note that the given item is 'in process'; move queue from READY or EMPTY to IN_PROCESS and remember in-process item.

Specified by:
noteInProcess in interface URIWorkQueue
Parameters:
o -

noteProcessDone

public void noteProcessDone(CrawlURI o)
Note that the given item's processing has completed; forget the in-process item and move queue from BUSY or READY to READY or EMPTY state if necessary

Specified by:
noteProcessDone in interface URIWorkQueue
Parameters:
o -

checkEmpty

public boolean checkEmpty()
Update READY/EMPTY state after preceding queue edit operations.

Specified by:
checkEmpty in interface URIWorkQueue
Returns:
true if state changed, false otherwise

getWakeTime

public long getWakeTime()
Specified by:
getWakeTime in interface URIWorkQueue
Returns:
Time to wake, when snoozed

setWakeTime

public void setWakeTime(long w)
Should take care not to mutate this value while queue is inside a sorted queue.

Specified by:
setWakeTime in interface URIWorkQueue
Parameters:
w - time to wake, when snoozed

getSortFallback

public java.lang.String getSortFallback()
To ensure total and consistent ordering when in scheduled order, a fallback sort criterion

Specified by:
getSortFallback in interface URIWorkQueue
Returns:
Fallback sort.

equals

public boolean equals(java.lang.Object o)
The only equals() that matters for KeyedQueues is object equivalence.

Overrides:
equals in class java.lang.Object
See Also:
Object.equals(java.lang.Object)

enqueue

public void enqueue(CrawlURI curi)
Add an item in the default manner

Specified by:
enqueue in interface URIWorkQueue
Parameters:
curi -
See Also:
Queue.enqueue(java.lang.Object)

isEmpty

public boolean isEmpty()
Specified by:
isEmpty in interface URIWorkQueue
Returns:
Is this KeyedQueue empty of ready-to-try URIs. (NOTE: may still have 'frozen' off-to-side URIs.)
See Also:
Queue.isEmpty()

dequeue

public CrawlURI dequeue()
Remove an item in the default manner

Specified by:
dequeue in interface URIWorkQueue
Returns:
A crawl uri.
See Also:
Queue.dequeue()

length

public long length()
Specified by:
length in interface URIWorkQueue
Returns:
Total number of available items. (Does not include any 'frozen' items.)
See Also:
Queue.length()

getIterator

public java.util.Iterator getIterator(boolean inCacheOnly)
Iterate over all available (non-frozen) items.

Specified by:
getIterator in interface URIWorkQueue
Parameters:
inCacheOnly -
Returns:
Iterator.
See Also:
Queue.getIterator(boolean)

deleteMatchedItems

public long deleteMatchedItems(org.apache.commons.collections.Predicate matcher)
Delete items matching the supplied criterion.

Specified by:
deleteMatchedItems in interface URIWorkQueue
Parameters:
matcher -
Returns:
Number of deletes.
See Also:
Queue.deleteMatchedItems(org.apache.commons.collections.Predicate)

getInProcessItems

public java.util.List getInProcessItems()
Specified by:
getInProcessItems in interface URIWorkQueue
Returns:
The remembered item in process (set with noteInProgress()).

isDiscardable

public boolean isDiscardable()
May this KeyedQueue be completely discarded. It may be discarded only if empty of available and frozen items, and not SNOOZED or FROZEN (which implies state info which would be lost if discarded).

Specified by:
isDiscardable in interface URIWorkQueue
Returns:
True if discardable.

setValence

public void setValence(int v)
Description copied from interface: URIWorkQueue
Set 'valence', the number of simultaneous items to allow in process before becoming BUSY

Specified by:
setValence in interface URIWorkQueue
Parameters:
v -

getLastQueued

public java.lang.String getLastQueued()
Specified by:
getLastQueued in interface URIWorkQueue
Returns:
Return the last enqueued URI; useful for assessing queue state.

getLastDequeued

public java.lang.String getLastDequeued()
Specified by:
getLastDequeued in interface URIWorkQueue
Returns:
Return the last dequeued URI; useful for assessing queue state.

peek

public CrawlURI peek()

unpeek

public void unpeek()

setMaximumMemoryLoad

public void setMaximumMemoryLoad(int i)
Parameters:
i -


Copyright © 2003-2005 Internet Archive. All Rights Reserved.