org.archive.crawler.datamodel
Class CrawlURI

java.lang.Object
  extended by org.archive.crawler.datamodel.CandidateURI
      extended by org.archive.crawler.datamodel.CrawlURI
All Implemented Interfaces:
java.io.Serializable, CoreAttributeConstants, FetchStatusCodes, Reporter

public class CrawlURI
extends CandidateURI
implements FetchStatusCodes

Represents a candidate URI and the associated state it collects as it is crawled.

Core state is in instance variables but a flexible attribute list is also available. Use this 'bucket' to carry custom processing extracted data and state across CrawlURI processing. See the CandidateURI.putString(String, String), CandidateURI.getString(String), etc.

Author:
Gordon Mohr
See Also:
Serialized Form

Field Summary
(package private)  java.lang.Object holder
           
(package private)  int holderCost
          spot for an integer cost to be placed by external facility (frontier).
(package private)  java.lang.Object holderKey
           
static int MAX_OUTLINKS
          Protection against outlink overflow.
protected  long ordinal
          Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering.
(package private)  java.util.Collection<java.lang.Object> outLinks
          All discovered outbound Links (navlinks, embeds, etc.) Can either contain Link instances or CandidateURI instances, or both.
static int UNCALCULATED
           
 
Fields inherited from class org.archive.crawler.datamodel.CandidateURI
HIGH, HIGHEST, MEDIUM, NORMAL
 
Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE
 
Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX
 
Constructor Summary
CrawlURI(CandidateURI caUri, long o)
          Create a new instance of CrawlURI from a CandidateURI
CrawlURI(UURI uuri)
          Create a new instance of CrawlURI from a UURI.
 
Method Summary
 void aboutToLog()
          Notify CrawlURI it is about to be logged; opportunity for self-annotation
static void addAlistPersistentMember(java.lang.Object key)
          Add the key of alist items you want to persist across processings.
 void addAnnotation(java.lang.String annotation)
          Add an annotation: an abbrieviated indication of something special about this URI that need not be present in every crawl.log line, but should be noted for future reference.
 void addCredentialAvatar(CredentialAvatar ca)
          Add an avatar.
 void addLocalizedError(java.lang.String processorName, java.lang.Throwable ex, java.lang.String message)
          Make note of a non-fatal error, local to a particular Processor, which should be logged somewhere, but allows processing to continue.
 void addOutLink(Link link)
          Add a discovered Link, unless it would exceed the max number to accept.
protected  boolean annotationContains(java.lang.String str2Find)
           
 void clearOutlinks()
           
 void createAndAddLink(java.lang.String url, java.lang.CharSequence context, char hopType)
          Convenience method for creating a Link with the given string and context
 void createAndAddLinkRelativeToBase(java.lang.String url, java.lang.CharSequence context, char hopType)
          Convenience method for creating a Link with the given string and context, relative to a previously set base HREF if available (or relative to the current CrawlURI if no other base has been set)
 void createAndAddLinkRelativeToVia(java.lang.String url, java.lang.CharSequence context, char hopType)
          Convenience method for creating a Link with the given string and context, relative to this CrawlURI's via UURI if available.
 Link createLink(java.lang.String url, java.lang.CharSequence context, char hopType)
          Convenience method for creating a Link discovered at this URI with the given string and context
static java.lang.String fetchStatusCodesToString(int code)
          Takes a status code and converts it into a human readable string.
static CrawlURI from(CandidateURI caUri, long ordinal)
          Make a CrawlURI from the passed CandidateURI.
 java.lang.String getAnnotations()
          Get the annotations set for this uri.
 UURI getBaseURI()
          Get the (HTML) Base URI used for derelativizing internal URIs.
protected  java.lang.String getClassSimpleName(java.lang.Class c)
           
 java.lang.Object getContentDigest()
          Return the retained content-digest value, if any.
 java.lang.String getContentDigestSchemeString()
           
 java.lang.String getContentDigestString()
           
 long getContentLength()
          For completed HTTP transactions, the length of the content-body.
 long getContentSize()
          Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers.
 java.lang.String getContentType()
          Get the content type of this URI.
 java.lang.String getCrawlURIString()
           
 java.util.Set<CredentialAvatar> getCredentialAvatars()
           
 int getDeferrals()
          Get the deferral count.
 int getEmbedHopCount()
          Deprecated.  
 int getFetchAttempts()
          Get the number of attempts at getting the document referenced by this URI.
 long getFetchDuration()
           
 int getFetchStatus()
          Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.
 java.lang.Object getHolder()
          Return the 'holder' for the convenience of an external facility.
 int getHolderCost()
          Return the 'holderCost' for convenience of external facility (frontier)
 java.lang.Object getHolderKey()
          Return the 'holderKey' for convenience of an external facility (Frontier).
 HttpRecorder getHttpRecorder()
          Get the http recorder associated with this uri.
 int getLinkHopCount()
          Deprecated.  
 long getOrdinal()
          Get the ordinal (serial number) assigned at creation.
 java.util.Collection<CandidateURI> getOutCandidates()
          Returns discovered candidate URIs.
 java.util.Collection<Link> getOutLinks()
          Returns discovered links.
 java.util.Collection<java.lang.Object> getOutObjects()
          Returns all of the outbound objects.
 st.ata.util.AList getPersistentAList()
           
 java.lang.Object getPrerequisiteUri()
          Get the prerequisite for this URI.
 long getRecordedSize()
          Get size of data recorded (transferred)
 int getThreadNumber()
          Get the number of the ToeThread responsible for processing this uri.
 java.lang.String getUserAgent()
          Get the user agent to use for crawling this URI.
 boolean hasBeenLinkExtracted()
          If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content.
 boolean hasCredentialAvatars()
           
 boolean hasPrerequisiteUri()
           
 boolean hasRfc2617CredentialAvatar()
           
 void incrementDeferrals()
          Increment the deferral count.
 int incrementFetchAttempts()
          Increment the number of attempts at getting the document referenced by this URI.
 boolean is2XXSuccess()
           
 boolean isHeaderTruncatedFetch()
           
 boolean isHttpTransaction()
          Return true if this is a http transaction.
 boolean isLengthTruncatedFetch()
           
 boolean isPost()
          Returns true if this URI should be fetched by sending a HTTP POST request.
 boolean isPrerequisite()
          Returns true if this CrawlURI is a prerequisite.
 boolean isSuccess()
          Ask this URI if it was a success or not.
 boolean isTimeTruncatedFetch()
           
 boolean isTruncatedFetch()
          TODO: Implement truncation using booleans rather than as this ugly String parse.
 void linkExtractorFinished()
          Note that link extraction has been performed on this CrawlURI.
 void markAsSeed()
          Deprecated.  
 void markPrerequisite(java.lang.String preq, ProcessorChain lastProcessorChain)
          Do all actions associated with setting a CrawlURI as requiring a prerequisite.
 Processor nextProcessor()
          Get the next processor to process this URI.
 ProcessorChain nextProcessorChain()
          Get the processor chain that should be processing this URI after the current chain is finished with it.
 int outlinksSize()
           
 void processingCleanup()
          Clean up after a run through the processing chain.
static boolean removeAlistPersistentMember(java.lang.Object key)
           
 boolean removeCredentialAvatar(CredentialAvatar ca)
          Remove all credential avatars from this crawl uri.
 void removeCredentialAvatars()
          Remove all credential avatars from this crawl uri.
 void replaceOutlinks(java.util.Collection<CandidateURI> links)
          Replace current collection of links w/ passed list.
 void resetDeferrals()
          Reset deferrals counter.
 void resetFetchAttempts()
          Reset fetchAttempts counter.
 void setBaseURI(java.lang.String baseHref)
          Set the (HTML) Base URI used for derelativizing internal URIs.
 void setContentDigest(byte[] digestValue)
          Deprecated. Use setContentDigest(String scheme, byte[])
 void setContentDigest(java.lang.String scheme, byte[] digestValue)
           
 void setContentSize(long l)
          Sets the 'content size' for the URI, which is considered inclusive of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server).
 void setContentType(java.lang.String ct)
          Set a fetched uri's content type.
 void setFetchStatus(int newstatus)
          Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.
 void setHolder(java.lang.Object obj)
          Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .
 void setHolderCost(int cost)
          Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI
 void setHolderKey(java.lang.Object obj)
          Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .
 void setHttpRecorder(HttpRecorder httpRecorder)
          Set the http recorder to be associated with this uri.
 void setNextProcessor(Processor processor)
          Set the next processor to process this URI.
 void setNextProcessorChain(ProcessorChain nextProcessorChain)
          Set the next processor chain to process this URI.
 void setPost(boolean b)
          Set whether this URI should be fetched by sending a HTTP POST request.
 void setPrerequisite(boolean prerequisite)
          Set if this CrawlURI is itself a prerequisite URI.
 void setPrerequisiteUri(java.lang.Object link)
          Set a prerequisite for this URI.
 void setThreadNumber(int i)
          Set the number of the ToeThread responsible for processing this uri.
 void setUserAgent(java.lang.String string)
          Set the user agent to use when crawling this URI.
 void skipToProcessor(ProcessorChain processorChain, Processor processor)
          Set which processor should be the next processor to process this uri instead of using the default next processor.
 void skipToProcessorChain(ProcessorChain processorChain)
          Set which processor chain should be processing this uri next.
 void stripToMinimal()
          Remove all attributes set on this uri.
 
Methods inherited from class org.archive.crawler.datamodel.CandidateURI
clearAList, containsKey, createCandidateURI, createCandidateURI, createSeedCandidateURI, flattenVia, forceFetch, fromString, getAList, getCandidateURIString, getClassKey, getInt, getLong, getObject, getPathFromSeed, getReports, getSchedulingDirective, getString, getTransHops, getURIString, getUURI, getVia, getViaContext, inheritFrom, isLocation, isSeed, keys, makeHeritable, makeNonHeritable, needsImmediateScheduling, needsSoonScheduling, putInt, putLong, putObject, putString, readUuri, remove, reportTo, reportTo, sameDomainAs, setAList, setClassKey, setForceFetch, setIsSeed, setPathFromSeed, setSchedulingDirective, setVia, singleLineLegend, singleLineReport, singleLineReportTo, toString
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

UNCALCULATED

public static final int UNCALCULATED
See Also:
Constant Field Values

MAX_OUTLINKS

public static final int MAX_OUTLINKS
Protection against outlink overflow. Change value by setting alternate maximum in heritrix.properties.


ordinal

protected long ordinal
Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering. Will sometimes be truncated to 48 bits, so behavior over 281 trillion instantiated CrawlURIs may be buggy


holder

transient java.lang.Object holder

holderKey

transient java.lang.Object holderKey

holderCost

int holderCost
spot for an integer cost to be placed by external facility (frontier). cost is truncated to 8 bits at times, so should not exceed 255


outLinks

transient java.util.Collection<java.lang.Object> outLinks
All discovered outbound Links (navlinks, embeds, etc.) Can either contain Link instances or CandidateURI instances, or both. The LinksScoper processor converts Link instances in this collection to CandidateURI instances.

Constructor Detail

CrawlURI

public CrawlURI(UURI uuri)
Create a new instance of CrawlURI from a UURI.

Parameters:
uuri - the UURI to base this CrawlURI on.

CrawlURI

public CrawlURI(CandidateURI caUri,
                long o)
Create a new instance of CrawlURI from a CandidateURI

Parameters:
caUri - the CandidateURI to base this CrawlURI on.
o - Monotonically increasing number within a crawl.
Method Detail

fetchStatusCodesToString

public static java.lang.String fetchStatusCodesToString(int code)
Takes a status code and converts it into a human readable string.

Parameters:
code - the status code
Returns:
a human readable string declaring what the status code is.

getFetchStatus

public int getFetchStatus()
Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.

Returns:
a value from FetchStatusCodes

setFetchStatus

public void setFetchStatus(int newstatus)
Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.

Parameters:
newstatus - a value from FetchStatusCodes

getFetchAttempts

public int getFetchAttempts()
Get the number of attempts at getting the document referenced by this URI.

Returns:
the number of attempts at getting the document referenced by this URI.

incrementFetchAttempts

public int incrementFetchAttempts()
Increment the number of attempts at getting the document referenced by this URI.

Returns:
the number of attempts at getting the document referenced by this URI.

resetFetchAttempts

public void resetFetchAttempts()
Reset fetchAttempts counter.


resetDeferrals

public void resetDeferrals()
Reset deferrals counter.


nextProcessor

public Processor nextProcessor()
Get the next processor to process this URI.

Returns:
the processor that should process this URI next.

nextProcessorChain

public ProcessorChain nextProcessorChain()
Get the processor chain that should be processing this URI after the current chain is finished with it.

Returns:
the next processor chain to process this URI.

setNextProcessor

public void setNextProcessor(Processor processor)
Set the next processor to process this URI.

Parameters:
processor - the next processor to process this URI.

setNextProcessorChain

public void setNextProcessorChain(ProcessorChain nextProcessorChain)
Set the next processor chain to process this URI.

Parameters:
nextProcessorChain - the next processor chain to process this URI.

markPrerequisite

public void markPrerequisite(java.lang.String preq,
                             ProcessorChain lastProcessorChain)
                      throws org.apache.commons.httpclient.URIException
Do all actions associated with setting a CrawlURI as requiring a prerequisite.

Parameters:
lastProcessorChain - Last processor chain reference. This chain is where this CrawlURI goes next.
preq - Object to set a prerequisite.
Throws:
org.apache.commons.httpclient.URIException

setPrerequisiteUri

public void setPrerequisiteUri(java.lang.Object link)
Set a prerequisite for this URI.

A prerequisite is a URI that must be crawled before this URI can be crawled.

Parameters:
link - Link to set as prereq.

getPrerequisiteUri

public java.lang.Object getPrerequisiteUri()
Get the prerequisite for this URI.

A prerequisite is a URI that must be crawled before this URI can be crawled.

Returns:
the prerequisite for this URI or null if no prerequisite.

hasPrerequisiteUri

public boolean hasPrerequisiteUri()
Returns:
True if this CrawlURI has a prerequisite.

isPrerequisite

public boolean isPrerequisite()
Returns true if this CrawlURI is a prerequisite.

Returns:
true if this CrawlURI is a prerequisite.

setPrerequisite

public void setPrerequisite(boolean prerequisite)
Set if this CrawlURI is itself a prerequisite URI.

Parameters:
prerequisite - True if this CrawlURI is itself a prerequiste uri.

getCrawlURIString

public java.lang.String getCrawlURIString()
Returns:
This crawl URI as a string wrapped with 'CrawlURI(' + ')'.

getContentType

public java.lang.String getContentType()
Get the content type of this URI.

Returns:
Fetched URIs content type. May be null.

setContentType

public void setContentType(java.lang.String ct)
Set a fetched uri's content type.

Parameters:
ct - Contenttype. May be null.

setThreadNumber

public void setThreadNumber(int i)
Set the number of the ToeThread responsible for processing this uri.

Parameters:
i - the ToeThread number.

getThreadNumber

public int getThreadNumber()
Get the number of the ToeThread responsible for processing this uri.

Returns:
the ToeThread number.

incrementDeferrals

public void incrementDeferrals()
Increment the deferral count.


getDeferrals

public int getDeferrals()
Get the deferral count.

Returns:
the deferral count.

stripToMinimal

public void stripToMinimal()
Remove all attributes set on this uri.

This methods removes the attribute list.


getContentSize

public long getContentSize()
Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers. It is the responsibility of the classes which fetch the URI to set this value accordingly -- it is not calculated/verified within CrawlURI. This value is consulted in reporting/logging/writing-decisions.

Returns:
contentSize
See Also:
#setContentSize()

addLocalizedError

public void addLocalizedError(java.lang.String processorName,
                              java.lang.Throwable ex,
                              java.lang.String message)
Make note of a non-fatal error, local to a particular Processor, which should be logged somewhere, but allows processing to continue. This is how you add to the local-error log (the 'localized' in the below is making an error local rather than global, not making a swiss-french version of the error.).

Parameters:
processorName - Name of processor the exception was thrown in.
ex - Throwable to log.
message - Extra message to log beyond exception message.

getClassSimpleName

protected java.lang.String getClassSimpleName(java.lang.Class c)

addAnnotation

public void addAnnotation(java.lang.String annotation)
Add an annotation: an abbrieviated indication of something special about this URI that need not be present in every crawl.log line, but should be noted for future reference.

Parameters:
annotation - the annotation to add; should not contain whitespace or a comma

isTruncatedFetch

public boolean isTruncatedFetch()
TODO: Implement truncation using booleans rather than as this ugly String parse.

Returns:
True if fetch was truncated.

isLengthTruncatedFetch

public boolean isLengthTruncatedFetch()

isTimeTruncatedFetch

public boolean isTimeTruncatedFetch()

isHeaderTruncatedFetch

public boolean isHeaderTruncatedFetch()

annotationContains

protected boolean annotationContains(java.lang.String str2Find)

getAnnotations

public java.lang.String getAnnotations()
Get the annotations set for this uri.

Returns:
the annotations set for this uri.

getEmbedHopCount

public int getEmbedHopCount()
Deprecated. 

Get the embeded hop count.

Returns:
the embeded hop count.

getLinkHopCount

public int getLinkHopCount()
Deprecated. 

Get the link hop count.

Returns:
the link hop count.

markAsSeed

public void markAsSeed()
Deprecated. 

Mark this uri as being a seed.


getUserAgent

public java.lang.String getUserAgent()
Get the user agent to use for crawling this URI. If null the global setting should be used.

Returns:
user agent or null

setUserAgent

public void setUserAgent(java.lang.String string)
Set the user agent to use when crawling this URI. If not set the global settings should be used.

Parameters:
string - user agent to use

skipToProcessor

public void skipToProcessor(ProcessorChain processorChain,
                            Processor processor)
Set which processor should be the next processor to process this uri instead of using the default next processor.

Parameters:
processorChain - the processor chain to skip to.
processor - the processor in the processor chain to skip to.

skipToProcessorChain

public void skipToProcessorChain(ProcessorChain processorChain)
Set which processor chain should be processing this uri next.

Parameters:
processorChain - the processor chain to skip to.

getContentLength

public long getContentLength()
For completed HTTP transactions, the length of the content-body.

Returns:
For completed HTTP transactions, the length of the content-body.

getRecordedSize

public long getRecordedSize()
Get size of data recorded (transferred)

Returns:
recorded data size

setContentSize

public void setContentSize(long l)
Sets the 'content size' for the URI, which is considered inclusive of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server). (In contrast, content-length matches the HTTP definition, that of the enclosed content-body.) Should be set by a fetcher or other processor as soon as the final size of recorded content is known. Setting to an artificial/incorrect value may affect other reporting/processing.

Parameters:
l - Content size.

hasBeenLinkExtracted

public boolean hasBeenLinkExtracted()
If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content. This does not preclude other link extractors that may have an interest in this CrawlURI from also doing link extraction.

There is an onus on link extractors to set this flag if they have run.

Returns:
True if a processor has performed link extraction on this CrawlURI
See Also:
linkExtractorFinished()

linkExtractorFinished

public void linkExtractorFinished()
Note that link extraction has been performed on this CrawlURI. A processor doing link extraction should invoke this method once it has finished it's work. It should invoke it even if no links are extracted. It should only invoke this method if the link extraction was performed on the document body (not the HTTP headers etc.).

See Also:
hasBeenLinkExtracted()

aboutToLog

public void aboutToLog()
Notify CrawlURI it is about to be logged; opportunity for self-annotation


getHttpRecorder

public HttpRecorder getHttpRecorder()
Get the http recorder associated with this uri.

Returns:
Returns the httpRecorder. May be null but its set early in FetchHttp so there is an issue if its null.

setHttpRecorder

public void setHttpRecorder(HttpRecorder httpRecorder)
Set the http recorder to be associated with this uri.

Parameters:
httpRecorder - The httpRecorder to set.

isHttpTransaction

public boolean isHttpTransaction()
Return true if this is a http transaction. TODO: Compound this and isPost() method so that there is one place to go to find out if get http, post http, ftp, dns.

Returns:
True if this is a http transaction.

processingCleanup

public void processingCleanup()
Clean up after a run through the processing chain. Called on the end of processing chain by Frontier#finish. Null out any state gathered during processing.


getPersistentAList

public st.ata.util.AList getPersistentAList()

from

public static CrawlURI from(CandidateURI caUri,
                            long ordinal)
Make a CrawlURI from the passed CandidateURI. Its safe to pass a CrawlURI instance. In this case we just return it as a result. Otherwise, we create new CrawlURI instance.

Parameters:
caUri - Candidate URI.
ordinal -
Returns:
A crawlURI made from the passed CandidateURI.

getCredentialAvatars

public java.util.Set<CredentialAvatar> getCredentialAvatars()
Returns:
Credential avatars. Null if none set.

hasCredentialAvatars

public boolean hasCredentialAvatars()
Returns:
True if there are avatars attached to this instance.

addCredentialAvatar

public void addCredentialAvatar(CredentialAvatar ca)
Add an avatar. We do lazy instantiation.

Parameters:
ca - Credential avatar to add to set of avatars.

removeCredentialAvatars

public void removeCredentialAvatars()
Remove all credential avatars from this crawl uri.


removeCredentialAvatar

public boolean removeCredentialAvatar(CredentialAvatar ca)
Remove all credential avatars from this crawl uri.

Parameters:
ca - Avatar to remove.
Returns:
True if we removed passed parameter. False if no operation performed.

isSuccess

public boolean isSuccess()
Ask this URI if it was a success or not. Only makes sense to call this method after execution of HttpMethod#execute. Regard any status larger then 0 as success except for below caveat regarding 401s. Use is2XXSuccess() if looking for a status code in the 200 range.

401s caveat: If any rfc2617 credential data present and we got a 401 assume it got loaded in FetchHTTP on expectation that we're to go around the processing chain again. Report this condition as a failure so we get another crack at the processing chain only this time we'll be making use of the loaded credential data.

Returns:
True if ths URI has been successfully processed.
See Also:
is2XXSuccess()

is2XXSuccess

public boolean is2XXSuccess()
Returns:
True if status code is in the 2xx range.
See Also:
isSuccess()

hasRfc2617CredentialAvatar

public boolean hasRfc2617CredentialAvatar()
Returns:
True if we have an rfc2617 payload.

setPost

public void setPost(boolean b)
Set whether this URI should be fetched by sending a HTTP POST request. Else a HTTP GET request will be used.

Parameters:
b - Set whether this curi is to be POST'd. Else its to be GET'd.

isPost

public boolean isPost()
Returns true if this URI should be fetched by sending a HTTP POST request. TODO: Compound this and isHttpTransaction() method so that there is one place to go to find out if get http, post http, ftp, dns.

Returns:
Returns is this CrawlURI instance is to be posted.

setContentDigest

public void setContentDigest(byte[] digestValue)
Deprecated. Use setContentDigest(String scheme, byte[])

Set the retained content-digest value (usu. SHA1).

Parameters:
digestValue -

setContentDigest

public void setContentDigest(java.lang.String scheme,
                             byte[] digestValue)

getContentDigestSchemeString

public java.lang.String getContentDigestSchemeString()

getContentDigest

public java.lang.Object getContentDigest()
Return the retained content-digest value, if any.

Returns:
Digest value.

getContentDigestString

public java.lang.String getContentDigestString()

setHolder

public void setHolder(java.lang.Object obj)
Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .

Parameters:
obj -

getHolder

public java.lang.Object getHolder()
Return the 'holder' for the convenience of an external facility.

Returns:
holder

setHolderKey

public void setHolderKey(java.lang.Object obj)
Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .

Parameters:
obj -

getHolderKey

public java.lang.Object getHolderKey()
Return the 'holderKey' for convenience of an external facility (Frontier).

Returns:
holderKey

getOrdinal

public long getOrdinal()
Get the ordinal (serial number) assigned at creation.

Returns:
ordinal

getHolderCost

public int getHolderCost()
Return the 'holderCost' for convenience of external facility (frontier)

Returns:
value of holderCost

setHolderCost

public void setHolderCost(int cost)
Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI

Parameters:
cost - value to remember

getOutLinks

public java.util.Collection<Link> getOutLinks()
Returns discovered links. The returned collection might be empty if no links were discovered, or if something like LinksScoper promoted the links to CandidateURIs. Elements can be removed from the returned collection, but not added. To add a discovered link, use one of the createAndAdd methods or getOutObjects().

Returns:
Collection of all discovered outbound Links

getOutCandidates

public java.util.Collection<CandidateURI> getOutCandidates()
Returns discovered candidate URIs. The returned collection will be emtpy until something like LinksScoper promotes discovered Links into CandidateURIs. Elements can be removed from the returned collection, but not added. To add a candidate URI, use replaceOutlinks(Collection) or getOutObjects().

Returns:
Collection of candidate URIs

getOutObjects

public java.util.Collection<java.lang.Object> getOutObjects()
Returns all of the outbound objects. The returned Collection will contain Link instances, or CandidateURI instances, or both.

Returns:
the collection of Links and/or CandidateURIs

addOutLink

public void addOutLink(Link link)
Add a discovered Link, unless it would exceed the max number to accept. (If so, increment discarded link counter.)

Parameters:
link - the Link to add

clearOutlinks

public void clearOutlinks()

replaceOutlinks

public void replaceOutlinks(java.util.Collection<CandidateURI> links)
Replace current collection of links w/ passed list. Used by Scopers adjusting the list of links (removing those not in scope and promoting Links to CandidateURIs).

Parameters:
a - collection of CandidateURIs replacing any previously existing outLinks or outCandidates

outlinksSize

public int outlinksSize()
Returns:
Count of outlinks.

createLink

public Link createLink(java.lang.String url,
                       java.lang.CharSequence context,
                       char hopType)
                throws org.apache.commons.httpclient.URIException
Convenience method for creating a Link discovered at this URI with the given string and context

Parameters:
url - String to use to create Link
context - CharSequence context to use
hopType -
Returns:
Link.
Throws:
org.apache.commons.httpclient.URIException - if Link UURI cannot be constructed

createAndAddLink

public void createAndAddLink(java.lang.String url,
                             java.lang.CharSequence context,
                             char hopType)
                      throws org.apache.commons.httpclient.URIException
Convenience method for creating a Link with the given string and context

Parameters:
url - String to use to create Link
context - CharSequence context to use
hopType -
Throws:
org.apache.commons.httpclient.URIException - if Link UURI cannot be constructed

createAndAddLinkRelativeToBase

public void createAndAddLinkRelativeToBase(java.lang.String url,
                                           java.lang.CharSequence context,
                                           char hopType)
                                    throws org.apache.commons.httpclient.URIException
Convenience method for creating a Link with the given string and context, relative to a previously set base HREF if available (or relative to the current CrawlURI if no other base has been set)

Parameters:
url - String URL to add as destination of link
context - String context where link was discovered
hopType - char hop-type indicator
Throws:
org.apache.commons.httpclient.URIException

createAndAddLinkRelativeToVia

public void createAndAddLinkRelativeToVia(java.lang.String url,
                                          java.lang.CharSequence context,
                                          char hopType)
                                   throws org.apache.commons.httpclient.URIException
Convenience method for creating a Link with the given string and context, relative to this CrawlURI's via UURI if available. (If a via is not available, falls back to using #createAndAddLinkRelativeToBase.)

Parameters:
url - String URL to add as destination of link
context - String context where link was discovered
hopType - char hop-type indicator
Throws:
org.apache.commons.httpclient.URIException

setBaseURI

public void setBaseURI(java.lang.String baseHref)
                throws org.apache.commons.httpclient.URIException
Set the (HTML) Base URI used for derelativizing internal URIs.

Parameters:
baseHref - String base href to use
Throws:
org.apache.commons.httpclient.URIException - if supplied string cannot be interpreted as URI

getBaseURI

public UURI getBaseURI()
Get the (HTML) Base URI used for derelativizing internal URIs.

Returns:
UURI base URI previously set

addAlistPersistentMember

public static void addAlistPersistentMember(java.lang.Object key)
Add the key of alist items you want to persist across processings.

Parameters:
key - Key to add.

removeAlistPersistentMember

public static boolean removeAlistPersistentMember(java.lang.Object key)
Parameters:
key - Key to remove.
Returns:
True if list contained the element.

getFetchDuration

public long getFetchDuration()


Copyright © 2003-2011 Internet Archive. All Rights Reserved.