CrawlURI (Heritrix 1.15.5-201106092337)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.datamodel
Class CrawlURI

java.lang.Object
  org.archive.crawler.datamodel.CandidateURI
      org.archive.crawler.datamodel.CrawlURI

All Implemented Interfaces:: java.io.Serializable, CoreAttributeConstants, FetchStatusCodes, Reporter

public class CrawlURI
extends CandidateURI
implements FetchStatusCodes
extends CandidateURI
implements FetchStatusCodes

Represents a candidate URI and the associated state it collects as it is crawled.

Core state is in instance variables but a flexible attribute list is also available. Use this 'bucket' to carry custom processing extracted data and state across CrawlURI processing. See the CandidateURI.putString(String, String), CandidateURI.getString(String), etc.

Author:: Gordon Mohr
See Also:: Serialized Form

Field Summary
`(package private) java.lang.Object`	`holder`
`(package private) int`	`holderCost` spot for an integer cost to be placed by external facility (frontier).
`(package private) java.lang.Object`	`holderKey`
`static int`	`MAX_OUTLINKS` Protection against outlink overflow.
`protected long`	`ordinal` Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering.
`(package private) java.util.Collection<java.lang.Object>`	`outLinks` All discovered outbound Links (navlinks, embeds, etc.) Can either contain Link instances or CandidateURI instances, or both.
`static int`	`UNCALCULATED`

Fields inherited from class org.archive.crawler.datamodel.CandidateURI
`HIGH, HIGHEST, MEDIUM, NORMAL`

Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes
S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE

Fields inherited from interface org.archive.crawler.datamodel.FetchStatusCodes

S_BLOCKED_BY_CUSTOM_PROCESSOR, S_BLOCKED_BY_QUOTA, S_BLOCKED_BY_RUNTIME_LIMIT, S_BLOCKED_BY_USER, S_CONNECT_FAILED, S_CONNECT_LOST, S_DEEMED_CHAFF, S_DEEMED_NOT_FOUND, S_DEFERRED, S_DELETED_BY_USER, S_DNS_SUCCESS, S_DOMAIN_PREREQUISITE_FAILURE, S_DOMAIN_UNRESOLVABLE, S_GETBYNAME_SUCCESS, S_OTHER_PREREQUISITE_FAILURE, S_OUT_OF_SCOPE, S_PREREQUISITE_UNSCHEDULABLE_FAILURE, S_PROCESSING_THREAD_KILLED, S_ROBOTS_PRECLUDED, S_ROBOTS_PREREQUISITE_FAILURE, S_RUNTIME_EXCEPTION, S_SERIOUS_ERROR, S_TIMEOUT, S_TOO_MANY_EMBED_HOPS, S_TOO_MANY_LINK_HOPS, S_TOO_MANY_RETRIES, S_UNATTEMPTED, S_UNFETCHABLE_URI, S_UNQUEUEABLE

Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants
A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX

Fields inherited from interface org.archive.crawler.datamodel.CoreAttributeConstants

A_ANNOTATIONS, A_CONTENT_DIGEST, A_CONTENT_TYPE, A_CREDENTIAL_AVATARS_KEY, A_DELAY_FACTOR, A_DISTANCE_FROM_SEED, A_DNS_FETCH_TIME, A_DNS_SERVER_IP_LABEL, A_ETAG_HEADER, A_FETCH_BEGAN_TIME, A_FETCH_COMPLETED_TIME, A_FETCH_HISTORY, A_FORCE_RETIRE, A_FTP_CONTROL_CONVERSATION, A_FTP_FETCH_STATUS, A_HERITABLE_KEYS, A_HTML_BASE, A_HTTP_BIND_ADDRESS, A_HTTP_PROXY_HOST, A_HTTP_PROXY_PORT, A_HTTP_TRANSACTION, A_LAST_MODIFIED_HEADER, A_LOCALIZED_ERRORS, A_META_ROBOTS, A_MINIMUM_DELAY, A_MIRROR_PATH, A_PREREQUISITE_URI, A_REFERENCE_LENGTH, A_RETRY_DELAY, A_RRECORD_SET_LABEL, A_RUNTIME_EXCEPTION, A_SOURCE_TAG, A_STATUS, A_WRITTEN_TO_WARC, HEADER_TRUNC, LENGTH_TRUNC, TIMER_TRUNC, TRUNC_SUFFIX

Constructor Summary
`CrawlURI(CandidateURI caUri, long o)` Create a new instance of CrawlURI from a `CandidateURI`
`CrawlURI(UURI uuri)` Create a new instance of CrawlURI from a `UURI`.

Method Summary
`void`	`aboutToLog()` Notify CrawlURI it is about to be logged; opportunity for self-annotation
`static void`	`addAlistPersistentMember(java.lang.Object key)` Add the key of alist items you want to persist across processings.
`void`	`addAnnotation(java.lang.String annotation)` Add an annotation: an abbrieviated indication of something special about this URI that need not be present in every crawl.log line, but should be noted for future reference.
`void`	`addCredentialAvatar(CredentialAvatar ca)` Add an avatar.
`void`	`addLocalizedError(java.lang.String processorName, java.lang.Throwable ex, java.lang.String message)` Make note of a non-fatal error, local to a particular Processor, which should be logged somewhere, but allows processing to continue.
`void`	`addOutLink(Link link)` Add a discovered Link, unless it would exceed the max number to accept.
`protected boolean`	`annotationContains(java.lang.String str2Find)`
`void`	`clearOutlinks()`
`void`	`createAndAddLink(java.lang.String url, java.lang.CharSequence context, char hopType)` Convenience method for creating a Link with the given string and context
`void`	`createAndAddLinkRelativeToBase(java.lang.String url, java.lang.CharSequence context, char hopType)` Convenience method for creating a Link with the given string and context, relative to a previously set base HREF if available (or relative to the current CrawlURI if no other base has been set)
`void`	`createAndAddLinkRelativeToVia(java.lang.String url, java.lang.CharSequence context, char hopType)` Convenience method for creating a Link with the given string and context, relative to this CrawlURI's via UURI if available.
`Link`	`createLink(java.lang.String url, java.lang.CharSequence context, char hopType)` Convenience method for creating a Link discovered at this URI with the given string and context
`static java.lang.String`	`fetchStatusCodesToString(int code)` Takes a status code and converts it into a human readable string.
`static CrawlURI`	`from(CandidateURI caUri, long ordinal)` Make a `CrawlURI` from the passed `CandidateURI`.
`java.lang.String`	`getAnnotations()` Get the annotations set for this uri.
`UURI`	`getBaseURI()` Get the (HTML) Base URI used for derelativizing internal URIs.
`protected java.lang.String`	`getClassSimpleName(java.lang.Class c)`
`java.lang.Object`	`getContentDigest()` Return the retained content-digest value, if any.
`java.lang.String`	`getContentDigestSchemeString()`
`java.lang.String`	`getContentDigestString()`
`long`	`getContentLength()` For completed HTTP transactions, the length of the content-body.
`long`	`getContentSize()` Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers.
`java.lang.String`	`getContentType()` Get the content type of this URI.
`java.lang.String`	`getCrawlURIString()`
`java.util.Set<CredentialAvatar>`	`getCredentialAvatars()`
`int`	`getDeferrals()` Get the deferral count.
`int`	`getEmbedHopCount()` Deprecated.
`int`	`getFetchAttempts()` Get the number of attempts at getting the document referenced by this URI.
`long`	`getFetchDuration()`
`int`	`getFetchStatus()` Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.
`java.lang.Object`	`getHolder()` Return the 'holder' for the convenience of an external facility.
`int`	`getHolderCost()` Return the 'holderCost' for convenience of external facility (frontier)
`java.lang.Object`	`getHolderKey()` Return the 'holderKey' for convenience of an external facility (Frontier).
`HttpRecorder`	`getHttpRecorder()` Get the http recorder associated with this uri.
`int`	`getLinkHopCount()` Deprecated.
`long`	`getOrdinal()` Get the ordinal (serial number) assigned at creation.
`java.util.Collection<CandidateURI>`	`getOutCandidates()` Returns discovered candidate URIs.
`java.util.Collection<Link>`	`getOutLinks()` Returns discovered links.
`java.util.Collection<java.lang.Object>`	`getOutObjects()` Returns all of the outbound objects.
`st.ata.util.AList`	`getPersistentAList()`
`java.lang.Object`	`getPrerequisiteUri()` Get the prerequisite for this URI.
`long`	`getRecordedSize()` Get size of data recorded (transferred)
`int`	`getThreadNumber()` Get the number of the ToeThread responsible for processing this uri.
`java.lang.String`	`getUserAgent()` Get the user agent to use for crawling this URI.
`boolean`	`hasBeenLinkExtracted()` If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content.
`boolean`	`hasCredentialAvatars()`
`boolean`	`hasPrerequisiteUri()`
`boolean`	`hasRfc2617CredentialAvatar()`
`void`	`incrementDeferrals()` Increment the deferral count.
`int`	`incrementFetchAttempts()` Increment the number of attempts at getting the document referenced by this URI.
`boolean`	`is2XXSuccess()`
`boolean`	`isHeaderTruncatedFetch()`
`boolean`	`isHttpTransaction()` Return true if this is a http transaction.
`boolean`	`isLengthTruncatedFetch()`
`boolean`	`isPost()` Returns true if this URI should be fetched by sending a HTTP POST request.
`boolean`	`isPrerequisite()` Returns true if this CrawlURI is a prerequisite.
`boolean`	`isSuccess()` Ask this URI if it was a success or not.
`boolean`	`isTimeTruncatedFetch()`
`boolean`	`isTruncatedFetch()` TODO: Implement truncation using booleans rather than as this ugly String parse.
`void`	`linkExtractorFinished()` Note that link extraction has been performed on this CrawlURI.
`void`	`markAsSeed()` Deprecated.
`void`	`markPrerequisite(java.lang.String preq, ProcessorChain lastProcessorChain)` Do all actions associated with setting a `CrawlURI` as requiring a prerequisite.
`Processor`	`nextProcessor()` Get the next processor to process this URI.
`ProcessorChain`	`nextProcessorChain()` Get the processor chain that should be processing this URI after the current chain is finished with it.
`int`	`outlinksSize()`
`void`	`processingCleanup()` Clean up after a run through the processing chain.
`static boolean`	`removeAlistPersistentMember(java.lang.Object key)`
`boolean`	`removeCredentialAvatar(CredentialAvatar ca)` Remove all credential avatars from this crawl uri.
`void`	`removeCredentialAvatars()` Remove all credential avatars from this crawl uri.
`void`	`replaceOutlinks(java.util.Collection<CandidateURI> links)` Replace current collection of links w/ passed list.
`void`	`resetDeferrals()` Reset deferrals counter.
`void`	`resetFetchAttempts()` Reset fetchAttempts counter.
`void`	`setBaseURI(java.lang.String baseHref)` Set the (HTML) Base URI used for derelativizing internal URIs.
`void`	`setContentDigest(byte[] digestValue)` Deprecated. Use `setContentDigest(String scheme, byte[])`
`void`	`setContentDigest(java.lang.String scheme, byte[] digestValue)`
`void`	`setContentSize(long l)` Sets the 'content size' for the URI, which is considered inclusive of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server).
`void`	`setContentType(java.lang.String ct)` Set a fetched uri's content type.
`void`	`setFetchStatus(int newstatus)` Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.
`void`	`setHolder(java.lang.Object obj)` Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .
`void`	`setHolderCost(int cost)` Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI
`void`	`setHolderKey(java.lang.Object obj)` Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .
`void`	`setHttpRecorder(HttpRecorder httpRecorder)` Set the http recorder to be associated with this uri.
`void`	`setNextProcessor(Processor processor)` Set the next processor to process this URI.
`void`	`setNextProcessorChain(ProcessorChain nextProcessorChain)` Set the next processor chain to process this URI.
`void`	`setPost(boolean b)` Set whether this URI should be fetched by sending a HTTP POST request.
`void`	`setPrerequisite(boolean prerequisite)` Set if this CrawlURI is itself a prerequisite URI.
`void`	`setPrerequisiteUri(java.lang.Object link)` Set a prerequisite for this URI.
`void`	`setThreadNumber(int i)` Set the number of the ToeThread responsible for processing this uri.
`void`	`setUserAgent(java.lang.String string)` Set the user agent to use when crawling this URI.
`void`	`skipToProcessor(ProcessorChain processorChain, Processor processor)` Set which processor should be the next processor to process this uri instead of using the default next processor.
`void`	`skipToProcessorChain(ProcessorChain processorChain)` Set which processor chain should be processing this uri next.
`void`	`stripToMinimal()` Remove all attributes set on this uri.

Methods inherited from class org.archive.crawler.datamodel.CandidateURI
clearAList, containsKey, createCandidateURI, createCandidateURI, createSeedCandidateURI, flattenVia, forceFetch, fromString, getAList, getCandidateURIString, getClassKey, getInt, getLong, getObject, getPathFromSeed, getReports, getSchedulingDirective, getString, getTransHops, getURIString, getUURI, getVia, getViaContext, inheritFrom, isLocation, isSeed, keys, makeHeritable, makeNonHeritable, needsImmediateScheduling, needsSoonScheduling, putInt, putLong, putObject, putString, readUuri, remove, reportTo, reportTo, sameDomainAs, setAList, setClassKey, setForceFetch, setIsSeed, setPathFromSeed, setSchedulingDirective, setVia, singleLineLegend, singleLineReport, singleLineReportTo, toString

Methods inherited from class org.archive.crawler.datamodel.CandidateURI

clearAList, containsKey, createCandidateURI, createCandidateURI, createSeedCandidateURI, flattenVia, forceFetch, fromString, getAList, getCandidateURIString, getClassKey, getInt, getLong, getObject, getPathFromSeed, getReports, getSchedulingDirective, getString, getTransHops, getURIString, getUURI, getVia, getViaContext, inheritFrom, isLocation, isSeed, keys, makeHeritable, makeNonHeritable, needsImmediateScheduling, needsSoonScheduling, putInt, putLong, putObject, putString, readUuri, remove, reportTo, reportTo, sameDomainAs, setAList, setClassKey, setForceFetch, setIsSeed, setPathFromSeed, setSchedulingDirective, setVia, singleLineLegend, singleLineReport, singleLineReportTo, toString

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Field Detail

UNCALCULATED

public static final int UNCALCULATED

See Also:: Constant Field Values

MAX_OUTLINKS

public static final int MAX_OUTLINKS

Protection against outlink overflow. Change value by setting alternate maximum in heritrix.properties.

ordinal

protected long ordinal

Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering. Will sometimes be truncated to 48 bits, so behavior over 281 trillion instantiated CrawlURIs may be buggy

holder

transient java.lang.Object holder

holderKey

transient java.lang.Object holderKey

holderCost

int holderCost

spot for an integer cost to be placed by external facility (frontier). cost is truncated to 8 bits at times, so should not exceed 255

outLinks

transient java.util.Collection<java.lang.Object> outLinks

All discovered outbound Links (navlinks, embeds, etc.) Can either contain Link instances or CandidateURI instances, or both. The LinksScoper processor converts Link instances in this collection to CandidateURI instances.

Constructor Detail

CrawlURI

public CrawlURI(UURI uuri)

Create a new instance of CrawlURI from a UURI.

Parameters:: uuri - the UURI to base this CrawlURI on.

CrawlURI

public CrawlURI(CandidateURI caUri,
                long o)

Create a new instance of CrawlURI from a CandidateURI

Parameters:: caUri - the CandidateURI to base this CrawlURI on.; o - Monotonically increasing number within a crawl.

Method Detail

fetchStatusCodesToString

public static java.lang.String fetchStatusCodesToString(int code)

Takes a status code and converts it into a human readable string.

Parameters:: code - the status code
Returns:: a human readable string declaring what the status code is.

getFetchStatus

public int getFetchStatus()

Return the overall/fetch status of this CrawlURI for its current trip through the processing loop.

Returns:: a value from FetchStatusCodes

setFetchStatus

public void setFetchStatus(int newstatus)

Set the overall/fetch status of this CrawlURI for its current trip through the processing loop.

Parameters:: newstatus - a value from FetchStatusCodes

getFetchAttempts

public int getFetchAttempts()

Get the number of attempts at getting the document referenced by this URI.

Returns:: the number of attempts at getting the document referenced by this URI.

incrementFetchAttempts

public int incrementFetchAttempts()

Increment the number of attempts at getting the document referenced by this URI.

Returns:: the number of attempts at getting the document referenced by this URI.

resetFetchAttempts

public void resetFetchAttempts()

Reset fetchAttempts counter.

resetDeferrals

public void resetDeferrals()

Reset deferrals counter.

nextProcessor

public Processor nextProcessor()

Get the next processor to process this URI.

Returns:: the processor that should process this URI next.

nextProcessorChain

public ProcessorChain nextProcessorChain()

Get the processor chain that should be processing this URI after the current chain is finished with it.

Returns:: the next processor chain to process this URI.

setNextProcessor

public void setNextProcessor(Processor processor)

Set the next processor to process this URI.

Parameters:: processor - the next processor to process this URI.

setNextProcessorChain

public void setNextProcessorChain(ProcessorChain nextProcessorChain)

Set the next processor chain to process this URI.

Parameters:: nextProcessorChain - the next processor chain to process this URI.

markPrerequisite

public void markPrerequisite(java.lang.String preq,
                             ProcessorChain lastProcessorChain)
                      throws org.apache.commons.httpclient.URIException

Do all actions associated with setting a CrawlURI as requiring a prerequisite.

Parameters:: lastProcessorChain - Last processor chain reference. This chain is where this CrawlURI goes next.; preq - Object to set a prerequisite.
Throws:: org.apache.commons.httpclient.URIException

setPrerequisiteUri

public void setPrerequisiteUri(java.lang.Object link)

Set a prerequisite for this URI.

A prerequisite is a URI that must be crawled before this URI can be crawled.

Parameters:: link - Link to set as prereq.

getPrerequisiteUri

public java.lang.Object getPrerequisiteUri()

Get the prerequisite for this URI.

A prerequisite is a URI that must be crawled before this URI can be crawled.

Returns:: the prerequisite for this URI or null if no prerequisite.

hasPrerequisiteUri

public boolean hasPrerequisiteUri()

Returns:: True if this CrawlURI has a prerequisite.

isPrerequisite

public boolean isPrerequisite()

Returns true if this CrawlURI is a prerequisite.

Returns:: true if this CrawlURI is a prerequisite.

setPrerequisite

public void setPrerequisite(boolean prerequisite)

Set if this CrawlURI is itself a prerequisite URI.

Parameters:: prerequisite - True if this CrawlURI is itself a prerequiste uri.

getCrawlURIString

public java.lang.String getCrawlURIString()

Returns:: This crawl URI as a string wrapped with 'CrawlURI(' + ')'.

getContentType

public java.lang.String getContentType()

Get the content type of this URI.

Returns:: Fetched URIs content type. May be null.

setContentType

public void setContentType(java.lang.String ct)

Set a fetched uri's content type.

Parameters:: ct - Contenttype. May be null.

setThreadNumber

public void setThreadNumber(int i)

Set the number of the ToeThread responsible for processing this uri.

Parameters:: i - the ToeThread number.

getThreadNumber

public int getThreadNumber()

Get the number of the ToeThread responsible for processing this uri.

Returns:: the ToeThread number.

incrementDeferrals

public void incrementDeferrals()

Increment the deferral count.

getDeferrals

public int getDeferrals()

Get the deferral count.

Returns:: the deferral count.

stripToMinimal

public void stripToMinimal()

Remove all attributes set on this uri.

This methods removes the attribute list.

getContentSize

public long getContentSize()

Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers. It is the responsibility of the classes which fetch the URI to set this value accordingly -- it is not calculated/verified within CrawlURI. This value is consulted in reporting/logging/writing-decisions.

Returns:: contentSize
See Also:: #setContentSize()

addLocalizedError

public void addLocalizedError(java.lang.String processorName,
                              java.lang.Throwable ex,
                              java.lang.String message)

Make note of a non-fatal error, local to a particular Processor, which should be logged somewhere, but allows processing to continue. This is how you add to the local-error log (the 'localized' in the below is making an error local rather than global, not making a swiss-french version of the error.).

Parameters:: processorName - Name of processor the exception was thrown in.; ex - Throwable to log.; message - Extra message to log beyond exception message.

getClassSimpleName

protected java.lang.String getClassSimpleName(java.lang.Class c)

addAnnotation

public void addAnnotation(java.lang.String annotation)

Add an annotation: an abbrieviated indication of something special about this URI that need not be present in every crawl.log line, but should be noted for future reference.

Parameters:: annotation - the annotation to add; should not contain whitespace or a comma

isTruncatedFetch

public boolean isTruncatedFetch()

TODO: Implement truncation using booleans rather than as this ugly String parse.

Returns:: True if fetch was truncated.

isLengthTruncatedFetch

public boolean isLengthTruncatedFetch()

isTimeTruncatedFetch

public boolean isTimeTruncatedFetch()

isHeaderTruncatedFetch

public boolean isHeaderTruncatedFetch()

annotationContains

protected boolean annotationContains(java.lang.String str2Find)

getAnnotations

public java.lang.String getAnnotations()

Get the annotations set for this uri.

Returns:: the annotations set for this uri.

getEmbedHopCount

public int getEmbedHopCount()

Deprecated.

Get the embeded hop count.

Returns:: the embeded hop count.

getLinkHopCount

public int getLinkHopCount()

Deprecated.

Get the link hop count.

Returns:: the link hop count.

markAsSeed

public void markAsSeed()

Deprecated.

Mark this uri as being a seed.

getUserAgent

public java.lang.String getUserAgent()

Get the user agent to use for crawling this URI. If null the global setting should be used.

Returns:: user agent or null

setUserAgent

public void setUserAgent(java.lang.String string)

Set the user agent to use when crawling this URI. If not set the global settings should be used.

Parameters:: string - user agent to use

skipToProcessor

public void skipToProcessor(ProcessorChain processorChain,
                            Processor processor)

Set which processor should be the next processor to process this uri instead of using the default next processor.

Parameters:: processorChain - the processor chain to skip to.; processor - the processor in the processor chain to skip to.

skipToProcessorChain

public void skipToProcessorChain(ProcessorChain processorChain)

Set which processor chain should be processing this uri next.

Parameters:: processorChain - the processor chain to skip to.

getContentLength

public long getContentLength()

For completed HTTP transactions, the length of the content-body.

Returns:: For completed HTTP transactions, the length of the content-body.

getRecordedSize

public long getRecordedSize()

Get size of data recorded (transferred)

Returns:: recorded data size

setContentSize

public void setContentSize(long l)

Sets the 'content size' for the URI, which is considered inclusive of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server). (In contrast, content-length matches the HTTP definition, that of the enclosed content-body.) Should be set by a fetcher or other processor as soon as the final size of recorded content is known. Setting to an artificial/incorrect value may affect other reporting/processing.

Parameters:: l - Content size.

hasBeenLinkExtracted

public boolean hasBeenLinkExtracted()

If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content. This does not preclude other link extractors that may have an interest in this CrawlURI from also doing link extraction.

There is an onus on link extractors to set this flag if they have run.

Returns:: True if a processor has performed link extraction on this CrawlURI
See Also:: linkExtractorFinished()

linkExtractorFinished

public void linkExtractorFinished()

Note that link extraction has been performed on this CrawlURI. A processor doing link extraction should invoke this method once it has finished it's work. It should invoke it even if no links are extracted. It should only invoke this method if the link extraction was performed on the document body (not the HTTP headers etc.).

See Also:: hasBeenLinkExtracted()

aboutToLog

public void aboutToLog()

Notify CrawlURI it is about to be logged; opportunity for self-annotation

getHttpRecorder

public HttpRecorder getHttpRecorder()

Get the http recorder associated with this uri.

Returns:: Returns the httpRecorder. May be null but its set early in FetchHttp so there is an issue if its null.

setHttpRecorder

public void setHttpRecorder(HttpRecorder httpRecorder)

Set the http recorder to be associated with this uri.

Parameters:: httpRecorder - The httpRecorder to set.

isHttpTransaction

public boolean isHttpTransaction()

Return true if this is a http transaction. TODO: Compound this and isPost() method so that there is one place to go to find out if get http, post http, ftp, dns.

Returns:: True if this is a http transaction.

processingCleanup

public void processingCleanup()

Clean up after a run through the processing chain. Called on the end of processing chain by Frontier#finish. Null out any state gathered during processing.

getPersistentAList

public st.ata.util.AList getPersistentAList()

from

public static CrawlURI from(CandidateURI caUri,
                            long ordinal)

Make a CrawlURI from the passed CandidateURI. Its safe to pass a CrawlURI instance. In this case we just return it as a result. Otherwise, we create new CrawlURI instance.

Parameters:: caUri - Candidate URI.; ordinal -
Returns:: A crawlURI made from the passed CandidateURI.

getCredentialAvatars

public java.util.Set<CredentialAvatar> getCredentialAvatars()

Returns:: Credential avatars. Null if none set.

hasCredentialAvatars

public boolean hasCredentialAvatars()

Returns:: True if there are avatars attached to this instance.

addCredentialAvatar

public void addCredentialAvatar(CredentialAvatar ca)

Add an avatar. We do lazy instantiation.

Parameters:: ca - Credential avatar to add to set of avatars.

removeCredentialAvatars

public void removeCredentialAvatars()

Remove all credential avatars from this crawl uri.

removeCredentialAvatar

public boolean removeCredentialAvatar(CredentialAvatar ca)

Remove all credential avatars from this crawl uri.

Parameters:: ca - Avatar to remove.
Returns:: True if we removed passed parameter. False if no operation performed.

isSuccess

public boolean isSuccess()

Ask this URI if it was a success or not. Only makes sense to call this method after execution of HttpMethod#execute. Regard any status larger then 0 as success except for below caveat regarding 401s. Use is2XXSuccess() if looking for a status code in the 200 range.

401s caveat: If any rfc2617 credential data present and we got a 401 assume it got loaded in FetchHTTP on expectation that we're to go around the processing chain again. Report this condition as a failure so we get another crack at the processing chain only this time we'll be making use of the loaded credential data.

Returns:: True if ths URI has been successfully processed.
See Also:: is2XXSuccess()

is2XXSuccess

public boolean is2XXSuccess()

Returns:: True if status code is in the 2xx range.
See Also:: isSuccess()

hasRfc2617CredentialAvatar

public boolean hasRfc2617CredentialAvatar()

Returns:: True if we have an rfc2617 payload.

setPost

public void setPost(boolean b)

Set whether this URI should be fetched by sending a HTTP POST request. Else a HTTP GET request will be used.

Parameters:: b - Set whether this curi is to be POST'd. Else its to be GET'd.

isPost

public boolean isPost()

Returns true if this URI should be fetched by sending a HTTP POST request. TODO: Compound this and isHttpTransaction() method so that there is one place to go to find out if get http, post http, ftp, dns.

Returns:: Returns is this CrawlURI instance is to be posted.

setContentDigest

public void setContentDigest(byte[] digestValue)

Deprecated. Use setContentDigest(String scheme, byte[])

Set the retained content-digest value (usu. SHA1).

Parameters:: digestValue -

setContentDigest

public void setContentDigest(java.lang.String scheme,
                             byte[] digestValue)

getContentDigestSchemeString

public java.lang.String getContentDigestSchemeString()

getContentDigest

public java.lang.Object getContentDigest()

Return the retained content-digest value, if any.

Returns:: Digest value.

getContentDigestString

public java.lang.String getContentDigestString()

setHolder

public void setHolder(java.lang.Object obj)

Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI .

Parameters:: obj -

getHolder

public java.lang.Object getHolder()

Return the 'holder' for the convenience of an external facility.

Returns:: holder

setHolderKey

public void setHolderKey(java.lang.Object obj)

Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI .

Parameters:: obj -

getHolderKey

public java.lang.Object getHolderKey()

Return the 'holderKey' for convenience of an external facility (Frontier).

Returns:: holderKey

getOrdinal

public long getOrdinal()

Get the ordinal (serial number) assigned at creation.

Returns:: ordinal

getHolderCost

public int getHolderCost()

Return the 'holderCost' for convenience of external facility (frontier)

Returns:: value of holderCost

setHolderCost

public void setHolderCost(int cost)

Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI

Parameters:: cost - value to remember

getOutLinks

public java.util.Collection<Link> getOutLinks()

Returns discovered links. The returned collection might be empty if no links were discovered, or if something like LinksScoper promoted the links to CandidateURIs. Elements can be removed from the returned collection, but not added. To add a discovered link, use one of the createAndAdd methods or getOutObjects().

Returns:: Collection of all discovered outbound Links

getOutCandidates

public java.util.Collection<CandidateURI> getOutCandidates()

Returns discovered candidate URIs. The returned collection will be emtpy until something like LinksScoper promotes discovered Links into CandidateURIs. Elements can be removed from the returned collection, but not added. To add a candidate URI, use replaceOutlinks(Collection) or getOutObjects().

Returns:: Collection of candidate URIs

getOutObjects

public java.util.Collection<java.lang.Object> getOutObjects()

Returns all of the outbound objects. The returned Collection will contain Link instances, or CandidateURI instances, or both.

Returns:: the collection of Links and/or CandidateURIs

addOutLink

public void addOutLink(Link link)

Add a discovered Link, unless it would exceed the max number to accept. (If so, increment discarded link counter.)

Parameters:: link - the Link to add

clearOutlinks

public void clearOutlinks()

replaceOutlinks

public void replaceOutlinks(java.util.Collection<CandidateURI> links)

Replace current collection of links w/ passed list. Used by Scopers adjusting the list of links (removing those not in scope and promoting Links to CandidateURIs).

Parameters:: a - collection of CandidateURIs replacing any previously existing outLinks or outCandidates

outlinksSize

public int outlinksSize()

Returns:: Count of outlinks.

createLink

public Link createLink(java.lang.String url,
                       java.lang.CharSequence context,
                       char hopType)
                throws org.apache.commons.httpclient.URIException

Convenience method for creating a Link discovered at this URI with the given string and context

Parameters:: url - String to use to create Link; context - CharSequence context to use; hopType -
Returns:: Link.
Throws:: org.apache.commons.httpclient.URIException - if Link UURI cannot be constructed

createAndAddLink

public void createAndAddLink(java.lang.String url,
                             java.lang.CharSequence context,
                             char hopType)
                      throws org.apache.commons.httpclient.URIException

Convenience method for creating a Link with the given string and context

Parameters:: url - String to use to create Link; context - CharSequence context to use; hopType -
Throws:: org.apache.commons.httpclient.URIException - if Link UURI cannot be constructed

createAndAddLinkRelativeToBase

public void createAndAddLinkRelativeToBase(java.lang.String url,
                                           java.lang.CharSequence context,
                                           char hopType)
                                    throws org.apache.commons.httpclient.URIException

Convenience method for creating a Link with the given string and context, relative to a previously set base HREF if available (or relative to the current CrawlURI if no other base has been set)

Parameters:: url - String URL to add as destination of link; context - String context where link was discovered; hopType - char hop-type indicator
Throws:: org.apache.commons.httpclient.URIException

createAndAddLinkRelativeToVia

public void createAndAddLinkRelativeToVia(java.lang.String url,
                                          java.lang.CharSequence context,
                                          char hopType)
                                   throws org.apache.commons.httpclient.URIException

Convenience method for creating a Link with the given string and context, relative to this CrawlURI's via UURI if available. (If a via is not available, falls back to using #createAndAddLinkRelativeToBase.)

Parameters:: url - String URL to add as destination of link; context - String context where link was discovered; hopType - char hop-type indicator
Throws:: org.apache.commons.httpclient.URIException

setBaseURI

public void setBaseURI(java.lang.String baseHref)
                throws org.apache.commons.httpclient.URIException

Set the (HTML) Base URI used for derelativizing internal URIs.

Parameters:: baseHref - String base href to use
Throws:: org.apache.commons.httpclient.URIException - if supplied string cannot be interpreted as URI

getBaseURI

public UURI getBaseURI()

Get the (HTML) Base URI used for derelativizing internal URIs.

Returns:: UURI base URI previously set

addAlistPersistentMember

public static void addAlistPersistentMember(java.lang.Object key)

Add the key of alist items you want to persist across processings.

Parameters:: key - Key to add.

removeAlistPersistentMember

public static boolean removeAlistPersistentMember(java.lang.Object key)

Parameters:: key - Key to remove.
Returns:: True if list contained the element.

getFetchDuration

public long getFetchDuration()

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.crawler.datamodel Class CrawlURI

UNCALCULATED

MAX_OUTLINKS

ordinal

holder

holderKey

holderCost

outLinks

CrawlURI

CrawlURI

fetchStatusCodesToString

getFetchStatus

setFetchStatus

getFetchAttempts

incrementFetchAttempts

resetFetchAttempts

resetDeferrals

nextProcessor

nextProcessorChain

setNextProcessor

setNextProcessorChain

markPrerequisite

setPrerequisiteUri

getPrerequisiteUri

hasPrerequisiteUri

isPrerequisite

setPrerequisite

getCrawlURIString

getContentType

setContentType

setThreadNumber

getThreadNumber

incrementDeferrals

getDeferrals

stripToMinimal

getContentSize

addLocalizedError

getClassSimpleName

addAnnotation

isTruncatedFetch

isLengthTruncatedFetch

isTimeTruncatedFetch

isHeaderTruncatedFetch

annotationContains

getAnnotations

getEmbedHopCount

getLinkHopCount

markAsSeed

getUserAgent

setUserAgent

skipToProcessor

skipToProcessorChain

getContentLength

getRecordedSize

setContentSize

hasBeenLinkExtracted

linkExtractorFinished

aboutToLog

getHttpRecorder

setHttpRecorder

isHttpTransaction

processingCleanup

getPersistentAList

from

getCredentialAvatars

hasCredentialAvatars

addCredentialAvatar

removeCredentialAvatars

removeCredentialAvatar

isSuccess

is2XXSuccess

hasRfc2617CredentialAvatar

setPost

isPost

setContentDigest

setContentDigest

getContentDigestSchemeString

getContentDigest

getContentDigestString

setHolder

org.archive.crawler.datamodel
Class CrawlURI