|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.archive.crawler.datamodel.CandidateURI
org.archive.crawler.datamodel.CrawlURI
public class CrawlURI
Represents a candidate URI and the associated state it collects as it is crawled.
Core state is in instance variables but a flexible
attribute list is also available. Use this 'bucket' to carry
custom processing extracted data and state across CrawlURI
processing. See the CandidateURI.putString(String, String)
,
CandidateURI.getString(String)
, etc.
Field Summary | |
---|---|
(package private) java.lang.Object |
holder
|
(package private) int |
holderCost
spot for an integer cost to be placed by external facility (frontier). |
(package private) java.lang.Object |
holderKey
|
static int |
MAX_OUTLINKS
Protection against outlink overflow. |
protected long |
ordinal
Monotonically increasing number within a crawl; useful for tending towards breadth-first ordering. |
(package private) java.util.Collection<java.lang.Object> |
outLinks
All discovered outbound Links (navlinks, embeds, etc.) Can either contain Link instances or CandidateURI instances, or both. |
static int |
UNCALCULATED
|
Fields inherited from class org.archive.crawler.datamodel.CandidateURI |
---|
HIGH, HIGHEST, MEDIUM, NORMAL |
Constructor Summary | |
---|---|
CrawlURI(CandidateURI caUri,
long o)
Create a new instance of CrawlURI from a CandidateURI |
|
CrawlURI(UURI uuri)
Create a new instance of CrawlURI from a UURI . |
Method Summary | |
---|---|
void |
aboutToLog()
Notify CrawlURI it is about to be logged; opportunity for self-annotation |
static void |
addAlistPersistentMember(java.lang.Object key)
Add the key of alist items you want to persist across processings. |
void |
addAnnotation(java.lang.String annotation)
Add an annotation: an abbrieviated indication of something special about this URI that need not be present in every crawl.log line, but should be noted for future reference. |
void |
addCredentialAvatar(CredentialAvatar ca)
Add an avatar. |
void |
addLocalizedError(java.lang.String processorName,
java.lang.Throwable ex,
java.lang.String message)
Make note of a non-fatal error, local to a particular Processor, which should be logged somewhere, but allows processing to continue. |
void |
addOutLink(Link link)
Add a discovered Link, unless it would exceed the max number to accept. |
protected boolean |
annotationContains(java.lang.String str2Find)
|
void |
clearOutlinks()
|
void |
createAndAddLink(java.lang.String url,
java.lang.CharSequence context,
char hopType)
Convenience method for creating a Link with the given string and context |
void |
createAndAddLinkRelativeToBase(java.lang.String url,
java.lang.CharSequence context,
char hopType)
Convenience method for creating a Link with the given string and context, relative to a previously set base HREF if available (or relative to the current CrawlURI if no other base has been set) |
void |
createAndAddLinkRelativeToVia(java.lang.String url,
java.lang.CharSequence context,
char hopType)
Convenience method for creating a Link with the given string and context, relative to this CrawlURI's via UURI if available. |
Link |
createLink(java.lang.String url,
java.lang.CharSequence context,
char hopType)
Convenience method for creating a Link discovered at this URI with the given string and context |
static java.lang.String |
fetchStatusCodesToString(int code)
Takes a status code and converts it into a human readable string. |
static CrawlURI |
from(CandidateURI caUri,
long ordinal)
Make a CrawlURI from the passed CandidateURI . |
java.lang.String |
getAnnotations()
Get the annotations set for this uri. |
UURI |
getBaseURI()
Get the (HTML) Base URI used for derelativizing internal URIs. |
protected java.lang.String |
getClassSimpleName(java.lang.Class c)
|
java.lang.Object |
getContentDigest()
Return the retained content-digest value, if any. |
java.lang.String |
getContentDigestSchemeString()
|
java.lang.String |
getContentDigestString()
|
long |
getContentLength()
For completed HTTP transactions, the length of the content-body. |
long |
getContentSize()
Get the size in bytes of this URI's recorded content, inclusive of things like protocol headers. |
java.lang.String |
getContentType()
Get the content type of this URI. |
java.lang.String |
getCrawlURIString()
|
java.util.Set<CredentialAvatar> |
getCredentialAvatars()
|
int |
getDeferrals()
Get the deferral count. |
int |
getEmbedHopCount()
Deprecated. |
int |
getFetchAttempts()
Get the number of attempts at getting the document referenced by this URI. |
long |
getFetchDuration()
|
int |
getFetchStatus()
Return the overall/fetch status of this CrawlURI for its current trip through the processing loop. |
java.lang.Object |
getHolder()
Return the 'holder' for the convenience of an external facility. |
int |
getHolderCost()
Return the 'holderCost' for convenience of external facility (frontier) |
java.lang.Object |
getHolderKey()
Return the 'holderKey' for convenience of an external facility (Frontier). |
HttpRecorder |
getHttpRecorder()
Get the http recorder associated with this uri. |
int |
getLinkHopCount()
Deprecated. |
long |
getOrdinal()
Get the ordinal (serial number) assigned at creation. |
java.util.Collection<CandidateURI> |
getOutCandidates()
Returns discovered candidate URIs. |
java.util.Collection<Link> |
getOutLinks()
Returns discovered links. |
java.util.Collection<java.lang.Object> |
getOutObjects()
Returns all of the outbound objects. |
st.ata.util.AList |
getPersistentAList()
|
java.lang.Object |
getPrerequisiteUri()
Get the prerequisite for this URI. |
long |
getRecordedSize()
Get size of data recorded (transferred) |
int |
getThreadNumber()
Get the number of the ToeThread responsible for processing this uri. |
java.lang.String |
getUserAgent()
Get the user agent to use for crawling this URI. |
boolean |
hasBeenLinkExtracted()
If true then a link extractor has already claimed this CrawlURI and performed link extraction on the document content. |
boolean |
hasCredentialAvatars()
|
boolean |
hasPrerequisiteUri()
|
boolean |
hasRfc2617CredentialAvatar()
|
void |
incrementDeferrals()
Increment the deferral count. |
int |
incrementFetchAttempts()
Increment the number of attempts at getting the document referenced by this URI. |
boolean |
is2XXSuccess()
|
boolean |
isHeaderTruncatedFetch()
|
boolean |
isHttpTransaction()
Return true if this is a http transaction. |
boolean |
isLengthTruncatedFetch()
|
boolean |
isPost()
Returns true if this URI should be fetched by sending a HTTP POST request. |
boolean |
isPrerequisite()
Returns true if this CrawlURI is a prerequisite. |
boolean |
isSuccess()
Ask this URI if it was a success or not. |
boolean |
isTimeTruncatedFetch()
|
boolean |
isTruncatedFetch()
TODO: Implement truncation using booleans rather than as this ugly String parse. |
void |
linkExtractorFinished()
Note that link extraction has been performed on this CrawlURI. |
void |
markAsSeed()
Deprecated. |
void |
markPrerequisite(java.lang.String preq,
ProcessorChain lastProcessorChain)
Do all actions associated with setting a CrawlURI as
requiring a prerequisite. |
Processor |
nextProcessor()
Get the next processor to process this URI. |
ProcessorChain |
nextProcessorChain()
Get the processor chain that should be processing this URI after the current chain is finished with it. |
int |
outlinksSize()
|
void |
processingCleanup()
Clean up after a run through the processing chain. |
static boolean |
removeAlistPersistentMember(java.lang.Object key)
|
boolean |
removeCredentialAvatar(CredentialAvatar ca)
Remove all credential avatars from this crawl uri. |
void |
removeCredentialAvatars()
Remove all credential avatars from this crawl uri. |
void |
replaceOutlinks(java.util.Collection<CandidateURI> links)
Replace current collection of links w/ passed list. |
void |
resetDeferrals()
Reset deferrals counter. |
void |
resetFetchAttempts()
Reset fetchAttempts counter. |
void |
setBaseURI(java.lang.String baseHref)
Set the (HTML) Base URI used for derelativizing internal URIs. |
void |
setContentDigest(byte[] digestValue)
Deprecated. Use setContentDigest(String scheme, byte[]) |
void |
setContentDigest(java.lang.String scheme,
byte[] digestValue)
|
void |
setContentSize(long l)
Sets the 'content size' for the URI, which is considered inclusive of all recorded material (such as protocol headers) or even material 'virtually' considered (as in material from a previous fetch confirmed unchanged with a server). |
void |
setContentType(java.lang.String ct)
Set a fetched uri's content type. |
void |
setFetchStatus(int newstatus)
Set the overall/fetch status of this CrawlURI for its current trip through the processing loop. |
void |
setHolder(java.lang.Object obj)
Remember a 'holder' to which some enclosing/queueing facility has assigned this CrawlURI . |
void |
setHolderCost(int cost)
Remember a 'holderCost' which some enclosing/queueing facility has assigned this CrawlURI |
void |
setHolderKey(java.lang.Object obj)
Remember a 'holderKey' which some enclosing/queueing facility has assigned this CrawlURI . |
void |
setHttpRecorder(HttpRecorder httpRecorder)
Set the http recorder to be associated with this uri. |
void |
setNextProcessor(Processor processor)
Set the next processor to process this URI. |
void |
setNextProcessorChain(ProcessorChain nextProcessorChain)
Set the next processor chain to process this URI. |
void |
setPost(boolean b)
Set whether this URI should be fetched by sending a HTTP POST request. |
void |
setPrerequisite(boolean prerequisite)
Set if this CrawlURI is itself a prerequisite URI. |
void |
setPrerequisiteUri(java.lang.Object link)
Set a prerequisite for this URI. |
void |
setThreadNumber(int i)
Set the number of the ToeThread responsible for processing this uri. |
void |
setUserAgent(java.lang.String string)
Set the user agent to use when crawling this URI. |
void |
skipToProcessor(ProcessorChain processorChain,
Processor processor)
Set which processor should be the next processor to process this uri instead of using the default next processor. |
void |
skipToProcessorChain(ProcessorChain processorChain)
Set which processor chain should be processing this uri next. |
void |
stripToMinimal()
Remove all attributes set on this uri. |
Methods inherited from class org.archive.crawler.datamodel.CandidateURI |
---|
clearAList, containsKey, createCandidateURI, createCandidateURI, createSeedCandidateURI, flattenVia, forceFetch, fromString, getAList, getCandidateURIString, getClassKey, getInt, getLong, getObject, getPathFromSeed, getReports, getSchedulingDirective, getString, getTransHops, getURIString, getUURI, getVia, getViaContext, inheritFrom, isLocation, isSeed, keys, makeHeritable, makeNonHeritable, needsImmediateScheduling, needsSoonScheduling, putInt, putLong, putObject, putString, readUuri, remove, reportTo, reportTo, sameDomainAs, setAList, setClassKey, setForceFetch, setIsSeed, setPathFromSeed, setSchedulingDirective, setVia, singleLineLegend, singleLineReport, singleLineReportTo, toString |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
Field Detail |
---|
public static final int UNCALCULATED
public static final int MAX_OUTLINKS
protected long ordinal
transient java.lang.Object holder
transient java.lang.Object holderKey
int holderCost
transient java.util.Collection<java.lang.Object> outLinks
Constructor Detail |
---|
public CrawlURI(UURI uuri)
UURI
.
uuri
- the UURI to base this CrawlURI on.public CrawlURI(CandidateURI caUri, long o)
CandidateURI
caUri
- the CandidateURI to base this CrawlURI on.o
- Monotonically increasing number within a crawl.Method Detail |
---|
public static java.lang.String fetchStatusCodesToString(int code)
code
- the status code
public int getFetchStatus()
public void setFetchStatus(int newstatus)
newstatus
- a value from FetchStatusCodespublic int getFetchAttempts()
public int incrementFetchAttempts()
public void resetFetchAttempts()
public void resetDeferrals()
public Processor nextProcessor()
public ProcessorChain nextProcessorChain()
public void setNextProcessor(Processor processor)
processor
- the next processor to process this URI.public void setNextProcessorChain(ProcessorChain nextProcessorChain)
nextProcessorChain
- the next processor chain to process this URI.public void markPrerequisite(java.lang.String preq, ProcessorChain lastProcessorChain) throws org.apache.commons.httpclient.URIException
CrawlURI
as
requiring a prerequisite.
lastProcessorChain
- Last processor chain reference. This chain is
where this CrawlURI
goes next.preq
- Object to set a prerequisite.
org.apache.commons.httpclient.URIException
public void setPrerequisiteUri(java.lang.Object link)
A prerequisite is a URI that must be crawled before this URI can be crawled.
link
- Link to set as prereq.public java.lang.Object getPrerequisiteUri()
A prerequisite is a URI that must be crawled before this URI can be crawled.
public boolean hasPrerequisiteUri()
public boolean isPrerequisite()
public void setPrerequisite(boolean prerequisite)
prerequisite
- True if this CrawlURI is itself a prerequiste uri.public java.lang.String getCrawlURIString()
public java.lang.String getContentType()
public void setContentType(java.lang.String ct)
ct
- Contenttype. May be null.public void setThreadNumber(int i)
i
- the ToeThread number.public int getThreadNumber()
public void incrementDeferrals()
public int getDeferrals()
public void stripToMinimal()
This methods removes the attribute list.
public long getContentSize()
#setContentSize()
public void addLocalizedError(java.lang.String processorName, java.lang.Throwable ex, java.lang.String message)
processorName
- Name of processor the exception was thrown
in.ex
- Throwable to log.message
- Extra message to log beyond exception message.protected java.lang.String getClassSimpleName(java.lang.Class c)
public void addAnnotation(java.lang.String annotation)
annotation
- the annotation to add; should not contain
whitespace or a commapublic boolean isTruncatedFetch()
public boolean isLengthTruncatedFetch()
public boolean isTimeTruncatedFetch()
public boolean isHeaderTruncatedFetch()
protected boolean annotationContains(java.lang.String str2Find)
public java.lang.String getAnnotations()
public int getEmbedHopCount()
public int getLinkHopCount()
public void markAsSeed()
public java.lang.String getUserAgent()
public void setUserAgent(java.lang.String string)
string
- user agent to usepublic void skipToProcessor(ProcessorChain processorChain, Processor processor)
processorChain
- the processor chain to skip to.processor
- the processor in the processor chain to skip to.public void skipToProcessorChain(ProcessorChain processorChain)
processorChain
- the processor chain to skip to.public long getContentLength()
public long getRecordedSize()
public void setContentSize(long l)
l
- Content size.public boolean hasBeenLinkExtracted()
There is an onus on link extractors to set this flag if they have run.
linkExtractorFinished()
public void linkExtractorFinished()
hasBeenLinkExtracted()
public void aboutToLog()
public HttpRecorder getHttpRecorder()
public void setHttpRecorder(HttpRecorder httpRecorder)
httpRecorder
- The httpRecorder to set.public boolean isHttpTransaction()
isPost()
method so that there is one
place to go to find out if get http, post http, ftp, dns.
public void processingCleanup()
public st.ata.util.AList getPersistentAList()
public static CrawlURI from(CandidateURI caUri, long ordinal)
CrawlURI
from the passed CandidateURI
.
Its safe to pass a CrawlURI instance. In this case we just return it
as a result. Otherwise, we create new CrawlURI instance.
caUri
- Candidate URI.ordinal
-
public java.util.Set<CredentialAvatar> getCredentialAvatars()
public boolean hasCredentialAvatars()
public void addCredentialAvatar(CredentialAvatar ca)
ca
- Credential avatar to add to set of avatars.public void removeCredentialAvatars()
public boolean removeCredentialAvatar(CredentialAvatar ca)
ca
- Avatar to remove.
public boolean isSuccess()
is2XXSuccess()
if
looking for a status code in the 200 range.
401s caveat: If any rfc2617 credential data present and we got a 401 assume it got loaded in FetchHTTP on expectation that we're to go around the processing chain again. Report this condition as a failure so we get another crack at the processing chain only this time we'll be making use of the loaded credential data.
is2XXSuccess()
public boolean is2XXSuccess()
isSuccess()
public boolean hasRfc2617CredentialAvatar()
public void setPost(boolean b)
b
- Set whether this curi is to be POST'd. Else its to be GET'd.public boolean isPost()
isHttpTransaction()
method so that there
is one place to go to find out if get http, post http, ftp, dns.
public void setContentDigest(byte[] digestValue)
setContentDigest(String scheme, byte[])
digestValue
- public void setContentDigest(java.lang.String scheme, byte[] digestValue)
public java.lang.String getContentDigestSchemeString()
public java.lang.Object getContentDigest()
public java.lang.String getContentDigestString()
public void setHolder(java.lang.Object obj)
obj
- public java.lang.Object getHolder()
public void setHolderKey(java.lang.Object obj)
obj
- public java.lang.Object getHolderKey()
public long getOrdinal()
public int getHolderCost()
public void setHolderCost(int cost)
cost
- value to rememberpublic java.util.Collection<Link> getOutLinks()
getOutObjects()
.
public java.util.Collection<CandidateURI> getOutCandidates()
replaceOutlinks(Collection)
or
getOutObjects()
.
public java.util.Collection<java.lang.Object> getOutObjects()
public void addOutLink(Link link)
link
- the Link to addpublic void clearOutlinks()
public void replaceOutlinks(java.util.Collection<CandidateURI> links)
a
- collection of CandidateURIs replacing any previously
existing outLinks or outCandidatespublic int outlinksSize()
public Link createLink(java.lang.String url, java.lang.CharSequence context, char hopType) throws org.apache.commons.httpclient.URIException
url
- String to use to create Linkcontext
- CharSequence context to usehopType
-
org.apache.commons.httpclient.URIException
- if Link UURI cannot be constructedpublic void createAndAddLink(java.lang.String url, java.lang.CharSequence context, char hopType) throws org.apache.commons.httpclient.URIException
url
- String to use to create Linkcontext
- CharSequence context to usehopType
-
org.apache.commons.httpclient.URIException
- if Link UURI cannot be constructedpublic void createAndAddLinkRelativeToBase(java.lang.String url, java.lang.CharSequence context, char hopType) throws org.apache.commons.httpclient.URIException
url
- String URL to add as destination of linkcontext
- String context where link was discoveredhopType
- char hop-type indicator
org.apache.commons.httpclient.URIException
public void createAndAddLinkRelativeToVia(java.lang.String url, java.lang.CharSequence context, char hopType) throws org.apache.commons.httpclient.URIException
url
- String URL to add as destination of linkcontext
- String context where link was discoveredhopType
- char hop-type indicator
org.apache.commons.httpclient.URIException
public void setBaseURI(java.lang.String baseHref) throws org.apache.commons.httpclient.URIException
baseHref
- String base href to use
org.apache.commons.httpclient.URIException
- if supplied string cannot be interpreted as URIpublic UURI getBaseURI()
public static void addAlistPersistentMember(java.lang.Object key)
key
- Key to add.public static boolean removeAlistPersistentMember(java.lang.Object key)
key
- Key to remove.
public long getFetchDuration()
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |