|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.archive.crawler.util.FPMergeUriUniqFilter
public abstract class FPMergeUriUniqFilter
UriUniqFilter based on merging FP arrays (in memory or from disk). Inspired by the approach in Najork and Heydon, "High-Performance Web Crawling" (2001), section 3.2, "Efficient Duplicate URL Eliminators".
| Nested Class Summary | |
|---|---|
class |
FPMergeUriUniqFilter.PendingItem
Represents a long fingerprint and (possibly) its corresponding CandidateURI, awaiting the next merge in a 'pending' state. |
| Nested classes/interfaces inherited from interface org.archive.crawler.datamodel.UriUniqFilter |
|---|
UriUniqFilter.HasUriReceiver |
| Field Summary | |
|---|---|
static int |
DEFAULT_MAX_PENDING
|
static long |
FLUSH_DELAY_FACTOR
|
protected int |
maxPending
size at which to force flush of pending items |
protected long |
mergeDupAtLast
|
protected long |
mergeDuplicateCount
|
protected long |
nextFlushAllowableAfter
time-based throttle on flush-merge operations |
protected long |
pendDupAtLast
|
protected long |
pendDuplicateCount
|
protected java.util.TreeSet<FPMergeUriUniqFilter.PendingItem> |
pendingSet
items awaiting merge TODO: consider only sorting just pre-merge TODO: consider using a fastutil long->Object class TODO: consider actually writing items to disk file, as in Najork/Heydon |
protected java.io.PrintWriter |
profileLog
|
protected ArrayLongFPCache |
quickCache
cache of most recently seen FPs |
protected long |
quickDupAtLast
|
protected long |
quickDuplicateCount
|
protected UriUniqFilter.HasUriReceiver |
receiver
|
| Constructor Summary | |
|---|---|
FPMergeUriUniqFilter()
|
|
| Method Summary | |
|---|---|
void |
add(java.lang.String key,
CandidateURI value)
Add given uri, if not already present. |
void |
addForce(java.lang.String key,
CandidateURI value)
Add given uri, all the way through to underlying destination, even if already present. |
protected abstract void |
addNewFp(long fp)
Add an FP (which may be an old or new FP) to the new complete list. |
void |
addNow(java.lang.String key,
CandidateURI value)
Immediately add uri. |
protected abstract it.unimi.dsi.fastutil.longs.LongIterator |
beginFpMerge()
Begin merging pending candidates with complete list. |
void |
close()
Close down any allocated resources. |
static long |
createFp(java.lang.CharSequence key)
Create a fingerprint from the given key |
protected abstract void |
finishFpMerge()
Complete the merge of candidate and previously-known FPs (closing files/iterators as appropriate). |
long |
flush()
Perform a merge of all 'pending' items to the overall fingerprint list. |
void |
forget(java.lang.String key,
CandidateURI value)
Forget item was seen |
void |
note(java.lang.String key)
Note item as seen, without passing through to receiver. |
protected void |
pend(long fp,
CandidateURI value)
Place the given FP/CandidateURI pair into the pending set, awaiting a merge to determine if it's actually accepted. |
long |
pending()
Count of items added, but not yet filtered in or out. |
protected void |
profileLog(java.lang.String key)
|
long |
requestFlush()
Request that any pending items be added/dropped. |
void |
setDestination(UriUniqFilter.HasUriReceiver receiver)
Receiver of uniq URIs. |
void |
setMaxPending(int max)
|
void |
setProfileLog(java.io.File logfile)
Set a File to receive a log for replay profiling. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface org.archive.crawler.datamodel.UriUniqFilter |
|---|
count |
| Field Detail |
|---|
protected UriUniqFilter.HasUriReceiver receiver
protected java.io.PrintWriter profileLog
protected long quickDuplicateCount
protected long quickDupAtLast
protected long pendDuplicateCount
protected long pendDupAtLast
protected long mergeDuplicateCount
protected long mergeDupAtLast
protected java.util.TreeSet<FPMergeUriUniqFilter.PendingItem> pendingSet
protected int maxPending
public static final int DEFAULT_MAX_PENDING
protected long nextFlushAllowableAfter
public static final long FLUSH_DELAY_FACTOR
protected ArrayLongFPCache quickCache
| Constructor Detail |
|---|
public FPMergeUriUniqFilter()
| Method Detail |
|---|
public void setMaxPending(int max)
public long pending()
UriUniqFilter
pending in interface UriUniqFilterpublic void setDestination(UriUniqFilter.HasUriReceiver receiver)
UriUniqFilter
setDestination in interface UriUniqFilterreceiver - Object that will be passed items. Must implement
HasUriReceiver interface.protected void profileLog(java.lang.String key)
public void add(java.lang.String key,
CandidateURI value)
UriUniqFilter
add in interface UriUniqFilterkey - Usually a canonicalized version of value.
This is the key used doing lookups, forgets and insertions on the
already included list.value - item to add.
protected void pend(long fp,
CandidateURI value)
fp - long fingerprintvalue - CandidateURI or null, if fp only needs merging (as when
CandidateURI was already forced inpublic static long createFp(java.lang.CharSequence key)
key - CharSequence (URI) to fingerprint
public void addNow(java.lang.String key,
CandidateURI value)
UriUniqFilter
addNow in interface UriUniqFilterkey - Usually a canonicalized version of uri.
This is the key used doing lookups, forgets and insertions on the
already included list.value - item to add.
public void addForce(java.lang.String key,
CandidateURI value)
UriUniqFilter
addForce in interface UriUniqFilterkey - Usually a canonicalized version of uri.
This is the key used doing lookups, forgets and insertions on the
already included list.value - item to add.public void note(java.lang.String key)
UriUniqFilter
note in interface UriUniqFilterkey - Usually a canonicalized version of an URI.
This is the key used doing lookups, forgets and insertions on the
already included list.
public void forget(java.lang.String key,
CandidateURI value)
UriUniqFilter
forget in interface UriUniqFilterkey - Usually a canonicalized version of an URI.
This is the key used doing lookups, forgets and insertions on the
already included list.value - item to add.public long requestFlush()
UriUniqFilter
requestFlush in interface UriUniqFilterpublic long flush()
protected abstract it.unimi.dsi.fastutil.longs.LongIterator beginFpMerge()
protected abstract void addNewFp(long fp)
fp - the FP to addprotected abstract void finishFpMerge()
public void close()
UriUniqFilter
close in interface UriUniqFilterpublic void setProfileLog(java.io.File logfile)
UriUniqFilter
setProfileLog in interface UriUniqFilter
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||