18. Release 0.10.0 - 2004-06-04

Abstract

Release for second heritrix workshop, Copenhagen 06/2004 (1.0.0 first release candidate). Added site-first prioritization, fixed link extraction of multibyte URIs, added metadata to arcs as xml, changed arc naming template, new user and developer manuals, added basic/digest auth and http post/get login facility, and added help to UI. Bug fixes.

18.1. Changes

Table 11. Changes

IDTypeSummary
896769Addjob report: show 'active' hosts, show more size totals
896772Add"Site-first"/'frontline' prioritization
956614Addmultiple open http connections per host needed
896674AddAdd help to web UI
964931AddWhen a host last had a completed URI shown in crawl report
958335AddEncode multibyte URIs using page charset before queuing
909246AddOne src for site, help, and readme docs.
936684Addidentifying ARCs: unique names, header records
930667AddResetting arc file counter for every job.
863318AddARCs need better headers
908507AddSpecify location of jobs dir
914301AddLogging in (HTTP POST, Basic Auth, etc.)
944066AddUpdate dnsjava from 1.5 to 1.6.2 (Fix NPE)
966168Fixcrawl.log entries without annotations end with a space
966172FixAn issue with arc names' date and serial number alignment
957963FixOutput of warning message leads to NullPointerExceptions
963965FixEither UURI or ExtractHTML should strip whitespace better
965267FixMaximum documents not enforced
965308FixNPE in path depth filter
934549Fixembed/speculative inclusion too loose
962899FixUnsupportedCharsetException handled awkwardly
962892FixUURI accepting/creating unUsable URIs (bad hosts)
860733FixCachingDiskLongFPSet UI availability
954130FixCrawls slow till change a setting
961867Fixzero link-hops should work
942627Fixmultiple robots.txt URLs in the "default" frontier
957941FixNPE in ExtractorHTML#isHtmlExpectedHere
953718FixUnwanted behavior with seed redirection
952636FixLink extraction failing
863315FixMemory issues: Frontier.snoozeQueue
903838FixTransitive scope confusion, may not work as expected
955345FixWrong stats after deleting URIs from Frontier
952276FixNoSuchElementException in admin/reports/frontier.jsp
952665FixAlert: Authentication scheme(s) not supported
936702FixIP validity: units, TTL vs. setting
951582FixConcurrentModificationException in DomainScope focus filter
949489FixConcurrentModificationException terminate job
949551FixAuthentication bug
948898Fixterminate running crawl == NPE
927940Fixjava.net.URI parses %20 but getHost null
874220FixNPE in java.net.URI.encode
808270Fixjava.net.URI chokes on hosts_with_underscores
788277FixDoing separate DNS lookup for same host
910120Fixjava.net.URI#getHost fails when leading digit
949548FixConstraining java URI class
943373FixSame CrawlServer instance for http & https.
887999FixBroad crawl/ too many open files
926912Fixmultiple charset headers + long lines
926338FixCorrupted blue image in progress bars
896757FixNPEs in Andy's Th-Fri Crawl + NPE in RIS
922080FixIllegalArgumentEx/ReplayCharSequenceFactory (offset vs. size
935271FixFTP URIs in seeds interpreted as HTTP
945923Fixmaven rc2 won't make src distribution
947754FixCorrupted arc files on termination of job
931269Fixhttps exception: java.io.IOException: SSL failure
935146FixExcessive ARCWriterPool timeouts: