Abstract
Release for second heritrix workshop, Copenhagen 06/2004 (1.0.0 first release candidate). Added site-first prioritization, fixed link extraction of multibyte URIs, added metadata to arcs as xml, changed arc naming template, new user and developer manuals, added basic/digest auth and http post/get login facility, and added help to UI. Bug fixes.
Table 11. Changes
ID | Type | Summary |
---|---|---|
896769 | Add | job report: show 'active' hosts, show more size totals |
896772 | Add | "Site-first"/'frontline' prioritization |
956614 | Add | multiple open http connections per host needed |
896674 | Add | Add help to web UI |
964931 | Add | When a host last had a completed URI shown in crawl report |
958335 | Add | Encode multibyte URIs using page charset before queuing |
909246 | Add | One src for site, help, and readme docs. |
936684 | Add | identifying ARCs: unique names, header records |
930667 | Add | Resetting arc file counter for every job. |
863318 | Add | ARCs need better headers |
908507 | Add | Specify location of jobs dir |
914301 | Add | Logging in (HTTP POST, Basic Auth, etc.) |
944066 | Add | Update dnsjava from 1.5 to 1.6.2 (Fix NPE) |
966168 | Fix | crawl.log entries without annotations end with a space |
966172 | Fix | An issue with arc names' date and serial number alignment |
957963 | Fix | Output of warning message leads to NullPointerExceptions |
963965 | Fix | Either UURI or ExtractHTML should strip whitespace better |
965267 | Fix | Maximum documents not enforced |
965308 | Fix | NPE in path depth filter |
934549 | Fix | embed/speculative inclusion too loose |
962899 | Fix | UnsupportedCharsetException handled awkwardly |
962892 | Fix | UURI accepting/creating unUsable URIs (bad hosts) |
860733 | Fix | CachingDiskLongFPSet UI availability |
954130 | Fix | Crawls slow till change a setting |
961867 | Fix | zero link-hops should work |
942627 | Fix | multiple robots.txt URLs in the "default" frontier |
957941 | Fix | NPE in ExtractorHTML#isHtmlExpectedHere |
953718 | Fix | Unwanted behavior with seed redirection |
952636 | Fix | Link extraction failing |
863315 | Fix | Memory issues: Frontier.snoozeQueue |
903838 | Fix | Transitive scope confusion, may not work as expected |
955345 | Fix | Wrong stats after deleting URIs from Frontier |
952276 | Fix | NoSuchElementException in admin/reports/frontier.jsp |
952665 | Fix | Alert: Authentication scheme(s) not supported |
936702 | Fix | IP validity: units, TTL vs. setting |
951582 | Fix | ConcurrentModificationException in DomainScope focus filter |
949489 | Fix | ConcurrentModificationException terminate job |
949551 | Fix | Authentication bug |
948898 | Fix | terminate running crawl == NPE |
927940 | Fix | java.net.URI parses %20 but getHost null |
874220 | Fix | NPE in java.net.URI.encode |
808270 | Fix | java.net.URI chokes on hosts_with_underscores |
788277 | Fix | Doing separate DNS lookup for same host |
910120 | Fix | java.net.URI#getHost fails when leading digit |
949548 | Fix | Constraining java URI class |
943373 | Fix | Same CrawlServer instance for http & https. |
887999 | Fix | Broad crawl/ too many open files |
926912 | Fix | multiple charset headers + long lines |
926338 | Fix | Corrupted blue image in progress bars |
896757 | Fix | NPEs in Andy's Th-Fri Crawl + NPE in RIS |
922080 | Fix | IllegalArgumentEx/ReplayCharSequenceFactory (offset vs. size |
935271 | Fix | FTP URIs in seeds interpreted as HTTP |
945923 | Fix | maven rc2 won't make src distribution |
947754 | Fix | Corrupted arc files on termination of job |
931269 | Fix | https exception: java.io.IOException: SSL failure |
935146 | Fix | Excessive ARCWriterPool timeouts: |