Abstract
Release for second heritrix workshop, Copenhagen 06/2004 (1.0.0 first release candidate). Added site-first prioritization, fixed link extraction of multibyte URIs, added metadata to arcs as xml, changed arc naming template, new user and developer manuals, added basic/digest auth and http post/get login facility, and added help to UI. Bug fixes.
Table 11. Changes
| ID | Type | Summary |
|---|---|---|
| 896769 | Add | job report: show 'active' hosts, show more size totals |
| 896772 | Add | "Site-first"/'frontline' prioritization |
| 956614 | Add | multiple open http connections per host needed |
| 896674 | Add | Add help to web UI |
| 964931 | Add | When a host last had a completed URI shown in crawl report |
| 958335 | Add | Encode multibyte URIs using page charset before queuing |
| 909246 | Add | One src for site, help, and readme docs. |
| 936684 | Add | identifying ARCs: unique names, header records |
| 930667 | Add | Resetting arc file counter for every job. |
| 863318 | Add | ARCs need better headers |
| 908507 | Add | Specify location of jobs dir |
| 914301 | Add | Logging in (HTTP POST, Basic Auth, etc.) |
| 944066 | Add | Update dnsjava from 1.5 to 1.6.2 (Fix NPE) |
| 966168 | Fix | crawl.log entries without annotations end with a space |
| 966172 | Fix | An issue with arc names' date and serial number alignment |
| 957963 | Fix | Output of warning message leads to NullPointerExceptions |
| 963965 | Fix | Either UURI or ExtractHTML should strip whitespace better |
| 965267 | Fix | Maximum documents not enforced |
| 965308 | Fix | NPE in path depth filter |
| 934549 | Fix | embed/speculative inclusion too loose |
| 962899 | Fix | UnsupportedCharsetException handled awkwardly |
| 962892 | Fix | UURI accepting/creating unUsable URIs (bad hosts) |
| 860733 | Fix | CachingDiskLongFPSet UI availability |
| 954130 | Fix | Crawls slow till change a setting |
| 961867 | Fix | zero link-hops should work |
| 942627 | Fix | multiple robots.txt URLs in the "default" frontier |
| 957941 | Fix | NPE in ExtractorHTML#isHtmlExpectedHere |
| 953718 | Fix | Unwanted behavior with seed redirection |
| 952636 | Fix | Link extraction failing |
| 863315 | Fix | Memory issues: Frontier.snoozeQueue |
| 903838 | Fix | Transitive scope confusion, may not work as expected |
| 955345 | Fix | Wrong stats after deleting URIs from Frontier |
| 952276 | Fix | NoSuchElementException in admin/reports/frontier.jsp |
| 952665 | Fix | Alert: Authentication scheme(s) not supported |
| 936702 | Fix | IP validity: units, TTL vs. setting |
| 951582 | Fix | ConcurrentModificationException in DomainScope focus filter |
| 949489 | Fix | ConcurrentModificationException terminate job |
| 949551 | Fix | Authentication bug |
| 948898 | Fix | terminate running crawl == NPE |
| 927940 | Fix | java.net.URI parses %20 but getHost null |
| 874220 | Fix | NPE in java.net.URI.encode |
| 808270 | Fix | java.net.URI chokes on hosts_with_underscores |
| 788277 | Fix | Doing separate DNS lookup for same host |
| 910120 | Fix | java.net.URI#getHost fails when leading digit |
| 949548 | Fix | Constraining java URI class |
| 943373 | Fix | Same CrawlServer instance for http & https. |
| 887999 | Fix | Broad crawl/ too many open files |
| 926912 | Fix | multiple charset headers + long lines |
| 926338 | Fix | Corrupted blue image in progress bars |
| 896757 | Fix | NPEs in Andy's Th-Fri Crawl + NPE in RIS |
| 922080 | Fix | IllegalArgumentEx/ReplayCharSequenceFactory (offset vs. size |
| 935271 | Fix | FTP URIs in seeds interpreted as HTTP |
| 945923 | Fix | maven rc2 won't make src distribution |
| 947754 | Fix | Corrupted arc files on termination of job |
| 931269 | Fix | https exception: java.io.IOException: SSL failure |
| 935146 | Fix | Excessive ARCWriterPool timeouts: |