I need a Java application which will harvest data from www.uspto.gov. It should use the "Browse Dictionary" feature to find the data to extract and crawl this dictionary, pulling down as many articles as possible. It will be running on a server with a large number of IP's, and will also have a database of proxies. The solution should know how to use as many available ip's on the server as possible and the proxies. The application should be configurable to control request speed and exclude specific ip's and proxies, in case they get banned.
Of course, it should be multi-threaded, and configurable to manage the number of threads running.