Information automation systems for gathering and organizing information of interest from large, unstructured, repositories (such as the Internet) are known. One such system issues a query to a search engine (known as a deep Web source), accumulates & downloads the results from that engine's search, applies the query using a scoring system against the results returned, then either accepts and stores the document or rejects and purges the document.
Occasionally, during the configuration process the source record does not get properly created or contains information that needs to be adjusted. This might be as simple as providing a better title for the source or as complicated as adjusting the form's field/value pairs in the action string. An editor that permits users to efficiently make such adjustments is desirable.
Another type of challenge in present-day information automation systems is internationalization, i.e., the ability effectively search sources in various languages. Systems for creating database queries using character sets beyond 7-bit ASCII are known. However, significant challenges exist to deep harvest from non-Latin based language deep Web sources.
One such challenge is to communicate the user's query with a given source using the source's character encoding as well as the source's Boolean support capabilities. Previously, all communication was done using ASCII (ISO-8859) on out-bound traffic. However, nearly all non-Latin based languages use code points beyond ASCII (ISO-8859), so this approach provided invalid data that are simply rejected.
Another challenge involves Boolean searching in foreign languages. Most languages, especially Latin based languages, support whitespace between the query terms and Boolean operators. Some languages, however, such as Chinese, Japanese and Korean, for example, are very specific about where a space can appear when conducting a query. The rules even vary from source to source within the same language. Boolean searching in foreign languages cannot be conducted in the same way it is done in English.
A further challenge in information automation systems concerns systems that operate to retrieve a copy of a source's HTML page, prompt that page to the user, ask the user to interact with the page to conduct a search, and capture what the user does for future use. Previously, the technique was to download the HTML page, scan the HTML for all form elements, and then modify the action strings to point back to query manager. This meant that much of the existing data had to be tucked away in hidden fields on the document then processed later. A solution to improve the efficiency of this operation is desirable.