1. Field of Invention
The present invention relates generally to the field of data retrieval. More specifically, the present invention is related to a business model which provides a fee-based, real-time, intermediary service including a method of extracting data from third party providers, removing existing formatting information and returning the data to the requester in a desired format.
2. Discussion of Prior Art
The proliferation of the Internet and World Wide Web (WWW) has produced a deluge of information often times in unmanageable formats to the average user. To assist the user, various search engines have been developed which work through the user""s browser to keyword search various indexed data sources. While search results of text Web pages may be easy to manage, search results of structured type data prove not to be so easily managed. Typically database results are returned preformatted in HTML, text or spreadsheet forms. The user, however, has no means of selecting a format not envisioned by the data supplier. The user may want to select a data output only in spreadsheet format for direct integration into locally stored table structures. Most users cannot perform such a conversion because of software or hardware limitations, and certainly not in real-time. What is needed is an intermediate service provider through which a user can enhance their data retrieval by customizing the data output without having to create complex algorithms or mapping structures locally on their PC. The following prior art describes various attempts to extract data from database sources located on the Web.
The patent to Schofield (U.S. Pat. No. 5,860,072), assigned to Tandem Computers Incorporated, provides for a Method and Apparatus for Transporting Interface Definition Language-Defined Data Structures Between Heterogenous Systems. Data strings are stored locally in a receiving computer""s buffer and thereafter, the data structure extracted, realigned and stored. Column 4, lines 37-39 suggest an Internet embodiment.
The patent to Horvitz et al. (U.S. Pat. No. 5,864,848), assigned to Microsoft Corporation, provides for a Goal-Driven Information Interpretation and Extraction System. Column 1, lines 47-52 suggest the extraction of data from Internet web pages.
The web page entitled, xe2x80x9cVisual Design and Cross-Platform Executionxe2x80x9d, provides for a technical overview of the software product xe2x80x9cCambio.xe2x80x9d Cambio extracts the desired data fields (which can be spread across multiple lines in a text file) and assembles those fields into a flat record of data. These records are presented in the conventional row/column, tabular format (see http://www.datajunction.com/products/cambio_technical.html).
The web page entitled, xe2x80x9cGlimpseGatexe2x80x9d, provides for context searching of html web documents with data strings (see http://phones.cybercell.net/xcx9chsf/sources/glimpsegate/).
Additional data extractors can be found in the following patents, web pages and articles:
U.S. Pat. No. 5,761,656 to Ben-Shachar, U.S. Pat. No. 5,819,265 to Ravin et al.; U.S. Pat. No. 5,870,746 to Knutson et al.; U.S. Pat. No. 5,881,232 to Cheng et al., and U.S. Pat. No. 5,892,908 to Hughes et al.,
Web sites:
4.1 Overview -http://skwww.enc.iis.sinica.edu.tw/user-manual/node42.html;
HelponCitibaseDataExtraction- http://biscu.its.yale.edu/SSDA/helpfiles/citihelp.html
HTML Presentation - http://www.fortnet.org/FortNet/HTML/Presentation/stats/
HTML2TEXT v1.51- http://www.telekabel.nl/sprinter/wieger/html2txt.htm
HTMLess 2.0- http://elanor.sci.muni.cz/ar/ar407_Sections/news19.html
NeXtract - http://www.nextract.com
Article: SAC Software Agent Corporation Presents The Search Agent - http://www.io.com/xcx9csac/, and article by Lawrence, Steve et al., IEEE Internet Computing, xe2x80x9cContext and Page Analysis for Improved Web Searchxe2x80x9d, July-August 1998, pp. 38-46.
Whatever the precise merits, features and advantages of the above cited references, none of them achieves or fulfills the purposes of the present invention, one of which specifically to provide an E-commerce business model and system including an intermediate service provider through which a user can enhance their WWW data retrieval by customizing the data output in realtime without creating and maintaining complex data mapping algorithms. The prior art shows that both stripping algorithms and Java agents are known, however, neither have been used to dispatch intermediary agents for real-time extraction of structured data from HTML pages accessed by the user and arbitrary post-processing of third party data.
These and other objectives are achieved by the detailed description that follows.
A data extractor system for the extraction, deformatting, and postformatting of data available on the WWW. The system enables buffering and streamlining between the user and web data providers; converting the visual presentation of information into data for further processing, translating one data request into a cascade of data requests and pasting results together, filtering data output; allowing a variety of presentations of data different from the original presentation; optional dataflow between the user""s applications and the third-party data providers thereby bypassing interactive interfaces.
A user, connected to the Internet/Web, contacts an intermediate data service which provides an interface to determine various aspects of the user""s query, including output format. The intermediate data service generates a stripping agent, such as a Java program, which is sent to the user""s browser to interface with a third party data provider. The Java stripping agent contains the knowledge to strip away the formatting of user interfaces such as HTML, reformat, reorganize, filter and present the data in real-time in a user-selected format. The present invention:
1. Embeds all user input in a standardized way in a URL (CGI), hiding from the user various data entry protocols such as post-data, Java script data entry forms, etc. Thus, allowing the user to:
a. bookmark this URL with predefined input data
b. embed this URL in various user scripts
2. Converts the formatted data retrieved from third party data provider into an ASCII file, one line per result, tabs separating fields; eliminating all graphics and irrelevant text, leaving only data allowing:
a. convenient downloading of data into user applications
b. compact results
c. development of embedded applications
3. When a third-party site gives a few records at a time and a xe2x80x9cnextxe2x80x9d button, the present invention recursively dispatches an agent to recursively call the third-party data provider to give the user in one operation a large volume of data.
4. In addition to plain ASCII output by default, the user will be able to parametrically specify additional forms of output:
formatted ASCII (72 characters per line, aligned spaces instead of tabs, one field can continue on several lines)
RTF
HTML tables
PDF
Postscript
And others
The present invention delivers standardized extracted graphic files of spatial data: maps, remote-sensing images, etc.
5. The user can specie a parameter EGREP_SCREEN giving a regular expression to screen the output or a simplified parameter KEYWORDS_SCREEN. (Note: this is post-processing of results after they are received from third-party providers)
6. In an alternative embodiment, the intermediate data service subscribes to a variety of pay-per-use services and re-delivers information to paying customers. The end user""s convenience, in addition to repackaging, will be that the user does not have to subscribe to many services, just to the intermediate data service (a charge includes a small mark-up, or no mark-up if wholesale rates are obtained).
7. In an alternative embodiment, the system performs merges and joins between data from more than one server.
8. In an alternative embodiment, certain joins will be allowed within same site, e.g., by traversing pointers to product detail from product list.
9. In an alternative embodiment, the system includes a virtual conceptual semantic schema of all WWW information accessible by the user via the service and allow the user to specify complex database query against same schema without knowing which third-party sites need to be accessed or joined to perform the query.
10. The program can employ Java-agent technology, which agent will perform all the activities at user site; reducing traffic on the intermediate data service and will also protect the intermediate service provider from possible claims of third-party data providers regarding reselling or storing of their data contrary to license or copyright provisions.
11. The program will allow a number of post-formatting options, including:
audio file produced after adding connecting words to properly delineate fields (it is impossible to produce a meaningful audio file without first stripping output and delimiting fields with connecting words)
smart translation into other languages; The present invention will decide which fields should be translated and which should not, exercising its knowledge of the semantics of the data source.
12. The program is written in such a way that definitions of the tird-party web site protocols are outside of the program, in a Knowledge Base, and easy to maintain and change by a low-skilled staff.
13. The intermediate data service maintains a large database or references to data providing sites whose input/output stripping instructions are known.
14. When no parameters are given, the present invention replies with a list of third party services it knows to query, the kind of information they provide, and list of field names.
15. Examples of services to be supported are:
various white and yellow phone directories
business directories and classification (SIC)-zip2.com
weather services
stock quotes (input: a list of ticker symbols)
public English dictionaries, bilingual dictionaries, and thesauri
web search engines (Dog Metafind; Yahoo!; Infoseek)
geographic text servers (zipcode less than xe2x80x94 greater than city, address less than xe2x88x92 greater than area code  less than xe2x88x92 greater than airport code)
online translators
airline schedules and flight info (airline-specific sites)
professional directories: doctors, lawyers
Microsoft aerial photography
maps