1. Field of the Invention
The present invention relates to information retrieval, and the application and deployment architecture for such information retrieval. Specifically, the present invention concerns a multi-tier client/server model for record retrieval wherein optimum record retrieval from a database is achieved based on embedded expert judgments linked to words, phrases, sentences and paragraphs of text; or numbers; or maps, charts, and tables (including spread sheet; or still pictures and/or graphics; or moving pictures and/or graphics; or audio elements (hereinafter sometimes collectively referred to as the "links" or "Linked Terms," or when any one of the aforementioned elements are used singly, as the "link" or "Linked Term"), contained in documents on a network resource, such as a web site and incorporating an intuitive graphical user interface (GUI) to correlate through a plurality of frames the retrieved records with records from one remote database or a large collection of remote databases maintained by one company, called a Data Warehouse, plus a means to select various databases or Data Warehouses and a comprehensive selectable index of the linked embedded expertjudgments.
2. Background Information
"Pull" Technology
A conventional information retrieval system includes a database of records, a processor for executing searches on the records, and application software that controls how the retrieval system, such as a database management system (DBMS), accepts the search queries, manages the search, and handles the search results. Generally, the database includes records such as text documents, financial or court records, medical files, personnel records, graphical data, technical information, audio and video files or various combinations of such data. Typically, a user enters a password and client billing information, and then initiates the search by finding the appropriate database or groups of databases to search and formulating a proper query that is sent to the DBMS. This process is known as searching by pull technology. To effectively search and retrieve records from the database, the DBMS typically offers a limited variety of search operations, or query models, specifically designed to operate on the underlying records in the database. The query models are coordinated and executed by an application generally referred to as a search engine. For example, a document database, such as a database of court opinions, may be organized with each court opinion as a record with fields for the title of the case, jurisdiction, court and body text. A simple search engine may support a full text searching query model for all the text fields, individual field searching, such as searching by court or jurisdiction, and various Boolean search operations such as and, or, and not. More sophisticated search engines may support the following query models:
1. nested Boolean or natural language searches; PA0 2. grammatical connectors that search for terms in a grammatical relationship such as within the same sentence or paragraph (e.g., "/s", "/p", etc.); PA0 3. proximity connectors that require search terms to appear within a specified number of terms of each other (e.g., "w/5"); PA0 4. exclusion terms ("BUT NOT"); PA0 5. weighted keyword terms; PA0 6. wildcards; PA0 7. specification of the order in which the database processes the search request (e.g., grouping words in parenthetical expressions); PA0 8. restriction of the search to certain fields, and formulation of a restricted search such as by date, subject, jurisdiction, title, etc.; and PA0 9. combination of the fields of search. PA0 1. The great majority of the Internet search engines have no control over the records in their database. Unlike the commercial Data Warehouses who have an ongoing relationship with the content provider (usually by a license agreement), and who carefully screen, cleanse and format the information provided by their information providers, many Internet search engines sweep through the WWW periodically and automatically, and catalogue web sites as records in their databases. They also permit any web publisher to submit his or her web site as a record entry with little or no prior screening. PA0 2. As a result of little or no screening, and absolutely no contact with the information provider, Internet search engines often provide search results that have multiple "dead ends," the result of links which are often moved or deleted after the search engines have catalogued them. Moreover, the web sites' authors can sometimes manipulate the words on their site and cause the Internet search engines to list their websites higher on the search engine's relevancy lists than other web sites. PA0 3. The search engines' databases include only a fraction of the Internet's content, and even then, the content may be from dubious sources, or sources which are not updated frequently. PA0 4. Where the web sites include embedded search terms in links in documents to existing Internet search engines or current awareness "news" databases, since the words are linked to the free Internet search engines discussed above, the information retrieved, for reasons explained above, is not reliable and users often receive multiple irrelevant responses. Words linked to the current awareness databases receive more useful information, but there is no GUI correlating and synchronizing the records of multiple databases. Typically, those web sites pass authentication information by the QUERY.sub.-- STRING environment variable. Once placed on the command line by the browser, the viewer can see all passwords and usernames in the authentication argument.
In addition, large commercial database providers, such as BLOOMBERG, DIALOG, LEXIS/NEXIS and WESTLAW typically have thousands of individual databases. These large commercial database providers are Data Warehouses, which comprise an architecture and process where data are extracted from external information providers, then formatted, aggregated, and integrated into a read only database that is optimized for decision making. Users subscribe to the Data Warehouses by monthly or yearly subscription, and then typically pay stratified levels of hourly charges for access to certain databases, or groups of databases.
Drawbacks of Pull Technology
One limitation of existing information retrieval systems, especially among the commercial Data Warehouses, is the burden on the user to first enter client and billing information and passwords to gain access and initiate the search, and then formulate the search query. Typically, the subscription based commercial database services provide password administration and extensive catalogues, both in print and on-line, describing the content and scope of the databases offered, and in some cases, live assistance by telephone by reference librarians who assist the user to find the proper databases. However, the user must remember the password, and spend time finding the proper database by catalogue, on-line access, or phone, or else incur more expensive hourly charges searching through single databases or groups of databases for the appropriate database content and scope.
A second limitation of pull technology is the formulation of the search query. To use the more powerful commercial Data Warehouses effectively, a user must be trained to use all of the aforementioned query models, and have sufficient knowledge of the topic to choose the appropriate keywords or natural language terms. The complexity of the search process compels the commercial Data Warehouses to offer training and keyword help to their subscribers by multiple publications that describe search tips; interactive software based training modules; account representatives who visit the user and train him or her; and customer service and reference librarians available by phone.
A third limitation of pull technology concerns how it is employed on the World Wide Web area of the Internet ("WWW") by such search engines as THE ELECTRIC LIBRARY, EXCITE!, FOUR ONE ONE (411), HOTBOT, INFOSEEK, LINKSTAR, LYCOS, MAGELLAN ALTA VISTA, OPEN TEXT INDEX, WEB CRAWLER, WWWWORM, and YAHOO!, just to name a few. These search engines' query models are beginning to approach the sophistication and complexity of those of the commercial database companies, but unlike the commercial databases, they offer minimal customer support. Another drawback of the Internet search engines, well documented in the computer business and popular press, is that their search engine algorithms cause multiple irrelevant responses to a query. Other drawbacks of Internet search engines employing pull technology include:
The considerable logistical and practical drawbacks of pull technology are illustrated in the following example of an investment banker who is responsible for buying bonds for an institutional investor, such as a bank or an insurance company. This hypothetical investment banker, based on an actual person, will be used at different points throughout this patent application to illustrate and support the novelty and unobviousness of the present invention.
Every week, this investment banker must go before a board of executives at his bank and provide them with a list of bonds that he had examined and analyzed and recommends to the bank to buy. In order to do his due diligence he must cover in his report five areas of research concerning the bond: 1) compare the bond price to other bond prices (the Bond Comparables); 2) obtain historical data concerning the bond and the company issuing the bond (the Historical Data); 3) obtain the Securities and Exchange filings, such as 10K's, and 10Q's for the company issuing the bond (the SEC Filings); 4) obtain specific information from a wide variety of publications concerning the industry in which the company operates (the Industry Data); and 5) obtain information concerning the historical and anticipated performance of the company's stock (the Stock Data). Furthermore, he has to read various newsletters and white papers issued by investment banks desiring to sell the bonds to him, and which analyze the bonds using the same criteria mentioned above. In order to collect the data, this investment banker must log on and enter password and billing information; find the appropriate databases; and formulate the search and obtain the results in three to five different Data Warehouses, each of which are organized differently from one another and have different methods to enter search queries, and different query models. While pull technology satisfies the demands for the breadth and depth of the search (since the user can formulate his or her own queries, and make unlimited selections of databases to search) it is time consuming, cumbersome and expensive because the user must find the appropriate query formulation and database or databases within which to run the query, sometimes even in different Data Warehouses.
"Push" Technology
In response to the flood of information facing the typical Internet user under the pull model, the complexity of the query statements, and the well documented inability of the Internet search engines to locate and deliver relevant content, software companies developed software agents to push information to users. The push model is also known as webcasting.
Under push, computers sift through large volumes of information, filtering, retrieving and then ranking in order of importance articles of current interest. The user fills out a "profile" (also called a "channer"), that defines a predefined area of interest or activates a filter. This, in turn, causes the webcast search engine to search its own databases, or the databases of others, for content matching the profile or the filters submitted by the user. The user, in order to access the channels and have the content "pushed" to him or her, must download special client software which acts either independently of, or in conjunction with, the user's browser. Altematively, a user can access a dynamically generated web page on the webcaster's server that lists the found articles. (An example of a dynamically generated web page is "Newspage Direct" by Individual, Inc.)
One early version of the Internet push model, developed by Pointcast Inc., clogged the network behind a company's employees' firewall when large numbers of the company's software agents pulled information from Pointcast's servers on the Internet at or near the same time. Pointcast later alleviated this problem by providing remote servers that could operate behind a company's firewall and request and collect (or cache) information at once or at predetermined times from the Pointcast severs on the Internet. These intermediate servers then pushed the information to employees, which effectively centralized the distribution of information in the Information Services (IS) department.
As mentioned above, all push technology requires that users compile a "profile" to detail their interests. The prior art of delivering the information obtained by the search engine pursuant to the profile is divided into three broad categories: offline browsers; e-mail delivered content providers and information channels.
The offline browsers typically operate by requiring a user to complete a profile with predetermined categories; automatically search the Internet for the information specified in the profile and download the materials to the user's hard drive for viewing at a later time when the user is off the Internet. This first category of products include: Freeloader by Freeloader, Inc.; Smart Delivery by FirstFloor, Inc.; WebEx by Traveling Software, Inc.; WebRetriever by Folio Inc. and Web Whacker by ForeFront Group, Inc.
The second category of push products delivers the results of searches performed pursuant to the user's profile directly to the user's e-mail box, and includes: Netscape's Inbox Direct and Microsoft Mail.
The third category of push products arranges the predetermined categories into "channels" and uses filters to allow users to customize their news deliveries from a broad range of proprietary news sources. It is claimed that the results of the searches are pushed or "broadcast" in real time to the viewer. Examples of this type of service include: BackWeb by BackWeb, Inc.; Headliner by Lanacom, Inc.; Incisa by Wayfarer, Inc.; Intermind by Intermind, Inc.; Pointcast by Pointcast, Inc.; and Marimba by Marimba, Inc. However, since the retrieved data is first cached on the service provider's server (e.g. Pointcast's server), and then again on the companys' servers behind the firewall, the results of the search are not really "broadcast in real time."
There is a fourth category of push products which do not fall neatly into any of the above three categories of delivery. Citizen 1 by Citizen 1 Software, Inc., is a human organized hierarchical listing of free Internet search engines. The user can then select a number of databases which fall under that category, and run several simultaneous queries in the databases. Digital Bindery by Digital Bindery Company allows users to "subscribe" to web pages as they browse. Once a subscriber, the user will automatically receive via e-mail any updates to the web pages to which the user subscribed.
Webcasting attempts to eliminate the inefficiencies of pull technology, namely the time consuming and unproductive hunt for information through Internet search engines. Instead of an open ended search through many databases linked to the web by various search engines, as is done under the pull model, push substitutes one central secure database which has collected either the content itself, or the links to the content. However, in spite of the name, push, the information provider does not drive the distribution of data. Instead, a client (in a client/server arrangement) contacts the information provider and requests the information. The client then downloads the information in the background, giving the impression that it is broadcast, when in fact, it is only automatically downloaded at a predetermined time.
Shortcomings of "Push" Technology
"Push" may be a satisfactory method for serving information to knowledge workers who depend on a constant stream of updated factual information served in narrow categories. Examples of these kinds of workers would be sales representatives who must find new prospects, staff in field offices who must be aware of sudden price changes, information managers who must distribute software upgrades and marketing professionals who must be aware of the new products released by the competition.
However, there is a category of knowledge workers whose information needs are not properly satisfied by push technology. The hypothetical investment banker discussed above is an example of such a knowledge worker. These knowledge workers cannot use "filters" and "profiles" to provide the most relevant information since the information they need cannot easily fit into categories, but rather spans categories. These knowledge workers use information to solve problems that are rarely alike. They need information to solve a problem, but they do not know what they need day to day.
This knowledge worker culls information and sparks creativity by comparisons and contrasts, juxtapositions, and induction and deduction, rather than by looking at raw news reports. The investment banker discussed above, usually does not know well in advance what industry or company he will be analyzing. He also does not always know where his research and analyses will take him, or what databases he will use. His decisions are tied into so many variables that exist in the marketplace that his information cannot be predetermined by a general form or profile. A further limitation of webcasting is that it has not struck the optimum balance between burdening the viewer with a persistent stream of alerts versus alerting the viewer when new information has arrived.
Moreover, since webcasting centralizes the development, control and the administration of "profiles" within an Information Services (IS) department, certain knowledge workers' information needs may not be satisfied by such centralization. IS departments, already strapped for resources to manage mail servers, web servers, Lotus Notes servers and application servers, may not be capable of managing servers that maintain lists of user "profiles" and dispatch software agents into the World Wide Web (WWW). The push model works only if IS departments proactively keep the profile lists current and advertise them internally. Furthermore, there may be enormous legal ramifications, as of yet not addressed, to companies downloading copyrighted material to their internal servers and redistributing it internally, especially if the push purveyor links to other websites or search engines without permission. See, "Legal Situation Is Confused on Web Content Protections,". New York Times, Jun. 9, 1997, at page D5.
Finally, all the above examples of "push" technology, except for "Digital Binder," require the buying, installation, maintenance and updating of software by both the publisher and the user.
In addition to the above-mentioned disadvantages, both the push and pull models fail to address the need to efficiently, inexpensively, and frequently augment web sites with current or historical data. According to the Mar. 11, 1997 Wall Street Journal, in an article entitled At Thousands of Web Sites, Time Stands Still: "Nearly five million pages of a total 30 million indexed by AltaVista on the Web haven't been updated at all since early 1996 . . . Some 424,000 pages haven't been refreshed since early 1995--and 75,000 Web pages haven't been touched since before 1994."
Therefore, it is desirable to dynamically augment a static web page containing text, audio, graphics, and/or video information on a network resource with Linked Terms connected to current awareness and/or historical records from expert pre-selected Data Warehouses or single databases, thereby saving the enormous labor and time costs involved in updating web pages.
It is similarly desirable to permit users to choose and narrow their own search criteria through pull technology by clicking on Linked Terms in a written document, and still obtain the benefits of push technology by having current awareness and historical records pushed to update their selections without introducing new protocols or application programmers interfaces (API's) to operate. It is therefore desirable to provide a method and apparatus use of which does not encumber the user's or publisher's computer system in the following ways: 1) neither the user, nor the publisher has to buy, install, maintain or update software to use the invention; 2) use of the method and apparatus does not require large hard disk and memory allocations by the user; and 3) as a result of "2," use of the method and apparatus does not preclude using other push products simultaneously. This invention can work with any operating system that employs a browser, and can accommodate any binary data type, including FTP repositories, full Java applets and VRML, and any browser plug-in, such as Shockwave applications. Moreover, it can deliver information from a variety of sources, including from the Internet, company databases, groupware and intra- and extranets.
Finally, given the almost exclusive use of current awareness and historical data on databases for research purposes in the prior art, the present invention is unique and unobvious because it is the only invention that updates Linked Terms in any written document, including web pages, with current and/or archived information from databases and Data Warehouses using a proprietary user interface and embedded expert judgment. Updating web pages and written content in this matter effectively transforms raw information into data which can support any point made in any written document. So, for example, if the document is used for marketing purposes, this invention would permit raw information to be used for marketing purposes, etc.
It is also desirable to provide a method and apparatus, which, rather than seeking to identify records on a database whose characteristics exactly match what the user types into a query model, embody one or more kinds of expert judgement data for the purpose of selectively retrieving on demand the best fitting or most appropriate records in response to user data entry. Accordingly, it is desirable to provide a query architecture for an information retrieval method and apparatus that utilizes both pull and push technologies wherein knowledge workers can select their database resources based on the issue they must solve and current awareness or historical data can be pushed to them based upon embedded expert judgment based on the same issue once they have selected the database resources.
It is further desired that the Linked Terms in any document be augmentative and allow for the efficient integration of embedded expert judgment that correlates a user's choice of a Linked Term with optimum data information judgments or designations to identify those data where the fit between the user's choice of a Linked Term and optimum data for that Linked Term is best.