Increased memory and remote electronic data storage capacity offers access to large amounts of data in a very convenient form and physical size. Data may be available on diskette, CD-ROM, magnetic tape, and on line to a centrally located computer and memory storage medium. The challenge remains to extract information from the data simply and efficiently and to have confidence in the result that all relevant items have been uncovered. To focus in on relevant database records, search engines generally use keywords, categorization, segment limitations, Boolean logic, and hit counts. More complex search engines can also employ hierarchical categorization and multifaceted searching.
Keywords are the basis of most searches. A simple keyword search, such as that found in most word processors under the “Find” command, will locate the occurrence of a text string within a document or a record. Misspellings, synonyms, or different tenses of a given text string will not be located. The searcher must be cautious to truncate the text string to a word's root. A search for the text string “graphical”, for instance, will not locate instances of the text string “graphics.” The searcher must also not choose commonly occurring words, as such a search would result in a high number of search results. Keywords are commonly combined with categorization, segment limitations, Boolean logic operators, advanced keywording, date operators and numeric operators to create a more effective search.
Categorization is a technique used to focus the scope of a search. A category is a subset of records. By conducting a search only within this subset of records, fewer irrelevant hits result. Lexis-Nexis™ and Dialog™, two online searchable databases with proprietary search engines, are examples of categorized databases. Prior to conducting a keyword search within the Lexis-Nexis™ or Dialog™ database, the searcher must select from an extensive list of categories. Some categories are broader than others. If the searcher selects an overly broad category, his or her search will result in too many irrelevant hits and the searcher will waste time sorting through the undesired search result records looking for relevant hits. If the searcher selects an overly narrow category, his or her search results will not include some of the desired records. Selection of an appropriate category, therefore, is of vital importance.
Searches can be further focused with the use of segment limitations. “Segments” are similar to categories in that they are domain specific. Category classifications are used to divide multiple records into subsets, or “fields”. Segment classifications are used to divide individual records into specific groupings of information. Using segments, keyword searches can be targeted at certain fields of a record, such as a record's title or author. Search engines distributed by Lexis-Nexis™ and Dialog™, two online searchable database providers, are well-adapted to such targeted searches, often using dozens of segments for each category of records. A news article record, for instance, is typically broken down into separate fields for byline, date, publisher, abstract, and body. To find a news article with the word “elephant” in the title (or headline) using the classical interface of the Lexis-Nexis™ search engine, the following syntax would be needed: “HEADLINE(elephant)”.
Using keyword searching may not be very helpful if the user is not familiar with the appropriate standard terminology related to the information they are looking for. Further, there may be many appropriate ways to describe the information sought by the user. A concept expressed by a standard industry term in one industry may be different from a standard industry term in a different industry. A keyword search would require searching all synonyms used in order to ensure a complete and accurate result.
When a user of a searching/retrieval system enters a keyword search query into a system, the query is parsed. Based on the parsed query, a listing of documents relevant to the query is provided to the user. In the prior art, it is also known to use semantic networks when parsing a query. The number of words used to search the database is then expanded by including the corresponding words or associated words identified by the semantic network in the search instructions. This expansion can be based on any one or a combination of using stems or roots of terms, using sound-a-like words, using wildcard words or any other appropriate semantic technique.
Boolean operators, such as “AND”, “OR” and “MINUS”, are used to enhance the capabilities of a search engine. The basic format of Boolean queries is well known in the art and generally takes on the form of “X OR Y”, where X and Y are two distinct keywords. Because search requests are processed by a computer, syntax rules must be strictly followed when drafting a Boolean keyword search. In many search engines the logical operators “AND” and “OR” must be capitalized. Some search engines allow additional syntax that indicates requisite proximity of keywords or hierarchy within a specific Boolean query. Hierarchy within a Boolean query is usually designated with the use of parenthesis. The “(A OR B) AND (D OR D)” query, for instance, finds a first set of records containing “A OR B” and a second set of records containing “C OR D”, then finds records included in both the first set and the second set.
Using the Boolean operator “AND” in a search expression such as “X AND Y,” will yield records which include both X and Y in the record. Using the Boolean operator “OR” in a search expression such as “X OR Y,” will yield records which include either X or Y in the record. Using the Boolean operator “MINUS” in a search expression such “MINUS X” will yield records which do not include the term X in the record.
A query that is too narrow will result in less than the desired number of records. Correspondingly, a query that is too broad will result in greater than the desired number of records. Immediate user feedback on a specific query helps the searcher construct a better subsequent query. Hit count is perhaps the most effective form of feedback for constructing a better query. If a query is too narrow, the hit count will be very low, possibly even zero. If a query is too broad, the hit count will be very high. Hit count information is used with selected viewing of search results to alert the searcher of mistakes, such as incorrect category or segment choice, or otherwise assist the searcher in drafting more effective queries. Hit counts are generally displayed after a given query is executed. Hit counts are more useful when provided for each search term and each combination of search terms. Boolean Representation One, illustrated below in Table I, demonstrates how individual hit counts can be used for the Boolean keyword search for “(cat OR dog) AND (doctor OR veterinarian)”.
TABLE IBoolean Representation OneIn the above example, the hit counts are as follows: in the database the term “cat” is included in 280 records; in the database the term “dog” is included in 494 records; in the database the term “veterinarian” is included in 34 records; in the database the term “doctor” is included in 194 records; in the database the term “cat” or “dog” is included in 774 records; in the database the term “veterinarian” or “doctor” is included in 228 records; and in the database the Boolean query for the Boolean expression “(cat OR dog) AND (doctor OR veterinarian)” results in the location of 4 records. If the Boolean expression is altered by the replacement of “dog” with “cow”, the hit count change ripples through the Boolean expression's representation as shown in Boolean Representation Two, illustrated below in Table II.
TABLE IIBoolean Representation TwoFeedback from individual hit counts gives the searcher access to information normally hidden. Viewing individual hit counts, a searcher is better able to identify search terms that are too specific, too broad, or misspelled.
An additional search tool is hierarchical categorization. Instead of classifying records into separate categories, hierarchical categories classify records into both broad groupings and progressively narrower groupings. An example of hierarchical categorization is found in biology, where organisms are organized, from broadest to narrowest, by kingdom, phylum, class, order, family, genus, and species. Hierarchical categorization is commonly used in conventional internet search engines, such as those found at the Yahoo!™ and Altavista™ websites. To find information about a specific topic, a search engine user navigates from a list of broad categories through an increasingly more specific list of categories. Once the first category is selected, a search engine typically displays a lower level screen with another list of alternatives. Such navigation continues down through the various menus of alternatives having decreasing priority levels. At any point of the category navigation, a keyword or Boolean search can be performed upon the records in that category. Search results are only obtained from records located within the category searched. Most search engines only allow searches in one category at a time. To search a second category, the searcher must navigate up the hierarchical category tree and then down to the second category.
The limits of the hierarchical categorization method were addressed in the early 1990's by the Software Technology for Adaptable, Reliable Systems (STARS™) program, which was spearheaded by International Business Machines Corporation and the Boeing Company. One objective of the STARS™ program was to improve the classification system for software so that previously developed software could be reused in new software development efforts. One proposal resulting from the STARS™ program was multifaceted classification. Instead of assigning a record to a single category, multifaceted classification allows a record to belong to multiple categories. The multiple categories become part of a record's description, along with standard information for the record such as the title, the abstract (or keywords), the date, and author. Multifaceted classification improves the likelihood of locating relevant records. First, the searcher can take several different paths to locate the same record. Using the analogy of books in a library, multifaceted classification is able to place a single book on more than one shelf. Second, the multiple categories can be subjected to a Boolean query. Records relating to sports medicine could be found by searching for records included in both the sports category and the medicine category.
An internet search engine employing multifaceted classification has been developed by the NCBI (National Center for Biological Information), a division of the NLM (National Library of Medicine) at the NIH (National Institutes of Health) for the PubMed database of bibliographic information. The NCBI search engine includes a hierarchical category tree from which categories can be selected. The NCBI search engine permits the searcher to select multiple categories, entitled “MeSH Terms”, from a hierarchical category tree. A MeSH Term can be linked by a logical AND or a logical OR with other MeSH Terms to create a Boolean expression. The Boolean expression of MeSH Terms can then be combined with additional terms to create the final query. The NCBI search engine also displays the hit count for each category and for each Boolean combination of categories.
Boolean logic, segment limitations, hit counts, hierarchical categorization, and multifaceted classification help the searcher create more effective queries, but at the cost of increased complexity. Often instruction manuals or a software program's help menu must be consulted to draft a query. Dialog™, for instance, publishes a “Bluebook” that contains detailed lists of segment codes for each of their many databases. Lexis-Nexis™ goes so far as to provide free online access and training seminars for students to overcome their search engine's initial learning curve. Addressing the complexity of search syntax, efforts have been made in the design of search engine software to reduce the amount of knowledge and experience needed to draft queries. The Lexis-Nexis™ search engine, for instance, provides searchers with the option of using a graphical user interface rather than their classical interface.
Most modern computer systems employ a graphical user interface rather than the more basic textual interface. In a graphical user interface, the user can run application programs, manipulate files, and perform most other necessary functions by manipulating images on the computer's display. This manipulation is accomplished by using cursor control keys and other keyboard keys or by using a cursor controlling peripheral device such as a joystick, mouse or trackball. A computer system 400 with a graphical user interface can be implemented as illustrated in FIG. 1. In FIG. 1, the computer system 400 includes a central processor unit (CPU) 401, a main memory 402, a video memory 403, a keyboard 404 for user input, supplemented by a conventional mouse 405 for manipulating graphic images as a cursor control device and a mass storage device 406, all coupled together by a conventional bidirectional system bus 407. The mass storage device 406 may include both fixed and removable media using any one or more of magnetic, optical or magneto-optical storage technology or any other available mass storage technology. The system bus 407 contains an address bus for addressing any portion of the memory 402 and 403. The system bus 407 also includes a data bus for transferring data between and among the CPU 401, the main memory 402, the video memory 403 and the mass storage device 406. Coupled to a port of the video memory 403 is a video multiplex and shifter circuit 408, to which in turn a video amplifier 409 is coupled. The video amplifier 409 drives a monitor or display 410 on which a graphical user interface is displayed. The video multiplex and shifter circuitry 408 and the video amplifier 409 convert pixel data stored in the video memory 403 to raster signals suitable for use by the monitor 410.
Graphical user interfaces for search engines use one or more screens to assist the searcher in the creation of a query. A sample query input screen 100 is illustrated in FIG. 2. The input screen 100 includes several labeled boxes capable of receiving textual inputs, including a client text box 102, a category text box 104, and a query text box 108. The client text box 102 is included such that individual searches can be billed to different clients and/or different projects. The category text box 104 is provided such that the searcher can input the category that will be searched. The query text box 108 is provided for the text of the query that will be executed in the selected category. The date parameters for a search are inputted using three boxes: a date parameter box 116, a start date box 120, and an end date box 122. Queries can be executed using the search button 114, saved using the save query button 124, or closed using the close button 126. The searcher can obtain assistance in use of the input screen 100 by pressing the help button 128.
Buttons in the FIG. 2 input screen 100 prompt the display of additional information. A date parameter selection button It 8 is used to display available date parameters, such as: “Date Is”, “Date After”, “Date Before”, “Date Between”, or “No Date Restriction”. To find a record published between 1985 and 1989, the searcher would select “Date Between” for the date parameter box 116, “1985” for the start date box 120, and “1989” for the end date box 122. A segments button 110 displays the segment limitation available for the category selected in the category text box 104. A Boolean operators button 112 displays the Boolean operators available for connecting keywords in the query text box 108, such as: “OR”, “AND”, “NOT”, and the special syntax used for proximity searches. A category walker button 106 is used to choose from a wide selection of available categories.
An exemplary category walker 140 is illustrated in FIG. 3. The category walker 140 includes a category tree viewing area 152, a category scroll bar 154, an OK button 146, a cancel button 148, and a category help button 150. The category viewing area 152 includes a sample hierarchical category tree represented graphically in a folder structure 142 with a root directory 144 and subdirectories arranged in alphabetical order. To navigate through the folder structure 142, the searcher scrolls through the alphabetical list of categories and clicks on the folder icons to view specific subdirectories of the selected category. To select a category, the searcher clicks the folder icon associated with the desired category using the conventional mouse 405 and then clicks the OK button 146. Once the OK button 146 is clicked, the category walker 140 closes and the category tree for the selected category appears in the category text box 104.
Saving a query is different from saving search results. A saved query contains all the information displayed on the input screen 100. When the searcher reopens a saved query, the query is displayed in the input screen 100 in the same manner it was originally displayed. The client information is displayed in the client text box 102, the category is displayed in the category text box 104, the query is displayed in the query text box 108, and any date information is displayed in the date parameter box 116, the start date box 120, and the end date box 122. The saved query displayed in the input screen 100 can be modified by the searcher and saved under another file name. Search result records are typically not saved in a format that can be manipulated by the search engine. Instead, search engines permit lists of search result records to be downloaded and saved as text documents. Individual search results can also be downloaded and saved as text documents. Once downloaded and saved on the searcher's computer, the list of search results can be manipulated using word processing software. To browse through the search results of a previously executed query using a search engine, a searcher must reexecute a saved query.
The database for a search engine can be local or remote. Local databases are generally marketed in the form of a CD-ROM accompanied by a proprietary search engine tailored for use with the particular data. Once a query is drafted, the proprietary software accesses records from the CD-ROM and displays the relevant records. The CD-ROMs can be used within a stand-alone computer or on a local area network accessible to multiple computers. As a CD-ROM is stored locally, use of the database does not require access to external transmission networks (e.g., telephone lines, ISDN, T-1, or DSL). Avoiding the need for external data, data from local databases can be retrieved faster and with greater reliability. CD-ROMs are not practical in some applications, however. CD-ROMs hold a limited amount of data, so they are not practical for databases that will not fit on a single CD-ROM (about 650 megabytes). The availability of CD-ROMs is also limited. Only widely used databases are available through normal marketing channels. Purchase of CD-ROMs is also impractical when the same information is available to the public over the internet.
Remote databases are searched in much the same way as a local database, but over a communication line. Remote access allows large databases to be centrally located and maintained, resulting in larger storage capability and lower costs. The remote database's need for transmission of data over a communication line, however, results in slower and less reliable retrieval of database information. Before the widespread growth of the internet, most remote databases were accessed by dial-up modems through a direct connection over a telephone line using proprietary data transfer protocols. With the rapid growth of the internet, more database services are accessible through the internet with a typical internet browser, such as Microsoft Internet Explorer or Netscape Communicator.
Access to remote databases through the internet follows the client-server model. The “client” is the searcher's computer. The server is the hardware and software maintained in a different location by the database provider. FIG. 4 illustrates how the client-server model is typically implemented. FIG. 4 includes a client 170, an internet 172, middleware 176, a first database 184, a second database 186, a third database 188, and a third party web server 199. The client 170 includes a browser 196 and a modem 198. The client 170 is coupled to the middleware 176 through a first communications route 174 traveling through the internet 172. The client 170 is also coupled to the third party web server 199 through a second communications route 197 through the internet 172. The middleware 176 is coupled to the first database 184 through a first database connection 190, to the second database 186 through a second database connection 192, and to a third database 188 through a third database connection 194.
The middleware 176 is used to connect the client 170 to database records. The middleware 176 includes a web server 178, a servlet 180, and a Java database connectivity (JDBC) layer 182. The web server 178 is used to control data transfer sent to and arriving from the client 170. Data is transferred between the client 170 and the middleware 176 by one of several data transfer protocols: hypertext markup language (HTML), simple text markup language (STML), extended HTML (XML), remote method invocation (RMI), common object request broker application (CORBA), or a proprietary protocol. The web server 178 receives data arriving from the client 170 and formats the data for the servlet 180. The web server 178 also receives data from the servlet 180, formats the data for the client 170, and sends the data to the client 170 through the internet 172. The servlet 180 is a Java program that receives the requests from the client 170, collects and processes the information requested, and then sends the information to the client 170 through the web server 178. The servlet 180 accesses databases through a JDBC layer 182. The JDBC layer 182 is a programming interface used to access information contained in the databases.
A search using the client-server model begins with a visit to the database provider's home page and ends with a visit to the third party web server 199. The database provider's home page includes instructions on where and how to conduct a search. The Yahoo!™ home page, for instance, includes a hierarchical category index, a box for entering a query, and a search button. Each of the categories listed on a given page is a hyperlink that will request a web page with subcategories for the selected category. Once satisfied with the category choice, the searcher types in a query and clicks the search button. The search request is formatted by the browser 196 and transmitted by the modem 198 across the internet 172 to the middleware 176. The web server 178 formats the request for the servlet 180, which sends a query to a given database through the JDBC layer 182. Records within the Yahoo!™ databases include a category field, a title field, an abstract field, and a universal resource locator (URL) field. The query sent to the database includes the category and keywords for the search. The database searches through the category field and keyword field for matching records. Search results are delivered back to the servlet 180 through the JDBC layer 182. The search results include a hit count total and summaries of each hit including the title, a short description, and a URL. After collecting the search results from a given database, the servlet 180 creates a new web page that is sent to the client through the web server 178 and over the internet 172. The client 170 receives the new web page from the internet 172 through the modem 198 and displays the new web page on the browser 196. The new web page displays the title, short description, and URL for each hit. The title of each hit is a hyperlink to the URL of the corresponding web page. By clicking the hyperlink, the browser is instructed to retrieve the web page found at the selected URL. The web page is then retrieved through the internet 172 from the third party web server 199 and displayed by the browser 196.
The design of internet search pages is constrained by the need to continually retrieve new web pages. Web pages are transmitted to the client 170 as HTML, which includes all the text and graphics via links displayed by the browser 196. The text and graphics displayed by the browser 196 cannot usually be altered unless a new page is requested and received over the internet. Each click of the computer mouse results in a noticeable delay, the length of which depends on the transmission speed of the modem 198 and the efficiency of the middleware 176. This time delay has resulted in the design of internet search engines that require fewer web pages for the execution of a search. The time delay has also resulted in fewer features, as each new feature requires the download of additional web pages for execution. While simplicity has resulted in fewer time delays, it has also resulted in less compelling graphical user interfaces. Each search looks the same as any other search.