The present invention provides a method for searching an index of structured and unstructured incoming data received from remote locations on a wide area network or global network, e.g. the Internet or an enterprise intranet. More specifically, the invention provides for capturing and enriching data records with metadata or appended data, and accessing the data through the use of a search engine designed for searching unstructured or free-form data.
Such search engines are in common use. Examples presented herein have been specifically tested for use with the familiar Google® search appliance and Internet search engine. However, the teachings herein are adaptable for use with other search appliances and engines useful in searching records on the Internet and on private intranets, often configured by business enterprises to enable access to data from diverse locations, e.g., the open source Lucene search engine licensed by the Apache Software Foundation.
Referring to FIG. 1 of the drawings, Internet search engines, such as Google®'s, typically index information that is ‘unstructured’. Each record has information of a different type than the next record. Such search engines can also index the fields in structured databases but treat the data similar to unstructured text.
To conduct a search users type a word or a set of keywords and then submit their natural language query to the search engine. The search engine returns a set of results, known as “hits”, each of which contains:                a uniform resource locator (URL) for the source document which can be either unstructured data (text or word processor document) or structured data (database record);        a snippet—the description of the search result, and a link, for example, to the cached source in the index.        
Some search engines return all records found in a search of an index, but others, e.g., Google®, generally return only a subset of the most relevant results (URLS)—usually up to 1000 hits. For example, even though a search query may have one million hits, the engine will return only the first 1000 most relevant search results. If a user needs additional results, i.e., if the user was not able to find what was sought within the first 1000 results, the user would have to refine the search words by adding or replacing words and submit them as a new query. This limitation is pragmatic since the expectation is that if a user does not find the results within the top most relevant hits, it will be more efficient to refine the query than to page through all one million hits.
Search engines typically display approximately 10 search results per page. Usability studies indicate that the majority of the users, especially enterprise users, expect to find what they are looking for within the first 3 pages (30 search results). If they do not find it, they resubmit the query. This process is inefficient, because:
The user has no other way to gain insight about what may be in the search results except by reading the snippets of all of the results. Snippets are generated by algorithms. Sometimes they are not understandable. Such snippets can also be misleading.
There is no guarantee that replacing the old results with new results will be more useful given that the user refines the search without much knowledge about the structure content of all 1000 previous results.
Unlike unstructured information, structured information has the property that the information is all of the same type, and the components of the information can be identified by tags or field names. The information that is structured may be intended for storage in relational databases for example. For each data element that is described by a ‘fieldname’, there is a ‘fieldvalue’.
Structured databases contain uniformly structured records, each of which has the same named categories of information, referred to as fields, and one or more values for each field in the records. That is, records are each composed of fieldname-value pairs, sometimes herein referred to as tag-value pairs, name-value pairs, or FIELD_Name, Field_Value pairs, such as those shown in Table 1 below.
TABLE 1FieldnameValueACCIDENTDATE090106TYPE_OF_ACCIDENTauto crashCOUNTYHUDSONINJURED1NAME_OF_INJURED01JOHN SMITHHOSPITALHackensack GeneralADMITTING_DOCTORROBERT JONES
Users of a search engine find information by entering a search term. This is usually on one or more data values. For example, if a user enters the search information as “Smith”, among the “hits” (search engine answer set) would be the sample record shown in Table 1 above.
However, the sample record of Table 1 would be included in the hits no matter which field had the value “Smith”. That is, “Smith” could be the value of the field ADMITTING_DOCTOR, or of the field NAMEOF_INJURED, or of the field COUNTY. Hence a search for hospital records with a patient's name of “Smith” would find records where the patient's name was “Jones” if the doctor's name was “Smith”. Or a search for hospital records with a patient's name of “Smith” would find records where the patient's name was “Adams” if the patient was in an automobile accident in Smith County.
Even though the number of records having information of interest to a searcher might be very small, the number of hits could occupy many pages, most containing irrelevant information, making it very difficult for the searcher to find what was wanted. Some filtering may, therefore, be appropriate.
Search results are usually displayed in a static form, giving users almost no ability to analyze or perform any manipulation of the returned results within the search results page. At most, users can sort the results by relevance or by date, and they can do this only when they are connected to the server. If they are offline, they loose even the ability to sort by relevance or date, hence storing search results has little usefulness. These limitations severely constrain the ability of users to efficiently analyze and manipulate search results to make faster and more informed decisions.
While this limitation may not be as obvious when searching completely unstructured data, such as word processing documents, it becomes quickly apparent when users search structured data sources.
An example of such application would be the search of retail or inventory databases. In both cases the search engine may return hundreds of records within different categories and different price ranges. A mere sequential listing of these records is not very useful. A tabular view would be more appropriate.
Users want to manipulate tabular data as well as transform it in order to make informed decisions. A dynamic tabular view offers the user the ability to sort the data by any of the available categories, such as gender, product category, sub-category, price range, price, color, etc.
In a dynamic table a user can quickly find not only the minimum price, but also the minimum price within each category. A user can also pivot the data, i.e., display product prices by brand and category in order to compare and contrast. An inventory manager can sum the quantities directly in the search results, instead of having to go to other applications to perform this task. The prior art offers no search tool having analytic capabilities and a facility for data transformation within the search results. Prior art search systems fail to make analysis, manipulation and storing of search results meaningful.