The present invention is related to the field of data handling and analysis using information handling systems. More specifically, the present invention involves methods and/or system and/or devices that provide for advanced searching of a structured document set, including expressions for providing statistical analysis of document set search results.
A number of techniques have been previously discussed for searching and analyzing data sets stored in an information processing system. The reader is assumed to have knowledge of SQL and other standard database and analysis packages as well as knowledge of search techniques and methods commonly used on the Internet.
The following patents and publications may be related to the invention or provide background information. Listing of these patents here should not be taken to indicate that a formal search has been completed or that any of these patents constitute prior art.
U.S. Pat. No. 5,924,090 discusses a method and search apparatus for searching a database of records that organizes results of the search into a set of most relevant categories. In response to a search instruction from the user, the search apparatus searches the database, which can include Internet records and other records, to generate a search result list corresponding to a selected set of the records. The search apparatus processes the search result list to dynamically create a set of search result categories with each category associated with a subset of the records in the result list having common characteristics. Categories can be displayed as a plurality of folders. Each record within the database is classified according to various meta-data attributes (e.g., subject, type, source, and language characteristics). Substantially all of the records are automatically classified by a classification system into the proper categories. The classification system automatically determines the various meta-data attributes when such attributes are not available from the source. The technique discussed is directed to using category analysis to further narrow a set of returned documents. The technique is provided for public use at www.northernlight.com. For example, the search for xe2x80x9cpamela anderson leexe2x80x9d at that site returns 19,043 total items and indicates 14 category folders. Each of these folders, when selected, displays a set of documents and a set of subfolders. For example, selecting the xe2x80x9cActors and Actressesxe2x80x9d folder returns 2,587 documents and indicates 13 additional subfolders for those documents. The subfolder xe2x80x9cLee, Tommyxe2x80x9d returns 151 items, and indicates 10 further subfolders. The sub-sub folder xe2x80x9crockcool.comxe2x80x9d returns three total items. Neither the patent nor website discuss or provide techniques for expressing search strategies using category analysis nor methods or techniques that allow for other types of analysis to be performed on returned documents. The references also do not discuss expressing returned documents from an analysis as anything other than category folders and associated documents.
At the present time, querying and data mining/data analysis are generally considered two different fields with two different audiences. Querying is widely used by both data handling professionals and general computer users. Many websites (such as Altavista or Ebay) provide query ability using operators to all users, allowing users to specify document subsets using AND, OR, NOT or similar expressions. However, it has become an increasingly common scenario for users to receive an overwhelming list of results with no additional guidance as to how to further understand, prioritize or explore said results.
Data analysis, in contrast, is generally considered the province of data handling professionals. Analysis packages often require specialized training and use commands and syntax that are specific to a particular package. Data Mining/Data Analysis is generally considered a separate and specialized function apart from accessing data using queries.
The present invention, in particular embodiments, involves methods and/or systems that provide for combinations of traditional search capabilities with generalizible analytical functions. The combination of these two features into a single technology provides particular advantages when dealing with documents or structured documents (such as electronic catalogs, XML documents, text documents, HTML documents. etc.). In various embodiments, the invention includes powerful tools for expressing search/analysis requests, methods and systems for evaluating such expressions, and powerful ways to express results of such query analysis. (In this discussion, structured documents can be understood generally to refer to any data item or object that can be specified by a query, including such things as files, records, objects, etc. One type of document is a structured document.)
In particular embodiments, the present invention involves a proprietary software system that combines the functionality of a search engine and a data-mining/data-analysis engine. Search engines have traditionally focused on retrieving a list of documents, and data mining engines are generally designed to find statistical patterns amongst sets of data. By integrating these two technologies in a general way allowing for various analyses, in one aspect, the present invention can perform statistical analysis on collections of documents retrieved from a search. The results from these analyses can either be presented to an end user or used to search for another set of documents. In particular embodiments, the invention further allows for the construction of complex loops of search and analysis, creating a powerful new tool for managing information.
The following discussion describes the functionality of the present invention with reference to specific embodiments and illustrates its use through a set of examples. However, using the teachings provided herein, it will be understood by those of skill in the art, that the methods and apparatus of the present invention could be advantageously used in other situations that call for utilizing data that can be represented in an information processing system. The invention will be better understood with reference to the following drawings and detailed descriptions. In some of the drawings and detailed descriptions below, the present invention is described in terms of specific independent embodiments, including embodiments related to accessing a database of structured documents. This should not be taken to limit the invention, which it will be understood from the teachings provided herein has other applications.
Furthermore, it is well known in the art that logic systems can include a variety of different components and different functions in a modular fashion. Different embodiments of the invention may include different combinations of actions or elements. Furthermore, elements or actions that may be described below as being sub-elements of other elements, may be differently grouped in various specific embodiments. It will be clear from the teachings herein to those of skill in the art that in specific embodiments, some action steps may be performed in different order from the examples presented herein. For purposes of clarity, the invention is described in terms of systems ms that include many different innovative components and innovative combinations of components. No inference should be taken to limit the invention to combinations containing all of the innovative components listed in any illustrative embodiment. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes.