This invention relates to automated search of heterogeneous data sources for desired information, and to the management of the information retrieved during the search.
Data and information are different, but inseparably intertwined. To understand a difference between data and information for the purposes of the following discussion, a simple example will be provided.
The financial page of a newspaper may be thought of as providing data. In particular, a newspaper""s financial page may provide, for each of a plurality of stocks in a given market, a closing price and an indicator of the difference between the current closing price and an immediately preceding closing price. To a person attempting to discover the closing price of a particular one of the plurality of stocks, the financial page of the newspaper provides information. To a person attempting to discover whether the given market, as a whole, advanced or declined, the financial page of the newspaper provides data which might be aggregated and analyzed to find the direction of the market.
Taking the example further, the information as to whether the market advanced or declined as a whole may be thought of as data by a person attempting to determine whether there is a cycle of market advances or declines over a period of years. In turn, whether or not a given market advances or declines in a cyclic manner over a period of years may be merely data to yet another person attempting to discover whether there is any kind of link between cyclic markets and something more abstract, such as the number of representatives of a conservative political party elected subsequent to such cycles.
Data, in the abstract, may thus generally be thought of as being at a lower level than information. Whether an item more correctly qualifies as data or information is naturally dependent upon the point of view and on the discovery needs of a person. Because of the dynamic nature of the problems that confront people each day, the terms data and information are often interchangeably used.
A database may be understood to be a collection of data and information stored on a computer system. For the purposes of this discussion, such a general definition will generally be appropriate. Database management systems provide a level of independence between the raw data and programs that might be used to retrieve the data. Data can be retrieved from databases managed by database management systems by issuing appropriate function or procedure calls containing terms in a query definition language. The database management system response to the terms in a query definition language, such as SQL, by retrieving and returning the stored data that meets the parameters contained in the query.
Depending on the purpose of the system, different databases may have different levels of usefulness for those seeking to gain information from them. To explain, some database management systems are used to coordinate and to control information for the purpose of supporting online transaction processing (OLTP).
OLTP applications are characterized by many users creating, updating, or retrieving individual records, and so OLTP databases are optimized for transaction updating. The data is stored in a manner that is very useful for handling transactions, but a form that is much less useful for supporting analysis of the data.
One way to make this data more useful for high-level analysis is to reformat and to aggregate the data in a database specifically arranged for online analysis processing (OLAP). OLAP applications may be used by analysts and managers seeking a higher-level aggregated view of the data, such as total sales by product line, by region, and so forth. An OLAP database may be updated in batch, from multiple sources, and can provide a powerful analytical back-end to multiple user applications. OLAP databases are thus optimized for analysis.
It should be apparent, however, that such databases are in a highly structured format, and there is required intimate knowledge of the structured format to access the data to perform the appropriate analysis.
Not all data is managed by database management systems, and not all data in databases is highly structured. Some data in databases is stored in association with one or more indices. The data in the database is retrieved with reference to the one or more indices.
Oftentimes, the term xe2x80x9cdatabasexe2x80x9d evokes a sense of structure in the data. However, for the purposes of this discussion, not all databases are structured databases. In particular, a collection of text documents may be thought of as being a database. Much of the institutional knowledge of an organization may be contained in the documents of the organization and not in the structured databases managed by the organization""s database management systems. All of this organizational knowledge stored in text documents, while formerly unavailable for search, now is becoming useful as data with the advent of appropriate searching and querying tools.
For example, an organization may have a document management system that coordinates workflow with respect to documents, but also provides an index that can be used to find and retrieve documents across the organization. Likewise, an organization may use a text search processor to access a central database of text documents to find certain documents meeting the parameters of a text search query.
Web pages on the World Wide Web are typically text documents. A collection of such text documents may be thought of as an unstructured database. Thus, for the remainder of this discussion, a structured database will be understood to be one that has a definite structure and, typically, is controlled by a database management system. Likewise, an unstructured database will be understood to be one that is not controlled by a database management system and, typically, is a collection of text documents.
One of the biggest reasons for the importance of the World Wide Web is an advent of tools that make it possible to find and access the text documents that make up the Web pages of the World Wide Web. A brief look at some of the tools available to find information on the World Wide Web will now be undertaken.
A search engine may be thought of as a search database coupled with the tools to generate and search the search database. A search engine may be owned by a Web location service. A Web location service may be thought of as a Web site or a company that provides a way to find and locate Web pages having data that meets the information needs or discovery needs of a user.
Yahoo! is an example of a Web location service. Yahoo! attempts to provide a complete front end for the Internet by providing news, libraries, dictionaries, and other sources in addition to a search engine. Yahoo! emphasizes catalogingxe2x80x94a classification of identified pages into a hierarchical structure. Alta Vista and Excite are Web location services that emphasize providing the most comprehensive search database.
Some Web location services use the search engine technology of other companies, such as Inktomi, to provide a useful location service for Web pages and files while concentrating on providing other, additional services.
Every search engine may be thought of as providing three important elements. These elements include information discovery and search database components, a user search component, and a presentation component.
In particular, the information discovery and search database components of a search engine may obtain information by accepting information sent by persons hoping to gain greater exposure for their Web pages or by gathering the information using software programs designed to locate Web pages, and to store information about the pages and their location. Such software programs may be called Web crawlers, spiders, or robots. For convenience, such software programs may be herein referred to as robots, collectively.
When a robot identifies a new page, the robot may simply store the title of the page and the universal resource locator (URL). Web pages may include hyper-text markup language (HTML) meta-tags relating to content or keywords, and a robot may store also such information. An additional option is to store also the text of the Web page in part or in its entirety.
In any event, whatever the robot causes to be stored in the search database is indexed for quick retrieval.
The user search component of a search engine is the component with which the user enters the parameters of a query. It is conventional for a user to have the ability to type in a few relevant words into a search form. Some user search components of search engines even permit the user to specify whether the words must be in the title of a page, in the URL, in the meta-tags, or anywhere. Such so-called advanced search options also include Boolean operations. Furthermore, search engines typically attempt to take into account approximate spellings, plural variations, and truncation.
The presentation component of a search engine presents the results of a user query to the user. Given the immense size of the World Wide Web, it is possible for a given query to generate millions of results indicating millions of pages that may have potentially relevant data. Most engines find more sites from a typical search query than could ever be processed by a person. Search engines may assign each document xe2x80x9chitxe2x80x9d some measure of the relevance of the page to the search query. Such relevance scores may is reflect the number of times a search term appears in a page, with adjustments being possible when the search term appears in the title, in the meta-tags, in the beginning of the page, and the like. For example, a document having all of a plurality of search terms might be given a relevance score weighted differently than a document containing fewer than all of the search terms, although with greater frequency.
Some engines allow the user to alter the relevance score by giving different weights to each search word. The weights do not affect the retrieval of data, but do affect the relevance score and, ultimately, the ordering of the results (i.e., the ordering of the xe2x80x9chit listxe2x80x9d).
Where relevance scores are substantially the same for a plurality of results, the presentation component of a search engine typically orders those results alphabetically. Along with the URL of the page, the presentation component may provide also a summary of each page. Such a summary may be composed of the title of a document and some text from the beginning of the document, and/or an optional author-specified summary given in a meta-tag.
The results of a web search are typically returned in a list of documents or pages in an order based on the relevance score, together with identifiers relating to the pages, such as a page title, URL, or summary.
Web searching, then, may be thought of as the providing of parameters to a search engine using an interface of the user search component; causing the search engine to conduct a search of one or more indices available to the search engine to determine pages that qualify as matches with or as being relevant to the parameters (i.e., xe2x80x9chitsxe2x80x9d); evaluating the relevance of the hits to provide a relevance score; and returning to the user a list of results ordered first according to the computed relevance score and then according to alphabetic precedence.
The list of results is not particularly helpful in all situations, because the list is typically ordered in only one degree of freedom, namely, the relevance score. Nevertheless, this is the manner in which nearly all search engines present the results of a user query.
Some search engines provide a presentation that is slightly more useful in one sense, and less useful in another sense. In particular, some search engines provide the user the option to group the results by the Web site. Thus, when several pages of one site are hits for a given user query, these several pages are grouped together under a common entry that indicates the identity of the Web site.
Overall, the groups are presented in an order by the relevance of the group, typically determined by the number of pages grouped together to form the group.
The grouping of results by Web site is useful in that it gives the user a better intuitive feel for the overall content of the particular Web site, and provides elementary organization of the data. The grouping is unuseful to the extent that it tends to hide results, even particularly relevant results, for groups that are not deemed as relevant as groups having a greater number of pages having hits. To put it another way, if a query resulted in only one highly relevant page of a particular Web site being hit, and resulted in a plurality of marginally relevant pages of another Web site being hit, the grouping approach described above would result in the Web site of marginally relevant pages being presented first, and the Web site with the only one highly relevant page possibly being presented quite far down on the list.
The simple list format is unuseful in part because, to the user, it is in a relevance order that cannot be readily understood or appreciated. Furthermore, the list is typically nothing more than a huge amount of data presented serially. The human mind is not inherently capable of coping with so much information.
The existence of substantial numbers of large databases in organizations, and the possibility to query large numbers of Web pages on the internet has helped bring about the concept of Data Mining. Data mining, in general, may be understood to mean the extraction of new information from existing data, and also may be understood to include the use of query and analysis tools such as search engines and the like.
Data mining extracts new information from data. Data mining tools are seen by their proponents as doing more than query and analysis tools, more than OLAP tools, and more than statistical techniques like variance analysis. Data mining tools are thought to be useful for helping provide answers to certain kinds of questions.
Whereas more simple query and analysis tools are useful for questions such as, xe2x80x9cIs there a cycle of stock prices in Market X?xe2x80x9d a data mining tool is what might be used to answer even more abstract relationship questions such as, xe2x80x9cWhat are the factors that determine the period of the stock price cycle of Market X?xe2x80x9d
Traditionally, answers to the more complex relationship oriented questions are discovered by a human analyst who starts with a question, assumption, or hypothesis, and attempts to determine whether the data fits a model that embodies the analyst""s theory. By testing the model, the analyst eventually and iteratively modifies the model to fit the data and, from the completed model, may arrive at a conclusion. Data mining tools help this process along by facilitating the finding of an appropriate model.
Data mining tools may be said to create analytical models that are predictive and/or descriptive. Predictive models predict future values given a past history, and descriptive models focus more on information about the relationships in the underlying data. Models often tend to be both predictive and descriptive.
Data mining may be thought of as part of a larger iterative process which may be called knowledge discovery. Knowledge discovery may include steps of defining a problem; collecting and ordering data; data mining the data to select a model; testing the model; using the model for making decisions; and monitoring the data and model to detect changes over time.
Although some data mining tools exist to support the data mining of the collected and ordered data, there exists a dearth of tools that are helpful in collecting and ordering data. The best tools available require the data to be highly structured and to be well-behaved. Such tools are closely tied to the highly structured data, and are useless and impossible to employ outside their particular tailored environment where data is not structured and not well-behaved (such as the vast database of Web pages in the World Wide Web).
The tools that are appropriate for handling the unstructured and heterogeneous data are quite primitive, and are insufficient in their retrieval and presentation of results. There is a need for a better tool to collect and order data that is from heterogeneous sources, and that includes unstructured or structured data. Such a tool would have potential for being useful not only for data mining, but also for simply helping persons to find, in the seemingly infinite data universe of the available databases, the information they need to satisfy an information requirement.
Inasmuch as the preferred embodiments of the invention, described below, are implemented in the computer arts, it will be helpful to set forth some background information and definitions before summarizing the invention.
One embodiment of this invention resides in a computer system. Here, the term xe2x80x9ccomputer systemxe2x80x9d is to be understood to include at least a memory and a processor. In general, the memory will store, at one time or another, at least portions of an executable program code, and the processor will execute one or more of the instructions included in that executable program code. It will be appreciated that the term xe2x80x9cexecutable program codexe2x80x9d and the term xe2x80x9csoftwarexe2x80x9d mean substantially the same thing for the purposes of this description. It is not necessary to the practice of this invention that the memory and the processor be physically located in the same place. That is to say, it is foreseen that the processor and the memory might be in different physical pieces of equipment or even in geographically distinct locations.
Another embodiment of the invention resides in a computer program product, as will now be explained.
On a practical level, the software that enables the computer system to perform the operations described further below in detail, may be supplied on any one of a variety of media. Furthermore, the actual implementation of the approach and operations of the invention are actually statements written in a programming language. Such programming language statements, when executed by a computer, cause the computer to act in accordance with the particular content of the statements. Furthermore, the software that enables a computer system to act in accordance with the invention may be provided in any number of forms including, but not limited to, original source code, assembly code, object code, machine language, compressed or encrypted versions of the foregoing, and any and all equivalents.
One of skill in the art will appreciate that xe2x80x9cmediaxe2x80x9d, or xe2x80x9ccomputer-readable mediaxe2x80x9d, as used here, may include a diskette, a tape, a compact disc, an integrated circuit, a ROM, a CD, a cartridge, a remote transmission via a communications circuit, or any other similar medium useable by computers. For example, to supply software for enabling a computer system to operate in accordance with the invention, the supplier might provide a diskette or might transmit the software in some form via satellite transmission, via a direct telephone link, or via the Internet. Thus, the term, xe2x80x9ccomputer readable mediumxe2x80x9d is intended to include all of the foregoing and any other medium by which software may be provided to a computer.
Although the enabling software might be xe2x80x9cwritten onxe2x80x9d a diskette, xe2x80x9cstored inxe2x80x9d an integrated circuit, or xe2x80x9ccarried overxe2x80x9d a communications circuit, it will be appreciated that, for the purposes of this application, the computer usable medium will be referred to as xe2x80x9cbearingxe2x80x9d the software. Thus, the term xe2x80x9cbearingxe2x80x9d is intended to encompass the above and all equivalent ways in which software is associated with a computer usable medium.
For the sake of simplicity, therefore, the term xe2x80x9cprogram productxe2x80x9d is thus used to refer to a computer useable medium, as defined above, which bears in any form of software to enable a computer system to operate according to the above-identified invention.
Thus, the invention is also embodied in a program product bearing software which enables a computer to perform according to the invention.
Although it has been mentioned, above, that a computer program product includes the carrying of software over a communications mode (such as a download over the Internet), it can be useful also to look at such a situation as a particular kind of carrier wave. To be more particular, the invention resides, in one embodiment, also in a carrier wave that carries the software that enables a computer to perform according to the invention. In this sense, it may be said that the carrier wave includes certain code sections corresponding to the various steps involved in the execution of the invention.
It will be appreciated that a carrier wave includes not only signals or files downloaded over the Internet, but also over any network and over any communication medium.
The invention is also embodied in a user interface invocable by an application program. A user interface may be understood to mean any hardware, software, or combination of hardware and software that allows a user to interact with a computer system. For the purposes of this discussion, a user interface will be understood to include one or more user interface objects. User interface objects may include display regions, user activatable regions, and the like.
As is well understood, a display region is a region of a user interface which displays information to the user. A user activatable region is a region of a user interface, such as a button or a menu, which allows the user to take some action with respect to the user interface. It will be appreciated that, depending on the situation, a particular region of a user interface might be both a display region and a user activatable region.
A user interface may be invoked by an application program. When an application program invokes a user interface, it is typically for the purpose of interacting with a user. It is not necessary, however, for the purposes of this invention, that an actual user ever interact with the user interface. It is also not necessary, for the purposes of this invention, that the interaction with the user interface be performed by an actual user. That is to say, it is foreseen that the user interface may have interaction with another program, such as a program created using macro programming language statements that simulate the actions of a user with respect to the user interface.
An application program may be several separate programs, only one program, a module of a program, or even a particular task of a module.
An applications program may be written by an applications programmer. Applications programmers develop applications programs using any of a number of programming languages. During development and design of applications programs, applications programmers may adhere to a programming methodology. A programming methodology is a set of principles by which analysis is performed and by which design decisions are made. Programming methodologies may be referred to as programming paradigms. Examples of widely-known programming paradigms include the top-down, the data-driven, and the object oriented (OO) programming paradigms.
The OO paradigm is based on the object model. One of skill in the art readily understands the object model. For detailed information concerning the object model, a useful book, which herein is incorporated in its entirety by reference, is xe2x80x9cObject-oriented Analysis and Designxe2x80x9d, by Grady Booch (Addison-Wesley Publishing Company).
Recently, object oriented analysis and design (OOAD) and object oriented programming (OOP) have been the focus of great attention. OOAD and OOP are thought to provide advantages with respect to abstraction, encapsulation, modularity, and hierarchy. Furthermore, OOAD is thought to provide for improved software reuse and better adaptability to change.
According to the object model, a software system is modeled as collections of cooperating objects. Individual objects are treated as instances of a particular class. Each class has a place within a hierarchy of classes.
An object is understood to have a unique identity, to have a state, and to exhibit behavior. The behavior of an object relates to the set of operations that may be performed by the object. Such operations are also known, interchangeably, as methods of the object or as member functions of the object.
Member functions of an object are invoked by passing the object an appropriate message.
An object may retain data of interest. Passing the object appropriate messages may invoke a member function of the object to manipulate the data. For example, an object presently might retain an image of the Washington Monument, and might have a member function for rotating an image. Under the object model, when an appropriate message, such as xe2x80x9crotate image 45 degreesxe2x80x9d, is passed to the object the rotating member function is invoked and the image is rotated 45 degrees. The image, thus rotated, is retained in this state.
The invoking of member functions of objects to perform tasks is a central concept of the OO paradigm.
Objects can be related to each other. Two objects might have a client/supplier relationship. Such objects are said to be linked. Two objects might have a hierarchical relationship. For example, one object might represent a finger and another a hand. The hand object may thus be said to be higher in a hierarchy than the finger. Assuming the hand has more than one finger, there might be several finger objects that are so related with the hand object. Hierarchically related objects are said to be aggregated. In particular, the hand object and its finger objects may be referred to as an aggregate, or an aggregation. The finger objects may be referred to as being attributes, or members of the aggregation. The hand object, by virtue of its position at the xe2x80x9ctopxe2x80x9d of the hierarchy in the aggregation, may be referred to as an aggregating object.
An object cannot be considered without regard to its class. Every object, when constructed, receives its structure and behavior from its class. An object may be referred to as a class instance, or as an instance of a class. Classes, in the object model, may be hierarchically related. In particular, the relationship between two classes may be a subclass/superclass relationship. A subclass may inherit the structural and behavioral features of its superclass.
Thus, whenever an object is constructed, it receives important attributes from its class. If that class is a subclass of a particular superclass, the object may receive certain attributes from the superclass as well.
Classes, on a practical level, may be supplied in class libraries on any one of a variety of media. Class libraries may be understood to be a kind of software. Thus, the class definitions contained in class libraries also are actually statements written in a programming language that, when executed by a computer, cause the computer to act in accordance with the particular content of the statements. Furthermore, a class library may be provided in any number of forms including, but not limited to, original source code, assembly code, object code, machine language, compressed or encrypted versions of the foregoing, and any and all computer readable equivalents.
One of skill in the art will therefore appreciate that a class library may be embodied in a computer program product as that term has already been defined, above.
The invention has been created with an object of compensating for the above-identified shortcomings and disadvantages of the prior art. This object of the invention includes, at least in part, providing a better tool to collect and order data that is from heterogeneous sources, and that includes unstructured or structured data.
More particularly, the invention resides, in one embodiment, in a method of performing a search using weighted search parameters. The search parameters are combined to produce a search, and the results are retrieved. The results are retrieved from structured and/or unstructured databases, and arranged into a hierarchy.
The hierarchy is a tree with a control node at the root, and ordered levels of nodes corresponding generally to the search parameters. The terminal leaf nodes correspond to actual pages, but the intermediate nodes correspond to only conceptual or meta relationships.
It will be appreciated that the invention also resides in a computer program product, a user interface, a computer system, and a computer data signal for implementing the method.