This invention relates to the field of data mining, specifically the location of related items in a geometric space for the exposition of the relationships of database items in large databases.
Large databases are becoming commonplace. The INSPEC database on Dialog had almost 4 million records in one file as of November 1990, where records represent scientific articles, books, and papers. The Institute for Scientific Information maintains a database called SCISEARCH, where one file had almost 11 million records as of July 1991. The American Business Directory database had information on over 10 million companies in May 1995. The CLAIMS/U.S. Patents database provides access to over 2 million patents. Countless other private databases exist, storing information such as employee and student records, addresses, customer profiles, and household buying habits.
Even information not stored as a conventional database can have database-like qualities. For example, individual Web pages contain some information; the aggregation of many Web pages contains a large amount of information. Relationships between Web pages are not explicitly stored as in a database, but can be inferred from referencing among the pages. Other examples include financial transactions, not stored as a conventional database but nonetheless representing a large volume of information, with each item of information potentially related to many other items.
Searching and retrieval systems operating with large databases generally allow retrieval of individual items, or retrieval of sets of items related in some way. For example, some databases allow retrieval of individual items. Other databases allow searching for items containing certain keywords or topical markers. The large size of the database makes it more likely that a user can successfully find and retrieve the desired items.
The large size of the database, however, also makes it less likely that the user can comprehend the relationships among the many items in the database. The user can find individual items, and can find groups of related items. The user can not, however, access the structure of the relationships among the items.
The structure of the relationships among items can convey much useful information. For example, a lawyer can use Shepards to find a linear chain of related cases, but can not see beyond that chain to deduce how the cases relate to other such chains. Other lines of reasoning and rules of law in different areas can grow from a line of cases, or a line of cases can itself grow out of several preceding themes in the law. While the relationships among cases are usually explicit through case citations, the structure of the relationships can not be understood using existing search and retrieval tools.
As another example, scientific papers represent the state of research, and often have explicit relationships to other papers through references. Bibliographies and citation lists can help illuminate relationships in a specific area, but are not sufficient to illuminate the ways fields of research grow together, build on each other, or spawn new fields over time.
For databases containing only a few items, a user can read items, analyze relationships, and draw diagrams to deduce the relationships. Databases with more than a few items have much more information embedded in the relationships, but the relationships are too many and too complex for a user to analyze or comprehend from existing search and retrieval tools. Consequently, there is a need for a process that allows a user to comprehend the structure of relationships among items in databases having many items.
The present invention provides a method for locating related items in a geometric space. The method comprises locating the items in a geometric space so that the items"" relative locations in the geometric space correspond to the relationships among the items. The translation of arbitrary relationships to geometric relationships can foster more efficient communication of the relationships among the items. The method is especially beneficial for communicating databases with many items, and with non-regular relationship patterns. Examples of such databases include databases containing items such as scientific papers or patents, related by citations or keywords, and databases dominated by relationships rather than by content, such as calling or billing records. A computer system adapted for practice of the present invention can include a processor, a storage subsystem, a display device, and computer software to direct the location and display of the entities.
The method makes use of numeric values as a measure of similarity between each pairing of items. The items are given initial coordinates in the space. An energy is then determined for each item from the item""s distance and similarity to other items, and from the density of items assigned coordinates near the item. The distance and similarity component can act to draw items with high similiarities close together, while the density component can act to force all items apart. If a terminal condition is not yet reached, then new coordinates can be determined for one or more items, and the energy determination repeated. The iteration can terminate, for example, when the total energy reaches a threshold, when each item""s energy is below a threshold, after a certain amount of time or iterations.
Advantages and novel features will become apparent to those skilled in the art upon examination of the following description or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.