“Information age” and “knowledge economy” are just two of the terms commonly used to describe the explosion of digital information that characterizes our era. Whatever you call it, there is no question that the volume of information that is created is growing at unprecedented rates. Numerous attempts have been made to quantify the rate of new knowledge development and have produced various estimates of its exponential growth. Various sources describe and attempt to quantify this information explosion. A few examples of the kind of statistics often cited are:                Total human knowledge generally doubles every 5-10 years        Scientific knowledge generally doubles every 3-5 years        Medical knowledge generally doubles every 2-8 years        Number of US patents issued has about doubled in the last 7 years        Approximately 1.5 million pages added to the web each day        Worldwide production of original content stored digitally in 1999 would take about 635 thousand to 2.1 million terabytes to store        
Regardless of the reliability of these estimates, they all point to undeniable explosion in new information. Computer technology has made it easier to create and store new information. Both the number and size of the databases used to store this information are growing exponentially.
Despite the rapid growth of available information, human mental capabilities to assimilate and comprehend information have not significantly improved. The explosion of available information and our inability to assimilate it leads to information overload. The vast stores of information make it increasingly difficult to find the right information and even more difficult to make sense of the vast amount of new knowledge that is available.
Workers in the knowledge economy operate in an environment in which they are awash in information but are unable to distill insights. These workers often need to find and understand information related to a specific topic or area of interest so that they can improve their performance and/or decision-making. However, despite the availability of information that could inform and improve their decision-making, there is no practical way to find or assimilate it.
Enormous investments by numerous companies have been made to help information workers find the information “needle” they are seeking in the vast “haystack” of data in which they are searching. The dominant paradigm for information retrieval can be referred to as “Search and Sift”. The “Search and Sift” method invariably begins with a Boolean search that returns a large number of matching search results. The searcher then sifts through the results to find the information they are seeking. Internet users and users of other large databases will be very familiar with this method.
The majority of the investment in the field of information retrieval has been focused on improving the “Search and Sift” process. Examples of improvements include:                Query refinement—Query refinement attempts to determine the intent behind the searcher's query and refine the query in order to capture more of the documents that are relevant to the search or to exclude more irrelevant documents from the result set. An example of query refinement is “synonym expansion” in which the query terms are augmented to include synonyms of the search terms in the hope of capturing more relevant documents.        Result ranking—A second means of improving the “search and sift” method is result ranking. Result ranking attempts to order the search results based on their relevance to the searchers intent. Relevance rankings have been estimated in various means including; frequency of use of search terms, location of search terms within the document, and perceived “importance/usefulness” of the documents in the result set. Perhaps the best example of result ranking is Google's page rank metric which is based on the number of other web pages that link to the search result page.        Result filtering—A final example of means to improve the “search and sift” method is result filtering. Result filtering attempts to classify the documents in the result set based on some classification scheme. The hope is that this will allow the searcher to narrow down his/her “sifting” to a subset of the result set that is most closely related to the area of interest. Examples of result filtering include; Northern Light's “results folders” (see, e.g., FIG. 1) which are based on a fixed taxonomy of document classifications, Vivisimo's document clustering tool which classifies documents into a hierarchical tree structure (see, e.g., FIG. 2) based on the semantic content of the documents, and Grokker, which classifies documents into a dynamic hierarchical structure similar to Vivisimo, but also provides a visual display of the relative size of each classification using its “bubble display” (see, e.g., FIG. 3).        
All of these methods are useful improvements on the “search and sift” method, however, they all presume a specific type of information need, namely that the searcher is looking for a specific PIECE of information, and that the information being sought can be found WITHIN the documents in the result set. This kind of information retrieval is aimed at finding answers to questions such as:                Who killed Bobby Kennedy?        What is the world's second tallest mountain?        What is the weather forecast for Palo Alto, Calif. tomorrow?        What is the IBM's current stock price?        
While the embodiments described herein represent further improvement on the “search and sift” method, their primary contributions are aimed at meeting a different kind of information need. The primary purpose of these embodiments is to assist information users in making sense of search results, or large document sets by providing a means for assimilating the patterns of information AMONG the document in the set. This kind of information is referred to herein as “metadata” because it represents higher level information than is contained in any particular document or record in the database or search result. This kind of information retrieval is aimed at answering questions such as:                How many documents are related to my area of interest, and how quickly is this number growing?        Who are the main authors of information about this topic?        What companies are producing information on this topic?        What is the relationship among companies/authors that are working in this domain?        
The described embodiments utilize advanced visualization techniques to reveal the metadata associated with a set of documents or a search result. In order to understand the novel contributions of the present invention, it is useful to review other systems and techniques in this field, in particular within two areas of study; 1) Existing methods of presenting metadata, 2) Visualization methodologies used for understanding large data sets.
Existing Methods of Presenting Metadata
Previous efforts to analyze and present metadata related to large data sets can be divided into a number of categories. A brief description of each and examples of the existing state of the art are provided below for the purpose of differentiating the present invention.
Statistical Analysis
One of the simplest and most widely used means of analyzing sets of documents is statistical analysis. Statistical analysis can be as simple as calculating the number of documents in the set by date, author/inventor, author/inventor affiliation, country, classification, or other attribute. It may also include calculation of statistics relevant to the particular type of data being examined. For instance, in the patent data domain, statistics like number of citations, citations/patent/year, time from filing to grant, age of most recent citation, age of most recent academic citation, and other statistics are sometimes calculated. These statistical methods are employed widely, and are in some instances automated in commercial applications such as those offered by Delphion, Micropatent and CHI Research in the patent space and many others in other domains.
Statistical analysis can provide some useful insight into the set of documents under evaluation, but is clearly limited as to the amount of insight that can be obtained. The best-known tools of this type provide textual reports or simple bar charts showing the number of documents with each attribute value (e.g. How many documents by Company A, Company B, Company C, etc.) or the statistics associated with the overall document set (e.g. Average time from filing to grant). They do not provide information about how the various documents are related to each other, and they do not provide a means for interacting with the metadata in a way that allows the user to explore what the various attributes of the documents reveal about the overall document set. It is an objective of one or more embodiments of the present invention to provide a means for users to understand the relationships among groups of documents and to provide a means for deep exploration into the metadata associated with the document set or search result.
Clustering
Another method used for revealing metadata about large sets of documents is clustering. Various tools have been developed that group documents into clusters. Some of these tools separate documents into clusters based on a fixed taxonomy of categories, while others utilize syntactic information within the documents to cluster them into a dynamic set of categories. Two examples of fixed taxonomy clustering tools are the Northern Light search engine and The Brain's <thebrain.com> web search tool. The fixed taxonomy clustering method is accomplished in one of two ways. First, categories may be based on explicit attributes of the documents. For instance, Internet search results can be divided into categories based on their domain extensions such as “.com”, “.net”, “.edu”, or their country domain such as “.sp”, “.ge”, “.jp”, etc. Secondly, categories may be based on a taxonomy into which documents in the data repository have previously been assigned. This is generally accomplished by manually reviewing documents or the domains under which those documents fall and assigning them to one or more categories within the fixed taxonomy.
A second method of clustering documents or search results is based on the creation of a dynamic taxonomy. These clustering techniques use syntactic data within the documents and then cluster the document set into smaller groups and “name” those groups based on the words or phrases they have in common. The clustering method essentially creates an automated classification schema that can provide insight into the nature of the documents in the set. This technique has been applied to a wide variety of document types and various commercial software applications are available which perform this function. Examples of the use of clustering techniques within the domain of patents includes the Vivisimo and Themescape tools <micropat.com/static/advanced.htm> that are incorporated into Micropatent's Aureka <micropat.com/static/index.htm> tool set and the Text Clustering tools <delphion.com/products/research/products-cluster> available in Delphion's tool set. Vivisimo's tools can be configured to operate on any set of text documents, as can the semantic analysis tools developed by Inxight <inxight.com/products/smartdiscovery>.
Using these clustering tools, basic metadata about a document set or a search result can be presented. The methods employed by the above referenced tools can automatically display the number of documents in the set or search result that fall into each category, making it possible to more quickly “sift” through the results to find the piece of information that is being sought. They also provide some valuable information about the contents of the document set or search result.
The value of the best known clustering tools is limited in two important ways. First, the metadata provided about the contents of the document set is only as good as the taxonomy into which it is clustered. This is an inherent limitation of both fixed and dynamic taxonomy clustering techniques.
Fixed taxonomies are limited in their usefulness by a number of factors:                The taxonomy is based on the priorities of its creator, not the searcher. The creation of a taxonomy entails making choices about what attributes of the information is most important. For example, the first branches in a taxonomy of bird types could be established in multiple possible ways; migratory versus non-migratory, waterfowl versus landfowl, etc. Often, the priorities of the taxonomer are not aligned with the needs of the information user, thus limiting the value of the clustering metadata provided.        Fixed taxonomies can not easily be adjusted as the contents of the database evolve. Once a taxonomy has been established and users have begun using it, it becomes rigid and difficult to change. As the contents evolve, there is inevitably a need to add new categories, sub-divide categories, and recombine categories. This makes it difficult to compare results over time. As an example, consider the taxonomy of technologies created by the WIPO known as the International Patent Classification system (IPC). The IPC is now in it's seventh edition. In each edition, classes were added, moved, sub-divided and eliminated. However, the millions of patent documents that were filed prior to the revision remain classified under the original classification schema that existed at the time they were granted. This makes the presentation of clustering metadata problematic when based on a fixed taxonomy.        Another issue related to fixed taxonomies is that the documents in the data set typically do not fall into a single classification. This creates a classification problem that has typically been solved by assigning the documents into multiple categories within the taxonomy. This multiple-assignment creates a challenge for how to display the clustered results when many documents fall into multiple categories. They typical solutions are to count each document only within a single (primary) classification, or to count the document multiple times, once for each category of classification. Both solutions have problems. The first ignores important information about secondary classifications, and the second represents multiple instances of each document.        The other major limitation of fixed taxonomies is the difficulty in assigning documents to the categories. Typically, this is a manual process that is done either by the author of the document or by a specially trained person or persons who take responsibility for classification. Once again, both options have problems. Author classification suffers from a lack of consistency, while centralized classification is extraordinarily time consuming when large numbers of documents must be classified.        
Dynamic taxonomies have been created in order to overcome some of the limitations of fixed taxonomies. However, they have limitations of their own which diminish their usefulness in providing metadata about a large document set. Some of the challenges associated with dynamic taxonomies are described below:                All dynamic taxonomy systems known by the inventors are based on semantic data. Simply put, the classification of documents is based on the similarity of the words contained in the documents. The problem with this is that all languages are extremely imprecise when it comes to expressing ideas. Any classification of documents based on semantic similarity will suffer from both synonymy (multiple words expressing the same meaning) and polysemy (words have multiple meanings). Although there is certainly value in syntactic clustering, the experience of the inventors shows that the clusters created are suggestive of the contents, but far from precise.        A second linguistic issue associated with semantic clustering is multiple languages. Semantic clustering tools completely fail when documents of different languages are included in the data set. As the trend toward globalization continues, this problem will continue to increase in importance. Some attempts have been made to use multilingual thesauri to allow linguistic comparison of multilingual document sets, but this research is still in its infancy.        A final limitation of dynamic taxonomies is the lack of comparability between clusters from one document set or search result and another. Because the taxonomy is created specifically for the document set, no two taxonomies created for different document sets or different search results can be compared.        Dynamic taxonomies also suffer from the multiple classification problem described above.        
The second limitation of the clustering technique is that any taxonomy only describes the document set or search result in relation to a single attribute. Most taxonomies are meant to describe the topics or themes of the documents they categorize. While this information is useful, there is no system known by the inventors that allows users to simultaneously make use of clustering information as well as the variety of other available sources of metadata that describes the document set or search result. It is an objective of one or more embodiments of the present invention to provide users with a way to iteratively or simultaneously make use of the information contained in both fixed and dynamic taxonomies as well as a wide variety of other metadata sources in order to provide a deep level of insight about the document set or search result that meets the specific information needs of the user.
Visualization Methodologies Used for Understanding Large Data Sets
The most advanced methods of obtaining insight into the metadata related to large document sets or search results are the visualization techniques. The field of data visualization has progressed rapidly over the last several years as computer processors have become powerful enough to perform the many millions of calculations required to display complex data relationships. A number of data visualization tools are relevant to consider with respect to the present invention. These can be divided into several categories which will be described below. Relevant examples will also be provided for each.
Hierarchical displays—One visualization method which has been employed is the hierarchical display. In its simplest form, documents or search results are represented in the form of a tree structure similar to the directory structure which is a well known metaphor for displaying categorized data. One example of a hierarchical display designed to reveal metadata include Vivisimo's clustering tool described above. Because of the difficulty in displaying and comprehending a large hierarchical structure, several alternative methods have been developed to display these hierarchies. One example is the fisheye lens, which is used to display large hierarchies of patent citations within Micropatent's Aureka tool set. The fisheye display allows users to zoom in on a portion of the hierarchy while still comprehending their position within the overall hierarchy.
Another sophisticated example of a hierarchical display is the Grokker tool developed by Grokis Corporation and described in U.S. Pat. No. 6,879,332B2. Much like the Vivisimo tools, the Grokker tool clusters documents in a hierarchical structure based on a semantic algorithm. Unlike Vivisimo, the Grokker tool presents information to users in a stylized marimekko diagram. The Grokker visualization represents the document set in a two dimensional space with each cluster of documents sized based on the number of documents in the cluster. The space on the screen represents the overall search result. Within this space, clusters of documents are displayed (represented by circles or squares) and labeled based on a common word found within those documents. Within each cluster, are further “sub-clusters”, again represented visually and labeled with a keyword. The hierarchy descends until finally the documents themselves are found at the lowest level of the hierarchy.
Each of these leading examples of hierarchical data visualization is based on latent semantic information contained within the documents and as such, suffers from the limitations of semantic analysis as described above in the section describing fixed and dynamic taxonomies.
Spatial visualizations—A second type of visualization used to reveal meta-data within a large document set is the spatial visualization. Spatial visualization uses a map metaphor to arrange document records in a two or three-dimensional space. Although the various spatial visualization tools differ somewhat, those known to the inventors follow a similar methodology for creating a map. This method entails four steps; 1) Calculate a semantic vector for each document—For each document in the dataset, calculate a vector to represent the semantic content of the document (typically based on a histogram of word or concept usage) 2) Create a similarity matrix—using the semantic vectors for each document, calculate a similarity metric for each document pair and thereby create a document similarity matrix. 3) Create a two or three dimensional projection based on the similarity matrix—Using principal component analysis or similar method (e.g. multidimensional scaling), calculate locations for each document in the set such that the distance between documents best reflects the similarity between documents as described by the similarity matrix. and 4) Draw a visualization of the information space—Using the two or three dimensional projection, plot the documents as points within a document space.
Some spatial visualization tools take a further step of overlaying a topographical overlay on the information space to reveal the degree of clustering. Some may even identify and label clustered groups based on words that are common within the cluster.
An example of a spatial visualization tool is the Themescape map, which is part of the patent analysis toolkit developed by Aurigin Systems and is now part of the offering provided by its acquirer The Thomson Corporation through its subsidiary Micropatent. The Themescape visualization tool uses semantic analysis about patent titles, abstracts or full text (at the user's discretion) to create a two dimensional projection of the information space based on the method described above. As is shown in FIG. 4, Themescape uses a map metaphor and overlays a topography over the information space with mountains representing the most highly clustered portions of the information space. Users of the Themescape map can explore the terrain by searching the information space for company names and other keywords or by selecting document clusters to read or export back into a document list for further review or analysis.
The underlying technology for the Themescape tool came from research performed at the Pacific Northwest National Laboratory which also has a spatial visualization tool known as SPIRE (Spatial Paradigm for Information Retrieval and Exploration). As is shown in FIG. 5, Spire has two visualization analogies, one, the “Starfield” shows a plot of documents in three dimensions in a view that looks very much like a starry sky. The second, the “Theme view” is a topographical metaphor very similar to the implementation with Aurigin's Themescape map.
While quite useful in developing a general understanding of the information contained in a large dataset, the spatial visualization tools known to the inventors base their visualization solely on latent semantic information contained within the documents and as such, suffer from the limitations of semantic analysis as described above in the section describing dynamic taxonomies.
Network visualization—The final visualization technique that is sometimes applied to increase understanding of the meta-data associated with large data sets is network visualization. In its simplest form, a network diagram (mathematicians would call this a graph) is simply a set of nodes (typically represented as dots) connected by links (also known as edges or ties). Network graphs are not new, some network concepts date back at least to the ancient Greeks. Social network analysis developed significantly in the 1930s. The development of modern computers with powerful processors has made it possible to create computerized network visualization tools.
The network paradigm is a very valuable method to apply to analysis of large data sets. There are two specific reasons why the network lens is so valuable. First, most visualization tools are designed to draw attention to the entity being analyzed (typically a document, a person or an institution). While network visualizations display information about individual entities as well, they also place significant emphasis on the relationships between and among those entities. The network display shows not just the entities, but the system in which those entities operate. In recent years, various scientific and academic researchers have come to the realization that reductionist analysis, (e.g. analysis that focuses on breaking a problem down into its component parts and thoroughly analyzing each component) is limited. Fields like biology, genetics, ecology, sociology, physics, astronomy, information science and many others have all seen advances based on systems analysis. Systems analysis focuses not on the smallest elements (e.g. genes, atoms—or perhaps quarks, and bits), but on the interactions between and among those elements. The network tool is by its nature a systems visualization tool. It therefore can lead to entirely different kinds of insight and conclusions than can the other visualization tools within the prior art.
A second reason that network visualization tools are appropriate for analyzing large data sets is that networks have the potential to view the same set of information from a variety of viewpoints. Prior art network visualization systems do not take significant advantage of this fact, but networks have the potential to be transformed from one perspective to another, with each perspective providing a different insight about the data being analyzed. The description of the Network Visualization System below will describe how this can be accomplished in order to dramatically improve the insight that can be gained about large and complex datasets.
First however, it is necessary to understand the present state of the art in network visualization and to identify some of the key limitations of the existing tools. A variety of computerized network visualization tools exist, including the following:
aiSee <aisee.com>
Cyram NetMiner—<netminer.com>
GraphVis <graphvis.org>
IKNOW—<spcomm.uiuc.edu/projectsITECLAB/IKNOW/index.html>
InFlow—<orgnet.com/inflow3.html>
Krackplot <andrew.cmu.edu/user/krack/krackplot/krackindex.html>
Otter <caida.org/tools/visualization/otter/>
Pajek <vlado.fmf.uni-lj.si/pub/networks/pajek/>
UCINET &NetDraw <analytictech.com>
Visone <visone.de/>
Each one of these tools is capable of creating a network graph. The more advanced packages (e.g. UCINET/NetDraw, NetMiner) provide a range of visualization capabilities such as                Choosing alternative layout algorithms        Displaying multiple node types        Sizing/coloring/selecting shape of nodes based on the value of an attribute        Displaying multiple link types        Sizing/coloring/selecting line type of links based on the type of link        
All of these tools are general-purpose network visualization tools. In other words, they are designed to display network graphs of any data that is structured in such a way that both the nodes and links of the network are defined. Each of these tools uses a particular (and often unique) file format to capture information about nodes and node attributes, and links and link attributes. Node information is captured through a node list where each node is represented by a node record. Node records contain at least one field which is a unique identifier for that node, but can also contain other attribute fields that provide information about the node. Link information is captured through a link list (or link matrix) which at a minimum identifies which two nodes are linked, but may also capture information like link strength, link direction, and link type.
Although the tools differ in their details, the process of working with them follows a common pattern as in FIG. 6. A user of any of the known prior art systems gathers data from whatever sources are to be utilized. She then chooses a definition of what entity within the data will represent nodes and what information she will use to create links between the nodes. The data must then be formatted to match the particular file structure of the network visualization tool. In all cases, this requires the user to create a list of nodes and a link-list or link-matrix. Once formatted properly, the files of network data can then be input into the network visualization system and analyzed and visualized. The user can work with the data within the tool and select different layout algorithms or display attributes, and analyze the structure of the network using any provided analytical tools.
If the user would like to develop an alternative visualization of the data using a different definition of nodes and/or links, she must start from the beginning, redefine nodes and links, reformat the data into a node and link-list and re-introduce the new files into the visualization system. The system can then display a network graph based on the new definition of nodes and links. Some of the inherent limitations of these prior art systems include the following:                Database records from any data source can not be visualized because they do not contain node and link information that is usable by the system.        The process of accessing and formatting data is not integrated into the network visualization tool        The user must format data into node/link lists to accommodate the system        The user must select a stable definition of what constitutes a node and what constitutes a link prior to formatting the data for use in the system        There is no way to change definitions of nodes and links while working within the network visualization system        If a new node/link definition is chosen, there is no way to combine or connect the network based on the first definition with the network based on the second definition, even though both networks are based on the same underlying data.        There is no way to specify particularly useful node and link definitions to be used repeatedly with data from a particular source. Each time data from that source is to be visualized, the user must start from the beginning and specify each node and link definition and manipulate the data to accommodate the visualization system.        