1. Field of the Disclosure
The present embodiments of the invention enable encoding and decoding of networks of any size for rapid storage, query processing, and retrieval, as well as for analog discovery, data mining, and graph-structure-content to semantic property correlations. The new encoding is called a Cognitive Signature and can be decoded to reproduce an input network. The database used to store all the Cognitive Signatures is called a Cognitive Memory and can be implemented on any relational database management system (RDBMS) that supports a spatial data representation using, for example, Multi-Vantage Point Trees. The method and schema for storage or query processing using Cognitive Memory is presented to provide fast analogical results or exact results to user queries. The embodiments of the invention provide an (n)log(n) complexity for recall. Uniquely, the embodiments of the invention enable topic and concept extraction as a natural part of the encoding process by association between the highest k-order complex in the generalized combinatorial maps (GMAP) and the most important underlying semantic properties of the data being encoded.
2. Description of the Related Art
Network (also variously referred to as graphs) storage methods have been based on variations of hashing and content-based access, as in image databases, chemical molecular databases and Internet network content databases. The world's most popular algorithm for network based indexing is the Google™ Page-Rank algorithm that operates on networks derived from Internet in-and-out links to hub pages of content. Topological or geometric metrics on networks, such as Hosoya's Topological Index and Google's Page-Rank respectively, when used alone and even in combination are not sufficient to describe the content description of images, especially in terms of variances over time, and nor as a tool to express complex, relational, analogy-like queries where brittle matching between networks is undesired. In image processing, for example, graphs provide a good expression of content but graph based storage and retrieval is hard as the scale, sizes, resolution, number and fidelity of images, either singly, or in sequence as in videos, increases, and this drives up the complexity of graph based methods.
In Internet search, Google's™ Page-Rank has been the dominant and most successful network indexing algorithm, yet it fails to capture the analogies between web-sites, as well as context or even to serve as a means to profile web-site users by content representation. In algorithms such as Page Rank and other graph algorithms, the main focus is on connected-components and identifying important semantic concepts by the so-called hubs representing the maximally connected components that capture the most import underlying concepts.
The majority of other graph based algorithms and their main clustering methods all build on a single, static view of the largest connected components of the graphs or networks formed from the incoming data: whether the data is text (i.e. forming Text Graphs) or images (i.e. segmenting and forming image graphs for visual pattern recognition) or financial networks or molecular or biochemical graphs and networks (for drug design or chemical property assessments).
In addition, for retrieving candidate graphs, currently there are two main approaches in the literature:
(i) Index based approaches such as Levinson's Universal Graph [3], SUBDUE and others [4]; and,
(ii) Vector based approaches such as Attribute Relational Graph “ARG” methods by Petrakis [5].
Methods (i) and (ii) fail when structural variability, complexity, diversity and features are widely differing, or when there is a lot of dynamical changes to graphs. None of the methods is well suited to encoding and storing sequences of dynamical changes to the graphs.
Index based approaches maintain static, often pre-computed set, of hierarchical indexes of member graphs, which is traversed in response to a query. During the traversal, a distance metric via the index values between the query graph and the current index element is calculated and used for retrieval. Vector based approaches consider member graphs as a vector of features, and transform each graph onto a feature space. Usually, vectorization is performed on attributes of the graph. In this process, the structural properties that show how graph attributes are interlinked get neglected.
Network retrieval in image databases is different from graph retrieval in chemical data banks and is very different from retrieval in text databases or hyperlinked webs since the semantics are completely different. Some application areas require graph databases perform best when there are similar structures but variations on a theme (such as CAD drawings or other mechanical catalog parts) using a universal graph concept in which every graph is a variation of the universal graph stored in the database. This means that the member graphs are mostly similar with respect to structure. But the number of possible node and edge attributes for each graph would be large. In fact, every modern Object Oriented Relational Database Management System (OORDBMS) can be considered to be an attribute relational graph database. This is because a relational schema has an equivalent Entity Relation (ER) schema graph and hence is considered to be a graph database where member graphs are different instances of its ER schema graph. However, query processing and creation of a high complexity structure-oriented graph storage system has little in common with OORDBMS systems and hence there are no systems commonly available to store and retrieve networks at massive scales because in most cases, the graphs do not share near-identical structures but may be locally different though globally similar (e.g. as in protein structures) or locally similar but globally very different (e.g. as in graphs of texts in linguistic resources). Therefore, a method that accommodates these widely different perspectives is needed.
The “background” description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, are neither expressly or impliedly admitted as prior art against the present invention.