1. Field of the Invention
The present invention relates to a method and system for constructing compact partially ordered sets (posets) isomorphism certificates useful for comparing digital objects representable as posets. The present invention specifically relates to a method and system for fast construction of an (Omicron(P),Iota(P)) certificate, a minimal Phi(P) certificate and Phi-ranked symmetry classes for posets and their induced usages. Each poset certificate is a canonical representative for a class of isomorphic posets. The present invention provides a pair of one-dimensional keys for encoding a poset which enables compact storage.
2. Discussion of the Related Art
Many kinds of objects or data like genetic maps, multiple sequence alignments or taxonomies can be usefully represented by posets. Posets can be seen as generalizations of arbitrary graphs and lattices. Graphs are configurations described by vertices or nodes and connections or edges. They are widely used for analyzing physical networks such as transportation networks, computer networks, electrical circuits, and the Internet, VLSI design, robotics, compilers, modeling document structure using XML, programs using data flow graphs, or chemical compounds where atoms and bonds are modeled as vertices and connections, respectively and enumerating structural possibilities, including chemical structures (e.g., hydrocarbons, drug compounds, and polymer networks), genetic and biochemical networks. Graphs have also have been used for analysis of RNA structures. Furthermore, graphs are used to model financial networks, for example, market graphs or the losses flow across a network of banks. Lattice models form the standard way to valuing stock options. In many cases, the term semi-structured data appears to be accepted for data that is described using basic graph theory.
The term “poset” refers to a partially ordered set with strict partial order. Analyzing posets with strict partial order for isomorphism is no restriction of the general case. A strictly partially ordered set is a pair (V,>), consisting of an ordering relation ‘>’ and a set of objects or vertices V, named “base set”, such that for any two objects x,y in V either (1) x>y or (2) y>x; or (3) x and y are unrelated. For any two objects x,y in V holds: (1) not x>x, (2) if x>y, then not y>x, and also (3) if x>y, and y>z, then x>z. Two posets (V1,>1), (V2,>2) are said to be isomorphic if there exists a mapping pi: V1V2 such that for all x,y in V1, x>1 y if and only if pi(x)>2 pi(y). Again, an isomorphic mapping of the vertices onto themselves is an automorphism. When the number of objects and relations of a poset is large, no efficient (non-exponential) method to test whether two arbitrary posets are isomorphic is known. Checking isomorphism of a pair of posets typically requires exponential worst-case time. In a poset database, not only is comparing individual posets difficult, it is also required to perform the comparison operation with all posets in the database. The complexity is twofold: (1) search for candidate members to match among the huge number of candidates, and (2) check isomorphism with the retrieved candidates. The total number of poset isomorphism classes grows exponentially with the number of vertices of the poset.
Since the number of poset isomorphism classes is exponential in the number of vertices, even a linear time isomorphism algorithm will result in a potentially exponential search if a query object needs to be compared to all represented poset classes. In current systems, an object can be accessed in O(log(|E|)) time, where |E| denotes the number of entries if the data objects are sorted based upon a numerical ordering. This leads to the question whether “certificates” for poset isomorphism classes exist. A function f from a class of objects C to the natural numbers is called a “certificate” with respect to some equivalence relation like isomorphism if and only if it has the following properties: (i) supposing f(P1)=f(P2), then P1 is equivalent to P2; and (ii) supposing P1 is equivalent to P2, then f(P1)=f(P2) for all pairs of objects P1, P2 in C. f is a “deterministic certificate”, if f is a certificate and f is computable in polynomial time. In current systems, no deterministic certificate with respect to isomorphism is known for arbitrary posets. However, there exists a need for deterministic certificates for arbitrary posets.
Identifying symmetries is another important application of poset isomorphism. An automorphism is an isomorphic mapping of the vertices onto themselves. The set of all automorphisms of a given poset is the base of the automorphism group of the poset. The automorphisms of a poset define the automorphism partition of its vertices in the following way: two vertices v1,v2 are in the same class if there exists an automorphism which maps v1 onto v2. Poset automorphism may be used as a framework for modeling symmetry. By exploiting symmetries one can often save an exorbitant amount of time when only considering representatives of classes of symmetric but equivalent variables. Hence, symmetry detection and reduction is often crucial to the success of making reasoning and equation solving problems computationally tractable.
Those skilled in the art are familiar with the graph isomorphism problem that includes deciding whether two given graphs are isomorphic, i.e. whether there is a one-to-one mapping (a permutation) from the vertices of one graph to the vertices of the second graph, such that the edge connections are respected. As those skilled in the art recognize, the graph isomorphism problem can be efficiently translated into a poset isomorphism problem. In particular, undirected graphs can be considered as bipartite posets since any graph can be transformed into a bipartite poset by replacing each edge by two directed edges connected with a new vertex. Hence the pair of edges which are connected with the upper vertices of this bipartite poset refer to the edges and the lower vertices to the vertices of the original graph.
Much research has been devoted to the graph isomorphism problem. Apart from the present invention, neither an efficient (polynomial) worst-case algorithm to check whether two arbitrary graphs are isomorphic is known nor has the conjecture been proved that such an algorithm can exist. Certain complexity-theoretic evidence suggests that it is unlikely to be NP-complete. When the class of graphs to be dealt with is restricted, in many cases the resulting less complicated isomorphism problem can be solved by polynomial-time algorithms. For example, if the vertex degrees are bounded by a constant, Luks's group-theoretic algorithm performs isomorphism testing in polynomial time. There are a number of general graph isomorphism algorithms, but there are always cases for which the known algorithms are not efficient or not even able to find any solution at all. Furthermore, no construction is known that would enable a quantum computer to check graph isomorphism in polynomial time. Graph isomorphism is in contrast to subgraph isomorphism in which the problem is to decide whether there is a subgraph of the first graph that is isomorphic to the second graph. The subgraph isomorphism problem is known to be NP complete. Despite the previously mentioned advances, there exists a need for an efficient (polynomial) worst-case method and system to determine whether two arbitrary graphs are isomorphic.
The correspondence between poset and graph isomorphism implicates that whenever an efficient system for handling classes of isomorphic posets is devised, then an efficient system for processing a collection of graphs can be constructed by augmenting the poset encoding system by a pre- and post-processing unit which converts graphs into designated posets and vice-versa. There are many problems of theoretical, practical and technical relevance which can be efficiently translated into graph isomorphism problems. A problem is said to be graph isomorphism-complete if it is provably as hard as graph isomorphism. Deciding if two terms contain otherwise uninterpreted associative, commutative, and associative-commutative function symbols and commutative variable-binding operators are equal is efficiently reducible to determining if two graphs are isomorphic. The problems of testing either two semi groups, context free grammars or two finite automata for isomorphism have also be shown to be efficiently reducible to graph isomorphism. In addition, RDF model equality is graph—isomorphism complete.
These and many other practically relevant applications require a fast algorithm for checking isomorphism of two graphs. Checking isomorphism of two graphs is of critical importance in numerous practical areas such as chemistry, bioinformatics, cryptography, switching theory, program analysis, computer aided design, program verification, safety and security checking, computer vision, object recognition, multimedia document management, networking, robot construction, mechanical engineering, etc. In most cases, the task at hand includes both comparing pairs of graphs and searching large databases or even the Internet and retrieving isomorphic (similar) objects. With the ever growing body of data, it will be necessary to perform millions of comparisons within a database. However, a certificate for a graph in polynomial time could be applied once to construct a certificate code(G) for a given graph, and then, in average time O(|code(G)|*log(N)), the certificate for the given graph could be compared to certificates stored in a database sortable according to the certificates. Thus, there exists a need for a system and method employing a deterministic polynomial-time computable certificate to be used for general graph isomorphism.
Following are some specific exemplary applications having a need for a fast algorithm for checking isomorphism of two graphs, including some of the unique issues associated with each application:
(1) Computer Aided Design
In engineering, complex systems are usually described as a configuration of smaller components. A configuration can be viewed as graph of interconnected components. An inherent difficulty in solving configuration problems is the existence of many structural isomorphisms. One of the motivations for performing part similarity assessment is to locate similar parts or components in a large database or even the Internet. Popularity of 3D CAD systems in product design and manufacturing industry is resulting in a large number of CAD models being generated. The techniques to assess similarity of 2D models do not extend to 3D models. Similarity assessment in 3D cases is usually carried out by generating shape certificates from the 3D models and then comparing these certificates using suitable distance functions. One technique is to use graphs as shape certificates for 3D models. The comparison can then be carried out by graph isomorphism testing. Techniques based upon graph isomorphism are used to pare the design space for high-level scheduling and interconnect topology synthesis for embedded multiprocessors. Only isomorphically unique graphs are considered. The result is a reduction by orders of magnitude. Instances of the Boolean satisfiability problem (SAT) arise in many areas of design automation that comprise microprocessor verification and FPGA layout. It has shown that exploiting symmetries results in vastly reduced SAT solver run times. Symmetry detection may be performed by constructing a graph from the conjunctive normal form, and obtaining a generating set of the automorphism group of the graph. Moreover, symmetry-based reduction techniques have been successful used in synthesis of logic circuits and DSP algorithms. There exists a need for fast symmetry detection in order to reduce overall computation time.
(2) Machine Vision and Pattern Recognition
Graph isomorphism has been used for object recognition and classification tasks as those that exist in a robot workspace. A reliable and simple method for detecting isomorphism between pairs of kinematic chains such as robots and machine tools is also desirable. Further applications of graph isomorphism checking include low-level image, facial, character or graphical symbol recognition. It has been argued that better graph isomorphism methods can be used to improve the procedures to properly match the ridges of fingerprint images. Graph isomorphism is also an essential technique underlying Pictographic matching.
(3) Safety and Security Analysis
Methods and algorithms for finding differences between two versions of a binary executable file based on graph isomorphism are known. A method that allows to iteratively construct an isomorphism between two differing but similar executables is also known. Such an isomorphism has multiple interesting applications: It allows rapid reverse engineering of security updates, automatic classification of malware and detection of code theft. Systems for safety checking on systems with pointers have been devised but the bottleneck in these systems is the exponential runtime complexity caused by checking graph isomorphism. Checking properties of state spaces that are generated by model-checking also involves isomorphism testing. Dynamic data-structures with pointer links cause extremely difficult verification problems. Automorphism symmetry reduction techniques have been investigated for the verification of software systems with indirect reference chains. There exists a need for a practical framework for the efficient verification of such software systems. Graphs with millions of vertices or more are prevalent in many domains. It has been shown that the correctness of a VLSI circuit layout can be verified by utilizing techniques for solving the classical graph isomorphism problem. These tools currently are not effective enough to verify circuits containing tens of millions of transistors. Prepositional logic formulas generated to model verification problems often exhibit considerable structural symmetry—there exist many rearrangements of the variables which yield the same formula. Exploiting symmetries can often reduce the complexity of reasoning. However, it can be expensive to compute the full equivalence class based on known techniques. Accordingly, there exists a need for detecting symmetries using graph isomorphism techniques.
(4) Chemical and Biological Databases
Graph data management has long formed the basis for chemical and biological information retrieval systems. The chemical industries have created extensive databases of both 2D and 3D structures of chemical molecules and a variety of data mining tools are routinely used to support the discovery of novel pharmaceuticals and agrochemicals. The principal method of representation for a 2D chemical structure or 3D molecule is a labeled graph in which the vertices and edges of a graph represent the atoms and bonds, respectively, of a molecule. Many kinds of data arising in bioinformatics can be usefully represented as graphs. Biopathways are among the most important examples of such graph data. It has been shown that comparing the stoichiometric structure of two reactions systems as a crucial problem in understanding the mechanisms encoded in the genome is equivalent to the graph isomorphism problem. A chemical and/or biological database can hence be represented by a large number of particular graphs. Structure searching involves an exact-match search of a chemical and/or biological database for a specific query structure. Such a search involves a graph isomorphism search, in which the graph describing the query is checked for isomorphism with the graphs of the database graphs. There exists a need for applying efficient graph isomorphism techniques to conduct this type of search.
(5) Knowledge, Multimedia and Document Management
In many other domains, graphs in various forms provide the underlying concept for modeling objects and processes. For instance, software engineering employs a number of modeling approaches (eg. computer programs), languages (eg. UML), and ISO Standards based on graphs to capture non-linear aspects of workflows, data flows, data dependencies, project organizations, vector based graphics, etc. Web computing combines the graph-based approaches from a variety of disciplines: Objects, documents, and services are represented using HTML and XML, meta data is captured using numerous formats based on XML and RDF. But also the web as a whole is subject to modeling approaches, web graphs play a vital role in search engine design and operation. In addition, a variety of application domains depend on graph-based models, for instance, linguistic expressions, customer relationship models in e-commerce or market models in financial optimization approaches. Model searching involves an exact-match search of a database for a specific query. Such a search involves a graph isomorphism search, in which the graph describing the query model is checked for isomorphism with the graphs of the database graphs. There exists a need for an efficient method for retrieving isomorphic graphs.
(6) Compact Encoding
In an embedded system, instructions permanently reside, for example, in a ROM or flash memory chip. The cost of storing the instructions on a chip can be as high as the cost of a micro processor. Compressing a code to reduce the amount of memory required is an attractive way to reduce costs. Some RISC processors support aggressive code size reduction even at the cost of significant performance loss. Code compression is the technique of using data compression to reduce the code memory size for memory-limited, embedded computers. The instructions are compressed and stored in the embedded system. At run-time, the compressed instructions are incrementally decompressed and executed. While compressed codes have better density, their performance is typically lower because additional effort is required to decompress the instruction stream. It has been shown that the sequence of instructions can be encoded by a data flow graph. Consequently, there exists a need for an improved physical process in which instructions are efficiently and without any redundancies compressed.
(7) Coding Schemes and Storage of Semi-Structured Data
Scientific, governmental and industry consortiums generate standards in the form of digraphs such as the Gene Ontology digraph, the ICD-9 and ICD-10 medical naming convention, or SNOMED. A unified coding scheme for consistent mechanism representation is developed by identifying non-isomorphic contracted graphs, as well as a unique numbering for the higher order links in them, providing a consistent coding scheme for all classes of mechanisms. Data is then associated with these classifications and a complex semi-structured dataset emerges. Scientific work relating to the exploration of the human and other genomes has produced massive data that cross-references complex graphs and structures. Indexing of semi-structured tree data is being addressed by all the major database vendors in one form or another, such as is evident for example in both the DB2 database system from IBM and in Oracle's database system. Unfortunately, relational databases do not effectively support data schemata with multiple hierarchies to manage this sort of data and they lack facilities to access data by hierarchical queries. Numerous approaches to devise database systems to handle semi-structured data objects, which are also represented in other formats such as XML (Extensible Markup Language), are used to attempt to overcome the access problem. Despite these efforts, there still exists a need for a carefully formulated method and system for capturing all relevant members of an isomorphic class of objects and retrieving those data objects that are isomorphic to a requested item.
(8) Efficient Physical Storage and Transfer of Huge Objects
Encoding a small poset structure is simple. Storing and transferring a poset with massive amount of vertices and edges in memory constrained systems or with limited space on physical devices is an engineering challenge. Models of transportation networks, power grids, computer networks (eg., the Internet), social networks, intranets of large corporations are represented as graphs. Another example of a huge graph is the Web graph. A Web graph represents the link structure of the WWW, and can be used to study various properties of the Web. There are a variety of distinct encoding objectives which are considered: (1) Compressing the underlying graph for storage or transmission, up to isomorphism. This setting is useful if only the graph structure itself shall be stored. (2) Compressing the underlying graph for storage or transmission, maintaining the information associated with the vertices of the graph. (3) Compressing the underlying graph that still allows for efficient computation on the compressed form. Matrix representations which are very common for storing the web graph are extremely large and sparse which makes it impossible to keep them in memory. Efficient compression methods are required and are subject of ongoing empirical research. There exists a need for a method and system for providing very compact encoding of sparse graphs, not requiring further compression.