1. Field of the Invention
This invention generally relates to methods and systems for analyzing graphs. More specifically, the invention relates to methods and systems for finding frequently occurring subgraphs, or motifs, in one or more graphs.
2. Background Art
Understanding large volumes of data is a key problem in a large number areas such as the World Wide Web, bioinformatics and so on. Some of data in these areas cannot be represented as linear strings which have been studied extensively with a repertoire of sophisticated and efficient algorithms. The inherent structure in the data is best represented as graphs. This is particularly important in areas such as bioinformatics or chemistry since it might lead to understanding of biological systems from indirect evidences in the data. Thus automated discovery of “phenomenon” is a promising path to take as is evidenced by the use of motif (substring) discovery in DNA and protein sequences.
A protein network is a graph that encodes primarily protein-protein interactions and this is important in understanding the computations that happen with a cell. A recurring topology or motif in such a setting has been interpreted to act as robust filters in the transcriptional network of Escherichia coli. It has been observed that the conservation of proteins in distinct topological motifs correlates with the interconnectedness and function of that motif and also depends on the structure of the topology of all the interactions indicating that motifs may represent evolutionary conserved topological units of cellular networks in accordance with specific biological functions they perform. This observation is strikingly similar to the hypothesis in dealing with DNA and protein primary structures.
Topological motifs are also being studied in the context of structural units in RNA and for structural multiple alignments of proteins. For yet another application consider a typical chemical dataset: a chemical is modeled as a graph with attributes on the vertices and the edges. A vertex represents an atom and the attribute encodes the atom type; an edge models the bond between the atoms it connects and its attribute encodes the bond type. In such a database, very frequent common topologies could suggest the relationship to the characteristic of the database. For instance, in a toxicology related database, the common topologies may indicate carcinogenicity or any other toxicity.
In the field of machine learning, methods have been proposed to search for subgraph patterns which are considered characteristic and appear frequently: this uses an apriori-based algorithm with generalizations from association discovery. In massive data mining where the data is extremely large of the order of tens of gigabytes. These include the world wide web, internet traffic and telephone call detail. These are used to discover social networks and web communities among other characteristics.
In biological data the size of the database is not as large, yet unsuitable for enumeration schemes. When this scheme was applied researchers had to restrict their motifs to small sizes such as three or four.