1. Technical Field
The present invention relates to the classification of massive graph streams.
2. Discussion of the Related Art
In machine learning, classification is the assignment of a class label to an input object. Classification arises in the context of a number of domains such as chemical and biological data, the web and communication networks. As an example, web query topic classification/categorization deals with assigning a web search query (e.g., an input object) to one or more predefined categories (e.g., a class label), based on the query's topics. For example, users issuing a web query “apple” might expect to see web pages related to the fruit apple, or they might prefer to see products or news related to the computer company. Search result pages can be grouped according to the categories predicted by a query classification algorithm.
Many data domains such as chemical data, biological data and the web are structured as graphs. In the chemical and biological domains, the nodes of the graph are drawn from a modest base of possibilities and it is assumed that the data set is of modest size. On the other hand, the web graph, communication networks, and social networks may be defined on a massive underlying universe of nodes. A graph with more than 107 nodes may contain as many as 1013 edges and thus be considered massive. The nodes may correspond to URL addresses in a web graph, the IP addresses in a communication network, or the user identifiers in a social network. Links between the URL addresses, IP addresses and user identifiers are edges.
In streaming applications, data generated in some external environment is pushed asynchronously to servers that process this information. Streaming applications are characterized by the ability to process high-volume data streams in a timely and responsive fashion. A massive graph stream may include the communication pattern of users in a social network, the browsing pattern of all the users, or the intrusion traffic on a communication network.
When massive graphs are in the form of a stream, this restrains the kinds of algorithms that can be used to mine structural information for future analysis. For example, the stream constraint allows only one pass over the data. Further, the edges of a graph may arrive out of order in the data stream. The massive size of the graph also creates a challenge for effective extraction of information that is relevant to classification. For example, it is difficult to store summary information about the large number of distinct edges in the graph data. Further, since the structural behavior is governed by the combination of a large number of distinct edges, the exponential increase in complexity of the substructure determination problem is extremely rapid with substructure cardinality. In such cases, the determination of frequent discriminative subgraphs may be computationally and space inefficient to the point of being impractical.