Field of the Invention
The present invention relates generally to mining and learning network data. More specifically, the present invention describes methods and systems for supervised network clustering using densities associated with nodes and extracting node components from the network, based on using thresholds on densities.
Description of the Related Art
Network data has become increasingly popular, because of the increasing proliferation of social and information networks. A significant amount of research has been devoted to the problem of mining and learning network data. In many scenarios, a subset of the nodes in the network may have labels associated with them, and this information can be effectively used for a variety of clustering and classification applications.
In the context of the present invention, a network generally refers to a group of entities connected by links. This is a useful abstraction for many real-world scenarios, such as computer routers, pages on a website, or the participants in a social network. The nodes refer to the individual entities (e.g., routers, pages, participants) which are connected by links, which could either be communication links, hyperlinks, or social network friendship links.
The useful properties of such nodes can be captured by labels, which are essentially drawn from a small set of keywords describing the node. For example, in a social network of researchers, the label on the node could correspond to their topic area of interest. Such labels can provide useful background knowledge for a variety of applications, including directing a clustering process in different ways, depending upon the nature of the underlying application.
On the other hand, the available labels may often be noisy, incomplete, and are often partially derived from unreliable data sources. Many of the underlying clusters in the network may also not be fully described from such information, and even when the labels for a particular kind of desired category are available, they may represent an extremely small subset of the nodes. Nevertheless, such noisy, sparse, and incomplete information can also be useful in some parts of the network, and should therefore not be ignored during clustering.