1. Field of the Invention
The present invention relates generally to methods for analyzing relational systems where nodes have local interactions or links, and more particularly to methods for analyzing linked databases.
2. Description of Related Art
The World Wide Web comprises a heterogeneous complex network with potentially billions of nodes and edges that link these nodes or URLs together. The large-scale, time-varying, heterogeneous and unstructured nature of the web, make it a very difficult database from which to extract meaningful and desired information. The web does share a few similarities with conventional linked databases. Conventional linked databases can also be represented as a network comprising different classes of objects that can be characterized as nodes, whereas, in the case of the web, nodes are URLs or specific web sites. Conventional linked databases also include links connecting nodes and relationships among objects of linked databases may be regarded as equivalent to the hyperlinks of the web which are used to link to other web sites. However, the web is very noisy and lacks accurate annotation, which makes its exploration particularly difficult. In a conventional linked database, the nodes as well as the edges are annotated with meta-information, which describe various attributes of both the objects and the nature of their relationships. For example, for an edge or link, such meta-information might include a description of the underlying relationship (e.g., father, son, wife, girl friend, partner etc.) and its strength (e.g., frequency of contacts), time stamps describing when such a relationship was established, and, if applicable, when it is set to expire, and perhaps even geographical location of the relationship. In the case of web, however, such annotation for the nodes and links are lacking cannot be easily inferred. A web page might link to another page for a variety of reasons that cannot be always deduced from the content of the web page itself. Similarly, while it is relatively easy to identify the purpose of certain web pages (for example, a manufacturer of a particular product or a corporation usually has a well-organized web page that clearly states its products and services, partners, management team, location etc.) and create an accurate annotation, an accurate determination of its purpose, objectives, and relevance has proven to be a difficult task to accomplish for most web pages. Often, the relevance of both the content of a page, as well as its links, depends on the type of information that one is interested in. Thus, while the web is a networked information system comprising nodes and links, it has proven to be a very difficult problem to accurately extract meta-information for the nodes and edges, and it remains a difficult system to infer relevant information from.
Most existing search engines deal with this challenging task of organizing and extracting information from the web by performing three critical tasks: (i) crawling the whole web, (ii) indexing the content of each page by making a list of words and terms that appear in each page along with a relevance index (e.g., where in the text the words appear and in what font size), and (iii) calculating the relevancy, trustworthiness, or the importance of a given page, as determined by the link structure of the web. These tasks yield a measurement known as the page rank. Page rank attempts to determine how many “important” pages link to a given page, where importance or “page rank” is computed in a self-consistent manner. Thus, for a page to have a high rank, a lot of pages with relatively high rank must link to it. These steps allow search engines to support Boolean searches. All pages that match a query are returned as part of a list, which is sorted based on their page rank, and the strength or relevancy with which the key words in the query appear in the page. Sometimes, engines use fees paid by the owner of a page to determine its location in the sorted list if the query involves commercial products. If a user wants further information, then the user must look up a number of these pages, formulate hypotheses about what is important, and navigate the web by trial and error. For example, a query directed to a company's web presence, in the sense of what types of individuals and news organizations are reporting on the company and who they represent, and if they are relevant or important to the company, then there are no easy key words to get this information; an exhaustive search may be required with different key words followed by much manual post-processing in order to infer such information. Even then, only those individuals or organizations having directly reported on the company may be discovered, and it may be difficult to find other individuals and organizations that are closely related to these direct reporters. Such information is embedded in the underlying network but not accessible via key-words-based searches.
Conventional search engine technologies support key-words based search capability, where all web pages satisfying a Boolean query are returned as a sorted list. The list is sorted according to a relevancy score, which, in turn, is computed by combining a number of relevancy factors, including the page rank of a page as determined from the global link structure of the web, the relevancy with which the key words are present in the page, and based on an amount the related company is willing to pay for its page to be included at the top of the list. This list could be very long and is identical for the same set of key words and for all users. A user usually must explore this list by trial and error, and such exploration is complicated because the user often has only a vague idea of what is being sought.
Conventional search engines flatten the web of relationships, and convert the underlying complex network to one-dimensional lists. Relevancies of different documents are determined by the search engine in a linear fashion, and the search results are not organized in a fashion to make further explorations more meaningful. All users with the same keywords receive the same set of documents, and any feedback from the user is in the form of trial and error, and via modifications of Boolean expressions.
Recently, attempts have been made to devise methods for returning pages that are “relevant” to a particular page requested by a user, or for returning pages that are relevant to a query. In order to determine such relevant pages and compute their relevancies, these methods use a combination of page rank and semantic similarities. For example, the exact neighborhood network (n-network) of a relevant page is processed in an attempt to identify pages that are semantically similar in content to the initial page. The primary limitations of these systems include: (i) the n-network of a node can easily become too large to be fetched and processed in a meaningful way, thus restricting the exploration of pages to those that are at most 2 or 3 hops away from the initial node; (ii) the so-called “important” nodes in these networks are determined by an analysis of their degrees, which could be very misleading when it comes to the relevance of a page to the original query; and (iii) there is no reason for all these pages in the n-network to have a common semantic theme, making the processing of contents of these pages difficult and prone to errors. These methods provide incremental extensions of the predominant existing method for organizing information from the web. Such methods provide linear search results, and reduce the complexity of the web by representing it in terms of tables and linear lists. Hence, there is a need for methods to obtain a networked representation of the web that captures the complex informational relationships among the pages, and organizes the information content of a page with respect to the contents of other related web pages.