The field of information retrieval is a field founded on the problems of selecting documents, that are related to a query made, from a document pool, and ranking the selected documents with respect to relevance. There are a number of techniques developed on this subject including term frequency—inverse document frequency (tf-idf). Basically, every document is addressed by some terms, meaning that, full-text indexing. Traditionally, every term corresponds to a dimension in a multi-dimensional vector space. Later, documents are represented as points in this space according to the terms they include. They are mapped to a point in this space by means of the terms included in the search. Afterwards, the terms that are “close” to the query are selected and this closeness is measured as distance in the vector space.
Generally the documents do not have any relation between documents; they are independent from each other. On the other hand, some document types such as web pages or scientific abstracts, by nature, have connections to the others in the forms of hyperlinks or citations.
Google search engine, while using the relevance based on the document content, uses PageRank on the hyperlink network for estimating the rank of the selected document [2]. In the PageRank approach, every document is assigned an importance value called PageRank. PageRank of a document increases as the document gets more links from documents with higher PageRank.
Even though the rank value, when Google is in question, is independent from the query, HITS approach is dependent to the query [6]. For each query, a set of “hubs” and a set of “authorities” are defined. The hypothesis is that good hubs refers to good authorities which contain high quality information and vice versa.
It is claimed that not only the document itself, but also the documents citing it contains information about the document that is cited. The part of the citing document, which contains the citation is called “citation context”. It is believed that the citation context contains important information about the document that is cited [5, 8, 9].
In the state of the art, patent document numbered U.S. Pat. No. 6,457,028 B1 discloses gathering the related documents from the documents that are linked to each other, by using the method of co-citation analysis. If a document A is giving links to the documents B and C, then B and C are thought to be relevant to each other. If B and C receive links together not only in A but also M multiple documents, the relevance of them are considered to be strengthened. In this approach, only the condition whether a link is present between the documents is utilized, however, context information that is used when citing is not taken into account.
In the patent document numbered WO2006/001906 A in the background of the invention, a text document is worked on and word groups are formed from the text. These groups are related by a relationship. By this, a network consisting of word groups as nodes and two groups are connected by an edge if they are related is obtained. The nodes of this network are ranked by known techniques such as PageRank and HITS. By this, word groups are also ranked. This ranking is used in determining the keys that would explain the document, determining the important sentences. In this patent document, a single text document is worked on and the words in the text is used to obtain a network. However, in the suggested invention, the network is totally different from this. In the suggested invention, there is a plurality of documents which give reference to each other and in addition to the state of the art operations realized by the words in the document such as finding keywords and abstracting the text, the reference context of the referring documents is used. By this, a network is formed, however, in the formed network, the document itself is shown by a node. A reference given front one document to a second document is shown by an edge. Hence, the obtained network is a directional edge-labeled network. Additionally, the context of the referring document at the referred place is also added to the directional edge as a label of this connection.
In the state of the art, patent document numbered US20080071739 A1 focuses on the additional information about the relevant documents selected by the search engine. The search engine selects the documents suiting the terms of the query of the user. It, not only gives the title and the link of the document while showing them to the user, but also tries to give brief information about the document in order to help the user it, under normal circumstances, compiles this short information from the content of the document. In some cases, a text, to be compiled like this, might not be present inside the selected document, moreover in some cases; no text might be present in the content. In some cases, the search engine might not find the content to return the short information from. In the search engine, additional text information, which might help in this subject, can be gathered from the documents referring to the selected document. This patent suggests a method in this subject. The terms that are present in the place of citation in the referring document to are compiled as explanatory information for the selected document and are presented to the user. However in this document, as opposed to the suggested invention, the terms that are present in the referred place, meaning that in the context of reference, are not used during the search engine selection. As a result, the documents, that do not contain the keywords that are used while searching, could not be selected by the search engine.
In the patent document numbered EP0637805 B1 in the background of the invention, finding the lexical meaning of a word passing in a text is studied. A word having multiple meanings, the same word receiving different affixes by the linguistic rules according to the place it is used are the obstructing factors. Already present techniques are used for stripping the affixes and reducing the word to its principals. Once the principal word is found, in order to infer which of the multiple meaning of the word is used in the text, the sentence in which the word passes is also analyzed. By using the context information, it is tried to understand which of the different meanings of a word is. Additionally, multiple word connections with the context of the word are also utilized. If we use one of the given examples, when the term passes like “under the table” gives a totally different meaning than when the term “table” passing in the text is searched by itself. In this patent, close approaches to the approach in the suggested invention are present. There are (i) using the context of the searched word, (ii) using not only the words but also the word groups. On the other hand, the subject here is to infer in which meaning the word passing in the target text is used. However, in the suggested invention, there is no single texts and a word, the meaning, of which is to be found in the text. The content of a text in a group of referring texts, and hence, in order to be found when it is searched, the context in which the reference is given from the other referring texts is used.