When working with a collection of documents, it is often necessary to search for the desired information in the collection. Search results for information of interest may be generated by search engines using keywords entered by a user as a search query. Existing search systems enable users to use simple query languages to find documents that either contain or do not contain the words or word combinations specified by the user.
The search of information in numerous and heterogeneous text resources inevitably meets the facts that the same event, object, person is expressed in various documents by a different way using various words, expressions, notations etc. For example, a system of information extraction should understand that “Winter Olympics 2014”, HarrpHMep, rrporpaMMa H3BJieqeHH51 HHQ:>opMa:o;HH,ll;Omrma rroHHMaTh, qTo “Olympic Games in Sochi”, “Olympics in Sochi” etc. correspond to the same event, as well as “Yuri Gagarin”, “first cosmonaut of the Earth”, “first soviet cosmonaut” refer to the same person.
In order to increase the reliability and completeness of such searches and state that, for example, two objects from two different documents correspond to the same real world object, identifying features of such objects need to be determined or known. Still, even if some unique identifying features are determined for two objects, those objects can turn out to be different, such as complete namesakes.
Special models of presenting data, such as Resource Description Framework (RDF) are used to store information of objects in a collection of documents. RDF is a graph structure presenting a set of statements about entities, which are the real world objects (such as people, organizations, location), as well as facts (such as the fact of a person working at a particular organization). Each statement is presented in the form of three data entities (subject, predicate, object} and it is called a “triplet”. A plurality of statements—triplets form a graph with its nodes corresponding to objects and subjects linked by arcs—predicates directed from subjects to objects. Such RDF graphs can be constructed for one sentence as well as for the entire document in the collection of documents.
Each real world object in the collection is associated with one or more features of an RDF graph, and different copies of the same real world object in different documents can be characterized by the same features. Therefore, the task of global identification consists in comparing objects from texts in a natural language with each other and with real world objects and in creating RDF-graph and at one or more index of the document collection as different objects with identical features are represented in the RDF graph as the same object.