Field
Implementations of the present invention relate to natural language processing. In particular, implementations relate to comparing documents, which could be written in one or more languages and which contain one or more types of information. Comparing documents may involve estimation, computation and visualization of measures of similarity between any number of documents or other types of electronic files.
Related Art
Many natural language processing tasks require comparing documents in order to find out how similar or different they are from each other, i.e., estimating or computing a measure of similarity or difference of the documents. First, text resources existing on the Internet or other sources usually include a lot of copies of the same document which can be presented in different forms and formats. So, a document similarity computation is usually an implicit but mandatory step of many document processing tasks. Document similarity computation usually involves statistics, machine learning, such as, for example, document classification, clustering and so on. In particular, document similarity/difference computation could be required in plagiarism detection which aims at detecting if or when a document has been plagiarized from another one. A straight-forward approach to do this task is to compute a similarity/difference measure between documents, which is usually based on lexical features, such as words and characters. If the mentioned similarity/difference measure is beyond a certain threshold, the documents are deemed similar and therefore, one document could have been plagiarized from another one. More sophisticated ways to do this task could include other similarity/difference measures and approaches—but the concept is the same.
A related task is duplicate and near-duplicate detection. While constructing linguistic corpora, it makes sense to get rid of duplicate and near-duplicate documents. In this task, as well as in the case of plagiarism detection, it is required to estimate how similar considered documents are. In this task, lexical-based representations and therefore, similarity/difference measures, are usually enough for adequate performance.
However, many challenges exist for determining similarity of documents. For example, computation of cross-language document similarity/difference is in increasing demand to detect cross-language plagiarism. In this situation, the above-mentioned similarity/difference should be able to adequately detect substantially similar documents in different languages. Too often, such detection fails. Besides performing this task, such similarity/difference measure also could be used to construct parallel and comparable corpora, to build or enrich machine translation systems.
Most of the existing document processing systems are able to deal with documents written in only one, or rarely, in a few particular, identifiable languages. Systems are generally not able to compare documents written in different languages because a workable similarity/difference between such documents cannot be computed.
Further, many systems are also limited to particular document formats, i.e., some systems cannot analyze some documents without first obtaining a reliable and accurate recognition of their text (such as in the case of PDF files which can require processing by optical character recognition). Moreover, each system usually deals only with one particular type of information or data contained in a document, i.e., only with text-based, audio-based or video-based information. However, many documents, sources or files about a particular topic (e.g., online news) include a variety of types of information and types of documents. For example, two news-oriented documents or sources may contain or reference the same video file but discuss the content of the video differently. In this case, a text-oriented system may conclude that the sources are not similar and may conclude that the video-oriented material is identical without being able to adequately process the nuances of such material.
Therefore, there is a substantial opportunity for new methods for more accurately estimating similarity/difference between documents, content, sources and files in different languages and in different formats.