1. Field
Embodiments relate to creating parallel text corpora which may be used, for example, in machine translation, machine learning, search technology, system of understanding of natural languages in AI. In particular, embodiments relate to automatic creation of aligned parallel natural languages corpora and tagged parallel corpora. Such electronic content may be available, for example, on the Internet and in other electronic resources.
2. Background
There is existing technology that generates parallel text corpora. A parallel text corpus refers to texts consisting of two or more parts—a text in one language and its translation in another language that is a translation of the first text. Parallel texts corpus may contain texts in two or more languages. An aligned parallel text additionally comprises a mapping (correspondence) of a portion of the first text into a portion of its translation, where the portion may be a sentence, a paragraph, or another part of the texts. An example of aligned parallel text is a translation memory or other databases of translations which can be created, for example, by translation agencies or by individual human translators. Applications, such as machine translation, machine learning, search technologies, system of understanding of natural languages in AI, may employ connected parallel texts, and, more importantly, parallel texts which comprise logical relations between sentences, such as referential relations, anaphoras, connectors and the like. Aligned and tagged parallel texts are very useful for these applications as well.
Usual methods of aligning parallel texts are chiefly manual, or based on heuristics, for example, aligning formally by boundaries of sentences which are identified by punctuation marks. But, such methods may be not sufficiently precise, because text formatting may complicate assumptions about sentence boundaries and because there are cases when one sentence is translated into two or more sentences in another language. Additionally, it is desirable to obtain tagged parallel texts where grammatical, lexical, and even syntactical and semantic features, as well as syntactic relationships or/and semantic relationships, are identified, and where grammatical and lexical meanings, deep or surface slots may be determined and searchable. While US Application Publication Number US20060217963 A1 mentions the use of Interlingua representation in connection with translation memory, it does not provide an effective way to generate and compare such representations, which are described as tree structures.