Billions of documents exist in various natural languages, and human translators are sometimes employed to accurately translate some of these documents from one natural language into another. Because manual translation of every document into every natural language would be an inefficient use of resources and would be highly impractical, some of these documents are translated by human translators when a translation is thought to be needed. However, because there are not many expert human translators, translating documents can be expensive and time-consuming. When a non-expert is used, such as a human translator who is not familiar with a technical area in which the document was originally written, the translated document can contain translation errors.
Technology now exists to translate documents automatically. Machine translation is a computational technique that employs software to translate text or speech from one natural language (“source language”) into another (“target language”). Various forms of machine translation techniques exist, including word-for-word translation techniques and corpus translation techniques. When using word-for-word translation techniques, a computing device simply employs a translation dictionary to select a translated word in a target language for each word in a document's source language. The result is oftentimes unusable and can look like gibberish to a native speaker of the target language.
To translate text, corpus machine translation techniques employ corpora of parallel or comparable text (“corpora”) that humans have generally created manually that maps text in the source language to corresponding translated text in the target language. Such techniques typically can produce translations that are superior to translations produced by word-for-word translation techniques. A corpus is set of parallel or comparable text samples. As an example, a legislative document that has been manually translated from a source language into one or more target languages is a corpus of parallel text. Similarly, a novel that has been manually translated from a source language to another language is another corpus. Corpora can be tailored to particular domains or other specific attributes of documents, such as authors, genres, subject matters, and so forth. As examples, technical documents provide different corpora than literary documents; a play authored by Shakespeare provides a different corpus than a novel authored by Jean Paul Sartre; a romance novel provides a different corpus than a comic strip; and a civil engineering document may provides a different corpus than a computer engineering document. Each target language (or even each source and target language pairing) may be associated with different corpora, and these corpora may be further divided by domains, authors, subject matters, and so forth. Corpus translation techniques may use one or more corpora to statistically “learn” how to translate words or sequences of words (e.g., sentences or phrases) from a source language into a target language.
Although corpora can make machine translations more accurate, they can be difficult to obtain because they may not be readily available. As an example, it may be difficult to locate many corpora in the scientific or technical domains in some languages because many such documents may not exist for a given pair of languages. Given the lack of adequate corpora, even corpus translation techniques are not as accurate as they could be.