The subject application relates to statistical machine translation (SMT) in computing systems. While the systems and methods described herein relate to statistical machine translation, it will be appreciated that the described techniques may find application in other translation systems, other statistical mapping applications, and/or other translation methods.
Classical approaches to statistical machine translation (SMT) involve “bi-phrases”, that is, pairs of source language and target language expressions or phrases that form building blocks for constructing a target (i.e., translated) sentence from a source sentence. Conventional approaches for decoding (i.e. translating) with these phrase-based translation models involve dynamic programming techniques, typically employing a left-to-right heuristic beam-search, as described for instance in a paper by Philipp Koehn (Pharaoh: a beam-search decoder for phrase-based statistical machine translation models, in Proc. Conference of the Association for Machine Translation in the Americas (AMTA), 2004). Because they build a translation heuristically from left-to-right, such methods may be overly sensitive to choices that are made early in the search, and not be able to recover easily from early mistakes.
Certain previous authors such as Germann et al. (Fast and Optimal Decoding for Machine Translation. Artificial Intelligence 154, pp 127-143. Elsevier 2004) have noted some analogies between SMT decoding and the Travelling Salesman Problem, but fail to map the SMT decoding problem into a Travelling Salesman Problem. Instead, such approaches map the decoding problem into a linear integer program for solving a certain version of word-based (and not phrase-based) SMT. Such approaches are only able to solve very small translation problems, and do not contemplate phrase-based SMT.
Additionally, such integer programming approaches (which contrast with conventional beam-search approaches) do not map phrase-based SMT directly into a generalized traveling salesman problem (GTSP) problem. Such integer programming approaches, which are very generic, are also often very inefficient. In addition, integer programming formulations are typically suited to cases where exact optimal solutions are requested, which is much more demanding in terms of computing resources than when approximate solutions are sought. A single GTSP formulation allows for employing either exact solvers or approximate well-adapted solvers. Additionally, such integer programming approaches only incorporate bigram language models, and not trigram or n-gram language models.
N-gram language models are a type of probabilistic model for predicting the next item in a sequence. N-grams are used in various areas of statistical natural language processing and genetic sequence analysis. An n-gram is a sub-sequence of n items from a given sequence. The items in question can be phonemes, syllables, letters, words, base pairs, etc.
Additionally, there are several drawbacks to conventional phrase-based beam-search decoders. Because they build a translation from left-to-right, they tend to show inertia relative to bad choices which may be done at the beginning of the search, when partial candidates are still short and the heuristic estimate of the remainder is weak. If a construct appearing in the middle of the source sentence is strongly constraining, this knowledge cannot be exploited before choices that are related to constructs more to the left in the source sentence are exploited, although such choices may be less constraining. Because it is necessary to prune the search tree during the search in order to avoid combinatorial explosion, the solution that is found at the end of the search is typically suboptimal, and whether it is actually optimal can never be known.
Statistical Machine Translation (SMT) systems that employ “phrase-based” translation techniques build translations by relying on building blocks, called “biphrases,” such as are used in the following example:                Les plaisanteries les plus courtes sont toujours les meilleures        The shortest jokes are always the best        
In this example, the biphrases that are employed for producing the translation are the following:                les—the        plaisanteries—jokes        les plus courtes—shortest        sont toujours—are always        les meilleures—the best        
With regard to biphrases, the following points may be noted. There may be several biphrases competing to translate a given source segment. For instance, a bi-phrase library (e.g., a database) may contain the following entries: les plus courtes—the shortest, sont—are, toujours—always, and so forth. Additionally, the ordering of the target sentence may be different from that of the source. For instance, while plaisanteries appears before les plus courtes in the source sentence, shortest appears after jokes in the target sentence.
In order to translate a given source sentence S, such as les plaisanteries les plus courtes sont toujours les meilleures, classical phrase-based SMT systems use a log-linear model of the form:p(t,a|s)=1/ZsexpΣkλkhk(s,a,t)where the hk's are features, that is, functions of the source string s, of the target string t, and of the alignment a, and where the alignment is a representation of the sequence of biphrases that are used to build t from s. In the example provided herein, the sequence of biphrases is: les—the, les plus courtes—shortest, plaisanteries—jokes, sont toujours—are always, and les meilleures—the best. It will be noted that the order of this sequence is defined by reference to the target side: a bi-phrase b precedes a bi-phrase b′ in the alignment if and only if the target side of b precedes the target side of b′ in the target t. The λk's are weights and Zs is a normalization factor that guarantees that p(t,a|s) is a proper conditional probability distribution over the pairs (t,a).
Features that are local to biphrases, namely features that can be computed additively over biphrases participating in the alignment a, include (without being limited to): the forward and reverse conditional probability features log p({tilde over (t)}|{tilde over (s)}) and log p({tilde over (s)}|{tilde over (t)}), where {tilde over (s)} is the source side of the bi-phrase and {tilde over (t)} is the target side, and where these probabilities have been estimated on the basis of a large bilingual training corpus; the so-called “phrase penalty” feature, which is equal to 1 for each bi-phrase in the alignment; and the so-called “word penalty” feature, which counts the number of words in {tilde over (t)}.
Features that depend on the order in which biphrases appear in the alignment include: the language model feature log p(t), which computes the probability of the target sentence associated with the translation candidate, typically according to an n-gram language model estimated over a large target language corpus; the distortion feature, which measures how much the sequence of biphrases used in the candidate translation deviates from the “monotonic” order, namely the order that would be imposed if the target sides of the biphrases were sequenced in the same order as their source-sides.
Once a log-linear model has been defined (which involves a training phase; see e.g. Lopez, A. 2008. Statistical Machine Translation. ACM Comput. Surv. 40, 3 (August 2008), 1-49, incorporated by reference herein), the role of the decoder is to find a pair (t,a) that maximizes the conditional probability p(t,a|s), and to output the corresponding target string t.
Classical systems are based on some variant of a heuristic left-to-right search, that is, they attempt to build a candidate translation (t,a) incrementally, from left to right, extending the current partial translation at each step with a new biphrase, and computing two scores: a score for the known elements of the partial translation so far, and a heuristic estimate of the remaining cost for completing the translation. The variant which is most often used is a form of beam-search, where several partial candidates are maintained in parallel, and candidates for which the current estimate is too low are pruned in favor of candidates that are more promising.
Accordingly, there is an unmet need for systems and/or methods that employ phrase-based SMT by modeling the bi-phrases as nodes in a graph and applying a traveling salesman problem solver to the graph, while overcoming the aforementioned deficiencies.