The exemplary embodiment relates to the field of machine translation. It finds particular application in connection with the pruning of bi-phrases stored in a library of bi-phrases, and will be described with particular reference thereto.
Statistical machine translation (SMT) systems aim to find the most probable translation of a source language text in a target language. The system may be trained with bilingual corpora. These are source and target language texts that are mutual translations of each other. Phrase-based SMT systems have been developed which employ libraries (databases) of “bi-phrases,” that is, pairs of the form <source-phrase, target-phrase>, which are learned automatically from the bilingual corpora. When given a new segment of text to translate, these systems search the library to extract all relevant bi-phrases, i.e., items in the library whose source-language phrase matches some portion of the new input. A subset of these matching bi-phrases is then identified, such that each word of the input text is covered by exactly one bi-phrase in the subset, and that the combination of the target-language phrases produces a coherent translation.
Conventionally, SMT systems use contiguous bi-phrases, i.e., each phrase is an uninterrupted sequence of words. Recently, SMT phrase-based systems have been developed which are able to accommodate non-contiguous bi-phrases. (See, for example, Michel Simard, et al., Translating with non-contiguous phrases, Proc. Conf. on Human Language Technology and Empirical Methods in Natural Language Processing, Morristown, N.J., USA (HLT '05), published by the Association for Computational Linguistics, pp. 755-762 (2005) and U.S. Pub. No. 2006/0190241 to Goutte, et al.). In these non-contiguous bi-phrases, one or both of the source phrase and target phrase is non-contiguous, i.e., has one or more gaps, each gap representing a word. For example, in a French to English translation system, one non-contiguous bi-phrase may be <ne ⋄ plus, not ⋄ ⋄ ⋄ anymore>, where each diamond symbol represents a gap of one word. In other systems, the gap symbol may represent a variable number of words with a weighting system which favors particular gap sizes.
While non-contiguous bi-phrases have some advantages over contiguous ones, in particular the ability to generalize better over such linguistic patterns as in French, “ne . . . pas” or in English, “switch . . . off”, they may also lead to larger, more combinatorial, libraries. While the potential number of contiguous phrases in a sentence grows quadratically according to the length of the sentence, the potential number of non-contiguous phrases grows exponentially. It is desirable to control the proliferation of bi-phrases in this situation for several reasons. First, for storage purposes, the size of the bi-phrase library should not be too large. Second, it allows an improvement in translation speed by reducing the number of candidates to be considered during decoding. This is particularly relevant for a non-contiguous decoder, which is more complex than a contiguous one. Third, it helps to improve translation performance by removing “spurious” bi-phrases which do not have good predictive linguistic value.
In the case of a system based on contiguous bi-phrases, it has been shown that pruning of bi-phrases out of the library can be achieved without negatively impacting the end results, while at the same time improving the speed of decoding. (see, for example, Johnson, et al., Improving translation quality by discarding most of the phrasetable, Proc. 2007 Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 967-975 (2007), hereinafter “Johnson 2007”). In Johnson's approach, the strength of the statistical dependence between the source and the target of the bi-phrase is assessed. Those bi-phrases for which this strength is below a certain threshold can then be pruned from the library.
In existing techniques, filtering may begin with a computation of association scores between source- and target-phrases, based on the p-value associated to a certain statistical independence test, known as the “Fisher Exact Test.” This test assumes that any bi-phrase extracted from a parallel corpus of bi-sentences can be represented by a contingency table, as illustrated in FIG. 1 for the bi-phrase (S,T). In this figure, N is the total number of bi-sentences in a parallel corpus, C(S) is the number of bi-sentences where the phrase S appears on the source side, C(T) is the number of bi-sentences where the phrase T appears on the target side, and C(S,T) is the number of bi-sentences where S and T both appear, on the source and target sides, respectively. C(S) and C(T) are also called marginals, and the four main entries in the contingency table C(S,T), C(S)−C(S,T), C(T)−C(S,T), and N−C(S)−C(T)+C(S,T) represent a partition of the N corpus bi-sentences into bi-sentences that contain both S and T, contain S but not T, contain T but not S, and contain neither S nor T. For example, suppose a parallel corpus includes 10,000 bisentences, of which C(S) is 100, C(T) is 120, and C(S,T) is 50, then the contingency table can be represented as follows:
C(S, T) = 50C(S) − C(S, T) = 50C(S) = 100C(T) − C(S, T) = 70N − C(S) − C(T) +N − C(S) = 9900C(S, T) = 9830C(T) = 120N − C(T) = 9880N = 10,000
Fisher's Exact Test is a statistical test for association in a table based on the exact hypergeometric distribution of the frequencies within the table. The test computes the probability (“p-value”) that a certain joint event, here the joint occurrence of S and T in the source and target side of the same bi-sentence (S,T) appears under a so-called “null hypothesis” that corresponds to the situation where S and T are actually statistically independent. For the example above, for S and T, the null hypothesis is that if the C(S) occurrences of S are placed independently at random in the corpus and similarly for the occurrences of T, then: what is the probability that the joint occurrences of (S,T) will appear C(S,T) times or more? This probability is called the p-value.
Fisher's Exact Test computes the p-value exactly, using the following formulas:
                    p        hypergeometric            ⁡              (        k        )              =                            (                                                                      C                  ⁡                                      (                    S                    )                                                                                                      k                                              )                ⁢                  (                                                                      N                  -                                      C                    ⁡                                          (                      S                      )                                                                                                                                                                C                    ⁡                                          (                      T                      )                                                        -                  k                                                              )                            (                                            N                                                                          C                ⁡                                  (                  T                  )                                                                    )                        p      ⁢              -            ⁢              value        ⁡                  (                      C            ⁡                          (                              S                ,                T                            )                                )                      =                  ∑                  k          =                      C            ⁡                          (                              S                ,                T                            )                                                min          ⁡                      (                                          C                ⁡                                  (                  S                  )                                            ,                              C                ⁡                                  (                  T                  )                                                      )                              ⁢                        p          hypergeometric                ⁡                  (          k          )                    
where k is a variable which can assume integer values from k=C(S,T) to min(C(S), C(T)). The p-value is thus a measure which is associated with the contingency table defined by the quadruple [C(S,T), C(S), C(T), N], (or, equivalently, the quadruple [C(S,T), C(S)−C(S,T), C(T)−C(S,T), N−C(S)−C(T)+C(S,T)]). The smaller the p-value, the more significant the “dependence” between S and T is considered to be. For example, when the p-value is close to 0, then it means that the probability of finding as many joint occurrences as C(S,T), given that the marginals are C(S) and C(T), and given that the occurrences of S and of T are placed at random among the N bi-sentences, is close to 0. An association-score relative to a contingency table may be defined as follows:association_score≡−log(p-value)
where log may be loge.
Thus, for example, in the contingency table above, according to Fisher's exact test in R, the matrix data is input as follows:data<−c(50,50,70,9830)
To create the matrix in R, enter:mat<−matrix(data,nrow=2,ncol=2,byrow=TRUE)
To calculate the p-value in R enter:ftest=fisher.test(mat,alternative=“two.sided”)
The algorithm outputs:p-value=4.502836e−73association score=−log(p-value)=166.584
The association score, which will be referred to herein as β, varies between a minimum and a maximum value (e.g., from 0 to ∞, or some value less than ∞), with high numbers indicating a strong statistical dependence. Pruning of bi-sentences can be used to remove bi-sentences below a certain threshold value of β.
While Fisher's Exact Test provides an exact measure of the p-value, another statistical test of independence is the χ2. However, while more computationally costly, the exact test is more accurate than the χ2 when the counts in the cells of the contingency table are small, which is typically the case for bi-phrase tables.
One problem with using the association score for determining whether to prune a bi-sentence or bi-phrase from a library is that a high nominal association score between S and T does not always indicate dependence. This can be shown by the following example:
A corpus contains N=500,000 bi-sentences. Call a word S (resp. T) a singleton if it is represented only once in the source (resp. target) side of the corpus. Assume that 17,379 source singletons and 22,512 target singletons are observed and 19,312 s-s (singleton-singleton) pairs are observed. Each s-s pair has the same contingency table, and hence the same high association score: −log(p-value)=−log( 1/500,000). However, if these singletons were placed independently at random among the 500,000 bi-sentences, on the source and target sides resp., (null hypothesis), it would be expected that around 782.5 (=17,379× 22,512/500,000) s-s pairs would be observed.
This means that if around 782 s-s pairs had been observed (rather than 19,312), there would be no confidence that any such pair (S,T) would actually be indicative of a statistical dependence between S and T, despite the high association score of −log( 1/500,000).
This indicates a difficulty in interpreting such statistical significance tests. For any given singleton-singleton pair it is true that the probability of it occurring by chance, if the two singletons are actually statistically independent, is 1/500,000. However, it would be wrong to conclude from this that the fraction of the global population of singleton-singleton pairs that were observed (namely 19,312) that is due to chance is only 1/500,000. This would be the case if s-s pairs were statistically independent from one another, but clearly they are not, and the fraction of unreliable s-s associations is in general much larger.
To remedy this problem, the notion of noise has been introduced (see Robert C. Moore, On log-likelihood-ratios and the significance of rare events, Proceedings of EMNLP 2004 (Barcelona, Spain) (Dekang Lin and Dekai Wu, Eds.), Association for Computational Linguistics, July 2004, pp. 333-340) (hereinafter, “Moore 2004”). Noise is defined by Moore as the ratio between the expected number of s-s pairs under an independence assumption and the actually observed number of s-s pairs.
In the above example:Noise≡782.5/19,312=4%.
If only around 782 s-s pairs had been observed, the Noise would be close to 100%. However, given that 19,312 such pairs were observed, it can be concluded that there is about 0.04 probability that a given s-s pair is due to chance. If instead, the raw association score were used to estimate this probability, the p-value would be 1/500,000=0.000002, that is, a much too optimistic estimate of the dependence.
Attempts have been made to extend the concept of noise beyond singletons to words which can occur more than once in the corpus. However, on real bilingual corpora, it is typically the case that noise decreases monotically with the association score. Specifically, noise increases as association decreases and vice versa. As a result, attempts to use noise thresholds as a basis for pruning bi-phrases from a library are no different from approaches which are based on the association scores.