The present invention relates to analysis of rare events. More specifically, the present invention relates to determining the significance of rare events that occur in, for example, natural language processing systems, such as in the machine translation context, or in any other system that encounters rare events.
There are a wide variety of natural language processing systems which use statistical processing. One such system is a machine translation system. A machine translation system receives a textual input in one language, translates it to a second language, and provides a textual output in the second language. Such systems often use statistical methods to measure the strength of association, particularly lexical associations.
One conventional measure used in natural language processing is referred to as the G2 log-likelihood-ratio statistic. This measure is discussed in greater detail in Dunning, ACCURATE METHODS FOR THE STATISTICS OF SURPRISE AND COINCIDENCE, Computational Linguistics, 19(1):61-74 (1993). Even though this statistic is widely used in natural language processing, its use remains controversial on the grounds that it may be unreliable when applied to rare events.
Another statistic conventionally used in natural language processing is referred to as the Chi-square statistic. This is described in greater detail in Adgresti et al., CATEGORICAL DATA ANALYSIS, John Whiley and Sons, New York, N.Y. (1990). It has been demonstrated that the Chi-square test is valid with smaller sample sizes and more sparse data than the G2 statistic. However, either Chi-square or G2 can be unreliable when expected frequencies of less than five are involved.
A phenomenon referred to as Zipf's Law shows that the problem of rare events invariably arises whenever dealing with individual words. Zipf's Law has various formulations, but they all imply that relatively few words in a language are very common, and most words are relatively rare. This means that no matter how large a corpus is, most of the distinct words in that corpus occur only a small number of times. For example, one corpus includes 500,000 English sentences sampled from the Canadian Hansards data supplied for the bilingual word alignment workshop held at HLT-NAACL 2003 (and referred to in more detail in Mihalcea and Pedersen, AN EVALUATION EXERCISE FOR WORD ALIGNMENT, Proceedings of the HLT-NAACL 2003 workshop, BUILDING AND USING PARALLEL TEXTS: DATA DRIVEN MACHINE TRANSLATION AND BEYOND, pp. 1-6, Edmonton Alberta (2003)). In that corpus, there are 52,921 distinct word types, of which 60.5 percent occur five or fewer times, and 32.8 percent occur only once.
While the G2 statistic has been most often used in natural language processing as a measure of the strength of association between pairs of words, the sparse data problem which renders the G2 statistic unreliable becomes even worse when considering pairs of words. For example, considering the 500,000 French sentences corresponding to the English sentences described above, it can be seen that 19,460,068 English-French word pairs occur in aligned sentences more often than would be expected by chance, given their monolingual frequencies. Of these, 87.9 percent occur together five or fewer times (i.e., they have a joint occurrence frequency of five or less) and 62.4 percent occur together only once.
Moreover, if the expected number of occurrences of these word pairs (which is the criteria used for determining the applicability of Chi-square or G2 significance tests) is considered, it can be seen that 93.2 percent would be expected by chance to have fewer than five occurrences. Thus, any statistical measure that is unreliable for expected frequencies of less than five would be wholly unusable with such data.
In the past, a wide variety of statistics have been used to measure the strength of word association. Such statistics include point-wise mutual information, the Dice coefficient, Chi-square, G2 and Fisher's Exact Test. Each of these is described in greater detail in Inkpen, Hirst, ACQUIRING COLLOCATIONS FOR LEXICAL CHOICE BETWEEN NEAR-SYNONYMS, UNSUPERVISED LEXICAL ACQUISITION: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX), pp. 67-76, Philadelphia, Pa. (2002).
Despite the fact that many of these statistics arise from significance testing, the conventional practice in applying them in natural language processing has been to choose a threshold heuristically for the value of the statistic being used and to discard all the pairs below the threshold. It has been conventionally taught that there is no principled way of choosing these thresholds. See Inkpen and Hirst p. 70. Indeed, if standard statistical tests are conventionally used, the results make no sense in the types of natural language processing systems discussed herein.
An example may be helpful in illustrating the deficiencies of the conventional systems. Consider the case of two words that each occur only once in a corpus, but happen to co-occur. Conventional wisdom strongly advises suspicion of any event that occurs only once, yet it is easy to see that applying standard statistical methods to this case tend to suggest that it is highly significant, without using any questionable approximations at all.
The question that significance tests for association (such as Chi-square, G2 and Fisher's Exact Test) are designed to answer is: Given the sample size and the marginal frequencies of the two items in questions, what is the probability (or p-value) of seeing by chance as many or more joint occurrences as were observed? In the case of a joint occurrence of two words that each occur only once, this is trivial to calculate.
For instance, suppose an English word and a French word each occur only once in the corpus discussed above of 500,000 aligned sentence pairs of Hansards data, but they happen to occur together. In order to determine the probability that this joint occurrence happened by chance, it can be supposed that the English word occurs in an arbitrary sentence pair. The probability that the French word, purely by chance, would occur in the same sentence pair is clearly 1 in 500,000 or 0.000002. Since it is impossible to have more than one joint occurrence of two words that each have only a single occurrence, 0.000002 is the exact p-value for the question we have asked. However, one should not assume that the association between the words is highly certain on this basis alone, but this is what was done in conventional approaches.