The ordinary vocabulary of a language like English contains thousands of phrasal terms. A phrasal term is a multi-word lexical unit, such as a compound noun, a technical term, an idiomatic phrase, or a fixed collocation. The exact number of phrasal terms is difficult to determine because new phrasal terms are coined regularly. Moreover, it is sometimes difficult to determine whether a phrase is a fixed term or a regular, compositional expression. Accurate identification of phrasal terms is important in a variety of contexts, including natural language processing, question answering systems, information retrieval systems, and the like.
Distinguishing factors for the component words of phrasal terms as compared to other lexical units include the following: 1) the component words tend to co-occur more frequently; 2) the component words are more resistant to substitution or paraphrasing; 3) the component words follow fixed syntactic patterns; and 4) the component words display some degree of semantic non-compositionality. However, none of these characteristics are amenable to a simple algorithmic interpretation.
Any solution to the problem of variable length must enable normalization allowing direct comparison of phrases of different length. Ideally, the solution would also address the other issues—the independence assumption and the skewed distributions typical of natural language data.
While numerous term extraction systems have been developed, such systems typically rely on a combination of linguistic knowledge and statistical association measures. Grammatical patterns, such as adjective-noun or noun-noun sequences, are selected and ranked statistically. The resulting ranked list is then either used directly or submitted for manual filtering. Such systems include those described in F. Smadja, “Retrieving collocations from text: Xtract,” Computational Linguistics, 19:143-77 (1993); I. Dagan & K. Church, “Termight: Identifying and translating technical terminology,” ACM International Conference Proceeding Series: Proceedings of the fourth conference on applied natural language processing, Stuttgart, Germany pp. 39-40 (1994); J. S. Juteson & S. M. Katz, “Technical terminology: some linguistic properties and an algorithm for identification in text,” Natural Language Engineering 1:9-27 (1995); B. Daille, “Study and Implementation of Combined Techniques from Automatic Extraction of Terminology,” contained in “The Balancing Act: Combining Symbolic and Statistical Approaches to Language,” J. Klavans & P. Resnik, eds., pp 49-66 (1996); C. Jacquemin, et al., “Expansion of multi-word terms for indexing and retrieval using morphology and syntax,” Proceedings of ACL 1997, Madrid, pp 24-31; C. Jacquemin & E. Tzoukermann, “NLP for Term Variant Extraction: Synergy between Morphology, Lexicon, and Syntax,” Natural Language Processing Information Retrieval, pp 25-74 (1999); and B. Bougarev & C. Kennedy, “Applications of Term Identification Technology: Domain Description and Content Characterization,” Natural Language Engineering 5(1): 17-44 (1999), each of which is incorporated by reference herein in its entirety.
The linguistic filters used in typical term extraction systems have no direct connection with the criteria that define a phrasal term, such as non-compositionality, fixed order, non-substitutability, and the like. Instead, the linguistic filters function to eliminate improbable terms a priori and thus improve precision. An association measure then distinguishes between phrasal terms and plausible non-terms. Various measures have been used including a simple frequency, a modified frequency measure, and standard statistical significance tests, such as the t-test, the chi-squared test, log-likelihood, and pointwise mutual information. The modified frequency measures may include the c-value defined in K. Frantzi, et al., “Automatic recognition of multi-word terms: the C-Value and NC-Value Method,” International Journal on Digital Libraries 3(2):115-30 (2000) and D. Maynard & S. Ananiadou, “Identifying Terms by Their Family and Friends,” COLING 2000, pp 530-36 (2000), each of which is incorporated by reference herein in its entirety. K. W. Church & P. Hanks, “Word association norms, mutual information, and lexicography,” Computational Linguistics 16(1):22-29 (1990) and T. Dunning, “Accurate methods for the statistics of surprise and coincidence,” Computational Linguistics 19:1 (1993), each of which is incorporated herein by reference in its entirety, use various statistical significance tests.
However, none of the aforementioned methods provides adequate identification of phrasal terms. Indeed, the above methods generally fare worse than methods employing simple frequency orderings unless grammatical pre-filtering was performed on the input data. One explanation for the low precision of the above described lexical association measures on unfiltered data is the failure of the underlying statistical assumptions. For example, many of the tests assume a normal distribution, despite the highly skewed nature of natural language frequency distributions. Perhaps even more importantly, statistical and information-based metrics, such as log-likelihood and mutual information, measure significance or informativeness relative to the assumption that the selection of component terms is statistically independent. However, the possibilities for combinations of words are neither random nor independent. Use of linguistic filters such as “attributive adjective followed by noun” or “verb plus modifying prepositional phrase” arguably has the effect of selecting a subset of the language for which the standard null hypothesis—that any word may freely be combined with any other word—may be much more accurate. Additionally, many of the association measures are defined only for bigrams, and do not generalize well to phrasal terms of varying length.
Moreover, existing association methods are designed to measure the statistical relationship between word sequences and their component words without regard for alternative sequences. For example, judging “hot dog” to be a phrase would necessarily judge “the hot,” “eat the hot,” “dog quickly,” “hot dog quickly,” and numerous other word sequences to not be phrases using these association methods.
What is needed is a method of determining phrasal terms that improves upon the performance of previous lexical association methods.
A need exists for a method of determining phrasal terms based on a frequency-based measure.
A further need exists for natural language processing systems, essay evaluation systems, information retrieval systems, and the like which employ such a method.
A still further need exists for evaluating overlapping and alternative word sequences to determine if more than one phrasal term exists in a word sequence.
The present disclosure is directed to solving one or more of the above-listed problems.