A natural language processing system is a computer implemented software system which intelligently derives meaning and context from an input string of natural language text. "Natural languages" are languages which are spoken by humans (e.g., English, French, Japanese). Computers cannot, without assistance, distinguish linguistic characteristics of natural language text. For instance, a sentence in a natural language text read as follows:
I saw a bird. PA1 Use a saw. PA1 Word: school PA1 Part of Speech:
A student of English understands that, within the context of this sentence, the word "I" is a pronoun, the word "saw" is a verb, the word "a" is an adjective, and the word "bird" is a noun. However, in the context of other sentences, the same words might assume different parts of speech. Consider the following sentence:
The English student recognizes that the word "use" is a verb, the word "a" is an adjective, and the word "saw" is a noun. Notice that the word "saw" is used in the two sentences as different parts of speech, a verb and a noun, which an English speaking person realizes. To a computer, however, the word "saw" is represented by the same bit stream and hence can be identical for both sentences. The computer is equally likely to consider the word "saw" as a noun as it is a verb, in either sentence. A natural language processing system assists the computer in distinguishing how words are used in different contexts and in applying rules to construct intelligible text.
FIG. 1 shows the general components of a natural language processing system 20 which are typically implemented in software and executed on a computer. The natural language processing system 20 includes a lexical analyzer 22 which converts an input text string into a stream of tokens containing information from the lexicon and the system's morphology component. The lexical analyzer 22 determines the possible parts of speech, person, number and other grammatical features for each token (word). In this example, suppose the input string is the phrase "school finishes." The lexical analyzer 22 might resolve the word school as follows:
Noun PA2 Verb PA2 Adjective
Features: third person, singular PA3 Features: plural, infinitive, present tense PA3 Features pre-modifies noun.
The lexical analyzer 22 uses the components to construct data structures, commonly referred to as lexical records, for each word in the input string text. A parser creates a syntactic analysis for the input string by using the lexical records produced by the lexical analyzer 22, combining lexical records into constituents to form larger constituents until one or more complete trees are produced. The product of the parser 24 is passed to a logic normalizer 26 which places linguistically equivalent sentences (e.g., "John ate an apple" is essentially equivalent to "an apple was eaten by John") in a normalized form. Finally, a sense disambiguator 28 resolves any ambiguities that might be left in the sentence following the parse, syntax, and logic processes. For instance, the sense disambiguator 28 might handle whether the word school is a building or an activity that finishes.
This invention particularly concerns problems associated with natural language parsers. Conventional natural language parsers are typically one of two types: "statistical" and "rule-based." A statistical parser, which are currently more popular, determines parsing parameters by computing statistics on words used in a small sample portion of a corpus. Once the statistics are computed, the statistical parser relies on them when analyzing the large corpus. This is described below in more detail.
A rule-based parser stores knowledge about the structure of language in the form of linguistic rules. The parser makes use of syntactic and morphological information about individual words found in the dictionary or "lexicon" or derived through morphological processing (organized in the lexical analysis stage). Successful parsing requires that the parser (grammar) have the necessary rules and the lexical analyzer provide all the details needed by the parser to resolve as many ambiguities as it can at that level.
Natural language parsers are said to have "broad coverage" when capable of parsing general natural language text of many different types. To achieve broad coverage, a natural language parser needs a complete lexicon which includes frequent and seldom-used words. Even the most rare parts of speech should be represented when attempting broad-coverage.
Broad coverage, rule-based natural language parsers have a disadvantage in that they require extensive amounts of dictionary data and rule-writing labor by high skilled linguists to create, enhance, and maintain the parsers. Manually coding the required information is both time-consuming and error-prone. A standard on-line dictionary represents centuries of hand-coding by skilled lexicographers.
Machine-readable dictionaries (MRDs) are being adapted for use in natural language parsers. MRDs provide a large and complete lexicon needed for broad coverage. Though dictionaries prove useful as sources of comprehensive lexicons for natural language parsers, their completeness introduces ambiguity that is not easily resolved. Resolving ambiguity with regard to parts of speech presents a particularly difficult problem. The American Heritage Dictionary (1992 edition) has approximately 18,500 words with multiple parts of speech, which represents approximately 12% of the total number of entries (inflected forms included). However, these words are often common, well used words. One researcher studied the Brown Corpus (a well known, large, one million word body composed of natural language text from many different subjects) and found that only 11% of each unique word in the Corpus were part-of-speech ambiguous. However, those same words accounted for 48% of the raw text in the Brown Corpus, evidencing that words which are part-of-speech ambiguous tend to be common, well used words. DeRose, S. J. 1992. "Probability and Grammatical Category: Collocational Analyses of English and Greek." In For Henry Kuoera, eds. A. W. Mackie, T. K. McAuley and C. Simmons, 125-152. Michigan Slavic Publications, University of Michigan.
It is computationally desirable that the parser be able to choose the most probable parse from the potentially large number of possible parses. Further processing of the input quickly becomes complex and inefficient if more than one parse is considered. To reduce the number of possible parsers, it is desirable to develop methods which assist the parser in efficient resolution of part-of-speech ambiguities.
One prior art parsing technique is to use an augmented transition network (ATN). An ATN is similar to a recursive transition network in that it is a directed us graph with labeled states and arcs, except that the ATN permits the addition of conditions to be satisfied and structure building actions to be executed to an arc. ATNs often generate multiple and unlikely parses because they cannot successfully resolve part-of-speech ambiguities. Church, K. W. 1992. "Current Practice in Part of Speech Tagging and Suggestions for the Future." In For Henry Kuoera, eds. A. W. Mackie, T. K. McAuley and C. Simmons, 13-48. Michigan Slavic Publications, University of Michigan. This is most likely true for all broad-coverage rule-based approaches. To accomplish broad-coverage, a parser must be able to analyze the variety of structures found in real text. When there are multiple words which are ambiguous with respect to their part of speech in a single sentence, determining the most probable parse becomes a difficult undertaking. This problem becomes extreme when truly broad-coverage parsing is attempted.
Another prior art technique that has evolved over the last 25 years is to employ statistical models for part-of-speech determination. The statistical models are implemented using statistical parsers. With the statistical approach, a statistical parser is initially operated in a training mode in which it receives input strings that have been annotated by a linguist with tags that specify parts of speech, and other characteristics. The statistical parser records statistics reflecting the application of the tags to portions of the input string. After a significant amount of training using tagged input strings, the statistical parser enters a parsing mode in which it receives raw untagged input strings. In the parsing mode, the statistical parser applies the learned statistics assembled during the training mode to build parse trees for the untagged input string.
Early versions of the statistical parser required a large rule database and a large training corpus to provide adequate statistics for later use in determining parts of speech. Great strides have been made since then in terms of the efficiency, simplicity, and accuracy of tagging algorithms and in the reduction of the rule database. While the size of the rule database is shrinking, the need for large training corpora remains. Statistical approaches usually require a training corpus that has been manually tagged with part-of-speech information.
In an effort to avoid use of large training corpora, a developer proposed use of a rule-based parser to derive part-of-speech and rule probabilities from untagged corpora. By incorporating part-of-speech and rule probabilities into the same parser, the speed and accuracy of the parser was improved. This approach is described in a publication Richardson, S. D. 1994, "Bootstrapping Statistical Processing into a Rule-based Natural Language Parser," In Proceedings of the ACL Workshop "Combining symbolic and statistical approaches to language", pp. 96-103. It is also the subject of U.S. patent application Ser. No. 08/265,845, filed Jun. 24, 1994, and a PCT Application No. PCT/US95/08245, filed Jun. 26, 1995, which are entitled "Method and System for Bootstrapping Statistical Processing into a Rule-based Natural Language Parser."
The statistical rule-based parser assumes, however, the availability of a large corpus and a fairly comprehensive parser. In the English language, large well-balanced corpora like the Brown Corpus (Kuoera and Francis, 1967) and the Lancaster-Oslo/Bergen (LOB) Corpus (Johansson et al,. 1978) are suitable. Unfortunately, such corpora are not always available in other languages.
Accordingly, the inventor has developed an improved technique for deriving part-of-speech probabilities without reliance on large well-balanced training corpora.