The present invention relates to the field of natural language processing (xe2x80x9cNLPxe2x80x9d), and more particularly, to a method and system for organizing and retrieving information from an electronic dictionary.
Computer systems for automatic natural language processing use a variety of subsystems, roughly corresponding to the linguistic fields of morphological, syntactic, and semantic analysis to analyze input text and achieve a level of machine understanding of natural language. Having understood the input text to some level, a computer system can, for example, suggest grammatical and stylistic changes to the input text, answer questions posed in the input text, or effectively store information represented by the input text.
Morphological analysis identifies input words and provides information for each word that a human speaker of the natural language could determine by using a dictionary. Such information might include the syntactic roles that a word can play (e.g., noun or verb) and ways that the word can be modified by adding prefixes or suffixes to generate different, related words. For example, in addition to the word xe2x80x9cfish,xe2x80x9d the dictionary might also list a variety of words related to, and derived from, the word xe2x80x9cfish,xe2x80x9d including xe2x80x9cfishes,xe2x80x9d xe2x80x9cfished,xe2x80x9d xe2x80x9cfishing,xe2x80x9d xe2x80x9cfisher,xe2x80x9d xe2x80x9cfisherman,xe2x80x9d xe2x80x9cfishable,xe2x80x9d xe2x80x9cfishability,xe2x80x9d xe2x80x9cfishbowl,xe2x80x9d xe2x80x9cfisherwoman,xe2x80x9d xe2x80x9cfishery,xe2x80x9d xe2x80x9cfishhook,xe2x80x9d xe2x80x9cfishnet,xe2x80x9d and xe2x80x9cfishy.xe2x80x9d
Syntactic analysis analyzes each input sentence, using, as a starting point, the information provided by the morphological analysis of input words and the set of syntax rules that define the grammar of the language in which the input sentence was written. The following are sample syntax rules:
Syntactic analysis attempts to find an ordered subset of syntax rules that, when applied to the words of the input sentence, combine groups of words into phrases, and then combine phrases into a complete sentence. For example, consider the input sentence: xe2x80x9cBig dogs fiercely bite.xe2x80x9d Using the three simple rules listed above, syntactic analysis would identify the words xe2x80x9cBigxe2x80x9d and xe2x80x9cdogsxe2x80x9d as an adjective and noun, respectively, and apply the second rule to generate the noun phrase xe2x80x9cBig dogs.xe2x80x9d Syntactic analysis would identify the words xe2x80x9cfiercelyxe2x80x9d and xe2x80x9cbitexe2x80x9d as an adverb and verb, respectively, and apply the third rule to generate the verb phrase xe2x80x9cfiercely bite.xe2x80x9d Finally, syntactic analysis would apply the first rule to form a complete sentence from the previously generated noun phrase and verb phrase. An ordered set of rules and the phrases that result from applying them, including a final complete sentence, is called a parse.
Some sentences, however, can have several different parses. A classic example sentence for such multiple parses is: xe2x80x9cTime flies like an arrow.xe2x80x9d There are at least three possible parses corresponding to three possible meanings of this sentence. In the first parse, xe2x80x9ctimexe2x80x9d is the subject of the sentence, xe2x80x9cfliesxe2x80x9d is the verb, and xe2x80x9clike an arrowxe2x80x9d is a prepositional phrase modifying the verb xe2x80x9cflies.xe2x80x9d However, there are at least two unexpected parses as well. In the second parse, xe2x80x9ctimexe2x80x9d is an adjective modifying xe2x80x9cflies,xe2x80x9d xe2x80x9clikexe2x80x9d is the verb, and xe2x80x9can arrowxe2x80x9d is the object of the verb. This parse corresponds to the meaning that flies of a certain type, xe2x80x9ctime flies,xe2x80x9d like or are attracted to an arrow. In the third parse, xe2x80x9ctimexe2x80x9d is n imperative verb, xe2x80x9cfliesxe2x80x9d is the object, and xe2x80x9clike an arrowxe2x80x9d is a prepositional phrase modifying xe2x80x9ctime.xe2x80x9d This parse corresponds to a command to time flies as one would time an arrow, perhaps with a stopwatch.
Syntactic analysis is often accomplished by constructing one or more hierarchical trees called syntax parse trees. Each leaf node of the syntax parse tree represents one word of the input sentence. The application of a syntax rule generates an intermediate-level node linked from below to one, two, or occasionally more existing nodes. The existing nodes initially comprise only leaf nodes, but, as syntactic analysis applies syntax rules, the existing nodes comprise both leaf nodes as well as intermediate-level nodes. A single root node of a complete syntax parse tree represents an entire sentence.
Semantic analysis generates a logical form graph that describes the meaning of input text in a deeper way than can be described by a syntax parse tree alone. Semantic analysis first attempts to choose the correct parse, represented by a syntax parse tree, if more than one syntax parse tree was generated by syntactic analysis. The logical form graph corresponding to the correct parse is a first attempt to understand the input text at a level analogous to that achieved by a human speaker of the language.
The logical form graph has nodes and links, but, unlike the syntax parse tree described above, is not hierarchically ordered. The links of the logical form graph are labeled to indicate the relationship between a pair of nodes. For example, semantic analysis may identify a certain noun in a sentence as the deep subject or deep object of a verb. The deep subject of a verb is the doer of the action and the deep object of a verb is the object of the action specified by the verb. The deep subject of an active voice verb may be the syntactic subject of the sentence, and the deep object of an active voice verb may be the syntactic object of the verb. However, the deep subject of a passive voice verb may be expressed in an instrumental clause, and the deep object of a passive voice verb may be the syntactic subject of the sentence. For example, consider the two sentences: (1) xe2x80x9cDogs bite peoplexe2x80x9d and (2) xe2x80x9cPeople are bitten by dogs.xe2x80x9d The first sentence has an active voice verb, and the second sentence has a passive voice verb. The syntactic subject of the first sentence is xe2x80x9cDogsxe2x80x9d and the syntactic object of the verb xe2x80x9cbitexe2x80x9d is xe2x80x9cpeople.xe2x80x9d By contrast, the syntactic subject of the second sentence is xe2x80x9cPeoplexe2x80x9d and the verb phrase xe2x80x9care bittenxe2x80x9d is modified by the instrumental clause xe2x80x9cby dogs.xe2x80x9d For both sentences, xe2x80x9cdogsxe2x80x9d is the deep subject, and xe2x80x9cpeoplexe2x80x9d is the deep object of the verb or verb phrase of the sentence. Although the syntax parse trees generated by syntactic analysis for sentences 1 and 2, above, will be different, the logical form graphs generated by semantic analysis will be the same, because the underlying meaning of the two sentences is the same.
Further semantic processing after generation of the logical form graph may draw on knowledge databases to relate analyzed text to real world concepts in order to achieve still deeper levels of understanding. An example knowledge base would be an on-line encyclopedia, from which more elaborate definitions and contextual information for particular words can be obtained.
In the following, the three natural language processing subsystemsxe2x80x94morphological, syntactic, and semanticxe2x80x94are described in the context of processing the sample input text: xe2x80x9cThe person whom I met was my friend.xe2x80x9d FIG. 1 is a block diagram illustrating the flow of information between the subsystems of natural language processing. The morphological subsystem 101 receives the input text and outputs an identification of the words and senses for each of the various parts of speech in which each word can be used. The syntactic subsystem 102 receives this information and generates a syntax parse tree by applying syntax rules. The semantic subsystem 103 receives the syntax parse tree and generates a logical form graph.
FIGS. 2-5 display the dictionary information stored on an electronic storage medium that is retrieved for the input words of the sample input text during morphological analysis. FIG. 2 displays the dictionary entries for the input words xe2x80x9cthexe2x80x9d 201 and xe2x80x9cpersonxe2x80x9d 202. Entry 201 comprises the key xe2x80x9cthexe2x80x9d 203 and a list of attribute/value pairs. The first attribute xe2x80x9cAdjxe2x80x9d 204 has, as its value, the symbols contained within the braces 205 and 206. These symbols comprise two further attribute/value pairs: (1) xe2x80x9cLemmaxe2x80x9d/xe2x80x9cthexe2x80x9d and (2) xe2x80x9cBitsxe2x80x9d/xe2x80x9cSing Plur Wa6 Det Art B0 Def xe2x80x9d A lemma is the basic, uninflected form of a word. The attribute xe2x80x9cLemmaxe2x80x9d therefore indicates that xe2x80x9cthexe2x80x9d is the basic, uninflected form of the word represented by this entry in the dictionary. The attribute xe2x80x9cBitsxe2x80x9d comprises a set of abbreviations representing certain morphological and syntactic information about a word. This information indicates that xe2x80x9cthexe2x80x9d is: (1) singular; (2) plural; (3) not inflectable; (4) a determiner; (5) an article; (6) an ordinary adjective; and (7) definite. Attribute 204 indicates that the word xe2x80x9cthexe2x80x9d can serve as an adjective. Attribute 212 indicates that the word xe2x80x9cthexe2x80x9d can serve as an adverb . Attribute xe2x80x9cSensesxe2x80x9d 207 represents the various meanings of the word as separate definitions and examples, a portion of which are included in the list of attribute/value pairs between braces 208-209 and between braces 210-211. Additional meanings actually contained in the entry for xe2x80x9cthexe2x80x9d have been omitted in FIG. 2, indicated by the parenthesized expression xe2x80x9c(more sense records)xe2x80x9d 213.
In the first step of natural language processing, the morphological subsystem recognizes each word and punctuation symbol of the input text as a separate token and constructs an attribute/value record for each token using the dictionary information. The attributes include the token type (e.g., word, punctuation) and the different parts of speech which a word can represent in a natural language sentence.
The syntactic subsystem inputs the initial set of attribute/value records for the sample input text, generates from each a syntax parse tree node, and applies syntax rules to these initial nodes to construct higher-level nodes of a possible syntax parse tree that represents the sample input text. A complete syntax parse tree includes a root node, intermediate-level nodes, and leaf nodes. The root node represents the syntactic construct (e.g., declarative sentence) for the sample input text. The intermediate-level nodes represent intermediate syntactic constructs (e.g., verb, noun, or prepositional phrases). The leaf nodes represent the initial set of attribute/value records.
In certain NLP systems, syntax rules are applied in a top-down manner. The syntactic subsystem of the NLP system herein described applies syntax rules to the leaf nodes in a bottom-up manner. That is, the syntactic subsystem attempts to apply syntax rules one-at-a-time to single leaf nodes to pairs of leaf nodes, and, occasionally, to larger groups of leaf nodes. If the syntactic rule requires two leaf nodes upon which to operate, and a pair of leaf nodes both contain attributes that match the requirements specified in the rule, then the rule is applied to them to create a higher-level syntactic construct. For example, the words xe2x80x9cmy friendxe2x80x9d could represent an adjective and a noun, respectively, which can be combined into the higher-level syntactic construct of a noun phrase. A syntax rule corresponding to the grammar rule, xe2x80x9cnoun phrase=adjective+noun,xe2x80x9d would create an intermediate-level noun phrase node, and link the two leaf nodes representing xe2x80x9cmyxe2x80x9d and xe2x80x9cfriendxe2x80x9d to the newly created intermediate-level node. As each new intermediate-level node is created, it is linked to already-existing leaf nodes and intermediate-level nodes, and becomes part of the total set of nodes to which the syntax rules are applied. The process of applying syntax rules to the growing set of nodes continues until either a complete syntax parse tree is generated or until no more syntax rules can be applied. A complete syntax parse tree includes all of the words of the input sentence as leaf nodes and represents one possible parse of the sentence.
This bottom-up method of syntax parsing creates many intermediate-level nodes and sub-trees that may never be included in a final, complete syntax parse tree. Moreover, this method of parsing can simultaneously generate more than one complete syntax parse tree.
The syntactic subsystem can conduct an exhaustive search for all possible complete syntax parse trees by continuously applying the rules until no additional rules can be applied. The syntactic subsystem can also try various heuristic approaches to first generate the most probable nodes. After one or a few complete syntax parse trees are generated, the syntactic subsystem typically can terminate the search because the syntax parse tree most likely to be chosen as best representing the input sentence is probably one of the first generated syntax parse trees. If no complete syntax parse trees are generated after a reasonable search, then a fitted parse can be achieved by combining the most promising sub-trees together into a single tree using a root node that is generated by the application of a special aggregation rule.
FIG. 6 illustrates the initial leaf nodes created by the syntactic subsystem for the dictionary entries initially displayed in FIGS. 2-5. The leaf nodes include two special nodes, 601 and 614, that represent the beginning of the sentence and the period terminating the sentence, respectively. Each of the nodes 602-613 represent a single part of speech that an input word can represent in a sentence. These parts of speech are found as attribute/value pairs in the dictionary entries. For example, leaf nodes 602 and 603 represent the two possible parts of speech for the word xe2x80x9cThe,xe2x80x9d that are found as attributes 204 and 212 in FIG. 2.
FIGS. 7-22 show the rule-by-rule construction of the final syntax parse tree by the syntactic subsystem. Each of the figures illustrates the application of a single syntax rule to generate an intermediate-level node that represents a syntactic structure. Only the rules that produce the intermediate-level nodes that comprise the final syntax tree are illustrated. The syntactic subsystem generates many intermediate-level nodes which do not end up included in the final syntax parse tree.
In FIGS. 7-14, the syntactic subsystem applies unary syntax rules that create intermediate-level nodes that represent simple verb, noun, and adjective phrases. Starting with FIG. 15, the syntactic subsystem begins to apply binary syntax rules that combine simple verb, noun, and adjective phrases into multiple-word syntactic constructs. The syntactic subsystem orders the rules by their likelihood of successful application, and then attempts to apply them one-by-one until it finds a rule that can be successfidly applied to the existing nodes. For example, as shown in FIG. 15, the syntactic subsystem successfully applies a rule that creates a node representing a noun phrase from an adjective phrase and a noun phrase. The rule specifies the characteristics required of the adjective and noun phrases. In this example, the adjective phrase must be a determinate quantifier. By following the pointer from node 1501 back to node 1503, and then accessing morphological information included in node 1503, the syntactic subsystem determines that node 1501 does represent a determinate quantifier. Having located the two nodes 1501 and 1502 that meet the characteristics required by the rule, the syntactic subsystem then applies the rule to create from the two simple phrases 1501 and 1502 an intermediate-level node that represents the noun phrase xe2x80x9cmy friend.xe2x80x9d In FIG. 22, the syntactic subsystem generates the final, complete syntax parse tree representing the input sentence by applying a trinary rule that combines the special Begin 1 leaf node 2201, the verb phrase xe2x80x9cThe person whom I met was my friendxe2x80x9d 2202, and the leaf node 2203 that represents the final terminating period to form node 2204 representing the declarative sentence.
The semantic subsystem generates a logical form graph from a complete syntax parse tree. Commonly, the logical form graph is constructed from the nodes of a syntax parse tree, adding to them attributes and new bi-directional links. The logical form graph is a labeled, directed graph. It is a semantic representation of an input sentence. The information obtained for each word by the morphological subsystem is still available through references to the leaf nodes of the syntax parse tree from within nodes of the logical form graph. Both the directions and labels of the links of the logical form graph represent semantic information, including the functional roles for the nodes of the logical form graph. During its analysis, the semantic subsystem adds links and nodes to represent (1) omitted, but implied, words; (2) missing or unclear arguments and adjuncts for verb phrases; and (3) the objects to which prepositional phrases refer.
FIG. 23 illustrates the complete logical form graph generated by the semantic subsystem for the example input sentence. Meaningful labels have been assigned to links 2301-2306 by the semantic subsystem as a product of the successful application of semantic rules. The six nodes 2307-2312, along with the links between them, represent the essential components of the semantic meaning of the sentence. In general, the logical form nodes roughly correspond to input words, but certain words that are unnecessary for conveying semantic meaning, such as xe2x80x9cThexe2x80x9d and xe2x80x9cwhomxe2x80x9d do not appear in the logical form graph, and the input verbs xe2x80x9cmetxe2x80x9d and xe2x80x9cwasxe2x80x9d appear as their infinitive forms xe2x80x9cmeetxe2x80x9d and xe2x80x9cbe.xe2x80x9d The nodes are represented in the computer system as records, and contain additional information not shown in FIG. 23. The fact that the verbs were input in singular past tense form is indicated by additional information within the logical form nodes corresponding to the meaning of the verbs, 2307 and 2310.
The differences between the syntax parse tree and the logical form graph are readily apparent from a comparison of FIG. 23 to FIG. 22. The syntax parse tree displayed in FIG. 22 includes 10 leaf nodes and 16 intermediate-level nodes linked together in a strict hierarchy, whereas the logical form graph displayed in FIG. 23 contains only 6 nodes. Unlike the syntax parse tree, the logical form graph is not hierarchically ordered, obvious from the two links having opposite directions between nodes 2307 and 2308. In addition, as noted above, the nodes no longer represent the exact form of the input words, but instead represent their meanings.
Further natural language processing steps occur after semantic analysis. They involve combining the logical form graph with additional information obtained from knowledge bases, analyzing groups of sentences, and generally attempting to assemble around each logical form graph a rich contextual environment approximating that in which humans process natural language.
In the above general discussion of the morphological subsystem, the morphological subsystem was described as providing dictionary information for each input word. The morphological subsystem employs an electronic dictionary to find that information. For each input word, the morphological subsystem must find a corresponding entry or entries in the dictionary from which to obtain the information. This process of looking up input words in an electronic dictionary presents several related problems, the solution of which greatly impacts the accuracy and efficiency of the entire NLP.
The keys of commonly-used dictionaries contain both diacritical marks and, in the case of proper nouns, upper case letters. For example, in an English language dictionary, there is a separate entry for the verb xe2x80x9cresume,xe2x80x9d without an accent mark, and for the noun xe2x80x9cresume,xe2x80x9d with an accent mark. As another example, the English-language dictionary commonly contains two entries having the key xe2x80x9cpolish,xe2x80x9d representing the noun xe2x80x9cpolishxe2x80x9d and the verb xe2x80x9cpolish,xe2x80x9d as well as two entries with the key xe2x80x9cPolish,xe2x80x9d representing the proper noun xe2x80x9cPolishxe2x80x9d and the proper adjective xe2x80x9cPolish.xe2x80x9d
Unfortunately, the cases and diacritical markings of letters in input text may not match the cases and diacritical markings of the dictionary keys that correspond to them, greatly complicating the task of finding dictionary entries during morphological analysis. For example, in input text with all upper-case letters, as well as in input text from electronic mail messages, diacritical marks are generally removed. A capitalized word lacking diacritical marks may possibly represent any of a number of lower case normal forms. For example, the French words xe2x80x9cxc3xa9lxc3xa8ve,xe2x80x9d which means xe2x80x9cstudent,xe2x80x9d and xe2x80x9cxc3xa9levxc3xa9,xe2x80x9d which means xe2x80x9craised,xe2x80x9d both have the capitalized form xe2x80x9cELEVE.xe2x80x9d If capitalized text is being processed, and the French dictionary has lower-case entries, it is not clear which lower-case entry should be chosen to describe the input word xe2x80x9cELEVE.xe2x80x9d
Because entries in common dictionaries are generally in lower-case form, and because the case of the letters of an input word is often determined by the word""s occurrence as the first word of a sentence or the word""s occurrence in a title, rather than from the morphological function of the word, a morphological subsystem might first change the letters of input words to all lower case before attempting to match the word to keys in a dictionary. The process of changing all the letters to lower case is a particular type of case normalization. Removing all diacritical marks from the letters of an input word is an example of another type of normalization. The process of normalization substitutes certain letters for others in input words in order to remove unwanted distinctions between words. By normalizing to all lower case, the input words xe2x80x9cPolishxe2x80x9d and xe2x80x9cpolishxe2x80x9d both become the normalized word xe2x80x9cpolish.xe2x80x9d
Although case normalization makes it easier for the morphological subsystem to find dictionary keys matching a word that, only because of its occurrence as the first word of a sentence, has its first letter capitalized, case normalization may cause a loss of morphological distinction based on capitalization. For example, a sentence in a book might read: xe2x80x9cI told him to polish his shoes.xe2x80x9d Alternatively, it might read: xe2x80x9cxe2x80x98Polish your shoes,xe2x80x99 I told him.xe2x80x9d Perhaps the title of the book is xe2x80x9cPOLISH YOUR SHOES!xe2x80x9d The normalized word for xe2x80x9cpolish,xe2x80x9d xe2x80x9cPolish,xe2x80x9d and xe2x80x9cPOLISHxe2x80x9d in the three sentences is xe2x80x9cpolish.xe2x80x9d However, consider the sentence: xe2x80x9cThe Polish government announced new elections today.xe2x80x9d If the word xe2x80x9cPolishxe2x80x9d is normalized to xe2x80x9cpolishxe2x80x9d prior to subsequent analysis, the morphological distinction between xe2x80x9cPolishxe2x80x9d and xe2x80x9cpolishxe2x80x9d is lost. In this last case, the capitalization of the word xe2x80x9cPolishxe2x80x9d indicates its morphological difference from the word xe2x80x9cpolish,xe2x80x9d and not its position in a sentence or a title.
The underlying problem for both loss of diacritical marks and loss of case distinction is the lack of efficiency in dictionary lookup caused by the need to search an electronic dictionary for multiple entries for each input word. For the French-language example given above, there is a quite large number of possible dictionary entries corresponding to the input word xe2x80x9cELEVE,xe2x80x9d including every possible combination of unmarked and marked letters xe2x80x9cexe2x80x9d in the first, third, and fifth position of the word. There are four lower-case letters that correspond to the upper-case letter xe2x80x9cE.xe2x80x9d These are xe2x80x9cexe2x80x9d, xe2x80x9cxc3xa8, xe2x80x9d xe2x80x9cxc3xaa,xe2x80x9d and xe2x80x9cxc3xa9.xe2x80x9d There are therefore 43 or 64 different possible combinations of these four lower-case letters within the input word xe2x80x9cELEVE.xe2x80x9d Even if various orthographic and phonologic rules are used to eliminate certain combinations that cannot occur in the French language, 36 valid combinations remain. Dictionary lookups are expensive. Each lookup may involve one or more disk accesses. In the English language example given above, the input word xe2x80x9cPolishxe2x80x9d would always require four lookups, two lookups for the two separate entries having the key xe2x80x9cpolish,xe2x80x9d and two for the two separate entries having the key xe2x80x9cPolish.xe2x80x9d Of course, if the morphological subsystem fails to exhaustively search for all entries related to an input word by change in case or by the addition of possibly omitted diacritical marks, it may provide an erroneous result to the syntactic and semantic subsystems, leading to an incorrect parse and logical form graph.
Prior art electronic dictionaries and morphological analysis subsystems failed to handle the problem of normalization of capitalized input words. A need for a method for efficiently finding all the entries in an electronic dictionary that correspond to an input word from which diacritical marks have been stripped because of transfer through electronic mail, or that correspond to an upper-case input word, has been recognized in the art of natural language processing.
The present invention is directed to a method and system for locating information in an electronic dictionary. The system creates the electronic dictionary by first generating a normalized form from the canonical forms of the word to be stored in the dictionary. The canonical, or conventional, form of a word uses the appropriate upper and lower case letters and the appropriate diacritical marks. The canonical form of a word is the form in which the word would appear as a key for an entry in a conventional printed dictionary. The normalized form of a word has all lower case letters and no diacritical marks. For example, xe2x80x9cPolishxe2x80x9d is the canonical form of the word relating to Poland, and xe2x80x9cpolishxe2x80x9d is the canonical form of the word relating to xe2x80x9cwax.xe2x80x9d However, the normalized form of both words is xe2x80x9cpolish.xe2x80x9d The system then stores an entry in the electronic dictionary for each unique normalized form of a word (e.g., xe2x80x9cpolishxe2x80x9d). Each entry has a key and a record. The key is set to the normalized form of the word. For each canonical form of a word whose normalized form equals the unique normalized form, the system stores a sub-record within the record. The sub-record contains information relating to the canonical form of the word such as the definition of that word and the part of speech for that word. Continuing with the same example, the key for one entry would be xe2x80x9cpolishxe2x80x9d and that entry would contain sub-record for xe2x80x9cpolishxe2x80x9d and xe2x80x9cPolish.xe2x80x9d To locate the information, the system receives an input word (e.g., xe2x80x9cPOLISHxe2x80x9d) and generates a normalized form of the input word. The system then searches the electronic dictionary for an entry with a key that matches the normalized form of the input word. The found entry contains a sub-record with information relating to the canonical form of the word. By organizing the electronic dictionary according to normalized forms, the information relating to an input word, regardless of the presence or absence of capitalization and diacritical marks can be found by searching for only one entry.