The present invention generally relates to the field of computerized analysis, processing and storage of natural language text, and more specifically to a method for distinguishing insignificant from significant distinctions of upper and lower case letters in a number of input word types from a natural language text.
When analyzing, processing and storing natural language text several problems arise pertaining to the case of letters in the text. For example, when storing the word types of a large text in a database, the question arises whether a difference only in the case of a letter is relevant or not when distinguishing word types.
Known systems for analyzing, processing and storing word types have two general approaches to handling case distinctions. The two approaches are: (1) to obliterate all distinctions of upper and lower case in unique word types (case insensitivity), or (2) to preserve all case distinctions (case sensitivity). The first approach results in smaller inventories of word types at the cost of loss of the information conveyed by case distinctions, and the second approach results in retention of case information at the cost of larger inventories of word types.
An object of the present invention is to overcome the problem of loss of information associated with case insensitivity and the problem of large inventories of word types associated with case sensitivity, respectively, whilst at the same time maintaining the advantages of these two approaches. This object is achieved by a method for automatically distinguishing significant from insignificant variants of upper and lower case in a number of input word types according to the accompanying claims.
The invention is based on the recognition that local information, such as the occurrence and location of upper case letters in word types, together with global information, such as the occurrence of word types that only differ with respect to the case of one or more letters, can be used to determine whether the distinction of case of the letter is significant or not.
According to one aspect of the invention, a method for automatically distinguishing significant from insignificant distinctions of upper and lower case in a number of input word types by means of a computer is provided. According to the method an input word type is assigned to one of a number of disjoint local groups based on the case, and position, of the letters that make up the word type. Furthermore, said input word type is reassigned to one of a number of disjoint global groups, based on which local groups the case variants of the input word type are assigned to. Finally cases are normalized for said input word type in accordance with predetermined rules associated with the global group said input word type is assigned to.
According to this aspect of the invention, a large number of word types that have been identified in a very large text database are input to a computer. The word types are input as they appear in the text database, i.e. the cases of the letters of the word types are maintained. Thus, two word tokens in the text database that are identical except for the case of one or more letters will be input as two different word types, whereas two word tokens in the text database that are identical also in terms of the case of the letters will be input as one word type. The method, which is performed fully automatically by means of a computer, then makes use of both local information and global information regarding cases of the word types. The local information is the cases and positions of the letters that make up the word types, such as the case of the initial letter and the case of non-initial letters. As for the global information, the fact that there are word types that differ from each other only with respect to the case of one or more letters is used inventively. These word types are case variants of a common word type. It is recognized that, by determining what different case variants there are for one common word type, it is possible to determine with a reasonable level of certainty if the case difference between the case variants is significant or not and, if it is not, to which case variant the case variants should be normalized. The term assigned in xe2x80x9cassigned to a number of disjoint local groupsxe2x80x9d and xe2x80x9cassigned to a number of disjoint global groupsxe2x80x9d should be interpreted broadly so that it does not only cover an actual grouping of the input word types, but also a more theoretical recognition that there are different types of word types in terms of the local and global properties of concern. Furthermore, the predetermined rules also include rules that detect when no normalization is to be done, which happens when the cases of letters in the word types are considered to be significant. In this way, the cases are preserved for those input word types that do not have any case variants, and for those input word types that have case variants for which the case difference is considered to be significant, whereas the cases are normalized for input word types for which the case difference is considered to be insignificant. An advantage of this method is that the number of word types that, for example, should be stored in a database, is decreased. At the same time, the information conveyed by the case is preserved when the case is considered to be significant. Thus, the size of the database will be decreased which will decrease the costs of the database and increase the speed of look up in the database.
The method is general, language independent, and applicable to character sets of languages for which standard orthography distinguishes upper and lower case of letters. The method has applications in indexing and lookup procedures in systems for information retrieval, and in lexical analysis components of systems for text analysis.
In one embodiment of the method according to the invention, the case variants of an input word type are normalized to a given case variant, that is predetermined for the given global group of the input word type. Thus, for each global group there is one case form that is considered to be the normal form, and all case variants of a word type of a given global group are normalized to that normal form. This is based on the recognition that different types of word types, such as names, acronyms, nouns etc., will occur in a certain set of case variants in a natural language text, and that the set of case variants of a word type that are found in a large natural language text, is indicative of what type of word type the word type is.
In another embodiment of the method according to the invention, each input word type is associated with a frequency that indicates the number of occurrences of the input word type in the natural language text. The case variants of an input word type are then normalized in accordance with predetermined rules associated with (a) the global group that the input word type is assigned to, and (b) the frequency of the case variants of the input word type. Thus, in this embodiment the additional information regarding the number of times each word type has occurred in the natural language text is used in the determination of whether and how an input word type should be normalized. For example, information regarding the frequency of each case variant of a word type may indicate that the default normalization associated with the global group of the case variants should not be applied. Thus, even though there is one form in terms of cases that is considered to be the normal form to which all case variants of a word type should be normalized, this should not be done in some cases. For example, this could be the case when a case variant that is considered to be the normal form has a frequency that is significantly smaller than the frequency of another case variant. This is based on the recognition that, even if the set of case variants that a word type has in a natural language text indicates which type of word type the word type is, there are exceptions to this. These exceptions can be identified by also considering the frequency of each case variant. This enhances the performance of the method in terms of the correctness of the normalization.
In yet another embodiment of the method according to the invention, each input word type is associated with a sentence position that indicates whether the input word type occurred in a sentence internal position and/or in a sentence initial position in the natural language text. The case variants of an input word type are then normalized in accordance with predetermined rules referring to the global group of the input word type and to the sentence positions of the case variants of said input word type. Also in this embodiment, information regarding each specific group of case variants can be weighed in when determining whether and how an input word type should be normalized. For example, information regarding the sentence position of each case variant of a word type may indicate that the default normalization associated with the global group of the case variants should not be applied. Thus, even though there is one case form that is considered to be the normal form to which all case variants of a word type should be normalized, this should not be done in some cases. For example, when one case variant with an upper case initial letter and another case variant with a lower case initial letter both appear in internal positions of sentences in the natural text, this indicates that the case difference is significant and that no normalization should be done. This is based on the recognition that, even if the set of case variants of a word type indicates which kind of word type the word type is, there are exceptions to this. These exceptions can be identified by also considering in which sentence positions each case variant has occurred. This enhances the performance of the method in terms of preserving significant case differences.