The present invention generally relates to the area of automatic analysis of text, and more specifically to a method for automatic determination of whether a word typo is a solid compound word.
The recognition of compound words and their constituents is essential in many applications of automatic text analysis, since it is needed for determining the word class of a compound word, matching the compound word with other words etc. The recognition of compound words and their constituents is especially important in languages, such as German, Swedish, Dutch etc., where compound words can be generated that do not include a blank space between their constituents, so called solid compound words, A solid compound word is thus a solid string of characters where the constituents of the solid compound word may be separated by hyphens or not. One reason why it is important to recognize these solid compound words and their constituents is that the number of compound words that can be generated is immense and thus, it is virtually impossible to store all possible compound words. Thus, in order to facilitate a correct analysis of solid compound words, for example within text analysis, these solid compound words and their constituents need to be recognized. For languages in which constituents of compound words are separated by blank spaces, so called open compound words, recognition of compound words is not as difficult and the decomposition of them into their constituents is an easy task.
A problem in known methods for recognition of compound words and their constituents is that these often lead to many different possible segmentations of one single compound word. In these cases there is no methodology to identify the most probable segmentation of the word. Furthermore, the known methods only give the ultimate constituents of a compound word and not their structural relations (modifier-head relations).
An important reason to why it is difficult to recognize solid compound words and their constituents is that there are no regular rules governing whether or not there is a joining clement, and which joining element should be used when a solid compound word is formed. Thus, solid compound words and their constituents can not be recognized by just identifying joining elements. On the other hand, due to the fact that the number of solid compound words that can be generated is so immense, there is no possibility of storing all possible combinations of constituents. Even though a large number of known solid compound words has been stored, the risk of encountering an unknown solid compound word would still be very high. Furthermore, the generation of compound words without spaces between their constituents may give rise to segmentation ambiguities that arc not possible to resolve using known automatic methods.
The object of the invention is to provide a method for automatic recognition of whether a word type is a solid compound word that is not subject to the foregoing problems associated with existing methods for this task. Thus, a method for automatic determination whether a word type is a solid compound word or not is provided, which method significantly reduces the number of stored solid compound words needed and which gives a deterministic result.
The present invention is based on the recognition that solid compound words can be divided into groups according to their word class and that, by using an iterative and hierarchical method, solid compound words can be recognized using less volume of stored information and the result will be deterministic. Furthermore, the present invention is based on the recognition that, by storing, for the majority of the solid compound words, only a prefix and a suffix and not every possible combination of them, the amount of information that needs to be stored can be reduced significantly.
According to an aspect of the invention a method is provided for automatic determination whether a word type is a solid compound word or not. In the method a word type is looked up in an electronically stored list of known word types. The list comprises an indication for each known word type of whether it is a known solid compound word or not. If said word type is in the list of known word types, it is determined whether the word type is a known solid compound word or not by look-up in the list of known word types. If the word type is not in the list of known word types, the word type is exhaustively divided into a prefix and a suffix and the prefix and the suffix are looked up in an electronically stored list of known prefixes of solid compound words of a word class, and an electronically stored list of known suffixes of solid compound words of said word class, respectively. This look-up is done for all possible divisions of the word type into a prefix and a suffix. If a prefix, associated with a division, is in the list of known prefixes of solid compound words of the word class and a suffix, associated with the same division, is in the list of known suffixes of solid compound words of the word class, then it is determined that the word type is a solid compound word of the word class. If the word type is not a known word type and if it has not been determined to be a solid compound word, the look-up of prefixes and suffixes is then repeated for a new word class. This is repeated until the word type has been determined to be a solid compound word of a given word class, or until all of the word classes to be tested have been tested.
By first looking up the word type in a list of known word types and determining if the word type is a known word type word, known word types will not have to be subjected to the subsequent analysis of the method. This is advantageous, since it will eliminate the risk of a known word type that is not a solid compound word being erroneously identified as a solid compound word in the later analysis.
As for the further analysis that is done for word types that are not known word types, this analysis is divided into an analysis for each one of a number of different word classes. By looking up prefixes and suffixes in lists comprising known prefixes and suffixes, respectively, associated to one word class at a time, the fact that solid compound words are created according to different rules for different word classes can be utilized. Together with the fact that the look-up and determination will only be done for a word class as long as the word has not been determined to be a solid compound word of another word class, this will decrease the risk of a word type being erroneously determined to be a solid compound word of one word class when it is in fact a solid compound word of another word class. Furthermore, this will eliminate the risk of a word type being erroneously determined to be a solid compound word of two or more word classes simultaneously.
Furthermore, by doing the look-up of prefixes and suffixes in lists comprising known prefixes and suffixes, respectively, the amount of information that needs to be stored in a full form word list is reduced in relation to the alternative where all possible combinations of prefixes and suffixes are stored.
In one embodiment of the method according to the invention the list of known word types further comprises an indication for each known solid compound word of its main division point, i.e. the point between two characters in the word type that divides the word type into its main constituents, For example a solid compound word that has two main constituents that are non-compound words the main division point is simply between these two constituents, whereas for a compound word that has two main constituents of which one is a compound word and the other is not, the main division point will be between the compound word and the non-compound word. In this embodiment it is determined, when the word type is found in said electronically stored list of known word types, whether the word type is a known solid compound word or not in accordance with the indication in said electronically stored list of known word types. If the word type is a known solid compound word, its main division point is found in the list of known solid compound words. Furthermore, if the word type has been determined to be a solid compound word of a word class, its main division point is determined to be between the prefix and the suffix that have been found in the list of known prefixes an the list of known suffixes, respectively, for this word class. By storing the main division point for known solid compound words, and known prefixes and suffixes for solid compound words of different word classes, the determining of the main division point gives an unambiguous result.
In another embodiment of the invention, the determination of a main division point, and thus the main constituents of a solid compound word, is extended with the determination of the binary division points internal to the main constituents. Thus if a word type has been found to be a solid compound word, and its main division point has been determined, the method is repeated for the main constituents of the solid compound word. In this way, it will be determined if the main constituents in turn are solid compound words, This is preferably done recursively until all of the found constituents of the word type are non-compound words. The result will then not only give all of the constituents of the word type, but also their structural relations (modifier-head relations).
Furthermore, in one embodiment, the electronically stored list of known word types is updated with said word type, an indication that said word type is a known solid compound word, and an indication of where the word type has its main division point, whenever a word type is determined to be a compound word that is not in the list of known word types. This is advantageous since look-up in the list of word types is much faster than decomposition and look-up of prefixes and suffixes. Furthermore, since only the compound words that have actually been observed are stored in the list of known words, the list of known word type will still include much less known word types than if all possible combination of prefixes and suffixes are stored in this list.
In yet another embodiment of the method according to the invention the steps of looking up of prefixes and suffixes, and determining that a word is a solid compound words of a word class are performed for word classes with more restricted combinatorial properties before they are preformed for word classes with less restrictive conbinatorial properties. Furthermore, when a word type is determined to be a solid compound word of a given word class it will not be subjected to any further analysis. Thus, the risk of a solid compound word of a word class with more restrictive combinatorial properties being erroneously determined to be a solid compound word of a word class with less restrictive combinatorial properties. Preferably, these steps are performed first for solid compound names, then for solid compound verbs, and finally for other solid compound words, such as compound nouns, adjectives, and participles.