Many natural language applications require the use of one or more linguistic dictionaries; for example, word processing, linguistic searching, information extraction, information retrieval, spelling aids and query correction applications for search engines etc.
The complexity of building a linguistic dictionary is an enormous task. For example, in the English language there are over 616,500 word forms. To store each of these word forms in a permanent storage medium, on a computer, requires the utilization of a high volume of disk storage, which is expensive and often not desirable from a user's perspective or a developer's perspective. Further, to locate a word quickly, an efficient retrieval mechanism is required. Thus the selection of a suitable data structure for the organization, storage and retrieval of the word forms and/or phrases that form a language is a critical one.
There are many data structures that exist which allow data to be stored and retrieved in a structured way, these range from arrays and linked lists to tree-based data structures comprising a number of nodes and associated child nodes.
One type of data structure which is well suited to the storage and retrieval of linguistic data is a trie-data structure. The term “trie” stems from the word “retrieval”. Trie structures are multi-way tree structures which are useful for storing strings over an alphabet. Trie structures are used to store large dictionaries of words. The alphabet used in a trie structure can be defined for the given application, for example, {0, 1} for binary files, {the 256 ASCII characters}, {a, b, c . . . x, y, z}, or another form of alphabet such as Unicode, which represents symbols of most world languages.
The concept of a trie data structure is that all strings with a common prefix propagate from a common node. A node has a number of child nodes of, at most, the number of characters in the alphabet and a terminator. The string can be followed from the root to the leaf at which there is a terminator that ends a string. For example, an English-language dictionary can be stored in this way. A trie-based dictionary has the advantage that the data is compressed due to the common entries for prefixes (word constituents that can be concatenated at the beginning of a word) and possibly postfixes (a word constituent that can be concatenated at the end of the word).
An example of a trie-based structure 100 is shown in FIG. 1. The trie-based structure 100 stores four words do, did, don't and didn't. The trie-based structure 100 is a multi-way tree structure with a root node 101 from which child nodes 102 to 108 extend. In turn, each child node 102 to 108 can become a parent node with child nodes of its own. The nodes in the trie-based structure 100 represent characters in an alphabet and a string of characters is represented by following a route down the trie from the root node 101 to a leaf node 104. Leaf nodes are provided by terminators 104, 105, 107 for a recognized string of characters 104,105,107.
An example of one such route is illustrated—starting from the parent node 101 which comprises the letter ‘d’ down to the child node 104 representing the letter ‘o’. Thus following this route the string ‘do’ is derived. The string ‘do’ is a recognized word and therefore a terminator node 104 which is also a child node denotes that the string ‘do’ is a recognizable word.
Similarly, the following recognized words are shown in the trie-based dictionary 100: “don't”, “did”, “didn't”. Where each valid word ends, a terminator node 104, 105, 107 is provided. The terminator node is referred to as a gloss node where the root-to-terminal path string is a valid dictionary entry.
This type of trie-based linguistic dictionary provides low computational complexity whilst performing simple or approximate lookups.
A problem occurs in developing text processing applications for languages, such as Arabic, for example. The Arabic language is a highly inflected language with a complex morphology and thus building a trie-based dictionary of a reasonable size with attached morphology information is a challenging problem.
The attachment of morphology information to a list of words (surface forms), including part of speech, vocalized forms and normalized forms presents unique challenges for achieving a dictionary contraction ratio in a trie-based structure that is comparable to the contraction ratio that can be achieved when a trie-based structure comprises only surface forms.
Thus, building trie-based linguistic dictionaries for languages such as Arabic results in a large dictionary size. This in turn may hamper their use in industrial applications.
Existing methods for contracting trie-based linguistic dictionaries are mainly implemented by first creating a number of surface forms (possibly provided with any additional information such as annotations and lemmas), next converting the list into a letter tree with common pre-fixes being factored out, and finally, performing contraction where common post-fixes are factored out, thus arriving at a normalized form of the surface form. A normalized form accounts for typographic and inflectional variations in a word, such as, the different inflection variants of the same word. For example table is the normalized form of tables. Attaching the normalized forms of words to their corresponding surface forms is not an efficient method of storing because both the surface form and the normalized form must be stored together.
A known method for contracting linguistic dictionaries with normalized forms is a cut and paste technique which operates by describing how many characters of the surface form must be removed and which characters must be added to the end of the surface form to arrive at the normalized form (postfix contraction). An example of this is shown in Example 1.