1. Field of the Invention
The field of the present invention relates to a method, system, program, and data structure for a dense array storing character strings that provides improved compression over current array structures for storing character strings.
2. Description of the Related Art
Many operations on text require or benefit from some type of dictionary. A xe2x80x9cdictionaryxe2x80x9d generally comprises an abstract list of character strings to represent and the data structure to store that list of character strings. A xe2x80x9cwordxe2x80x9d is one of the individual character strings stored in the dictionary. An xe2x80x9calphabetxe2x80x9d refers to the repertoire of characters needed to represent all of the words in the dictionary. For instance, a dictionary of English words has an alphabet consisting of the letters of the Latin alphabet, plus some punctuation marks, such as the hyphen and apostrophe, that can appear in the middle of a word. In French, the alphabet would also include certain accented letters used in French; in Arabic, the alphabet would consist of the Arabic alphabet, plus various Arabic punctuation marks, but not the letters from the Latin alphabet.
Dictionaries are used for spell-checking, hyphenation, and other types of morphological analysis. However, storing a long list of words (the English language, for example, has some 30,000 words in common use, out of a total vocabulary of well over 100,000 words) can consume a substantial amount of storage space and be quite cumbersome to search for a particular word.
Two general schemes to store lists of objects in computer memory include arrays and linked data structures. An array is a list where all of the items are the same size and stored contiguously in a single range of memory. Accessing a single item is direct and fast, because the address of the item can be calculated easily from the address of the array and the item""s line number. A linked list, on the other hand, is a structure where each item in the list is stored in an independent range of memory along with the address of the next item in the list. The memory block containing the item data and the pointer to the next item is called a xe2x80x9cnode.xe2x80x9d Accessing items in the list can require searching numerous nodes. Typically, a search starts at the first node in the list and follows the links until the desired item is reached. This can be especially time consuming if the linked list has a substantial number of nodes, such as the number of nodes needed to represent the words in a dictionary.
One family of linked data structures are called xe2x80x9ctrees.xe2x80x9d In a tree, each node in the tree may point to two or more further nodes. The nodes a node points to are called its xe2x80x9cchildren.xe2x80x9d Every node in the tree is pointed to by only one other node (its xe2x80x9cparentxe2x80x9d), except for the xe2x80x9croot node,xe2x80x9d through which the entire tree is accessed. A node with no children is called a xe2x80x9cleaf.xe2x80x9d One of the most common abstract representations of a tree data structure is called a xe2x80x9ctrie.xe2x80x9d The nodes in a trie represent only a single character in the word of which they are a part. Each node can have an arbitrary number of children (in the dictionary application, the maximum number of children is the number of characters in the alphabet plus one). This approach allows redundancy to be squeezed out of a list of words because all words starting with the same letters share the nodes representing those letters.
FIG. 1 illustrates a trie implementing a tree data structure such that any node can point to any number of additional nodes. The trie in FIG. 1 stores the characters in the phrase xe2x80x9cNow is the time for all good men to come the aid of their country.xe2x80x9d Each child of the root node has the first character in each word of the phrase, each child of the child of the root has the second character in each word beginning with the character at the child node to the root node. Thus, the i children of each jth node includes the (j+1)th character in the i strings having the previous j characters at the nodes on branches contiguously connecting the root node to the jth node. An empty node indicates an end of a string or word. Examples of strings that share the same node include xe2x80x9ccomexe2x80x9d and xe2x80x9ccountryxe2x80x9d that share the nodes for xe2x80x9ccoxe2x80x9d and all of the words that start with xe2x80x9ctxe2x80x9d are children of the xe2x80x9ctxe2x80x9d node.
FIG. 2 illustrates a trie represented as a binary tree also storing the characters of the phrase xe2x80x9cNow is the time for all good men to come to the aid of their country.xe2x80x9d Each node in the trie is represented by one or more nodes in the binary tree. The root node of the trie, for example, is represented by the chain of nodes going down the right-hand side of the diagram (the nodes here have been rearranged into alphabetical order, a technique that can speed up the search). The xe2x80x9caxe2x80x9d node in FIG. 1 is represented by the xe2x80x9cixe2x80x9d and xe2x80x9clxe2x80x9d nodes in the upper left-hand corner of FIG. 2. To search the trie represented as a binary tree as shown in FIG. 2, the algorithm would follow the below steps:
1. Begin at the root node (xe2x80x9caxe2x80x9d in FIG. 2). Compare the first letter of the word to the letter in the root node. If there is a match, proceed to step 3.
2. If there isn""t a match, follow the node""s right link and match the letter against that node""s letter. If there isn""t a match, repeat this step. Proceed until either a match is found or the current node has no right link. If there is no right link, the algorithm terminates and the word isn""t in the dictionary. Otherwise, continue to step 3.
3. If the matching node""s letter is the xe2x80x9cnot in the alphabetxe2x80x9d token, the algorithm terminates and the word is present in the dictionary.
4. Otherwise, follow the node""s left link and advance to the next letter in the word. (If there are no more letters in the word, use the xe2x80x9cnot in the alphabetxe2x80x9d token as the next letter.) Compare this new letter to the current node""s letter and go back to step 2.
Another type of data structure that may be used to store strings of characters, such as the words in a dictionary, is an array or matrix. In fact, the tree data structures discussed above may be converted to an array data structure. FIG. 3 illustrates an array in which the phrase xe2x80x9cNow is the time for all good men to come to the aid of their countryxe2x80x9d is stored. Each row is a node in the trie (the root node is row 0). Each column contains the link fields of all nodes corresponding to the letter at its head (e.g., the xe2x80x9coxe2x80x9d column contains all of the link fields corresponding to the letter xe2x80x9coxe2x80x9d). A period represents an empty link (e.g., the period in the xe2x80x9cbxe2x80x9d column of row 0 means that there are no words in the dictionary starting with xe2x80x9cbxe2x80x9d, and the period in the xe2x80x9cdxe2x80x9d column of row 15 means there are no words in the dictionary beginning with xe2x80x9cadxe2x80x9d). In this way, the periods represent the absence of branches in FIG. 1. Internally, the periods are represented by the value 0, since nothing can loop back to the root node. The # sign at the top of the last column is the column for the xe2x80x9cnot in the alphabetxe2x80x9d character. A negative one (xe2x88x921) in this column is a link to the xe2x80x9cend of wordxe2x80x9d node. In this way, nodes and characters of strings may be ascertained from the array shown in FIG. 3.
Arrays provide faster searching for strings than trees. With trees, the nodes of the trees must be traversed to find a node matching the first character of the subject string. However, with arrays, the first character is instantly located in row zero in the column corresponding to the first character. Subsequent characters or dependent nodes can readily be determined from the array without having to traverse, and access, non-matching nodes. Although arrays are faster to search, they are not as efficient at storing data as trees are, as there may be numerous empty cells in the array. As can be seen, the tries in FIGS. 1 and 2 have no non-empty nodes or cells whereas the array in FIG. 3 has more empty cells than non-empty. In fact, the array in FIG. 3 is referred to as a sparse array because many array cells are empty.
Both the search processing speed and storage factors increase in importance as more words are included in the data structure, such as the case with word dictionaries used for looking up words or spell checking. For instance, with a dictionary, the amount of bytes the data structure consumes is particularly important because the dictionary is most efficiently processed when the entire dictionary is loaded into volatile or temporary memory. There is thus a need in the art for a method, system, and program for providing an improved data structure for storing data and, in particular, storing lists of words.
To overcome the limitations in the prior art described above, preferred embodiments disclose a method, system, and program for generating a data structure in computer memory for storing strings. Each string includes at least one character from a set of characters. An arrangement of nodes is determined to store the characters such that the arrangement of the nodes is capable of defining a tree structure. An array data structure is generated to store the nodes. The array includes a row for each node and a column for each character in the set of characters. A non-empty cell identifies a node for the character indicated in the column of the cell that has descendant nodes in the row indicated in the cell content for the node. The array data structure is processed to eliminate at least one row in the array data structure to reduce a number of bytes needed to represent the array data structure. In this way, the array data structure following the processing requires less bytes of storage space than before the processing.
In further embodiments, rows are eliminated by determining whether any two rows in the array data structure have a same value in every column. One of the two rows having the same value in every column is deleted. In this way duplicate rows are deleted. An index may be provided including an entry identifying the row in the array data structure including the content for the duplicate deleted row.
In still further embodiments, rows are eliminated by determining whether any two rows in the array data structure are capable of being merged. If so, the contents from each column in one of the rows is copied into the same column of another row of the two rows capable of being merged. The copy of the row in the array having its contents copied to the other row may then be deleted. An index may be provided to identify the row in the array including the content of the deleted row and a table may be provided indicating the columns or cells in the row in the array including the descendant nodes for the deleted row.
Preferred embodiments provide a method, system, and program for deleting rows in an array data structure storing nodes of strings to reduce the storage space the array must consume to represent all the nodes of the strings and the connections between the nodes. Preferred embodiments allow the contents of a row to be merged into another row, and the merged row deleted to further reduce the storage space utilized by the array data structure. In such case, different data structures are used to determine the row and cells including the descendant nodes in the merged row. By providing data structures that do not utilize significant space, such as the index and table, rows can be deleted to conserve space in the array and the data structures would indicate the location of the descendant nodes for the deleted row in the array.
By reducing the storage space arrays consume, preferred embodiments increase the benefits of arrays over tree and trie data structures as the arrays the preferred embodiments provide are significantly more efficient in their use of storage space and allow for comparable or faster searching than tree data structures. Further, as the size of the data structures increase, i.e., larger dictionaries, the array of the preferred embodiments allows for significantly faster search times than a tree data structure. The search time of the preferred embodiment array data structure remains constant as the size of the dictionary increases, whereas the search time for a tree data structure increases as the size of the dictionary increases.