Computerized textual analysis is commonly performed on computerized data representing text. Such textual analysis includes information retrieval and information extraction, among other types of textual analysis. Within computerized textual analysis, dictionaries are needed to obtain grammar and definitional information regarding words and phrases included in the text.
One type of dictionary is commonly known as an exact-match dictionary. Words and phrases within text are exactly matched, on a character-by-character basis, to entries within the exact-match dictionary. In response, information is provided regarding these words and phrases, such as their grammar and their definitions. Such exact-match dictionaries are commonly implemented in trie structures, which are ordered tree data structures similar to finite-state automaton structures, common hash maps, and so on.
Exact-match dictionaries are commonly employed within computerized textual analysis. Examples include those available within the LanguageWare® software or platform available from International Business Machines, Inc., of Armonk, N.Y. Another example is the dictionary employed within the ChaSen Japanese language morphological textual analysis system, available from the Nara Institute of Science and Technology, located in the Takayama District in the Nara Prefecture of Kansai Science City, and which maintains an Internet web site at http://www.naistjp/index_en.html. Both of these types of exact-match dictionaries can be used within computerized textual analysis.
Another type of dictionary is commonly known as a regular-expression dictionary. Rather than exactly matching words and phrases within text, as in an exact-match dictionary, a regular-expression dictionary employs strings that describe or match a set of strings, according to certain syntax rules. For instance, a date may be referenced in a variety of different formats, such as Jan. 1, 1970, 1 Jan. 1970, Jan. 1, 1970, Jan. 1, 1970, and so on, and therefore resists exact matching as in an exact-match dictionary. Regular-expression dictionaries are thus employed where a regular expression exists, but which can be described in a variety of different notations. Such regular expressions include dates, currency amounts, telephone numbers, Internet universal resource identifiers (URI's) such as universal resource locators (URL's), and chemical symbols.
Regular-expression dictionaries can be implemented by using existing regular-expression libraries. Examples of regular-expression libraries include the Java® programming language regular-expression matching library, available from Sun Microsystems, Inc., of Santa Clara, Calif., and which maintains a web site at http://java.sun.com. Another example is the International Components for Unicode for Java (ICU4J) programming language library, available from International Business Machines, Inc. However, these, and other, regular-expression libraries cannot be employed for or within computerized textual analysis by themselves.
Exact-match dictionaries can be employed for utilization within computerized textual analysis, for a variety of different reasons. They provide a framework that allows for the construction and operation of a programming language library, such as a Java® programming language library. For instance, they provide for the binding to Java® programming language classes, and allow entries to be added to and deleted from the dictionaries via appropriate application-programming interfaces (API's). They further provide for resource management mechanisms, including loading and unloading of the dictionaries.
By comparison, existing regular-expression libraries, such as the libraries noted above, cannot be employed for utilization within computerized textual analysis by themselves, because they lack these capabilities that are found within exact-match dictionaries. As a result, computerized textual analysis suffers, because it cannot retrieve information regarding regular expressions commonly found within text. For this and other reasons, therefore, there is a need for the present invention.