U.S. Pat. No. 5,748,805 discloses a technique that provides translations for selected words in a source document. An undecoded document image is segmented into image units, and significant image units such as words are identified based on image characteristics or hand markings. For example, a user could mark difficult or unknown words in a document. The significant image units are then decoded by optical character recognition (OCR) techniques, and the decoded words can then be used to access translations in a data base. A copy of the document is then printed with translations in the margins opposite the significant words.
U.S. Pat. No. 5,748,805 also discloses a technique applicable in a reading machine for the blind. The user can designate key words on a key word list. The user designated key words can then be found in the document, such as by OCR techniques, and regions around the key word can be identified as significant. Significant words can be decoded using OCR techniques and supplemental data can be retrieved and provided in a Braille version or speech synthesized version to the user.
Karttunen, L., "Directed Replacement", in Proceedings of the 34.sup.th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, Calif., 1996, pp. 108-116, discloses a family of directed replace operators, written "UPPER @.fwdarw.&gt;LOWER", that can be implemented in finite state transducers (FSTs). A transducer is disclosed that transduces an input string from left to right, making only the longest possible replacement at each point. This can be thought of as a procedure that copies the input until it finds an instance of UPPER; then the procedure selects the longest matching substring, which is rewritten as LOWER, and proceeds from the end of that substring without considering any other alternatives. The result is that the input string is parsed into a sequence of substrings that are either copied to the output unchanged or replaced by some other strings. Applications include tokenizing transducers that insert an end of token mark at the end of each multiword token in a list (which may include overlapping expressions), filtering transducers that divide an input stream into regions that are passed on and regions that are ignored, and marking transducers that mark instances of a regular language without otherwise changing text.
Chanod, J.-P., and Tapanainen, P., "A Lexical Interface for Finite-State Syntax", Technical Report MLTT-025, Rank Xerox Research Centre, Grenoble Laboratory, 1996, pp. 1-22, describe a lexical interface as implemented and used for development of a French constraint grammar. The lexical interface analyzes sentences. Tokenization uses (1) a multiword expression lexicon that encodes basic multiword expressions (MWEs), numerical or time expressions, and general regular expressions and (2) a tokenizing automation that recognizes simple tokens without considering whether they belong to the language. Morphological analysis is performed using several ordered lexical transducers, referred to as a lexicon stack, with the first transducer to recognize a token determining its lexical interpretation. The top lexicon in the stack is the multiword expression lexicon described above; second are one or more real lexicons; third is a guesser that accepts anything and always gives an analysis; and, lastly, the default lexicon matches any string to a noun.
During tokenization, a sentence string is read from the left, and the longest substring matched either in the tokenizing automation or the multiword expression lexicon is identified as a token; the process iterates with the rest of the sentence. Tokens matched in the automation are then looked up in the lexicon stack to obtain morpho-syntactic information, ignoring the top lexicon; tokens matched in the top lexicon do not require further lookup, as the lexicon is a transducer that provides the relevant morpho-syntactic information. The lexicons produce a finite-state network for each token, and the analyses are concatenated to form a finite-state network with all possible readings for a sentence.
These techniques are further described in Chanod, J.-P., and Tapanainen, P., "A Non-deterministic Tokeniser for Finite-State Parsing", in ECAI'96, 12.sup.th European Conference on Artificial Intelligence, Workshop on Extended Finite-State Models of Language, 1996, pp. 10 and 11 and Chanod, J.-P., and Tapanainen, P., "A Robust Finite-State Parser for French", ESSLLI'96 Workshop on Robust Parsing, Aug. 12-16, 1996, Prague, Czech Republic.
The invention addresses problems that arise in automatically finding selected words, multi-word expressions, or other strings of characters in text. Automatic string finding is useful in a variety of applications that use information about the presence, locations, and frequencies of selected strings in text. For example, an application could insert a translation, a hypertext link, or another annotation in association with each selected string that occurs in a text.
If an automatic string finding system were to use elaborate linguistic tools such as general lexicons, part-of-speech disambiguators, and syntactic analyzers, the system would be memory- and computation-intensive and, therefore, slow and expensive. In addition, unless it were made even more complex, the system would be specific to one language, and possibly to a single domain of terminology. These problems are collectively referred to herein as the "complexity problems".
The invention is based on the discovery of new techniques for finding character strings in text. The new techniques alleviate the complexity problems, providing simpler automatic string finding. The techniques automatically search a text to find character strings in the text that match any of a list of selected strings. In doing so, the techniques perform a series of iterations, each with a starting point in the text. Each iteration determines whether its starting point is followed by a character string that matches any of the selected strings and that ends at a probable string ending. Then, the iteration finds the next iteration's starting point at a probable string beginning.
Each iteration thus performs an operation that determines whether a starting point in the text is followed by a matching character string ending at a probable string ending. Each iteration also performs an operation that finds a probable string beginning as a starting point for the next iteration. Both of these operations can be performed very rapidly with tools that are already available or can be easily constructed. Therefore, the new techniques can be readily implemented to obtain a simple, fast automatic string finder.
To find probable string beginnings and endings, for example, the new techniques can perform tokenizing, such as with a finite state tokenizer, to find probable token breaks such as word separators. Where each of the selected strings begins and ends at a token break, as would be true for words and multiword expressions, tokenization will find probable string beginnings and endings.
To determine whether a starting point is followed by one of the selected character strings, the new techniques can further use a finite state data structure, such as a finite state transducer (FST) or a finite state automation (FSA), as a lexicon of the selected strings. The finite state data structure can include acceptance data that are accessed after matching an acceptable character string. The acceptance data thus indicate that the string is acceptable. An FST could have a level that accepts only selected character strings, and an FSA can similarly accept only selected character strings.
The techniques can access the finite state data structure and the tokenizer with the string of characters that follows a starting point, seeking to match each character in turn. Based on any failure to match a character and on any acceptance data that are accessed, the techniques can determine whether the starting point is followed by an acceptable character string that ends at a probable string ending.
When a character string matches one of the selected strings and ends at a probable string ending, the new techniques can determine whether to perform an operation relating to the matching string. For example, the new techniques can determine to perform such an operation only for the longest matching string that begins at the starting point.
If the finite state data structure is an FST, the new techniques can provide the character string following the starting point to a string matching level of the FST. The FST can, if the string is acceptable, provide information output relating to the string at an information output level. For example, the information output can be information useful for annotating the accepted string or for performing another operation relating to the accepted string, or it can be access data for accessing such information.
Alternatively, the new techniques can provide the character string following the starting point to an FSA. If the character string is accepted by the FSA, the techniques can obtain a counterpart number to which the character string maps. The new techniques can then use the counterpart number to access information relating to the accepted string.
Operations that could be performed include, for example, obtaining an annotation, such as a translation, a hyperlink or other handle for accessing related information, any interactive aid for accessing related documents, or other information associated with the matching string. If the associated information is an annotation, the operation can also insert the annotation into the text.
The new techniques can further be implemented in a system that includes text data defining a text, list data defining a list of selected character strings, and a processor that performs iterations as described above to automatically search the text to find character strings that match any of the list.
The new techniques can also be implemented in an article of manufacture for use in a system that includes text data defining a text and a storage medium access device. The article can include a storage medium and list data and instruction data stored by the storage medium. The system's processor, in executing the instructions indicated by the instruction data, performs iterations as above.
The new techniques can also be implemented in a method of operating a first machine to transfer data to a second over a network, with the transferred data including list data and instruction data as described above.
The new techniques are advantageous because they can be implemented with simple, inexpensive linguistic resources such as a tokenizer and a lexicon in the form of an FST or FSA. In addition, the new techniques lend themselves to fast operation, because tokenizing is a quick preliminary stage and lexicon lookup is also fast. The new techniques are especially fast when compared with string finding techniques that depend on morphological analysis. The new techniques also can be readily adapted to new domains of terminology or to new languages, simply by producing a tokenizer and a lexicon of selected words.
The new techniques are also advantageous because they only search for selected strings, and therefore will rarely find a string that is not of interest. And the computationally simple implementations of the new techniques will rarely fail to find an occurrence of one of the selected strings.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.