A. Field of the Invention
The present invention relates generally to information processing and, more particularly, to identifying multi-word text sequences that are semantically meaningful.
B. Description of Related Art
In some text processing applications, it can be advantageous to process multiple words in a sequence as a single semantically meaningful unit. For example, the author of the phrase “Labrador retriever” intends to refer to a specific type of dog. If this phrase was present in a search query, such as a search query input to an Internet search engine, it may be desirable to process the phrase as a single semantic unit rather than as the two separate words “Labrador” and “retriever.”
Applications other than search engines may benefit from knowledge of semantic units. Named entity learning, segmentation in languages that do not separate words with spaces (e.g., Japanese and Chinese), and article summarization, for example, are some applications that may use semantic units.
Thus, there is a need in the art to be able to automatically recognize semantic units from within one or more textual documents.