The invention relates to methods and systems for computer processing of natural languages.
Commercial interest in computer-based human language processing has been steadily increasing in recent years. Globalization and the widespread use of the Internet are driving the development of automated translation technology, while progress in robotics and software engineering is fueling growth in the area of human-machine interfaces and automated document processing.
Language processing applications often use computer-readable linguistic knowledge bases (LKB) containing information on the lexicon, grammar, and linguistic structure of natural languages, as well as on linguistic correspondences and/or other relations between natural languages.
Tasks such as automated translation have long been considered difficult because of the diversity, inherent ambiguity, context-sensitivity, and redundancy of language, and also because of the implied (non-explicit) content of human communication. Multiword expressions (e.g. idiomatic phrases) represent a particular challenge. Phrases that express the same notion or message in different languages may not share equivalent words or syntactic structure. The meaning of some phrases, such as “to kick the bucket” or “to pass the buck”, may be essentially different from that conveyed by their individual words.
Common approaches to natural language processing include dictionary-based, example-based, and corpus-based methods. Dictionary work involves the creation of lexical knowledge bases. Example-based methods aim to create large collections of example phrases, and to match incoming text to the stored examples. Corpus-based work often employs statistical models of relationships between words and other linguistic features.