To ease the burden of sifting through the enormous volume of electronically available information, modern computer systems and other machines are often used to extract meaningful content from stores of information and to organize the content for a human operator. Many information dispensing services employ some sort of language analyzer for this purpose.
Machine-implemented language analyzers are usually one of two general types: referential analyzers and mathematical analyzers. Referential analyzers (also called semantic analyzers) typically use a combination of syntactic analysis and definitional analysis to identify significant phrases in a document. Syntactic analysis is used to parse paragraphs, sentences or other sequence of words into phrases and to remove conceptually insignificant terms, such as conjunctions, articles and prepositions. Definitional analysis involves identifying significant phrases by reference to the dictionary definitions of the terms constituting each phrase. Typically, numeric weights are assigned to the words in a phrase according to their definitional significance, and the average, sum or some other combination of the weights is used to represent the definitional significance of the phrase. Because the definitional significance of the phrase is expressed as a numeric value (sometimes called a “relevance code”), numeric thresholds can often be used to discriminate between significant and insignificant phrases according to application needs.
Referential analyzers suffer from a number of disadvantages, due mostly to their reliance on the definitional significance of words. First, some sort of database of words and their respective numeric weights is usually required. The database consumes memory and makes for relatively slow linguistic analysis because a separate database search is usually required for each word in a phrase. Another disadvantage of referential analyzers is that they are language dependent, requiring a different database of words for each language as well as specialized databases for different industries and fields. This places a significant burden on developers of referential analyzers and limits the applicability of systems that incorporate referential analyzers to the particular languages for which word databases are provided.
Mathematical analyzers perform linguistic analysis by measuring the relative frequency of occurrence of stemmed words. A stemmed word is a word that has been reduced to its root form by removing inflectional elements (e.g., indications of plurality, tense, case and so forth) and by truncating declensional and conjugative forms of the word. Groups of stemmed words having a relatively high frequency of occurrence relative to other stemmed words are considered to be significant phrases.
Unfortunately, mathematical analyzers suffer from many of the same disadvantages as referential analyzers. A database of stemmed words and their various inflected forms is usually required. As with referential analyzers, the database consumes memory and makes for relatively slow linguistic analysis because a separate database search is usually required for each word in a phrase to determine whether there is a corresponding stemmed word. Mathematical analyzers are also language dependent and require a different database of words for each different language. As with referential analyzers, the language dependence of mathematical analyzers places a significant burden on developers of mathematical analyzers and limits applicability of systems that incorporate mathematical analyzers to the particular languages for which stemmed word databases are provided.