Increases in computer storage capacity, transmission rates and processing speed mean that many large and important collections of data are now available electronically, such as via bulletin boards, mail, and on-line texts, documents and directories. While many of the technological barriers to information access and display have been removed, the human/system interface problem of being able to locate what one really needs from the collections remains. Methods for storing, organizing and accessing this information range from electronic analogs of familiar paper-based techniques, such as tables of contents or indices to richer associative connections that are feasible only with computers, such as hypertext and full-context addressability. While these techniques may provide retrieval benefits over the prior paper-based techniques, many advantages of electronic storage are yet unrealized. Most systems still require a user or provider of information to specify explicit relationships and links between data objects or text objects, thereby making the systems tedious to use or to apply to large, heterogeneous computer information files whose content may be unfamiliar to the user.
To exemplify one standard approach whose difficulties and deficiencies are representative of conventional approaches, the retrieval of information using keyword matching is considered. This technique depends on matching individual words in a user's request with individual words in the total database of textual material. Text objects that contain one or more words in common with those in the user's query are returned as relevant. Keyword-based retrieval systems like this are, however, far from ideal. Many objects relevant to the query may be missed, and oftentimes unrelated objects are retrieved.
The fundamental deficiency of current information retrieval methods is that the words a searcher uses are often not the same as those by which the information sought has been indexed. There are actually two aspects to the problem. First, there is a tremendous diversity in the words people use to describe the same object or concept; this is called synonymy. Users in different contexts, or with different needs, knowledge or linguistic habits will describe the same information using different terms. For example, it has been demonstrated that any two people choose the same main keyword for a single, well-known object less than 20% of the time on average. Indeed, this variability is much greater than commonly believed and this places strict, low limits on the expected performance of word-matching systems.
The second aspect relates to polysemy, a word having more than one distinct meaning. In different contexts or when used by different people the same word takes on varying referential significance (e.g., "bank" in river bank versus "bank" in a savings bank). Thus the use of a term in a search query does not necessarily mean that a text object containing or labeled by the same term is of interest.
Because human word use is characterized by extensive synonymy and polysemy, straightforward term-matching schemes have serious shortcomings--relevant materials will be missed because different people describe the same topic using different words and, because the same word can have different meanings, irrelevant material will be retrieved. The basic problem may be simply summarized by stating that people want to access information based on meaning, but the words they select do not adequately express intended meaning. Previous attempts to improve standard word searching and overcome the diversity in human word usage have involved: restricting the allowable vocabulary and training intermediaries to generate indexing and search keys; hand-crafting thesauri to provide synonyms; or constructing explicit models of the relevant domain knowledge. Not only are these methods expert-labor intensive, but they are often not very successful.