Information Retrieval (IR) is the science of helping the user find a text or other media in a large group of documents. The user usually does this by inputting a query. A search engine takes a query and evaluates it against the group of documents. Usually, this evaluation is a simple number, and the document with the highest or lowest number will be the first document retrieved. Multiple documents can be retrieved, sorted according to this value, allowing the user to see a number of possible matches to what they were looking for.
There are two major types of queries in IR—structured and unstructured. In a structured query, the query must obey a predefined syntax known as a query language. SQL is one of the most widely used query languages. Query languages depend on a pre-defined structured representation of the data, which the user must specify. Since it is extremely difficult to form a consistent and sophisticated representation of the data from natural language, applying query languages to text search is a very difficult task.
Many modern search engines, such as Bing, Google, or AltaVista use unstructured queries, where both documents and queries are represented as a mathematical structure built from a concatenation of words. One of the most commonly used structures is the vector, where each element in the vector is a function of a word's frequency in the document, and distance metrics between these vectors are used to measure similarity or distance between the query and the document. This approach is often referred to as “bag of words.”
One of the major limitations to this model is that it does not take into account the order of words in a sentence. If the user typed in “George Bush likes broccoli” and “broccoli likes George Bush,” the results would be the same. However, they are saying completely different things, because in the first sentence, “George Bush” is the subject and “broccoli” is the object, and in the second sentence, these roles are reversed. This is because the sentences are expressing different logical relationships. Although some systems, like Watson and Lexis-Nexis, have rudimentary accommodation for these relationships, the dominant vector space model can only handle these in a very unstable and brittle fashion. This is because each word must be indexed not only according to its lexical identity, but also by its role in the sentence, i.e. “George Bush as subject,” “George Bush as object,” etc. Needless to say, with so many combinations of words and roles, the size of the vectors grows exponentially. Moreover, if the query sentence is “The President likes broccoli,” the term “President as subject” will not match with “George Bush as subject.”
Some systems attempt to arrive at a logical representation of the sentence by looking at words as they appear in order. While this approach is valid for simple sentences, it fails with complex sentences. For example, in the sentence “Dari, the language of the elite in Afghanistan, is a dialect of Modern Persian,” word order alone cannot tell the user that Dari is the subject, and “is a dialect” is the predicate. In fact, some systems, due to their text-cleaning processes, may decide that “Afghanistan” is the subject and“is Modern Persian” is the object. Moreover, if a word were added to the sentence, then it may completely throw off the comparison.
Currently, users searching for documents must either accept a high recall (large number of relevant documents returned) with low precision (low proportion of results are relevant) using “bag of words” approaches, or low recall (few documents returned) with high precision (high proportion of results are relevant) provided by relational approaches. The first option may provide the desired documents, but the desired documents may be buried in a haystack of irrelevant material that can take a lot of time to review. The second option may provide relevant results, but some other relevant results may be missed if the query is not correctly structured relative to the way the data is stored. The desire is to achieve high recall and high precision.
The bag-of-words approach may be improved by using latent semantic indexing (LSI) techniques. In LSI, a document is represented as a vector of real numbers. Each element in this vector corresponds with a word. A zero in an element means that this word is not present in the document. A nonzero value in this element means that the word is present. The magnitude of this value is usually a function of the word's frequency in the document. It is usually a count of that word, normalized in some way. In LSI, a mathematical approach called Singular Value Decomposition (SVD) is used to transform the vector space and effectively reduce the dimensionality of the document vectors, while preserving, many of the meaningful characteristics of documents in terms of the words used. A distance metric between vectors, such as Euclidean distance, indicates how different two documents are from one another in terms of the words used. In a search engine, a query vector and one or more document vectors are compared and the document vectors that minimize this distance are the documents that are returned.
The fundamental unit of data in LSI is the document. Thus, the nuances of language present in sentences (both query sentences and target sentences) are ignored. LSI does not utilize a representation of a sentence that is syntactic and semantic. That is, it does not provide a hierarchical representation of dependencies among parts of the sentence.