I. Technical Field
The present disclosure generally relates to the field of search assistance technologies, such as query-recommendation systems. More particularly, the disclosure relates to computerized systems and methods for determining the similarity between terms, phrases, or documents in order to provide relevant search results based on a query.
II. Background Information
The advent of the Internet has resulted in collections of networked computer systems through which users can access vast amounts of information. The information accessible through the Internet is stored in electronic files (e.g., documents) under control of the computer systems. With advancements of storage capacity technology, the amount of information stored on each computer system has dramatically increased. Due to this increasing volume of information as well as the sheer number of documents being stored on computer systems, it is becoming more difficult than ever to locate information that is relevant to a particular subject.
To assist users in locating documents that are relevant to a particular subject, the user may conduct a search using an information retrieval system that is typically referred to as a search engine. Search engines attempt to locate and index as many of the documents provided by as many computer systems of the Internet as possible. In the past, search engines would typically perform a Boolean search based on terms entered by a user, and return any document containing all of the terms entered by the user without regard to any relevancy ranking of the search results.
More recently, some search engines have permitted users to perform a search and to filter the results according to algorithms that implement a ranking system, where the ranking assists a user in identifying relevant documents. Query-recommendation systems and filtering, navigational and visualization technologies such as de-duplication, classified displays, and clustered displays have also been provided to assist users in finding and identifying relevant documents pertaining to their search terms. Clustering technologies, for example, present users with search results that are organized in clusters. The user can then select clusters deemed relevant to a search, thus significantly reducing the amount of information for a user to sort through.
Search engines that are based on Salton's Vector Space Model implement another method to filter search results. The Vector Space Model represents documents as essentially a “bag of words” and creates a histogram, or vector, of terms plotted by frequency of occurrence, with no particular attention given to the order of the terms. In matrix notation, the document is a vector containing primitive data types such as strings or numbers representing term-frequency counts, and the document collection is a TxD (term-document matrix). Relevancy scores can be computed by performing matrix multiplication operations on the TxD matrix, and the search engine can then rank documents based on these relevancy scores.
However, performing matrix operations for similarity, especially on large document collections, can be computationally expensive. The computational complexity of multiplication over a m×n TxD matrix is mn2. For example, for a document collection containing 1,000 documents and 5,000 unique terms, the computational complexity immediately runs to the order of 109. At this scale, the computational time for matrix operations can extend to minutes or hours, even on modern super-computers.
Accordingly, vector space search engines suffer from drawbacks that limit their efficiency and usefulness. Therefore, there is a need for improved search systems and methods for determining relevancy of documents which can yield results in a more efficient manner.