Conventional search tools return a list of documents in response to a search query. The documents from the list may be ranked according to their relevance to the search query. For example, highly relevant documents may be ranked higher than, and may be displayed in a list above, documents of a lesser relevance. This allows a user to quickly and conveniently identify the most relevant documents retrieved in response to the query.
Some conventional search tools allow a user to perform a query using natural language. For example, LexisNexis® uses Freestyle™ to enable users to submit query terms associated with a case or legal concept. The search tool then returns a ranked list of legal documents matching the query terms. The search tool may rank the legal documents based upon a number of times the query terms appear in the legal document. For example, a term “patent” may occur in a first document 50 times, and may occur in a second, similarly sized document, 10 times. If the user entered a query for “patent,” the search tool would deem the first document to be more relevant than the second document because it includes the term “patent” more times. In this instance, frequency and size are used to determine ranking. Therefore, the search tool would assign the first document a higher ranking than the second document.
With more complex queries, search tools may use word vectors when comparing a query with a document. Generally, a vector can be represented as a line segment with a direction and a magnitude. In a two-dimensional space, a two dimensional vector V=[x, y] can be graphed with a start point at the origin (0,0) of the graph and an endpoint at a coordinate (x, y) of the graph. A similarity between any two vectors in the two dimensional space can be determined by calculating the cosine of the angle θ between the two vectors.
However, vectors can theoretically be defined across any number of dimensions n, such that V=[x, y, . . . n]. While it is not possible to graphically model vectors over 3 dimensions, it is still possible to perform mathematical operations on these multidimensional vectors. For example, it is possible to determine an angle θ between two vectors that are defined over 3 dimensions, and to determine the similarity between those two vectors by calculating the cosine of the angle θ.
Word vectors can be used to model any string of words, such as a document or a natural language query. The vectors can be defined according to a number of concepts in the English language. For example, if a modern thesaurus includes 1000 concepts, then each word vector would include 1000 dimensions. In other words, V=[x, y, . . . n] where n=1000. Each dimension in the vector would correspond to a unique one of the 1000 concepts, and a number in any particular dimension of the vector is the number of times that the concept corresponding to that dimension occurred in the query or document.
The following example shows a comparison between a document and a query using word vectors. The concepts from this example can also apply to a comparison between any two sets of words, such as between two documents. Table 1 illustrates an exemplary set of concepts along with words related to each concept.
TABLE 1Concept DefinitionsConcept No.Words1the, a2attractive, nice, beautiful3rose, carnation, pansy4white, pink, purple
Table 2 illustrates an exemplary set of word strings, along with words included in each word string.
TABLE 2DocumentsWord String.TextDocumentthe nice, attractive white roseQuerythe beautiful carnation
Table 3 illustrates a vectorization of the document and the query from Table 2 using the concepts from Table 1.
TABLE 3VectorizationWord StringVectorCategorizationDocument[1, 2, 1, 1][the; nice, attractive; rose; white]Query[1, 1, 1, 0][the; beautiful; carnation; null]
The dimensions from the vectors in Table 3 correspond to the concepts set forth in Table 1, such that dimension 1 of each vector corresponds to concept 1, dimension 2 corresponds to concept 2, and so on. Accordingly, the document includes one term from concept 1 (“the”), and so a “1” is assigned to dimension 1 of its vector. The document includes two terms from concept 2 (“nice” and “attractive”), and so a “2” is assigned to dimension 2 of its vector. The remaining dimensions in the document vector, as well as the dimensions for the query vector, are filled in this manner.
Once the document vector and query vector are calculated in this example, it is possible to mathematically determine the angle θ between them. Therefore, it is also possible to determine the similarity between the query and the document by calculating the cosine of the angle θ between their respective word vectors. This similarity value can be compared with the similarity value of the same query with a different document. In this way, the search tool may rank the documents depending on their similarity with respect to the query. Phrase vectors may also be used in addition to, or instead of word vectors.
This technique may not be the best indicator of relevance. For one thing, it relies fundamentally on the frequency of terms within a particular class. It also ignores other factors that may be important in determining relevance and ranking.
Accordingly, there is a need to improve the ranking of search results in response to a query.