1. Field of the Invention
This invention pertains in general to information retrieval, and more specifically to multi-way nested searching and retrieval of information.
2. Description of the Related Art
Information retrieval systems face several daunting problems with delivering highly relevant and highly inclusive content in response to a user's query. For example, in conducting a search, the information retrieval system may have difficulties with synonomy, where different words are used to refer to the same thing (e.g., automobile vs. car) and polysemy, where there are different meanings for the same word (i.e., “Berkeley” can mean the University of California, Berkeley or the city, Berkeley in California). Further, spelling errors, abbreviations (e.g., NY is equal to New York), and word concatenations (e.g., MS Office vs. MSOffice) in both the search queries entered by a user and in the documents being queried can cause additional problems in the delivery of relevant search results. Information retrieval systems also have problems with partial matches (i.e., a document may contain “TOYOTA®,” whereas the query might be “TOYOTA TERCEL®”), incomplete queries, complex meanings that extend beyond the words entered in queries, accounting for the relative significance of a users' query in a document, and the implicit preferences of a user conducting queries that the user did not specify in his query but must be inferred by the information retrieval system. These are just a few of the many challenges that must be overcome to obtain highly relevant search results.
A number of different approaches have been used for attempting to solve some of the problems delineated above, including keyword searching or Boolean queries, concept tagging and conceptual searches, automatic classification/categorization, entity extraction using natural language parsing, and the like. However, these all have drawbacks. Keyword searches and Boolean queries can be very difficult for many users to construct effectively and do not address some of the most basic full-text search problems, including synonymy, polysemy, spelling errors, abbreviations, concatenations, and partial matches. To address the enormous problems surrounding keyword searching and Boolean queries, a commonly accepted practice is to tag documents with “concepts,” i.e. map documents into a “concept space,” and then map the query into the same “concept space” to find search result. However, it is difficult to accurately extract concepts from documents with a high degree of precision and recall. There are several algorithms used currently to automatically categorize a document into a taxonomy of concepts, but they commonly have a low degree of accuracy, very poor results, no ability to match an input query of a few words into concepts in the taxonomy, and a significant amount of training is required in order for the classifiers to work properly. Extracting concepts from text using natural language parsing (NLP) techniques is another method commonly used, but it is language dependent, it does not work well when there is ambiguity in the text or if the content does not have any grammatical structure, etc. While entity extraction using NLP is useful for finding (potentially) new concepts, it is generally not sufficient for finding existing, or known, concepts. Further, traditional collaborative filtering engines do not work well when the number of searchable items is increasing and changing constantly, the number of users is much less than the number of items, and very few of the items have been seen and rated in the past.
Another technique used to solve some of the above problems is latent semantic indexing. One mechanism for latent semantic indexing includes using key words or concepts and mapping them into a two-dimensional list of concepts. Thus, the set of words is mapped into a reduced space, and documents are indexed according to that space, so that an input word might match to one or more of these concepts. Thus, it can produce a similarity measure between the input string and documents. However, these techniques do not allow a high level of precision in the correlation between concepts, nor do they incorporate human knowledge or allow for human editorial control in the linking between concepts. In addition, these techniques do not provide mechanisms for combining multiple search results to obtain a set of the most relevant results.
In addition, in some cases it can be valuable to a user to conduct a whole document search or a search based on a particular document (rather than a search query) to find a matching document. When conducting this type of document to document search, the problems listed above regarding spelling errors, abbreviations, concatenations, etc. are compounded since the search engine must now be able to deal with potential problems or discrepancies (e.g., spelling errors) in two documents instead of one (i.e., problems in both the document being searched on and the documents being searched for). For example, a job candidate interested in searching through a database of job postings might be interested in conducting a search based on his résumé, rather than a search based on a list of search terms he selected. In this case, the user would provide his résumé to the search engine, and he would expect to receive search results including a number of job postings matching the information in his résumé. If his résuméstates that he has a law degree and includes a listing of skills in patent prosecution in the medical device field, he should receive job postings for patent attorneys at medical device companies, etc.
However, there are many problems with current techniques for these types of whole document searches. Two documents that should be a match might not be matched by a search engine if different terms are used to describe the same thing in those two documents. For example, a search based on a résumé listing Java™ as a skill might not return a job posting listing “object-oriented programming” skills as a requirement unless that search engine is able to recognize that Java™ falls under that category of skills. Techniques for duplicate detection allow some types of similar document searches, but these techniques allow for few variations in the words. For example, a job posting for a secretarial position requiring proficiency in using “MS Office” might not be matched with a résumé listing experience in “MSOffice” due to a failure to match these differently-written terms that mean the same thing. Further, not all words or sections of a document have the same importance, which can greatly affect search results. For example, in a job search based on a résumé, the job candidate's prior job titles might be relevant to the search since job postings will likely list the job titles associated with the position, but a job candidate's listing of prior companies worked at or schools attended is less important since the job posting is unlikely to list certain schools or companies as a requirement for prior experience. A machine learning technique could be used to do a similar document search where the system accepts feedback from users about the results received, allowing the system to make an inference as to what types of information are more or less relevant. However, this technique will not work unless a user conducts enough searches such that a sufficient amount of feedback can be received to allow the learning process to be effective. In addition, current techniques do not provide an easy way for a user to apply both whole document searches and search queries involving key words to obtain even more targeted search results. Thus, current techniques for similar document searching have a number of deficiencies that can greatly affect the relevance of the search results obtained and effectiveness of the search.