Any discussion of the background art throughout the specification should in no way be considered as an admission that such background art is prior art, nor that such background art is widely known or forms part of the common general knowledge in the field.
Traditional text search methods use a very small amount of input text i.e. a few words (keywords) and, based on that small amount of text, these traditional methods search huge databases and attempt to display useful documents which are relevant to the keywords as an output. This type of search method is very useful when there is little subject matter available to the searcher and the aim is to find subject matter as fast as possible. This type of searching i.e. text or keyword to relevant document(s), is the most commonly performed search on the internet and perhaps in the world today. Searching on www.google.com, say for example using the keywords “Britney Spears” is a typical example of this type of search. The user (searcher) performing such searches initiates the search with very little information, so the relevance of the documents returned to the user with respect to the keywords is often an estimated output based on the statistically most desired outcome, since the keywords themselves produce a huge number of document matches and yet there is not enough information in the input text to inherently order all these matches in terms of relevance to the particular desires of the user/searcher.
As the number of documents in the database to be searched becomes large and the amount of input text becomes small, the relevance of the documents in the search results becomes impossible to determine without additional information (i.e. information that is not contained in the initial input text or search query). In the case of internet search engines such as Google™, YAHOO™, Microsoft BING™ and others, the developers of the search algorithms have found ways to improve the relevance of the search results, most notably through the page rank algorithm of Google™, which essentially uses hypertext link structures to form a popularity index of billions of documents and millions of search terms.
Popularity works well for internet ‘text to document’ searching, since the popularity methodology finds appropriate information relevant to the input search query in the vast majority of cases. However, this type of searching is less useful for document-to-document searching as the input and output requirements are vastly different. Document-to-document searching is initiated with far more input text and in general a greater expectation of relevant output results given the increased input information. To date, Google™ limits the number of input terms in the search query to 50 terms or 2048 characters. The nature of the Google™ search tends (not always, but generally this is the case) to find fewer results as more information is added to the search query, as additional input text terms are used to exclude (prune) as many documents as possible from the search results. This is not a useful approach with document-to-document searching since the only document likely to match a particular document when using the text contained therein as the input search terms is itself.
Other traditional search methods use technology based on matching meta information. The meta information is essentially a group of labels (or tags) applied to each document, which allow documents to be aligned in different dimensions. An example concerning job searching is a candidate looking for a job with the two meta fields Location=“Los Angeles” and Job Type=“full time”. All documents without these meta matches are excluded. The specific nature of meta tags allows databases to be searched very quickly as the database is searching for a match (or non match) in a field instead of a match across a full document, which allows many documents to be excluded from the search before examining the full text contents. However, meta searching has several disadvantages, most notably these tags must be created for every document in the database. This is usually done manually as part of the database input process, which is extremely time consuming and also prevents batch importing of data. Although techniques such as Latent Semantic Indexing (LSI) are becoming more popular due to their ability to semantically determine appropriate tags. The second most notable issue is cross-compatibility issues surrounding different databases. Often each database provider uses different conventions for each meta field, making searching across different platforms virtually impossible. In some cases meta tags are produced automatically, but in many cases this is either simply not practical, highly limiting, or results in large instances of errors in the information assigned to meta tags for documents in the database.
Unlike text-to-document searches and meta searching as mentioned above, document-to-document searches have further additional complexity on the input processing requirements and therefore need different methodologies for calculating the relevance of documents in the database with respect to the input document. In particular, the aim of document-to-document searches is not to find new information (as with text-to-document or meta tag searches), but rather to find the most similar documents, or documents containing the most relevant information. The applications for this type of document searching are huge, such as research, job-candidate matching, legal case matching, patent portfolio management, and many others. In all these cases the searcher begins with at least one document, which is a comparatively large amount of information in comparison to text-to-document and meta tag searching as outlined above.
There are several examples of document-to-document searching applications. For example Iparadigms LLC, USA have developed a document searching engine for the detection of plagiarism in student and academic works. This technology looks for identical word strings in reference documents stored in a database which match an input text portion, or portions of the input text, which may be for example an essay or paper submitted by a student as part of a course of study. This type of search is very useful for finding very similar pieces of content (i.e. similar wording), but breaks down when trying to find documents with similar content using different wording.
Furthermore, Burning Glass Technologies, USA have developed technology specifically for the human resources industry. The Burning Glass technology identifies successful candidates for a given position and then looks for candidates with similarities to previous candidates who have been successful in jobs with similar selection criteria. This type of matching uses hidden markov models, and is often very useful technology, but such models have the disadvantage that they must rely on the identification of previous successes to predict new successes. This inherently requires repetition of the same job description, so is largely only useful for large companies refilling similar positions. This technology is also not very useful outside of job searching, as most other document-to-document searches are not repeated, evaluated and repeated again. As such, Burning Glass aims at company/institution based integrations instead of a more global approach to matching, as the search technology relies on repetition and established definitions of success, which in general work better inside a closed system.
In other examples of search methodologies, patent matching technologies, such as Patent Café Inc, USA employ Latent Semantic Analysis (LSA) techniques to help with patent searching, portfolio analysis, patent strength, etc. This methodology looks at text terms and uses inverse weighting based on population scores (how rare each term is) to give scores to terms to find a match, for example as described in U.S. Pat. No. 4,839,853. However, LSA techniques are limited by how well the system is initially set up, and relies primarily on inverse word population analysis which can be unreliable in many applications. Also, LSA techniques are generally not able to be adapted in real-time as a result of user interactions with results obtained by such LSA-type techniques i.e. these techniques are largely rigid and slow or unable to adapt as the information in the database(s) changes or to external input e.g. from a user and/or additional/external information source(s). LSA analysis also becomes extremely computationally intensive as the number of terms in the input becomes large, as LSA usually uses a two dimensional matrix, with terms and documents on each corresponding axis. This produces a semantic vector identifying each document in what is called the “term space”. As the number of terms and/or the number of documents becomes large, approximations are required to reduce the computational load. This reduction is typically done by grouping semantically similar terms (they exhibit many common documents) into higher-level groupings to reduce the term space. Unfortunately, however, this simplification has several drawbacks, mainly with regards to a) very rare terms that don't fit into any groupings, b) words with double meanings (polysemy), which fool the groupings and c) multiple words with similar meanings (synonyms). The reliance on approximations can produce much poorer results when either of these contextual issues are present on key search terms.
Another search methodology involves a process of receiving a query, identifying phrases within the query, identifying possible extension(s) of the phrases in the query, and searching a database of documents for coincidences between phrases in the documents and the phrase extensions identified from the query. Such a method is disclosed in US patent application No. 20060031195. This method appears to have many similarities with, autocomplete functions, for example as used by Google™ to predict extensions to a query of a few terms based on the popularity of previous search queries to narrow the search beyond that which could be achieved from the initial query. However, such methods are more suited to input queries of only a few terms and will have difficulties when the number of input query terms becomes large (greater than 10 or more), which would likely place extremely large computational loads on identification of phrases and phrase extensions.
Therefore a need exists for a new approach to text searching, particularly involving whole document to document searching applications where the input document comprises a large number of input terms.