Improving mechanisms for searching and extracting information from vast numbers of documents, such as those available via the World Wide Web (“web”) or on large enterprise computer networks, has increasingly been an area of focus and research. Documents are, for example, web pages of various formats, such as HTML, XML, XHTML, Portable Document Format (PDF) files, and word processor, database, and application program document files. This is especially due to the continued growth in the number of computer users, services and applications offered for example on the web, and the amount of information being added to the web, enterprise computer networks and other databases.
A search engine or search engine program is a widely used mechanism for allowing users to search vast numbers of documents for information. Generally, a search engine provides a user interface that includes a query field. A user can enter one or more search terms describing the desired information in the query field. In response to a query, the search engine attempts to locate, rank, sort and then return for display the search results, usually as a list of relevant documents. In some cases, such as a web search engine, a hyperlink to each document is included in the list. The list can also contain an excerpt of text from the document meant to summarize the content corresponding to the link to aid the user in assessing the relevance of each search result.
Typically, search engines match the search terms to the documents as a whole. If the user is interested in specific information, for example, “sharks”, but a particular document about “beaches around the world”, for example, only has one sentence about sharks, it is unlikely that the search engine would return the document. Documents like the one described are likely to score very low under the query for “sharks”, if at all, because the document as a whole is not “about” sharks. Yet the specific sentence within this document may be what the user is looking for. Search engines typically do not allow users to detect and extract such details within documents. Conventional search engines generally do not allow the user to specify portions of the documents to be returned in the search results.
Methods of complex searching exist, but most of these methods of complex searching are term based. For example, an advanced Boolean search or an advanced fuzzy text search are both based on terms. With a term search, documents are converted into a list of terms with relative rankings and the document's terms are compared with the defined search terms or with similar terms having similar rankings. The problem with such term searches is that words have multiple meanings, synonyms could have been used, and the typical algorithms used are not good at determining the importance of terms if those terms are not used frequently. Further, it is difficult to search for a concept using term-based searching methods. For example, if it is desired to search all documents that mention the concept “California cities”, each city must be enumerated as a search term.