This section introduces aspects that may be helpful in facilitating a better understanding of embodiments of the present invention. Accordingly, the statements of this section are to be read in this light and are not to be understood as admissions about what is in the prior art or what is not in the prior art.
Most individuals are familiar with manual searching for books, magazines or documents in a library or similar setting. Searching, in its most rudimentary form, often simply involves a researcher seeking a specific book written by a particular author by perusing the library stacks by category type and utilizing alphabetical order or some other organizational scheme to locate the specific book.
Searching for documents stored electronically often involves searching within a specific database via names or key words/search terms. When a researcher must independently search each database, he will only uncover documents stored in the selected database that relate to the search terms, and he will not uncover any related documents stored in other databases. This creates an organizational problem in that different researchers may search different databases attempting to find the same type of documents. In other words, two different researchers may think that a given document they are searching for should be contained in two different databases due to their own notions of the proper categorization of the searched for document. As a result, one or both researchers may not discover the document that they are searching for due to their failure to classify the document in the same manner as the creator of the database and their failure to search the database deemed appropriate by the database creator.
With the advent of the Internet, millions of documents are available through Internet search engines. An electronic document is a cohesive body of text that is electronically accessible (e.g. a patent document, a news article, a legal case, a medical journal article or a webpage). Often, a group of documents are contained within a single source, dataset, collection or database. Most individuals are familiar with the process of searching for relevant documents within a document collection via keywords and search terms. A researcher types the key words/search terms into the search engine to locate related documents and then sifts through the document results to determine which documents are most relevant.
If the researcher is satisfied with the results he obtains via the key word search, he can print or save the documents and complete the search. However, often the researcher is not satisfied with the initial results and the query (i.e. key words or search terms) must be modified to obtain potentially better results. After a number of searches are performed, the researcher often collects and organizes the results by printing the documents or saving the documents into a folder. The problem with this searching methodology is twofold. First, the results of the search are dependent on the researcher's selection of key words. The researcher may not select the best key words or may not be able to obtain the best results by simply using a few words (i.e., search terms) and may obtain no results by using too many terms. Second, the document results saved or printed are not “living” documents in that they represent how the document appeared when the document was saved or printed. They are not dynamic and capable of being updated and then viewed at a later date without further researcher involvement. The document results are also a snapshot of the search conducted at a given point in time and any documents added to the dataset after the search will not be included in the search results.
Keyword searching is still quite analogous to manually investigating a collection of printed documents. Software essentially just helps to perform that job more efficiently. The advent of the search engine was a cornerstone in the evolution of information research, but a search engine simply finds documents that contain some specific words.
Advanced search engines such as Google are forgiving in the sense that they can yield results that do not literally match on the keywords and allow the researcher to utilize natural language. Search engines, such as Google, utilize a “Page Rank” that may skew results from any given search. “Page Rank” involves a link analysis algorithm that assigns a value to each element of a set of documents to determine a document's relative importance within the set of documents. The value assigned to a document/webpage on the World Wide Web is defined recursively and is calculated based on the number and “Page Rank” of all webpages that link to the document with the theory being that a document linked to by many webpages with high “Page Ranks” is also worthy of a high “Page Rank.”
Semantics also play a role in natural language queries in which “unimportant” words such as “the” and “it” are discarded while the “important” words and synonyms to those “important” words are actually searched which may ultimately create a huge index that still needs to be manually inspected by the researcher.
Other database search engines (e.g. search engines for Wikipedia and the United States Patent and Trademark Office) utilize the familiar “Boolean keyword search” that is very literal and has its own distinct value and applicability. If a researcher types in too many keywords, no matches appear. If a researcher types in too few keywords, there are too many and highly varying results. If a researcher is unsatisfied with the results, he must rework the query by adding some complex operators (e.g. some combination of “AND”, “OR”, “NOT”, and/or parentheses).
If a researcher is unfamiliar with the nuances of the Boolean keyword search system, he may not properly utilize the Boolean operators and may not structure the query in the proper manner to obtain the most desirable results. Moreover, a Boolean search is traditionally unforgiving in that the search terms entered are either present or they are not present in the selected range (e.g. in the entire document or in the same sentence as one another).
Key word searching also may be difficult to perform in certain situations because of the different meaning of given words (e.g. China and china), causing a large number of varying search results that need to be perused by a researcher.
Traditional search solutions do not allow for electronic searching for documents utilizing an entire document or documents as the search criteria or utilizing portions of a document supplemented with key words entered by a researcher as the search criteria. For example, if one were to copy an entire document and stick it into Google, Bing or Yahoo searches, one would get an error message because these search engines are not designed to search entire documents. There are a few search engines that do semantic searches of entire documents such as Text Wise. However, these prior art full document semantic search engines are sub optimum because they utilize logic based systems that require things such as proximity searches for words (e.g. is the word “horse” within two words of the word “shoe”), Boolean logic (e.g. AND, OR, AND NOT) and attempts to understand the meaning of words by associating the words with other words using logic (e.g. the word “china” may be related to kitchenware if the word “porcelain” or “plate” is also used in the same document). This type of prior art semantic search using logic, Boolean logic and proximity is computationally difficult and it increases both the time and money required to perform searches and to index groups of documents to be searched.
Other solutions also do not permit collection, storage and sharing of the documents found during this type of searching in a portable and dynamic manner.
The prior art searching technology simply allows a researcher to enter some keywords for searching that may yield a set of documents that at least come close to the type of documents sought. Upon reviewing these documents, if a researcher discovers some words in a related document that help him develop his search criteria, the prior art solutions require him to enter those key words from that related document as search terms to try to locate additional relevant documents. The context of the language preceding and following those key words from the related document is lost when a new key word search is performed using this traditional searching technique. The prior art does not allow the researcher to leverage the entirety of that particular related document as the criteria for the next search.
In many document collections, the highest quality search criterion is actually the entire text of one of the documents in the database. A real document in the collection (or a new one that the researcher types in full) contains much more useful information than what a researcher typically types as keywords. The natural language of the document and all of its inherent properties tend to shine through, if analyzed with appropriate algorithms. When the text of an entire document or large portions of text thereof are used as the search criteria, the set of related documents returned are most similar to or related to the original document or portions thereof. In “complexity theory” this phenomenon is known as “emergence.” Emergence is the key to a natural stepping-stone in the evolution of information research from a “search engine” to a “discovery engine.”
A researcher conducting a document search, such as a patent search, could leverage a “discovery engine” as opposed to a “search engine” to obtain superior results. In this type of search, the researcher already has a full description of the patent/document. The description can be submitted as the search criteria and the top related documents can be returned. Some of the results may look very relevant and the researcher can hold/identify these documents to enable him to return to them later. The researcher also can identify others to ignore so they do not show up as results again. If one of the documents discovered looks extremely relevant, the researcher can perform a further search using that entire relevant document as the search criteria to view the top related documents to that relevant document. The search criteria are effectively changing each time a search is performed without having to rework a query manually each time based on search results.
Hence, there is a need for a device and methodology that efficiently, reliably and affordably permit a user to utilize the text of an entire document as the search criteria and/or to utilize an entire document along with supplemental text supplied by a researcher and/or multiple documents or subsections of documents as the search criteria and/or any combination of these potential search criteria. There is also a need for a device and methodology that permit a user to collect, store and share the collected/related documents from a search with other users and to further permit any individual to conduct an updated search for any newly added documents in a dataset based on the same search criteria.