1. Field of the Invention
The present invention generally relates to indexing of information and its retrieval, and it is particularly related to the information retrieval from networks such as the World Wide Web (WWW).
Prior to turning to the conventional techniques and systems for information retrieval, some basic principles in this field will be described hereinbelow.
First, a metadata relationship must be defined which will define the significance of the search space. The specific relationship utilized in the present invention is a text matching procedure similar to the matching procedure used in Web search engines today such as Yahoo!, Google, IBM's Clever, etc. Nevertheless, the method of the invention described hereinbelow is not restricted to this implementation, and the utilization of any other metadata relationship does not deviate from the spirit of this invention.
The metadata can be described as an additional block of information which is stored with the indexed data block, which contains information about the data which is contained in the block.
For example, a metadata block with the text “picture of sail boat” attached to a Joint Pictures Experts Group (JPEG) file (binary representation of a photograph) will be extremely helpful in retrieving the photograph when a user of the database posts a query like “retrieve pictures of a sail boat”.
Without the metadata information, it would be more difficult to retrieve the picture. It would be necessary to construct a “picture template” which describes the basic features of a sailboat, and then employ sophisticated pattern matching techniques in order to recognize a sailboat from the binary representation.
Some metadata information can be contained in the stored data block itself, and not in an additional metadata block. For example, Web pages written in HTML (Hypertext Markup Language) contain tags (special text, defined by the HTML language) and text which are rich in metadata information.
For example, the text: “</TITLE Pictures of Sailboats/TITLE>” can be used to find a Web page which has “links” to pictures of sailboats. A link is a special tag in the HTML language which references another data block. Links are of special significance in the organization of the World Wide Web, and there are several techniques which study the patterns with which data blocks stored on the Web are linked to each other.
For example, a web searching technique utilized in search engines such as Google (e.g., see www.google.com) and IBM's Clever (e.g., see “Enhanced Hypertext Categorization using Hyperlinks”, Proceedings of the ACM SIGMOD, Seattle, Wash., 1998) give special value to data blocks which are pointed to by several other data blocks. These “convergence” blocks are called “authorities”.
Another important linkage pattern is defined when a single block contains several links to other blocks in which are related to “the same subject”. A “subject”, in the context of the present application, is a specific metadata relationship which relates data to a segment of text which describes the subject.
2. Description of the Related Art
Turning now to the conventional techniques, the definition of subject relationships is of primary importance in the construction of World-Wide Web (“Web”) directories. However, prior to the present invention, there has been no efficient, reliable method for determining where a user may be interested in going and no efficient way to present the user with information without there existing a certain latency in presenting pages or documents.
For example, a well known search engine (e.g., the Yahoo! search engine) utilizes human specialists to sift through the Web maze to organize its directory. However, this search engine is problematic in that it is a manually-compiled Internet directory which uses human experts to read a document to determine a relationship and associations between the documents and then group them by interest. As known, Yahoo! also has a search engine facility in which a user can enter a word and a search is performed to find relevant documents (e.g., documents including the entered word). Yahoo! employs conventional techniques in which a matrix is built (e.g., a “term-by-document” matrix) including rows (e.g., terms starting with, for example, the letter “A” and so forth, similarly to words in a dictionary) and columns (e.g., indicating the percentage that the words occur in any given document).
Thus, for example, assuming a term(s) of interest is “IBM”, a search would be conducted throughout a number of documents, and the number of occurrences (e.g., hits) found for “IBM” in each of a number of documents, would be reflected in the score for that document (e.g., if a document had 50 occurrences of “IBM”, then it would have a relatively high score as compared to a document having only two (2) occurrences).
However, attempting to relate “IBM” to “computers” is more difficult. That is, Yahoo! does not provide a facility for determining such a relationship. Instead, a Boolean search (e.g., “IBM” and “computers” must be linked by the term “and”) must be performed. This is cumbersome.
A second technique is found in the “Google” search engine. Google is a new approach which attempts to find links between items. Hence, Google does not merely scan a page looking for terms. Instead, the Google directory is built automatically by an autonomous process, called a “Web Crawler”, which recognizes the specific metadata relationships described above. Thus, Google finds/counts the number of links coming in for a certain page and if Google sees a page which is pointed to by many other pages, then Google considers such a page as an “authority” on the subject of interest and ranks that page higher. For example, assuming a researcher publishes a very good paper on a topic and the paper is referenced/cited by many other authors in their papers, such a “very good” paper would be an “authority”, and thus the papers would have to link to a page having the very good paper. Thus, Google would find all such pages having such a link to the very good paper, and would rank the page having the paper higher.
A third approach is IBM's Clever which utilizes both of the techniques above in Yahoo! and Google and in addition has the capability of detecting a “directory”, which is a page that has several links to other pages and in which the degree of that page is very high. Hence, extending the example above, a compilation of all papers looking in a subject can be found and many links may be found to other references in that subject.
Thus, these conventional directories are utilized by users of the directory service in order to retrieve information which is related to a certain subject. Most of the directories today are utilized according to the following procedure which in the present application is referred to as a “traditional Web Navigation”, as shown in FIG. 1 and described below.
The term “navigation” refers to the order in which the user retrieves a document. This procedure is important to the present invention, because it describes a method for information organization which makes possible a navigation pattern very distinct to the traditional Web Navigation, and much more powerful.
Turning to the conventional navigation technique, as shown in the method 100 of FIG. 1, in step 105, the user will provide the engine with a search string, which may contain text used in the metadata relationship and also logical operators (such as the logical AND operator in the case of a Boolean search).
In step 110, the search engine will then return a list of links to Web pages which are related to the search criteria. As noted above, this list may be ordered utilizing “search scores” obtained from some other criteria derived from the metadata, as explained above.
In step 115, the user can then browse this list, which typically contains the page titles and excerpts from the page where words contained in the search criteria were found. Then, in steps 120 or 125, the user will browse this list and select the link which may contain the desired information, or even lead to the desired information.
The term lead is here of special significance. For example, sometimes articles posted by news services, e-mail notes, and even chat records are returned as the result of a search. Now, the user may select to follow a link to one of such documents because of the possibility that the document in turn may contain a link to another document which has the right information (step 130).
Sometimes, the user may have to follow several of these links, until either the information is found (step 135) or the user comes to a “dead end” (e.g., steps 140, 145, 150, 155). A “dead end” in the Web navigation process occurs when the user follows a link to a document which is not relevant to his search and that contains no other links which are relevant to the search (steps 140–155).
When the user encounters such a dead end, the user has the choice of “backing up” (e.g., step 150 of going back) to the previous page, or to any of the other previously visited pages. The previously visited pages are collectively called “the search history”. Then, the user can choose other links contained in pages in the search history to traverse. When no more interesting links are left in the search history, the user may go back to the original list of links returned by the search engine and select a new starting point for the traversal (e.g., step 115).
The user iterates on this process until either the information is found or the search list is exhausted. If the search list is exhausted, the user may resort to try another search criteria (e.g., step 120) which either describes the subject or is related to the subject that is being searched. The navigation process is then repeated. Hence, the conventional navigation technique of FIG. 1 is performed, but is inconvenient to the user due to backing up, etc.
That is, many times the user is searching for information which cannot be exactly defined by an exact search criteria, and as a result too many results are returned (in the range of thousands). In this case, the conventional navigation pattern described above will make it very hard to find the desired information, as shown in FIG. 2.
That is, FIG. 2 illustrates the traditional navigation pattern resulting from the conventional web navigation in which finding the most relevant document is somewhat cumbersome and difficult.
As shown in FIG. 2, on the search result page, the searched results are ordered according to their search score, with the highest being shown on the left hand side and sliding to the lowest across the page to the right hand side. L1–L12 are links and D1–D10 are documents. As shown, finding the most relevant document D10 is time consuming.
As evident from FIG. 2, a user always must traverse links to search pages. That is, a common problem is that after a search is input and the results are returned, the user goes through each page (document) one-by-one. However, if the user loses the list by, for example, traversing through a plurality of pages by following links on each page, then the user must back up and must return to a top page (link). Thus, for example, after traversing D6, the user must return to the top (the search results page) and then go to link L2. It is noted that going through the documents under link L2, document D5 will be accessed twice by traversing the links under link L1 or under link L2. The operator then returns to the top and accesses link L3 and so forth, until document D10 is finally found. Thus, the conventional web navigation pattern is slow and time-consuming.
Thus, prior to the invention, there has been no satisfactory method in which to find and navigate data in Web pages, databases, etc.