1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, systems, and products for indexing information in a distributed data processing system.
2. Description of Related Art
An example from current art of a large distributed data processing system is the World Wide Web. Search engines on the web are basically massive full-text indexes of millions of web pages. These search engines are specialized software programs specialized to receive search query messages from users or from users' browsers, where the search query messages comprise keywords or search terms. Search engines formulate, or ‘parse,’ the query messages into database queries against web search databases comprising massive search indexes.
The web includes many web sites comprising many millions of web pages, each of which is a document specially structured in a markup language, such as, for example, HTML, WML, HDML, and so on, to support some hyperlinking in some data communications protocol, such as, for example, HTTP, WAP, HDTP, and so on. The search indexes for the search engines are created by software robots called ‘spiders’ or ‘crawlers’ that survey the web and retrieve documents for indexing. The indexing itself is often carried out by another software engine that takes as its input the pages gathered by spiders, extracts keywords according to some algorithm, and creates index entries based upon the keywords and URLs identifying the indexed documents.
That is, spiders gather documents into a documents database, identifying the documents to be gathered from a URL list in the documents database or through hyperlinks in the documents themselves or through other methods. Spiders take as their inputs the entire web and produce as outputs documents to be indexed. Indexing engines take as their inputs documents to be indexed and produce as their outputs search indexes. Search engines take as inputs search indexes and search request messages bearing search terms and produce as their outputs search result messages for return to requesting users' browsers.
In current art, search engines return search results matching search terms from search requests with no indication where on a page the search terms were located. A search for the terms “ejb+xml+bmp” therefore can and often does return results in which those terms appear in an advertisement or a navigation panel in a document whose actual content has nothing to do with the search terms. This is true, despite the fact that the specially structured documents comprising the web all contain indications of the structure of the documents themselves, because the documents do not indicate the meaning of their structure. That is, the fact that search terms appear in an HTML table, form, or frame does not indicate whether the table, form, or frame is an advertisement, a navigation panel, or actual content. With no specification of the meaning, the semantics, of the structure, indexing engines are unable to include the semantics in the search indexes, and search engines are therefore unable to distinguish semantics or support search queries on the basis of semantics. There are ongoing needs for improvement, therefore, in searching and indexing documents in large distributed data processing system like the web.