The present invention relates to indexing structures to facilitate computerized searches through data. More specifically, the present invention relates to a method and an apparatus for generating an index to facilitate searching through data within a document based upon a predefined index stylesheet associated with the document that contains instructions for creating an index for the document.
The explosive growth of the Internet has been strongly tied to the development of search engines that allow users to rapidly search through large volumes of textual data from thousands and even millions of different web sites. A user who is interested in a particular topic merely has to enter a number of keywords into a search engine in order to receive linkages to different web pages containing the key words.
Search engines typically create an xe2x80x9cindexxe2x80x9d of documents (such as web pages) that are available on the world wide web. An index typically stores individual words (or other meaning carrying textual strings) in a more compact and easily searchable form known as xe2x80x9ctokens.xe2x80x9d
The process of building an effective index can be greatly complicated by the fact that documents can exist a wide variety of different forms which need to be indexed differently. For example, an efficient index for a technical paper might contain the abstract and tide of the technical paper, but not the body of the technical paper, whereas an efficient index for a television schedule might contain ratings for individual television programs.
The process of creating an index is also complicated by the fact that for common document formats, such the Hypertext Markup Language (HTML) or the Extensible Markup Language (XML), much of the important information for search purposes is stored within attribute fields, and is not within the normal text of the document.
Furthermore, the structure of a document may change over time, which can require the structure of the index to change. For example, suppose the structure of a product catalog is updated to include consumer reviews for individual products. This change may require the index to change to include these consumer reviews.
Existing systems create indexes for documents using ad hoc rules. For example, one ad hoc rule is to create an index for all textual information that is not within attribute fields. Unfortunately, such ad hoc rules often include much unimportant information in the index, and often exclude important information.
A similar problem exists in converting the document into tokens (tokenizing the document) during the index creation process. During the index creation process, relevant portions of a document are converted into tokens associated with individual meaning-carrying units of text, such as wordforms or numbers. In the English language, wordforms are typically delineated by white spaces and punctuation marks. Hence, the tokenizing process is relatively easy. In contrast, languages such as Japanese have no such delineation. Consequently, the tokenization process depends on contextual information and can be very complicated.
The tokenization process can also be domain dependent. For example, periods within an email address, such as xe2x80x9cperson.dept@companyx.comxe2x80x9d are linking elements, whereas periods within other textual information typically delineate word and sentence boundaries.
Hence, the tokenization process varies between languages and between domains.
One embodiment of the present invention provides a system that generates an index to facilitate searching through text within a document based upon an index stylesheet associated with the document. The system operates by receiving a document to be indexed and then parses the document to produce a parsed document. The system also retrieves instructions for creating the index for the document from an index stylesheet associated with the document. The system creates the index for the document by transforming the parsed document in a manner that is specified by the instructions retrieved from the index stylesheet.
In one embodiment of the present invention, retrieving the index stylesheet involves retrieving the index stylesheet across a network from a remote address.
In one embodiment of the present invention, the index stylesheet is appended to the document.
In one embodiment of the present invention, the system additionally makes the index available to a search engine so that the search engine is able to scan through the index.
In one embodiment of the present invention, the index stylesheet specifies sections of the document to skip in creating the index for the document.
In one embodiment of the present invention, the index stylesheet specifies attributes of the document that are to be included in the index.
In one embodiment of the present invention, the system receives additional documents to be indexed, and creates indexes for the additional documents using the index stylesheet.
In one embodiment of the present invention, creating the index for the document involves tokenizing the document by partitioning text within the document into individual meaning-carrying units of text.
In one embodiment of the present invention, prior to receiving the document, the system downloads and parses an index configuration file which specifies the index stylesheet to be used in creating the index.
In one embodiment of the present invention, the system receives the document from a client at an indexing server that creates the index for the client.