1. Field of the Invention
The field of the invention relates to information retrieval systems. More particularly, the field of the invention relates to generating index information for data objects.
2. Description of the Related Technology
Information retrieval (IR) systems index documents by searching for keywords that are contained within the documents. Typically, the searches are not performed on the documents themselves. Instead, words are extracted from the document and are then indexed in separate data structures optimized for searching.
However, secure documents, such as documents that are protected by digital rights management (DRM) software, present a special problem for IR systems. Traditionally, IR systems rely upon having full access to the contents of the document to prepare the index information for the document. For example, IR systems that index HyperText Markup Language (HTML) documents on the Internet typically open the HTML documents via its Uniform Resource Locator (URL), then download, parse, and index the entire document.
Secure software, however, does not permit this kind of unrestricted access. Access is restricted to those applications that are both authorized and trusted by the secure software. For security concerns, all other applications are prevented from accessing the protected document.
One way to solve this problem is to retrofit all pre-existing IR systems so that they are “rights enabled.” This solution permits IR systems to communicate directly with secure software to obtain the document source. However, this approach makes a number of unrealistic assumptions, including: (i) that it is possible to retrofit legacy IR systems such that they would comply with the secure software's security requirements; (ii) that all secure system providers would be willing or able to make the necessary changes in a timely manner; and (iii) that it is possible to establish the necessary trust relationships between every secure provider, copyright holder, and IR system provider. This approach has attendant flaws and there is a need for a better solution.
Another problem with preparing index information for IR systems is that each IR system has different indexing algorithms for organizing and storing information. IR systems often analyze the header of the electronic document when selecting the index information for the electronic document. The header includes meta-information regarding the content of document. However, not all of the IR systems retrieve the same keywords from the electronic document when selecting the index information. For example, some IR systems remove duplicative words from the metatag information, while others do not. Furthermore, for example, some IR system recognize phrases, while others do not. Accordingly, it is difficult to customize index information that is ideally suited for use with more than one IR system.
Thus, there is a need for a system for providing index information to IR systems. The system should be able to provide information to the IR systems that is almost as usable as the original. Preferably, the system should not require the modification of any legacy IR systems. Furthermore, it should be difficult to reconstruct the original document source (or any reasonable facsimile thereof) from the provided index information. Furthermore, the system should be able to automatically customize the index information regarding an electronic document, on an IR system-by-IR system basis.