The number of documents contained in computer-based information retrieval systems is growing at tremendous rates. Networks bring together large collections of documents, and the increased amount of data makes the retrieval process more difficult. The development of efficient and effective document retrieval techniques is critical to managing the increasing amount of documents available in electronic form.
A complicating factor in many information retrieval systems is that many documents are just different representations of the same content (e.g., a Microsoft Word document can be formatted as PDF, as plain text, as HTML, etc.). Or, the same data could be stored in an Oracle database or in Excel. An article could be stored in English and French.
Another complicating factor occurs when documents are revised. A storage system may have several versions of the same document. Also, documents are componentized, and the same paragraph, slide, or figure may appear in multiple documents. Another factor is that some documents need not be stored. Instead, such virtual documents can be generated upon demand by programs.
A primary issue in the retrieval of electronic documents is filtering the vast amount of information available so that a user can obtain information of interest to the user in a fast and efficient manner, and receive such information in an acceptable format. To assist in searching information available on the Internet, a number of search techniques have been devised to find information requested by the user.
Systems for storing and querying XML data have been implemented. For example, goxml.com facilitates the search of XML data stores, and includes the ability to perform transformations on result documents. However, goxml.com does not support the concept of metadata, nor does it support any chaining of transformations. Other systems, such as xyleme.com, support the searching of XML data on the web, but do not support either transformations or metadata. In addition to these systems for storing and querying XML data, there are many other systems for storing and querying electronic documents in a variety of formats. None of the existing systems appear to integrate transformation and querying capabilities for both metadata and content, nor do they support the creation of transformation plans.
It would be desirable for a system to integrate transformation and querying capabilities for both metadata and content, and to support the creation of transformation plans. It would also be desirable for a system to take both search and transformation costs into account when creating a transformation plan.