In many technical fields, up to 80 percent of the mission-critical information exists in heterogeneous or unstructured formats, such as spreadsheets, word processing documents, pdf, Web pages and other presentation formats (collectively referred to as “documents” herein). These semi-structured, and unstructured documents are scattered across many domains, and the fraction of documents in such forms is probably increasing as the variety of formats increases. Traditional approaches to data management and integration, such as data warehousing and customized point-to-point communication connections between specific applications and backend databases are expensive, time consuming, risky to implement and will probably provide a decreasing fraction of a total solution—if, indeed, a total solution can ever be implemented.
Most commercial off the shelf (COTS) tools available today for database querying are web-based technologies that will retrieve only the content of data stored in particular formats. Most COTS tools are limited to storing retrieving and querying data in a flat file system. Queries of arbitrary format (or unstructured) documents cannot be implemented. Further, performance complex queries spanning both context and content keyword searches, are either inefficient or non-existent.
What is needed is a document database framework for managing and searching within the database that is robust and flexible, that makes effective use of an XML formalism, and that can be applied to unstructured and semi-structured documents in the database. Preferably, the system should work with most proprietary and non-proprietary database integration software. Preferably, the system should allow use of simple queries and hierarchical queries.