1. Field of the Invention
This invention relates to content indexing of content in streaming XML documents within a stream of one or more XML documents.
2. Description of the Related Art
In processing XML or other forms of hierarchical data including SGML, JSON, and the like different areas of optimization are developing. The first is dividing up of Extensible Markup Language (XML) data in an XML document for storage and later retrieval in a conventional relational hierarchical such as XML databases, and/or hybrid database systems. In this area, optimization focuses on storing parts of the XML document to facilitate locating and retrieving the data of the XML document. The second area focuses on optimizing queries for XML data by re-writing the XML query and/or adjusting query execution plans such that the data requested is located and retrieved as efficiently as possible. A third area seeks to optimize how an entire XML document can be marked, tagged, or otherwise identified as having content that will merit future retrieval of the entire XML document. Operations in this third area may be referred to as subject indexing, tagging, cataloging, indexing, content indexing (as used herein), or search engine indexing of content in an XML document. Such operations are distinct from creation and maintenance of indexes in a database system.
Optimizations in relation to databases benefit from the ability to expend time and overhead processing and manipulating an XML document or XML data query once in exchange for optimization benefits over time due to the large collection of documents and high query request rates. In contrast, content indexing takes place during the update or storage of an XML document and so the impact of content indexing on performance should be minimal. Unfortunately, convention solutions in the area of content indexing have used XPATH processors that load the entire XML document into memory as a Document Object Model (DOM). This requires significant processing resources and delay waiting for the DOM instance to be generated. Furthermore, the set of XML documents that will be the subject of content indexing is unknown and thus conventional techniques are unpredictable. Typically, while the set of XML documents that will be the subject of content indexing is unknown it is known that the XML documents are generally very large often many tens of megabytes in size each. Consequently, the inefficient use of memory and processing resources of conventional content indexing solutions have prompted a search for more efficient solutions.