There are several widely used indexing techniques for structured documents, best suited for specific applications. Web applications and services may store HTML web pages, SOA transactions and various metadata in XML databases, such as Apache Xindice™ and MonetDB™ databases. The elements in the XML documents may be addressed via protocols such as XPointer™, via search queries such as XQuery™. The applications using XML documents may include XML comparison such as Altova DiffDog™, search engine web page indexing, HTML navigations, semantic web applications and other suitable applications. There are similar proprietary techniques for indexing Marcomedia Flash™ and Adobe Acrobat™ files, MSOffice™ documents, e-books and other suitable structured documents.
The systems and methods addressing XML documents may be modified by one skilled in art to address other types of structured and semi-structured documents.
The conventional indexing of structured documents, addresses the folders/web sites as trees and files/web pages as leaves. Typically there is no segmentation below document level. The indexing techniques for structured documents may be specifically built for the applications of interest. For example, search engines use search indices, inverted indices and suffix trees, which may be useful for search within multiple documents, but may not include section recognition and document hierarchy information. For example, XPointer™ framework forms a basis for identifying XML nodes, including a positional element addressing scheme, a scheme for namespaces, and a scheme for XPath™-based addressing. For example, XyDelta™ includes unique identifier per node and XML difference detection and encoding. It is non-trivial to derive an XML indexing method that provides sufficient performance for multiple applications.
In this patent we present a multiresolution indexing method for structured documents developed to enable search within document, contextual marking, incremental updates, granular proxy and storage of XML documents, and transcoding. Moreover, we describe how various applications may benefit from using the indexing system and methods described in this patent.
The system, methods and applications described in this patent allow overcoming the deficiencies of conventional XML indexing techniques for search, visual mark and incremental update applications, as more fully set forth herein.
FIG. 1 illustrates a prior art system for document tree representation.
The document root node 101 is a parent to several nodes, including document head node 102 and document body node 103.
The document head node 102 may contain document metadata, including title, keywords, style sheets, scripts and other metadata applicable to the scope of the whole document.
The document body node 103 may contain the object nodes 104 displayed on client's screen, including layers, tables, images, hyperlinks, forms, frames, ActiveX objects or any other suitable objects.
Object nodes 104 may recursively contain other object nodes 104, attribute nodes 105, text nodes 106, scripts or other suitable XML elements.
Attribute nodes 105 may contain object attributes and metadata, including style, name, event processing, user defined metadata and other suitable metadata.
Text nodes 106 may contain text and spaces.
The metadata storage 107 may be performed inside the document or outside the document and linked to the document. For example, search engines may keep web sites as graphs or trees, with documents as tree leaves and indexing of the content in the document as metadata.
The user data 108 may include user comments, tagging, voting, page views and other suitable user-originated metadata relevant to the document.
Search indices 109 may be implemented and A-Z book or other suitable search engine indexing method may include connection between keywords and the document. For example, for each keyword a list of documents containing the keyword may be kept. The connection appears at the level of full document, or a version of document.
Document history 110 may include document versions, update history, statistics history or other suitable history.