In the latter half of the twentieth century, there began a phenomenon known as the information revolution. While the information revolution is a historical development broader in scope than any one event or machine, no single device has come to represent the information revolution more than the digital electronic computer. The development of computer systems has surely been a revolution. Each year, computer systems grow faster, store more data, and provide more applications to their users.
A modern computer system has the capability to store mountains of data, and using readily available networks can access even more data. Unfortunately, the capability of digital systems to store and access data has generally outpaced the ability of users to organize or understand the data. Considerable effort has been and continues to be devoted to developing improved techniques for organizing, searching, collating and presenting meaningful information to users from the voluminous data potentially available.
One of these developments has been the content management system (CMS). A CMS is a computer program or set of programs for managing a repository of documents containing heterogeneous data, usually on behalf of multiple users. The data in the repository may include text files, spreadsheets, structured tables, graphics, still digital images, digital video, audio, and so forth. XML documents, which have the capability to incorporate heterogeneous content and metadata, are exemplary candidates for storage and management using a CMS system, although CMS systems are not necessarily limited to XML documents. The CMS allows users to add data to the repository, modify data, copy, delete and/or perform other operations with respect to the data. The CMS typically includes tools for indexing, searching and retrieval of data in the repository, as well as for organizing and formatting output of data from the repository.
A CMS typically includes rules which enforce data integrity. These rules are used to process files whenever a document (which may contain one or multiple files) is checked into or out of the repository. If a rule is satisfied, the CMS may perform subsequent processing of the content. Known content management systems may include rules related to bursting, linking and synchronization. Bursting rules govern how a document is bursted, or broken into individual chunks, when the document is checked into the repository. By bursting a document into chunks, the individual chunks may be reused in other documents and by other authors. Linking rules are used for importing and associating objects related to a CMS document based on particular elements or attributes from the document as specified by the rules. For example, an XML document that references external images can take advantage of linking rules so that relationships between the XML content and the external images are automatically created when the document is checked into the repository. Another kind of linking rule governs what content in a repository a user may link to in a document that will be subsequently checked into the repository. Synchronization rules govern synchronization between content and metadata related to the content. For example, a synchronization rule may specify that whenever a specified CMS attribute is changed, a particular piece of XML in the content should be automatically updated with that attribute's value.
Much of the data in a typical CMS is not directly searchable using conventional search and retrieval techniques. For example, it is difficult or impossible to search raw digital images. Typically, a CMS repository is organized as a collection of files and a structured relational database of metadata. The individual files contain the raw data provided by the users. The relational database contains metadata which characterizes the files. For example, the database may include a database table, in which each entry (row) in the table corresponds to a single one of the files in the repository, and the fields contain parameters describing the respective files. This metadata could include anything which might be useful in categorizing and retrieving files from the repository, such as a type of data in the file, file size, title, date of creation, date of most recent edit, version, author, subject, keywords, references to other files, etc. Content in the CMS is typically located by searching the metadata using any of various conventional search techniques. It may also be possible to search selective types of data in the raw data files; for example, it may be possible to search text files for selective strings of text characters.
Maintenance of the metadata in a CMS is a significant task. Metadata may be manually entered, but clearly it is desirable to reduce or avoid manual entry where practicable. Some CMS's have content parsers which parse documents as they are checked into or out of the repository and extract selective metadata for inclusion in the database. Parsers can in some cases extract metadata for inclusion in a content document, such as in selective tags of an XML file. Such parsers can significantly reduce the burden of maintaining the metadata and/or content documents, although they do not necessarily eliminate the need for manual entry entirely.
One deficiency of current parsers which extract metadata, not necessarily recognized, is that, for certain types of documents, the context of the tags or other data within a document can change within a single document being parsed. For example, MICROSOFT WORD™ supports documents which are zipped files containing potentially multiple XML documents, arranged in multiple folders having a pre-defined hierarchical structure. Each XML file has some pre-defined purpose. E.g., one might define page layout, another the fonts, another the headers/footers, etc. At least one will contain the body of the text. Some might be exclusively metadata, such as document creator, date and so forth. Because the document may contain multiple XML folders, each folder containing potentially multiple XML files, the same tag name can be used in different XML files to mean something slightly different, depending on the context. It would be desirable to provide greater automated parsing capability in these circumstances.