The pervasive use of internet technologies for the access of all sorts of data sources and the increasing size and complexity of internet systems constitute major challenges for the providers of information technology infrastructure. The information to be exchanged must be produced, validated, stored, retrieved, analysed, formatted, and delivered while observing high availability and performance requirements.
As the volume of data increase, it becomes insufficient to provide automated support only for the delivery of information to the user, which is often done via standard protocols like HTTP utilising standard software, such as web servers and web browsers. The data creation process must be supported in its entirety. E.g., for an online magazine it is necessary that the content of the magazine, which might consist of text documents, pictures, sound tracks, or video streams, is properly gathered and administered. Web content management systems (WCMSs) address the desire to produce larger and more complex web sites more quickly and with higher quality.
Large web sites are often developed collaboratively by several people whose access has to be coordinated and controlled. WCMSs usually do this by offering exclusive locks on individual documents and by verifying proper authorization. Furthermore, it is necessary to separate content and layout of the web site, since different people have specialised roles and responsibilities with respect to the web site development or operation, e.g., text editor, designer, programmer, and administrator. A WCMS therefore tries to structure the information so that different roles can work as independently as possible, e.g., allowing a text editor to focus on producing text without bothering with layout. The content is not just meant for access by human users but is also the data on which import, export, and personalization services operate.
The actual web site is often generated from a content database using templates which select and combine the content. For example, navigation bars are computed from the current position in the navigation hierarchy, a centre pane receives text articles, and a side bar features related content.
Because material published on a web site immediately goes public, quality assurance is important. To exploit the web's potential for up-to-date information, publication should be as fast as possible. On the other hand, published material should adhere to certain quality standards, at minimum contain no spelling mistakes, dangling links, or broken HTML.
In a content management system an explicit content schema may be used to model the content data to be handled by the WCMS. However, a content schema is almost impossible to get right on the first attempt during the development of the web site. Furthermore, the schema is not totally fixed over time: Organisational or technical considerations can suggest improvements and extensions to the content schema. Therefore, changing application requirements make it necessary to change, the schema even when the web site is already in production and content data has been accumulated.
Because the content data itself is a valuable asset, it is very expensive to throw away existing data and to start the data collection from scratch after modifying the content schema. In response to a schema migration, portions of the data already accumulated may be automatically converted to the new schema but sometimes human interaction is required to adapt content data to the new schema. This process is slow, so that inconsistent intermediate states will need to be managed persistently by the system. During these inconsistent periods some of the automated parts of the WCMS will not be fully functional due to the mismatch between schema and data. This may interrupt the entire web publishing process and halt the web site delivery operation.
Data migration strategies which, after a schema modification, convert the entire existing data to the new content schema are no solution to the evolving schema development process which is typical for large web site development. Many people are involved in this development process, and it is often that modifications to the schema are applied that may cause conflicts on the content data. Furthermore, it is likely that some changes to the content schema or the content data are undone later on in the process, which is only possible when the data is kept in its original form as long as possible. Data migration operations may cause irreparable data loss and, therefore, may prevent the restoration of the original content data, when applied automatically on the entire data. In addition, converting the entire content data of large web sites to a new schema is very expensive and requires a lot of computational effort. This becomes even more important when the site development or operation is an evolution like process requiring frequent releases of content schema and data.
Furthermore, inconsistencies within the content schema or between content and schema might arise out of schema evolution operations. These inconsistencies need to be detected efficiently and reliable, in order to maintain the availability and quality requirements for online publishing.