Content Management (CM) and OnLine Analytical Processing (OLAP) are two separate fields in information management. Although both fields study models, concepts, and algorithms for managing large amounts of complex data, they started with very different applications as their major technology drivers. CM focuses on uniform repositories for all types of information, document/record management and archiving, collaboration, integrated middleware, etc., while OLAP is driven by financial reporting, marketing, budgeting, forecasting, and so on. Consequently, the two different fields emphasize very different aspects of information management: information capturing, storing, retention, and collaboration on the CM side, and data consistency, clear aggregation semantics, and efficiency on the OLAP side.
Various advanced applications that have recently emerged impose modern user and business needs that require the benefits of both CM and OLAP technologies. Digital libraries are becoming very rich content repositories containing documents along with metadata and/or annotations stored in semistructured data formats. Intranet data, wikis and blogs represent examples of this trend. The above examples are just a subset of modern applications where the traditional information retrieval techniques—e.g., keyword or faceted search-are not enough, since advanced analysis and understanding of the information stored are required. On the other hand, application areas such as customer support, product and market research, or health care applications, with both structured and unstructured information, are mission-critical and require both CM and OLAP functionality, as well.
However, the synchronous application of both techniques is not straightforward. At first, the user models for CM and OLAP are dramatically different. In CM (as in information retrieval), a user is a human with cognitive capabilities. CM queries are best effort formulations of a user's information needs, and support an interactive process of data exploration, query rephrasing, and guidance towards the final query result. In OLAP, a user is more like an application programmer using an API to access the data. OLAP employs a multidimensional data model allowing for complex analytical queries that are precise and provide exact query results as fast as possible. The OLAP query result is typically a matrix or pivot table with OLAP dimensions in rows and columns, and measures in cells. In summary, CM query processing depends on ranking, while OLAP query processing is an aggregation task based on testing and grouping logical predicates.
Additionally, navigating and dynamically analyzing large amounts of content effectively is known to be a difficult problem. Keyword and semantic search help alleviate this problem to some extent by returning only the top-k relevant documents that contain the search keywords. While this maybe a satisfactory result for short hit lists and for information retrieval, it is not acceptable when thousands of documents qualify for the search keywords and the user is interested in all of them, i.e., ranking is not effective. The user wants to understand the entire hit list, look at the result from many different angles and different levels of detail. Traditional OLAP techniques seem to be a desideratum to such a problem, since they have been known to be effective in analyzing and navigating through large amounts of structured data. Unfortunately, unstructured data does not lend itself well to traditional OLAP style analysis.