Content management is defined as software that builds, organizes, manages, and stores collections of digital works in any medium or format. Content management refers to the process of handling various types of structured and unstructured information, including images and documents that may contain billing data, customer service information, or other types of content. Content management further refers to the process of capturing, storing, sorting, codifying, integrating, updating and protecting any and all information. Studies estimate that more than 75% of enterprise data is unstructured and document-related (Lyman, Peter, et. al., “How Much Information, 2000”, http://www.sims.berkeley.edu/how-much-info).
Key technologies in the content management market include document management, web content management, digital asset management, and records management. Typical users of content management are in document-heavy industries in which document management is essential, often for regulatory or compliance reasons. Content comprises many different forms of unstructured data requiring management: business documents, dynamic web content, records management, and rich media. Business documents comprise contracts, invoices, forms, and e-mail. Business documents, for example, facilitate internal back-office processes and enable direct external communication with customers, partners, and suppliers. Dynamic web content comprises business data in relational databases and personalized information. Records management is typically driven by government and industry regulations to effectively document the processes, audit trails, and data retention. Rich media comprises digital audio and video. Rich media is rapidly transforming areas of training, education, marketing and customer relationship management in many industries
The notion of relating document management with workflow has been prevalent for several decades and many document management systems incorporate this feature. One conventional method presents tools and methods to address problems in integrated document and workflow management with a case study involving offer processing for a machine tool company (Morschheuser, S., et. al., “Integrated document and workflow management applied to the offer processing of a machine tool company”, In Proceedings of Conference on Organizational Computing Systems, 1995). This conventional method is a process definition language designed to make a document-oriented tool with a workflow engine more efficient.
Another conventional approach utilizes an idea of using active document properties to extend document management applications (Dourish, P., et al., “Extending document management systems with user-specific active properties”, In ACM Transactions on Information Systems (TOIS), Volume 18 Issue 2, 2000). This conventional approach avoids traditional hierarchical storage mechanisms, reflects document categorizations meaningful to user tasks, and provides a means to integrate the perspectives of one or more individuals within a uniform interaction framework. Property-based document management systems are augmented with the notion of active properties that carry executable code to enable the provision of document-based services on a property infrastructure.
Yet another conventional system captures essentially freely structured documents such as those typically used in the office domain (Mattos, N. M., et. al., “An approach to integrated office document processing and management”, In ACM SIGOIS Bulletin, Proceedings of the Conference on Office Information Systems, Volume 11 Issue 2-3, 1990). This conventional system facilitates the handling of documents containing information. Analyzed documents are stored in a document management system that is connected to several different subsequent services and serves as rudimentary workflow.
FileNet presents a workflow engine in conjunction with the document technologies to automate production and ad hoc business processes respectively (Whelan, D, “FileNet integrated document management database usage and issues”, In ACM SIGMOD Record, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, Volume 27 Issue 2, 1998).
Most conventional document management systems are supported by a relational model. In terms of relevant relational modeling research, formal modeling of relational schemas originated with an emphasis on runtime aspects such as query expression (Andries M., et. al., “A hybrid query language for the extended entity relationship model”, In Journal of Visual Languages and Computing, 8(1), 1997, Special Issue on Visual Query Systems; and Angelaccio, M., et. al., “QBD*: A Fully Visual Query System”, Journal on Visual Languages and Computing, 1(2), 255-273, 1990), query result display, and navigation through the stored data. Collectively, these tasks are referred to as Visual Query Systems (VQS) (Catarci, T., et. al., “Visual Query Systems for Databases: A Survey”, Technical Report SI/RR-95/17, Dipartimento di Scienze dell'Informazione, Universita' di Roma “La Sapienza”, 1995).
In comparison, relatively little focus has been placed by conventional systems on an interface provided by the tools used to define and manipulate data models and database schemas. Commercial database modeling products such as Rational tools provide visual data modeling profiles that integrate into the broader software development cycle (Gornik, D., “UML Data Modeling Profile”, IBM Rational Software Whitepaper TP 162 05/02, 2003). These profiles are generally geared to UML (Unified Modeling Language) modeling of relational databases. The OPOSSUM system, developed at the University of Wisconsin, allows a database schema to be edited through manipulation of the schemas visualization (Haber, E. M., et. al., “OPOSSUM: A Flexible Schema Visualization and Editing Tool,” In Proceedings of the 1994 ACM CHI Conference, Boston, Mass., April 1994; and Haber, E. M., et. al. “Opossum: Desk-Top Schema Management through Customizable Visualization,” In Proceedings of the 21st International VLDB Conference, pages 527-538, Zurich, Switzerland, September 1995).
Document management systems typically encompass some aspect of document understanding and classification to support the business process. The general problem of classifying machine printed documents into genres has been explored where visual layout is a critical factor in recognizing fine-grained genres, since document content features are similar. One conventional method for document management uses layout structure detected from scanned binary images of the document pages, using no optical character recognition (OCR) results but instead using attributed relational graphs (Bagdanov, A. D., et. al., “Fine-Grained Document Genre Classification Using First Order Random Graphs”, In Proceedings of ICDAR 01).
Another conventional system utilizes learning techniques on layout based on the “logical closeness” where a directed weight graph is used to represent document layout (Li, X., et. al., “A Document Classification and Extraction System with Learning Ability”, In proceedings of ICDAR 99). Yet another conventional system uses document classification based on visual similarity (Hu, J., et. al., “Document Image Layout Comparison and Classification”, In Proceedings of ICDAR 99). In this conventional system, interval encoding is introduced to capture elements of spatial layout. These conventional systems propose a Hidden Markov model based page layout classification system that is trainable and extensible based on this spatial feature.
A further conventional system utilizes user-directed “rapid capture” of portions of a scanned image including tools to ease the accessing, editing, and dispatch to a desired destination, such as archive, application, webpage, etc. (Simske, S. J., et. al., “Editing and authoring: User-directed analysis of scanned images”, In Proceedings of the 2003 ACM symposium on Document Engineering, 2003). These tools utilize user-directed zoning analysis, known as “click and select”, and statistics-based region classification. “Click and select” incorporates a bottom-up zoning analysis engine. Statistics-based region classification allows rapid reconfiguration of region.
Although these conventional technologies have proven to be useful, it would be desirable to present additional improvements. The lifecycle of document management applications typically involves these phases: (a) ingest or capture of content; (b) management (including search, retrieval and workflow); (c) fulfillment at the end of the business process; and (d) archival for compliance or regulatory reasons. The ingest or capture phase typically creates metadata associated with incoming documents and associates the document with a schema defined in a content management system. The metadata associated with a schema enables the management phase to search the repository effectively in the context of the business process and workflow. After any management or transactions associated with the process have been completed, fulfillment activities may be triggered such as notifications, integrations with other systems like accounting, payables, records etc. If the documents need to be retained for a fixed period of time for audit reasons, they may be archived in offline storage.
Conventional document management systems manage the ingest phase in separate capture subsystems that allow the specification of the metadata in separate environments. Data that the conventional document management system should manage are located in many different places such as different branches of a business, a field office as opposed to a main office, etc. The documents are subsequently “released” into the content management system. Since these capture subsystems are often decoupled from the overall content management system, the metadata extracted is loosely tied to the schema and business process. As a result, there is frequently a manual step associated with the actual assignment of metadata and association with the specific schema or process resulting in reduced efficiencies in the overall context. For example, data that a business requires are typically collected and processed manually, often in a batch. Further, the ingest phase often has no linkage with the fulfillment or triggering of business processes after the management phase.
What is therefore needed is a system, a service, a computer program product, and an associated method for automatically, dynamically, and selectively composing and managing data and documents. The need for such a solution has heretofore remained unsatisfied.