The present invention relates generally to the Internet, database, and content management systems for the storage of files and data objects. Particularly, this invention is directed to a system and method for the efficient management of access and control over files and data linked to a database system and stored externally in a file system or another object repository. More specifically, the present invention relates to a scalable eContent system and associated method for managing the same.
In the past few years, the World Wide Web (xe2x80x9cthe webxe2x80x9d) has become a very important medium for information sharing and distribution. The number of business transactions taking place on the web has also been constantly increasing. This has resulted in significant changes in the way organizations communicate with their customers, employees and business partners.
Although companies still communicate information with sound, pictures, video, and the written word, the size and content of the information and the access frequency to the information are increasing faster than ever before. In addition, companies are making content available and accessible to employees, customers, and even general public from anywhere on the web, for competitive, contractual, financial, and legal reasons.
Corporate communications span multiple environments, employ diverse platforms and protocols, innumerable applications, WANs, LANs, extranets, intranets, and Virtual Private Networks (VPN). How to efficiently manage the fast growing information content and consistently keep up with the constant modifications to the information content have become major challenges to the information technology (IT) organizations in both large and small companies. Web applications providing gateways to corporate resources access and revise information from multiple legacy systems and repositories.
Traditionally, content management systems are accessed by a small number of clients (or employees) within an enterprise and only a small subset of the business information is stored on-line in these systems. Supporting large numbers of client connections is not a major concern in such an environment. However, when information is accessible to all employees, or even to customers and partners, through the web, client connection scalability becomes yet another major challenge to the IT professional.
In a content management system, three types of information are stored: primary content (also referred to as Data or Object), User Metadata, and System Metadata. Semi-structured and unstructured data, such as text file, image, web page, video clip, etc., constitute the primary content in a content management system. Description of, and information about the stored primary content, which are normally provided by the users, are referred to as user metadata.
The information created by the content management systems for access control, storage management, and content tracking/reference is referred to as system metadata. In contrast to the primary content, both user metadata and system metadata are well structured. Content management systems, in general, use a database system as a persistent repository for storing both user metadata and system metadata.
While semi-structured and unstructured data can also be stored in database tables, they are normally stored outside of databases in practice for performance and accessibility reasons. This is because of several reasons, among which are the following:
(a) the size of these types of data tends to be very large and the conventional database systems are not designed to handle them efficiently; and
(b) storing primary content (e.g. files) in a database makes it very difficult for applications to access the content through a native application program interface (API). Consequently, content management systems normally store and manage primary content and metadata separately. This leads to a distributed system architecture where the system storing metadata for search and access control becomes the master, commonly referred to as the library server (LS), while one or more systems storing the primary content become the slaves or the object servers (OS). OS is also known as resource manager (RM).
The current generation of conventional content management products is based on a self-contained, closed system architecture, and as such the content management have poor extensibility. The new generation of content management systems adopt an open system architecture that promises to fix the extensibility problem and to improve the system performance. However, such content management systems continue to be based on a distributed system architecture that will face many of the same issues encountered in the traditional distributed (database) systems.
Conventional content management systems provide a set of functions for content (data and metadata) creation, content management, and content distribution that enable users to manage data, system metadata, and user metadata. These conventional content management systems suffer from numerous shortcomings, among which are the following:
(a) Lack of Scalability and Extensibility:
Since a computer system has limited processing power and storage capacity, a content management system needs to have an architecture that is scalable so as to support future business/content growth. Three areas of scalability of particular interest to content management users are: primary content, metadata, and client/user connections.
For scalability in total content size and in the number of objects, conventional content management systems allow a library server to manage objects stored in multiple distributed object servers. When the total size of the primary content saturates or exceeds the capacity of an object server, an additional object server is installed. On the other hand, current content management systems cannot gracefully handle a significant increase in either metadata size or the number of client connections without a major architecture overhaul.
When the size of the metadata out-grows the capacity of a library server, multiple library servers must be installed, each managing a subset of the objects. Like all distributed systems, this type of space partition has a major problem, namely location transparency to clients. Since each library server is an autonomous server and it has no knowledge of remote objects stored in other library servers, clients are forced to keep track of each library server""s storage content. In addition, if a client needs to search information in all the library servers, the client is forced to establish individual connections to the library servers and to manage the merging of the results from multiple library servers.
Client connection scalability is also a significant limitation of the conventional content management architecture. When the number of users exceeds the capacity and/or processing power of a content management system, one can either limit the number of concurrent client connections or employ a more powerful computer system. Limiting client connections reduces productivity and limits company growth potential making it very undesirable. Replacing an existing computer system with a new one may require unloading data from old machine and reloading the data into the new machine, which is a cumbersome task as the size of data is normally huge. In addition, powerful machines are much more expensive and not always available.
The other alternative for providing client connection scalability is by installing multiple servers with replicated content, for example Yahoo web server, and a middle-tier server that is responsible for routing requests to one of the replicated servers. Replicating content incurs additional problem and complexity of synchronizing replicas.
(b) Poor Atomicity and Referential Integrity:
When an object is inserted or updated in a content management system, reference to and description of the object will also need to be created or updated in order to provide data consistency and avoid a referential integrity (RI) problem. Since objects are stored in an object server while metadata (reference, description, etc.) are stored in a library server, a distributed two-phase commit protocol is needed with the conventional architecture to guarantee consistency and preserve referential integrity.
While a distributed two-phase commit protocol has been studied extensively and is well understood, it is a complex protocol to implement, especially in the failure handling aspect, and has a very negative impact on performance. Furthermore, since the client is the one driving the insert/update logic, it is the best place to also drive the transaction (commit/rollback) logic. Most likely, a client machine is less reliable than either a library server or an object server machine. Thus, it would be undesirable, in terms of reliability, to drive the transaction logic at the client machine.
(c) Difficulty Integrating Existing Content:
A migration path is an important issue for users/customers of any computer software systems. A content management system needs to provide an easy and flexible migration path for users who have large amounts of data in file systems or in some forms of resource managers (e.g. video server). Conventional content management systems cannot manage objects stored in a file system directly as they do not support access to object via standard file system call, for example.
(d) Heterogeneous Communication and Access Protocol:
When multiple library servers and object servers are installed, a library server will use one set of protocol to communicate with other library servers but a different set of protocol to communicate with object servers, and vice versa. Since a client needs to access both library servers and object servers, the client will also need to use two different sets of protocols to access both the library server and the object server. This not only makes the implementation and installation of the content management software more complex, but also results in a complicated runtime environment.
(e) Limited Accessibility and Availability:
Access from any remote location is not a design goal of conventional content management architectures. To enable web browser access, a web enabler component such as an enterprise portal (EP) is added on top of the content management. Since the enterprise portal acts as a broker/intermediator in such a design, content management functions and content cannot be fully and/or directly explored by thin clients such as web browsers.
It is a feature of the present invention to present an architecture and associated method for a scalable content management (also referred to herein as xe2x80x9ceContent managementxe2x80x9d) system, that satisfies the need of high scalability, pervasive accessibility, enhanced integrity, and extensibility. More specifically, the invention proposes a scalable eContent management architecture (xe2x80x9cSeCMAxe2x80x9d) for managing both enterprise and web contents. This architecture includes an integrated and scalable eContent (or content) manager (SeCM) as a building block, that can be used to build a personal content manager on a small home PC or a high powered and highly scalable web and/or enterprise content server, by installing the scalable content manager on each of the computers, on a cluster of high power computers.
The scalable content manager on a single computer manages both data in a local file system (or a resource manager) and its associated metadata stored in a local database, which will greatly simplify both content manager logic and client application logic. The proposed architecture enables users to add scalable content manager nodes as needed, which allows users to easily scale up a scalable content manager system, in both data size and user connection, as business grows.
The proposed SeCM architecture will also include a tightly integrated http server that allows clients, such as web browsers, direct access to the content management system from anywhere on the web. In addition, the architecture enables users to convert an existing system (e.g. a filesystem) into a scalable content management system by installing a copy of the scalable content manager to the existing system. With the present scalable content management architecture, a multi-node content management system will appear to be a single content management system to users, thus providing location transparency.
The present architecture enables users to selectively scale a content management system as needed. It also provides a single system view to users of the content management system when metadata and objects are stored in multiple computer nodes. The location of an object and its associated metadata are transparent to the client. The need to run complex, distributed transaction protocols is thus obsolesced.
The scalable content management architecture of the present invention also provides direct access for web browsers, to enable users to access the content (or eContent) directly by client with Internet connection. In addition, the scalable content management architecture provides an extensible architecture that enables users to integrate new content, and to migrate existing content with ease and flexibility.
The scalable content management system of the present invention provides several functions, among which are the following:
(a) Integrated Data and Metadata Functions:
The scalable content management system provides integrated data (primary content) and metadata management functions, and therefore the logic for implementing the functions as well as application logic are significantly simplified. In addition, data and its associated metadata are collocated and under the control of a single SeCM. Thus update atomicity and referential integrity for managed data and metadata are maintained and enforced by a single scalable content manager, which does not require complex distributed commit protocol across distributed library server and object server nodes as in the traditional CM systems. In addition, a client will resign the responsibility to coordinate the updates to the library server and the object server. Also, mapping a logical reference to a physical object reference for object access and access control for both data and its metadata is done by a single scalable content manager, which will not require any coordination of access control and/or shared secret between the library server and the object server as in the traditional approach.
(b) Enhanced Scalability and Manageability:
A single scalable content manager will provide the functions needed for content creation, storage, search, management, and distribution. For a company with a relatively small amount of data/content, a content management system with a single scalable content manager (SeCM) can be installed on a single computer system. As the content (both object and metadata) size grows, additional scalable content managers can be installed on new computer systems. This provides a smooth path for scaling up when company business grows.
When a new scalable content manager is installed on a new machine, users may opt to redistribute some of the existing content from existing scalable content managers to the newly added scalable content manager, or may elect to leave the existing content intact and store only the new content to the new scalable content manager. In the event that a new scalable content manager is added to support growth in the number of users, content may have to be redistributed to the newly installed scalable content manager in order to balance the workload among all scalable content manager in a content management system.
(c) Single System View (Location Transparency):
Scalable content management systems communicate amongst themselves in order to provide a single system view to clients. A client can connect to any scalable content management system node to perform content creation, search, and update. It does not need to know which library server node to connect to for searching or updating an object and its metadata. By providing location transparency to clients, client application logic is also simplified.
(d) Global Metadata Index:
When a client requests to access metadata and objects stored in remote scalable content managers, the local scalable content manager that the client is connected to divides the request into sub-requests and broadcasts them to one or more remote scalable content managers. The local scalable content manager then collects and integrates replies/results from remote ones and delivers the results back to the client. When the number of installed scalable content managers in a content management system is small, it may be acceptable to always broadcast a request to all the scalable content manager nodes, even when only one or two scalable content managers store the objects and metadata requested by the client. When the number of installed scalable content managers is large, global metadata indexes would be provided to pin-point the specific scalable content manager or managers for servicing the sub-requests. This will reduce the number of messages sent and processed for a request, and thus improves the system performance.
(e) Accessibility:
Besides being an integrated scalable content management system, the scalable content management system of the present invention also includes one or more http daemons (httpd). Each httpd will make requests to the local scalable content manager through a set of servlets and/or executable programs. New servlets or programs can be installed to provide new or additional functions as needs arise. Clients can access any of the data using a browser from anywhere on the internet. This provides direct access to the scalable content management system from any browser.
(f) Integration of Existing Content:
When integrating existing content (e.g. files) into a scalable content management system, it does not require that existing data stored in the file system be extracted and reloaded, as in the conventional content management systems. The scalable content management system provides the flexibility and functionality for integrating existing data in the file system efficiently. With the scalable content management system, existing data in the file system can be brought into the scalable content management system control incrementally, or as needed. Users need only specify the names of the files (or file system) and create user metadata for the files to be managed by the scalable content management system.
In addition, the system administrators can add a SeCM software to an existing system to convert it into a scalable content management system with all the desirable properties provided by the present invention. For example, assume a company has several machines managing files. To convert the existing machines managing files into a scalable content management system, the company can install the scalable content manager on each machine, index all the files, then link together the scalable content managers and create share catalogs that describe all the files and their locations (i.e., indexing), provide the web browser access to all the data, and also allow existing applications to run using the original filesystem API.