The invention relates to a federation of clusters dispersed over different locations for enterprise data management.
A cluster is a group of servers and other resources working together to provide services for clients. The servers are referred to as nodes. Nodes typically consist of one or more instruction processors (generally referred to as CPUs), disks, memory, power supplies, motherboards, expansion slots, and interface boards. In a master-slave design, one node of the system cluster is called the master or primary server and the others are called the slave, or secondary servers. The master and slave nodes are connected to the same networks, through which they communicate with each other and with clients. Both kinds of nodes run compatible versions of software.
A federation is a loosely coupled affiliation of enterprise data access and management systems that adhere to certain standards of interoperability. Members of a federation are interconnected via a computer network, and may be geographically decentralized. Since such members of a federation remain autonomous, a federated system is a good alternative to the sometimes daunting task of merging together several disparate components. The interoperability resolves problems due to variations in hardware, operating systems, software, data access, data representation, and semantics of interface modules and commands. There are various implementations of federation. A federated database system is a type of database management system, which transparently integrates multiple autonomous database systems into a single federated database. A data grid, which enables users to collaborate, by processing and sharing data across heterogeneous systems, utilizes federation-based data access.
McLeod and Heimbigner published one of the first papers to define federated database architecture. The paper is entitled, “A Federated Architecture for Information Management”, ACM Transactions on Information Systems (TOIS), Volume 3, Issue 3 (July 1985), pages 253-278. The paper describes a federated database architecture in which a collection of independent database systems are united into a loosely coupled federation in order to share and exchange information. The federation consists of components (of which there may be any number) and a single federal dictionary. The components represent individual users, applications, workstations, or other components in an office information system. The federal dictionary is a specialized component that maintains the topology of the federation and oversees the entry of new components. Each component in the federation controls its interactions with other components by means of an export schema and an import schema. The export schema specifies the information that a component will share with other components, while the import schema specifies the non-local information that a component wishes to manipulate. The federated architecture provides mechanisms for sharing data, for sharing transactions, for combining information from several components, and for coordinating activities among autonomous components.
The idea of using data grid federation to simplify management of globally distributed data came into existence in the late nineties. Data grids provide interoperability mechanisms needed to interact with legacy storage systems and legacy applications. A logical name space is needed to identify files, resources, and users. In addition there is need to provide consistent management of state information about each file within the distributed environment. These capabilities enable data virtualization, the ability to manage data independently of the chosen storage repositories. Data federation has been used to provide applications with standardized access to integrated views of data. Data grid federation has been utilized for data publication, data preservation and collection across different research institutions and enterprises. There are three types of data grid federations. The first is peer-to-peer grid where each node in the grid is aware of every other node, and can access the data attached to that node. Nodes can access data without having any knowledge of where that data is physically located. The second is hierarchical grids, which follows the master-slave paradigm. All files in slave data grids are replicated from the master grid. The third is replication grids in which two independent data grids serve as back-up sites for each other.
Enterprise data management is the development and execution of policies, practices and procedures that properly manage enterprise data. Some aspects of data management are: security and risk management, legal discovery, Storage Resource Management (SRM), Information Lifecycle Management (ILM) and content-based archiving. In addition, some companies have their own internal management policies. Another aspect of data management is data auditing. Data auditing allows enterprises to validate compliance with federal regulations and insures that data management objectives are being met.
Security and risk management is concerned with discovery of sensitive data like Social Security number (SSN), credit card number, banking information, tax information and anything that can be used to facilitate identity theft. It is also concerned with enforcement of corporate policies for protection of confidential data, protection of data that contains customer phrases and numeric patterns and compliance with federal regulations. Some of the federal regulations that are related to security and risk management are: FRCP (Federal Rules of Civil Procedure), NPI (Non-Public Information) regulation, etc. Legal discovery refers to any process in which data is sought, located, secured, and searched with the intent of using it as evidence in a civil or criminal legal case. Legal discovery, when applied to electronic data is called e-Discovery. SRM is the process of optimizing the efficiency and speed with which the available storage space is utilized. ILM is a sustainable storage strategy that balances the cost of storing and managing information with its business value. It provides a practical methodology for aligning storage costs with business priorities. ILM has similar objectives to SRM and is considered an extension of SRM. Content-based archiving identifies files to be archived based on business value or regulatory requirements. It enables policy driven and automated file migration to archives based on file metadata and content, enables intelligent file retrieval, and it locks and isolates files in a permanent archive when that is required by federal regulations.
Data management is based on data classification, sometimes referred to as categorization. Categorization of data is based on metadata or full text search. Categorization rules specify how data is classified into different groups. For instance, documents categorization could be based on who owns them, their size and their content. Metadata consist of information that characterizes data. Sometimes it is referred to as “data about data”. Data categorization methods, based on metadata, group data according to information extracted from its metadata. A few examples of such information are: the time a document was last accessed, its owner, its type and its size. Categorization based on full text utilizes search technology. Full text search is used to identify documents that contain specific terms, phrases or combination of both. The result of the search is used to categorize data.
In addition to categorization, data management involves formulation of policies to be applied to classified data. For example, policies could be encrypting sensitive data, auditing data, retaining data, archiving data deleting data, modifying data access and modifying read and write permissions. There is a long list of data management policies. Policies could be enterprise-wide policies to be applied on all organizations across an enterprise, or departmental level policies to be applied on specific departments within an enterprise.
Part of data management is creation of data management reports. Reports could cover storage utilization, data integrity, duplicated data and results of executing compliance and internal policies. In some implementations, categorization rules, policies, results of data management and report definition files are stored in a database. Report definition files contain instructions that describe report layout for reports generated from a database.
Enterprise data is stored in different devices attached to a network. Data exists in file servers, email servers, portals, web sites, databases, archives, and in other applications. There are three types of enterprise data. Structured data, which is a set of one or more data records with a fixed field structure that is defined by an external data model, for instance database. Semi-structured data, which a set of one or more data records with a variable field structure that can be defined by an external data model, for instance emails and instant messages. Unstructured data, which is a set of one or more data records that have no externally defined field structure. This comprises the majority of documents or files (Microsoft words, Excel, Power Point, etc.). Unstructured data currently accounts for 80 percent of a company's overall data. It is desirable for companies to have transparent enterprise-wide management methods for all data types.
In companies that have data stored in different geographical locations, in some cases on different continents, many aspects of data management are done locally. Some applications provide global data management systems that span different geographical locations. Such applications only deal with data stored in databases, ignoring unstructured data, which contains the bulk of an organization's information.
There is need for transparency in managing all data, dispersed over different geographical location and stored in different devices. Data managed should include the three types: structured, unstructured and semi-structured to ensure full implementation of enterprise-wide management policies for regularity compliance, security and risk management, legal discovery, SRM, ILM and content-based archiving. Sensitive issues like data privacy and confidentiality, identity theft protection should cover all data in all locations.