The invention relates to a high availability and scalable cluster server for enterprise data management.
High availability cluster server is a server that continues to function even after a failure of system hardware or software. The usual way of providing high availability is to duplicate system components. If some component becomes unavailable, another can be used instead. Scalable cluster server is a server that is able to increase performance and workload by adding more hardware or software resources.
A cluster is a group of servers and other resources that act like a single system and enable high availability and load balancing. The servers are referred to as nodes. Nodes typically consist of one or more instruction processors (generally referred to as CPUs), disks, memory, power supplies, motherboards, expansion slots, and interface boards. In a master-slave design, one node of the system cluster is called the primary or master server and the others are called the secondary, or slave servers. The primary and secondary nodes have similar hardware, run the same operating system, have the same patches installed, support the same binary executables, and have identical or very similar configuration. The primary and secondary nodes are connected to the same networks, through which they communicate with each other and with devices connected to the network. Both kinds of nodes run compatible versions of software. Some high availability systems support virtual network interfaces, where more than one IP (Internet Protocol) address is assigned to the same physical port. Services are associated with the virtual network interface and computing resources needed to perform the services. The virtual IP address does not connect a client with a particular physical server; it connects the client with a particular service running on a particular physical server.
In some cases, disks are directly attached to a node. This is referred to as Direct Attached Storage (DAS). In other cases, Storage Area Network (SAN), which is a high-speed special purpose network or sub-network interconnects different storage devices with the nodes.
Enterprise data management is the development and execution of policies, practices and procedures that properly manage enterprise data. Some aspects of data management are: security and risk management, legal discovery, Storage Resource Management (SRM), information lifecycle management (ILM) and content-based archiving. In addition, some companies have their own internal management policies. Another aspect of data management is data auditing. Data auditing allows enterprises to validate compliance with federal regulations and insures that data management objectives are being met. One of the challenges for data management products is to provide solutions to different aspects of data management in one platform. This is due to the various, and sometimes conflicting, requirements of different aspects of enterprise data management.
Security and risk management is concerned with discovery of sensitive data like Social Security number (SSN), credit card number, banking information, tax information and anything that can be used to facilitate identity theft. It is also concerned with enforcement of corporate policies for protection of confidential data, protection of data that contains customer phrases and numeric patterns and compliance with federal regulations. Some of the federal regulations that are related to security and risk management are: FRCP (Federal Rules of Civil Procedure), NPI (Non-Public Information) regulation, PII (Personally Identifiable Information) regulation, FERPA (Family Educational Rights and Privacy Act), GLBA (Gramm-Leach-Bliley Act), HIPAA (Health Insurance Portability and Accountability Act), SOX (Sarbanes-Oxley Act) and the U.S. Securities and Exchange Commission's (SEC's) Regulation.
Legal discovery refers to any process in which data is sought, located, secured, and searched with the intent of using it as evidence in a civil or criminal legal case. Legal discovery, when applied to electronic data is called e-Discovery. E-Discovery involves electronic evidence protection, legal hold, and chain of custody. A legal hold, sometimes referred to as litigation hold, is a process, which an organization uses to preserve all forms of relevant information when litigation is reasonably anticipated. Chain of custody logs and documents how the data was gathered, analyzed, and preserved. Federal regulations details what, how and when electronic data must be produced, including production as part of the pre-trial process.
SRM is the process of optimizing the efficiency and speed with which the available storage space is utilized. Among many things, it involves: removing duplicate, contraband and undesirable files, data retention and deletion, moving or removing files based on metadata and content and automated file migration through integrated storage tiering. Tiered storage is the assignment of different categories of data to different types of storage media in order to reduce total storage cost
ILM is a sustainable storage strategy that balances the cost of storing and managing information with its business value. It provides a practical methodology for aligning storage costs with business priorities. ILM has similar objectives to SRM and is considered an extension of SRM.
Content-based archiving identifies files to be archived based on business value or regulatory requirements. It enables policy driven and automated file migration to archives based on file metadata and content, enables intelligent file retrieval, and it locks and isolates files in a permanent archive when that is required by federal regulations.
There are inter-dependencies between some components of enterprise data management. In some cases, the components share the same requirements. In other cases they have conflicting requirements. For instance SRM policy may decide that a category of data should be deleted, as it has not been accessed for a long time. At the same time legal discovery may decide to impose litigation hold on the same data because of its relevance to litigation.
One of the challenges to data management is the exponential growth of the enterprise data. Now, many companies have more than 1 petabyte of stored data. Another challenge is diversity of devices where data exists. Data exists in file servers, email servers, portals, web sites, databases, archives, and in other applications. Another problem domain is the proliferation of data into the fringes of the enterprise network, namely laptops and remote users.
Data management is based on data classification, sometimes referred to as categorization. Categorization of data is based on metadata or full text search. Categorization rules specify how data is classified into different groups. For instance, documents categorization could be based on who owns them, their size and their content. Metadata consist of information that characterizes data. Sometimes it is referred to as “data about data”. Data categorization methods, based on metadata, group data according to information extracted from its metadata. A few examples of such information are: the time a document was last accessed, its owner, its type and its size. There are many methods for accessing and extracting information from metadata. Some methods utilize file system utilities. File system utilities can only extract file system attributes. Document parsers, sometimes called filters, are used for extracting metadata from documents, such as Microsoft Word, Microsoft PowerPoint and PDF files. The three top commercial parsers being used now are: Stellent, KeyView and iFitler. Some software developers write their own parsers or use open source parsers such as Apache POI. Classification based on full text utilizes search technology. Full text search is used to identify documents that contain specific terms, phrases or combination of both. The result of the search is used to categorize data. One of the widely used open source search engines is Lucene.
In addition to categorization, data management involves formulation of policies to be applied to classified data. For example, policies could be encrypting sensitive data, auditing data, retaining data, archiving data deleting data, modifying data access and modifying read and write permissions. Different policies could be grouped to form a top-level enterprise-wide policy or a departmental policy.
Part of data management is creation of data management reports. Reports could cover storage utilization, data integrity, duplicated data and results of executing compliance and internal policies. In some implementations, classification rules, policies, results of data analysis and report definition files are stored in a database. Report definition files contain instructions that describe report layout for the reports generated from the database.
Enterprise data is stored in different devices dispersed across the network. To perform data analysis, one can manually enter the location of the data, which is daunting when many devices are connected to the network. Alternatively, one can use methods or a combination of methods for automated data discovery. Some methods utilize Internet Protocol (IP) port scanners like nmap and Advanced IP Scanner. IP scanners determine services, devices available in the network and the type of data source. Then, a crawler is used to retrieve data. The type of crawler used depends on data source accessible through a network port. If the data source is a network file server, file system crawlers are used to recursively scan the directory structure of the file system and retrieve files. If the data source is a database, then database crawlers that utilize JDBC or LDBC are used. JDBC stands for Java Database Connectivity. LDBC stands for Liberty Database Connectivity. If the data source is an email server, crawlers that use Messaging Application Programming Interface (MAPI) or connectors are used. MAPI is a Microsoft interface for components that connect to Microsoft exchange. An example of an email connector is Novell's Connector for Microsoft Exchange. Some enterprise data is stored in corporate web portals (corporate intranets). Web crawlers are used to automatically traverse a corporate intranet by retrieving a document, and recursively retrieving all documents that are referenced. Web crawlers are also called spiders or web robots. Crawlers for archives depend on the type of the archive. For instance, crawlers for Network Data Management Protocol (NDMP) compatible archives utilize NDMP based crawlers. NDMP is an open standard protocol for enterprise-wide backup of heterogeneous network-attached storage. Some software vendors provide interface modules to help in writing connectors to their archives.