Data can be an abstract term. In the context of computing environments and systems, data can generally encompass all forms of information storable in a computer readable medium (e.g., memory, hard disk). Data, and in particular, one or more instances of data can also be referred to as data object(s). As is generally known in the art, a data object can, for example, be an actual instance of data, a class, a type, or a particular form of data, and so on.
Generally, one important aspect of computing and computing systems is storage of data. Today, there is an ever increasing need to manage storage of data in computing environments. Databases provide a very good example of a computing environment or system where the storage of data can be crucial. As such, to provide an example, databases are discussed below in greater detail.
The term database can also refer to a collection of data and/or data structures typically stored in a digital form. Data can be stored in a database for various reasons and to serve various entities or “users.” Generally, data stored in the database can be used by one or more “database users.” A user of a database can, for example, be a person, a database administrator, a computer application designed to interact with a database, etc. A very simple database or database system can, for example, be provided on a Personal Computer (PC) by storing data (e.g., contact information) on a Hard Disk and executing a computer program that allows access to the data. The executable computer program can be referred to as a database program, or a database management program. The executable computer program can, for example, retrieve and display data (e.g., a list of names with their phone numbers) based on a request submitted by a person (e.g., show me the phone numbers of all my friends in Ohio).
Generally, database systems are much more complex than the example noted above. In addition, databases have been evolved over the years and are used in various business and organizations (e.g., banks, retail stores, governmental agencies, universities). Today, databases can be very complex. Some databases can support several users simultaneously and allow them to make very complex queries (e.g., give me the names of all customers under the age of thirty five (35) in Ohio that have bought all the items in a given list of items in the past month and also have bought a ticket for a baseball game and purchased a baseball hat in the past 10 years).
Typically, a Database Manager (DBM) or a Database Management System (DBMS) is provided for relatively large and/or complex databases. As known in the art, a DBMS can effectively manage the database or data stored in a database, and serve as an interface for the users of the database. For example, a DBMS can be provided as an executable computer program (or software) product as is also known in the art.
It should also be noted that a database can be organized in accordance with a Data Model. Some notable Data Models include a Relational Model, an Entity-relationship model, and an Object Model. The design and maintenance of a complex database can require highly specialized knowledge and skills by database application programmers, DBMS developers/programmers, database administrators (DBAs), etc. To assist in design and maintenance of a complex database, various tools can be provided, either as part of the DBMS or as free-standing (stand-alone) software products. These tools can include specialized Database languages (e.g., Data Description Languages, Data Manipulation Languages, Query Languages). Database languages can be specific to one data model or to one DBMS type. One widely supported language is Structured Query Language (SQL) developed, by in large, for Relational Model and can combine the roles of Data Description Language, Data Manipulation Language, and a Query Language.
Today, databases have become prevalent in virtually all aspects of business and personal life. Moreover, usage of various forms of databases is likely to continue to grow even more rapidly and widely across all aspects of commerce, social and personal activities. Generally, databases and DBMS that manage them can be very large and extremely complex partly in order to support an ever increasing need to store data and analyze data. Typically, larger databases are used by larger organizations, larger user communities, or device populations. Larger databases can be supported by relatively larger capacities, including computing capacity (e.g., processor and memory) to allow them to perform many tasks and/or complex tasks effectively at the same time (or in parallel). On the other hand, smaller databases systems are also available today and can be used by smaller organizations. In contrast to larger databases, smaller databases can operate with less capacity.
A current popular type of database is the relational database with a Relational Database Management System (RDBMS), which can include relational tables (also referred to as relations) made up of rows and columns (also referred to as tuples and attributes). In a relational database, each row represents an occurrence of an entity defined by a table, with an entity, for example, being a person, place, thing, or another object about which the table includes information.
One important objective of databases, and in particular a DBMS, is to optimize the performance of queries for access and manipulation of data stored in the database. Given a target environment, an “optimal” query plan can be selected as the best option by a database optimizer (or optimizer). Ideally, an optimal query plan is a plan with the lowest cost (e.g., lowest response time, lowest CPU and/or I/O processing cost, lowest network processing cost). The response time can be the amount of time it takes to complete the execution of a database operation, including a database request (e.g., a database query) in a given system. In this context, a “workload” can be a set of requests, which may include queries or utilities, such as, data loader that have some common characteristics, such as, for example, application, source of request, type of query, priority, response time goals, etc.
As a prominent example of database systems, Traditional Enterprise Data Warehousing has been focused on having a single large environment that can maintain all of the data required (“One Version of the Truth”) in tandem with sufficient processing and I/O capability to satisfy a myriad of different workloads. However, the Business Requirements that an EDW typically needs to satisfy have evolved in at least two distinct ways, namely High Availability and Alternate Data Processing.
High Availability is traditionally satisfied by having a second, ideally equivalent, environment that can be made available in case of a failure of the first system (Active-Standby) or which can operate in tandem with the first system (Active-Active). Both approaches require Data Synchronization while the latter approach can actually help with responsiveness through allowing for more processing capability when both systems are available as requests (queries) can be directed or load balanced across systems.
Alternate Data Processing can generally refer to the ability to issue a request (query) against data that does not necessarily conform to the Relational Model employed by databases. Such data could, for example, be semi-structured (Key, Value pairs), pure text, encoded sensor data etc. and the processing operations conducted against it might be relational, procedural, functional, mapper or reducer based (Map Reduce is a technique that can be applied to this type of alternate data to essentially turn it into a result set form by mapping input data against some pre-determined structure and reducing the resulting output to a final set by applying a selection algorithm). Two examples of these Alternate Data Processing (ADP) environments are Aster Data and Hadoop based environment as generally known in the art, where Aster Data can combine a parallel database approach as a means to store the data with a SQL wrapped Map Reduce capability (SQL-MR) provide for ADP, and Hadoop can combine a distributed file system with a Map Reduce framework to provide for ADP.
In the context of these differing database environments, a given piece of Information might exist in different data formats. For example, the full web click trail associated with a web based purchase might be stored as file data within a Hadoop system that could indicate the username (User ID) of the purchaser and the product purchasing details (Product ID and Purchase Price). A Map-Reduce function could be applied to that data in order to find all Products purchased by a given user and their purchase price. The full web click trail or just the final web transaction log could be stored within an Aster Data record again providing access to the User ID, Product ID and Purchase Price through a SQL-MR operation. An EDW, such as Teradata, could hold the web purchases in a relational “Sales” table which can be queried through SQL (select ProductID, SalesPrice from Sales where UserId=?).
In view of the foregoing, database systems and environments, including, Traditional Enterprise Data Warehousing and Alternate Data Processing (ADP) are highly useful.