Computer systems include both “structured” and “unstructured” data. Components of structured data are stored in a way that enables a computer to unambiguously identify the meanings of those components. One example of structured data is a database table having a set of fields (such as “name,” “address,” and “telephone number”) in which different types of data are stored. Because the data in such a table are divided into fields, each of which stores only data of a particular type, a computer can identify data of each type unambiguously. For example, a database system can determine whether the table includes any data representing a person named “John Smith” by searching only the “name” field of each record in the database. As this example illustrates, the fact that the table is structured facilitates performing automated processing operations (such as searching) on that data. In fact, structured data can typically only be accessed and processed through software applications, not viewed directly.
In contrast, components of unstructured data are not stored in a manner which unambiguously specifies their meaning to a computer. A word processing document containing a business memorandum is an example of unstructured data. Although such a document may include header text which specifies components of the document, such as “Subject:”, “From:”, “To:”, and “Date:”, such text is not easily processable by a computer even though its meaning is easily understood by a human. The reason is that the text “Subject:” is stored as text in a manner that does not distinguish it from any other text in the same document. As a result, a computer cannot easily discern that the text following the text “Subject:” refers to the subject of the memo, since the computer does not understand the meaning of words in human languages. This makes unstructured data more costly and time-consuming to process than structured data.
This is a significant problem because unstructured data comprises 85% of all digitally-stored data, and 80% of all data stored by businesses. Unstructured data represent a wide variety of commonly-used data, such as word processing documents, email messages, web pages, and metadata stored by applications such as file systems and backup systems. Note that as even these simple examples illustrate, data may be “structured” for one purpose but “unstructured” for another. For example, a user-created spreadsheet may be structured in the sense that it is divided into columns for storing different types of data, but unstructured from the perspective of the document management system (DMS) used by the user's enterprise, if that DMS is not programmed to understand the structure of the user's spreadsheet and therefore cannot search through or otherwise process the data in the spreadsheet intelligently. An email message may be considered semi-structured, because it contains both headers, which provide structure to the message, and unstructured text. Some data, such as metadata, may come in both structured and unstructured forms.
The information stored in unstructured data often is highly valuable. For example, the email messages sent between members of a project team may contain insights into the development of the project over time, such as which decisions led to increased efficiency. Yet such insights remain undiscovered if they are prohibitively expensive or time-consuming to extract from the unstructured data in which they are stored. Today's businesses often continue to rely on manual human analysis of unstructured data (such as human review of the project emails just mentioned), aided by search engines, to extract insights from unstructured data. Such manual analysis, which does not differ fundamentally from reviewing the same data on paper, is tedious, time-consuming, and prone to error. Furthermore, the amount of effort required to perform such analyses often leads businesses to not even attempt to perform them.
As a result, vast amounts of valuable information stored in unstructured data remain untapped. Furthermore, the amount of unstructured data stored by today's businesses is growing at a rate of 70% year over year. Therefore we should expect the value of untapped information stored in the form of unstructured data to grow commensurately.
In addition, unstructured data are also highly distributed and dynamic. Documents stored in a single corporate file system may be distributed across multiple servers located in multiple facilities. The location of any individual file may change from day to day, often automatically and without knowledge of the system's users, such as when the system's netword-attached storage subsystem is upgraded or relocated. The distributed and dynamic nature of unstructured data poses challenges for, and must be taken into account by, anyone who strives to extract value from unstructured data.
Unstructured data pose particular problems for managers of information technology (IT) systems. The manager of an IT system at a modern enterprise may have responsibility for ensuring that only authorized personnel have access to components of the system, managing the storage capacity of the system to ensure that it does not run out of storage space, performing chargeback (i.e., charging each division/department of the enterprise for its share of use of the enterprise's data), and keeping all components of the system in good working order to minimize downtime. Performing these and other functions successfully requires the IT manager to have accurate and up-to-date information about the state of the IT system. Yet IT managers often do not have such information because it is too difficult and expensive to extract it.
IT managers, however, cannot simply decline to perform these functions. Auditing requirements, such as those imposed by the Sarbanes-Oxley Act, may require IT managers to track and provide certain information stored in the IT system. Corporate policies may require that certain specified information be backed up, retained for a certain period of time, or destroyed after a certain period of time, despite the difficulty of locating such information and applying the correct procedures to it. Litigation may impose requirements on IT managers, such as the need to produce documents satisfying specified criteria or to prove that reasonable steps have been taken to secure data.
Consider, for purposes of example, an IT environment in a modern enterprise which contains 300 terabytes (TB) of data. For purposes of comparison, consider that approximately 5,000,000 typical word processing files can be stored in 1 TB of data. Such an IT environment also includes a large number of devices of many types, such as desktop and laptop computers, mobile computing devices (such as cellular telephones and personal digital assistants (PDAs)), printers, monitors, networking devices, and netword-attached storage devices. Such an IT environment also includes a large number of software programs of many types, such as operating systems, word processing software, database management systems, backup applications, and network security applications. Different versions of the same device and/or application may exist simultaneously in the same IT environment.
Although it would be difficult enough for an IT manager to manage such an environment due to its sheer size and complexity, the management task is further complicated by additional factors. For example, the IT environment is not static; it changes over time. Users of the system move temporarily (such as when they log in to the system from a satellite office rather than their home office) and permanently (such as when they relocate to a new home office). Devices, software, and data also move within the system. Sometimes it is desirable for such changes in location to be hidden from end users. For example, if the data stored in a file system directory moves from one physical hard disk drive to another, such a change should be hidden from software applications so that they do not break and need to be reprogrammed. In other situations, however, changes to the IT system should be visible to end users. For example, if an employee moves from one department to another, it may be desirable to prohibit that employee from accessing equipment (such as servers and printers) in the employee's old department. Ensuring that the IT system functions smoothly in the face of such changes to the system itself poses a significant challenge to IT managers.
What is needed, therefore, are improved techniques for managing data in an IT environment.