The present invention relates to management and utilization of data, including unstructured data. Unstructured data generally represent data that do not have a common schema and are not effectively managed by a conventional database management system. For example, data contained in email messages, HTML files, XML files, MS Office files, etc. may represent part of the unstructured data of an organization. Unstructured data may represent the majority of data of a typical organization.
Organizations today face various challenges related to data/information management. For example, increased digitized content, retention of data due to regulatory requirements, the prevalence of productivity tools, the availability of data on communication networks, and other factors have been driving rapid growth of data volumes in organizations. In response to the rapid data growth, most organizations have been expanding data storage. However, most organizations have had difficulties efficiently, effectively, and economically managing and utilizing data stored in data storage, especially unstructured data.
Unstructured data are typically scattered across networks and practically invisible to database management system of organizations. At the same time, unstructured data may contain data that are crucial to the operation, reputation, interests, and even existence of an organization. In an example, an organization may need to timely find a certain piece of in formation in unstructured data for litigation support. In another example, an organization may need to timely identify privacy data in unstructured data for protection of customer privacy and security. In another example, an organization may need to timely identify data pertaining to design concepts in unstructured data for protection of intellectual property. The failure of an organization to timely identify, find, and/or retrieve necessary information from unstructured data may result in significant damage to the organization and related parties.
Some techniques have been employed for managing data. However, the existing techniques have various disadvantages.
For example, to prevent unstructured data, an organization may store data in secure, closely monitored databases and may have strict procedures and policies governing how users (e.g., employees) handle and store data. However, the procedures and policies may impose significant burden on users, and therefore may reduce the productivity and efficiency of the users. Further, there may be no systematic way to validate that the procedures and policies are followed. As a result, the organization may still have a significant amount of unstructured data that cannot be efficiently and effectively utilized.
In another example, an organization may deploy search engines for finding information in unstructured data. However, the deployment of the search engines may typically require customization of search parameters, and therefore may require a significant amount of consultant hours and a long lead time to implement. Changes of the search parameters may be costly and time-consuming. The searches may involve a substantial amount of manual processes (e.g., coding), and the searches may not be efficient enough to timely deliver useful results.