Particular embodiments relate generally to data management and more specifically to adaptively classifying unstructured data.
Data management is significant for all organizations. Growing organizations witness exponential growth in their data reserves. Organizations use databases to manage their data reserves. The databases help users organize and manage the data, and also enable the users to access the data whenever required. The databases enable users to input search queries, for example, Structured Query Language (SQL) queries, and help them retrieve the required data from the databases. Generally, organizations have several databases to collect and store data. Alternatively, the organizations can also have a centralized database to collect and store data.
Unstructured data (or unstructured information) includes information that either does not have a data structure or has one that is not easily usable by traditional computer programs. Unstructured data is opposed to structured data such as, for example, data stored in fielded form in databases or annotated (using tags, metadata, etc.) in documents. Examples of unstructured data include, but are not limited to, text files such as Microsoft Word documents, Portable Document Format (PDF) files, email records; image files such as Joint Photographers Experts Group (JPEG) files, Tagged Image File Format (TIFF) files, Graphics Interchange Format (GIF) files; audio files such MP3, Windows Media files; video files such as Waveform Audio format (WAV) files, Moving Pictures Experts Group (MPEG4) files. Market research reveals that unstructured data accounted for 6 petabytes of capacity in 2007, and is expected to grow at an annual rate of 54% to 27.5 petabytes by 2010.
Data retrieval using unstructured data can be difficult since there may not be identification attributes such as tags or metadata associated with the unstructured data.
In addition, users generally prefer to retrieve information based on content and context of the information instead of retrieving by using explicit names of files. Information retrieval by using a name may not be very helpful in accessing unstructured data, such as in cases where it is desirable to use using imprecise queries where names and paths to the files are not specified. Examples of search queries may include finding all documents that are related to Joe Smith, or finding all images that contain an image of a car. It does not help that naming conventions for files have no relationship to the content or the context. Moreover, most unstructured data is not tagged or classified by creators or users since it consumes a considerable amount of time. Additionally, the classification done by creators and users is not consistent.
However, various methods to solve the problem of information retrieval from unstructured data are available. One of the methods for retrieving information from unstructured data is applicable when the underlying structure of the data model or the context of the data is well known. In such a case, the data can be parsed and, subsequently, entered into a database. Thereafter, information retrieval can be achieved through standard SQL queries on the database.
Another method for retrieving information from unstructured data is applicable when the underlying structure of the data model is not known but a specific document can be well characterized by a set of key words. In such a case, a search using explicit keywords (or tags) can be carried out for information retrieval. These methods are imprecise and may result in search results for a query that are not satisfactory to a user. If the user is not satisfied, the user can change the search terms used. This, however, does not address the fact that the results for the initial search query were not satisfactory.
However, the methods mentioned above for information retrieval can become cumbersome. For example, if it is not known how to characterize the context of the information retrieval query, especially when the knowledge sought is implicit. Second, when the rules of classification specific to business cannot be specified, i.e., a narrow classification rule is required as opposed to a broad rule specified by regulatory compliance needs. Also, the needs of the business may change over time and, therefore, the classification of information will also need to be revised to incorporate the evolving nature of the organization.