Particular embodiments relate to electronic archive data management and more specifically to a data management system configured to classify, analyze and query data maintained in unstructured format such as file systems, web logs, wikis, email text, image, audio, video and other multimedia data
Various methods of managing collections of data (e.g., databases) have been developed since data was first stored in electronic form so as to enable efficient retrieval and extract desired information. From initial systems and applications that simply collected data in one or more database files to present sophisticated database management systems (DBMS), different solutions have been developed to meet different requirements. Early solutions may have had the advantage of simplicity but became obsolete for a variety of factors, such as the need to store large—even vast—quantities of data, a desire for more sophisticated search and/or retrieval techniques (e.g., based on relationships between data), the need to store different types of data (e.g., audio, video, and the like). Later approaches have concentrated on populating databases using automated techniques. Such techniques, of which federated searches, web crawlers and content extraction engines are examples, often act as mere agents for adding data on to a database in specific formats or to solve specific problems. The databases created as a result of the action of such agents is often extremely structured and specific to format types and issues of the data being added.
In short, today's database management systems have been designed to manage structured data, typically along a single dimension, very effectively. However today's database management schemes have still not evolved towards managing data that are multi-dimensional in nature. Moreover, when the structure of the data is not known, as in retained unstructured data archives or repositories, existing database systems cannot be applied.