The present invention relates to crawling/enumerating/classifying data for processing, management, and/or utilization of the data.
Organizations today face various challenges related to data/information management. For example, increased digitized content, retention of data due to regulatory requirements, the prevalence of productivity tools, the availability of data on communication networks, and other factors have been driving rapid growth of data volumes in datastores of organizations.
In addition to the tremendous data volumes, a substantial portion of data stored in the datastores may have heterogeneous attributes. Such data max not be able to be effectively managed utilizing a common schema and/or a database management system.
Further, the datastores may automatically create filesystem snapshots for files/data in order to avoid version skew when backing up volatile data sets. The filesystem snapshots may create duplicated versions of the same files, and may further expand the data volumes.
Existing data enumeration/classification mechanisms, or crawlers, may not be able to efficiently, effectively, and economically crawl through data with heterogeneous attributes. Further, existing crawlers may be able to operate for only hours and may not be able to enumerate/classify a large amount of data. Further, crawlers may not be able to continuously enumerate/classify a large amount of data without interruptions caused by factors such as, for example, network problems, power shut-down, maintenance of filers, etc.
As a result, most organizations have had difficulties enumerating/classifying data stored in datastores for efficient, effective, and economical utilization of the data.