Knowing what kinds of data are stored in computers and how sensitive the data are may ensure the security of the data in an organization. Traditionally, the data classification is primarily done manually by the system administrators, e.g., by labeling data ranging from the most sensitive (e.g., “Top Secret”)) to the least sensitive (e.g., “Unclassified”). However, manual labeling is not feasible for a large organization where there are billions of data files. Recently, technologies have been developed for automated data content inspection for the purpose of data loss prevention. However, these methods suffer from several major limitations. For example, crawling and classifying a huge number of files consume substantial computing power and pose significant impact on the system performance; Direct access to the computers is required to scan the data content, which is challenging for an organization with many heterogeneous systems; Building data classification systems for a large number of categories is very time consuming; The classifiers are domain dependent, and need to be retrained for a new domain; Content inspection is not allowed for some cases where data privacy and security is a concern, thus, these techniques cannot be applied at all.