Deduplication reduces the amount of storage needed for backup of data in a client system. It is vital that deduplication backup storage be available in sufficient amounts to support ongoing backup regimens, file size and file count growth in a system, unexpected or unusually large files, equipment failure and data retention needs. It is difficult to size a new deduplication backup system, and it is also difficult to estimate when the capacity of an existing deduplication backup system will run out. Deduplication capacity is not linearly proportional to the amount of data being backed up, since relative amounts of data reduction in deduplication may vary considerably. Often, deduplication storage capacity is manually estimated for systems. One known estimating tool, the EMC Avamar™ CATTOOL, applies a modified client and runs an actual or simulated deduplication against some fraction of the total data on a customer system as a sample. The tool produces a log file that can then be used to determine the data commonality or deduplication ratio of this sample. Accurate use of this tool relies on customers identifying representative data, which they may or may not do correctly, and which is time-consuming for the customers. Consequences for inaccurately predicting or allocating deduplication storage capacity, or failing to arrange for a timely upgrade of such capacity, can include system downtime.