Data objects, such as files and directories, are typically organized in memory or on media using a file system. Companies that design and build file systems are constantly evaluating existing file systems in order to optimize and improve upon them. However, rather than use a customer's working or active file systems, these companies will typically generate and study a file system model, also known as “simulating a file system” or “creating a file system simulation,” in order to design a new file system that will fulfill the needs of the customer. The file system model is designed to simulate how the customer's working file system might behave, using synthetic data instead of real data. Afterwards, a “real” file system can be designed based upon the evaluation of the file system model.
Accurate file system models are built using information such as the size, number and types of objects to be stored on the file system. The file system model will also take into account when and how often data objects are created, deleted or edited over time on the file system (i.e., file system events). A person having skill in the art will appreciate that generating a file system model includes creating the synthetic file system's data objects, or content, that will comprise the file system model. Therefore, gathering information on how file system data objects change over time is important to the process of generating the file system model.
There are a number of ways to gather information about file system data and file system events in order to generate a file system model. Some of these methods are described in co-pending U.S. patent application Ser. No. 12/344,107, entitled, “SYSTEM AND METHOD FOR MODELING DATA CHANGE OVER TIME,” which is incorporated in full herein. However, the process of actually generating the model and its associated data is often tedious, time-consuming and resource-intensive. Since a tile system is generally built to function over time, it is important to know the state of the file system at selected points in time, or “timepoints.” As a result, it is not uncommon to simulate a file system at multiple timepoints. Typically, the information gathered that describes common file system events for a file system will be used to generate an initial user-defined data model. The user-defined data model is used as a foundation for generating future file system models at different timepoints. Given the fact that generating a single file system model may be time-consuming and resource-intensive, these issues become more pronounced when generating file system models for multiple timepoints.
Generating a file system model and its synthetic data typically involves two approaches: dynamic data generation and trace-driven simulation. Dynamic data generation involves generating data for a complete file system model for each selected timepoint. Dynamic data generation uses an algorithm that simulates how the data will change over time, as shown in FIG. 1. When the simulation is run, i.e. at “run-time,” the dynamic generation system will take a previous file system model, such as user-defined data model 101, and using data generator 123, will apply changes to the user-defined data model 101 to generate content for a new subsequent file system model. As a result, file system content 191 will comprise of synthetic data for the new file system model.
Synthetic data will be generated for each timepoint at run-time based upon the user-defined models for those timepoints. Therefore, if the changes are logged daily, then the simulated timepoints may correspond to a certain day. For example, Day 0 may correspond to how the file system data appears initially. Day 1 may correspond to how the file system data appears the day after Day 0 by simulating files that may have been created since Day 0. Day 2 may correspond to how the file system data appears following Day 1 by simulating how some files may have been edited, and Day 3 may simulate files that may have been deleted. This iterative process may continue until the timepoint selected by the user is reached.
Dynamic data generation simulates changes (file creation, edits, deletions, etc.) on a file system model at run-time by creating a complete file system model for the first timepoint, Day 0, then simulating the changes to create a new file system model for each subsequent timepoint. If one wishes to see how the file system data may appear on Day 30, then the simulation will require data generation from Day 0 to Day 30, resulting in the creation of 31 file systems. Each file system model for each timepoint must then be stored. As a result, dynamic data generation can be extremely time-consuming and resource-intensive. One will appreciate that the more content generated for each timepoint's respective file system model, or the higher the amount of file system events that occur between each timepoint, the longer and more complicated the simulation will be.
Trace-driven simulation involves applying an algorithm, or “trace,” that summarizes simulated user activity at set timepoints. In other words, the trace describes what may be done to various files in the simulated file system. The trace for each file in a file system is generated prior to run-time. At run-time, the trace is applied to previously-generated synthetic data, and the resulting changed synthetic data is created. This is shown in FIG. 2, in which an initial user-defined data model 201 is passed through a trace generator 211. Trace generator 211 applies a trace algorithm for each of the synthetic data files in data model 201, resulting in trace files 215. In other words, each trace file 215 will correspond to a file system object in a file system model that will be generated at run-time.
Using the example above, the trace may describe that on Day 0, the file system is empty; on Day 1, that files A, B and C were created; Day 2, that file D was created and file A was edited; Day 3, that file B was deleted and file C was edited, etc. The trace, therefore, describes a sequential process of file system events. A file system model is not created until run-time, at which point the trace interpreter 222 will read the respective trace files 215 and will create file system model 291. In order to simulate the file system model on Day 3, the trace for Day 2 must be known. Similarly, in order to simulate the file system model for Day 2, the trace for Day 1 must be known, and so on. A large file system model will have a large number of trace files. Therefore, the trace-driven simulation process can also be very time-consuming and resource-intensive.
As illustrated above, a key limitation of dynamic data generation and trace-driven simulations is that they are sequential processes. In other words, in order to simulate a file system as it might appear on any arbitrary Day X, the present file system model simulators require that the simulation be run from Day 0 until Day X. This requires a significant amount of time and processing power in order to perform each simulation, especially when simulating long-term file system models. In addition, the amount of memory required to store each sequentially-created file system model can waste storage space and further handicap simulation performance.
File system models are often used to design systems for performing backup operations on a customer's file system. In most cases, backup operations are performed daily. Therefore, the task of designing and configuring a system for backing up a file system needs to be able to simulate the changes to the file system using daily timepoints. Generating a file system model for each daily timepoint that simulates how the file system changes between those timepoints will help determine the optimal backup and recovery application for the customer, as well as the customer's storage requirements. However, because present methods for file system simulation are sequential, the file system model must be simulated for a first timepoint before it can simulate a file system model for the subsequent timepoint. As previously discussed, this requires a significant amount of time, processing power and memory.
What is therefore needed is a more efficient way to simulate a file system such that a file system model can be created for any selected point in time.