The present invention is generally directed to database content correction. More specifically, the present invention is directed to detecting and correcting differences between a data feed file and a corresponding representation of the data feed file stored in a database.
Data feeds are records of data that are transmitted to a database machine to be stored in a database. A database machine is any computer device, such as a PC, a network server, etc. which has a database. For example, data feeds can contain detailed records of network conversations, which are explicit exchanges of data between two or more network endpoints. For example, data feed containing detailed Internet Protocol (IP) traffic records can be collected for IP traffic analysis. Data feeds can be transported to a database machine in the form of data streams. When data feeds are transported to a database machine, the data feeds are sampled and the sampled data feeds are stored as data feed files in a file system of the database machine. The data feed files are distinguished from one another by a filename which can include a source of the data feed and a source timestamp corresponding to a time at which the data feed was generated. The data feed files are then imported into a database and stored as a set of records. The information contained in a filename of a data feed file can be used to identify the set of records that represents the data feed file. As used herein, the term “database file” refers to a set of records in the database. For a particular data feed file, a corresponding database file is the set of records stored in the database that represents the contents of that data feed file.
When an original data feed received at a database machine is sampled and stored as a data feed file, enough information from the original data feed is also stored. If there is any problem with a database file, such as errors being detected therein, the stored information from the original data feed is re-sampled and stored in the file system as a new version of the data feed file. Furthermore, it may be necessary to re-sample the original data feed in order to preserve a greater level of detail when subtle problems arise. For example, if the original data feed is network traffic data, the network traffic data may be re-sampled to preserve greater detail at a certain stage of a denial of service attack. The new version of the data feed file has the same file name as the previous version of the data feed file, but may contain different data.
When a new version of a data feed file previously stored in a database is stored on the file system a database machine, the new version of the data feed file is assigned a file system timestamp corresponding to a time at which the new version of the data feed file is stored in the file system. The database machine periodically scans the file system for new files (i.e., files having a file system timestamp more recent than a previous scan). When a scan finds a new version of a file previously stored in the database, the entire previously stored database file is deleted, and the entire new data feed file is imported to be stored in the database, even if only a small fraction of the data feed file differs from the corresponding database file previously stored in the database.
Typically, data feed files (and the corresponding database files) are very large. Therefore, deleting and re-loading large files having mostly the same data is inefficient and can lead to database downtime. Accordingly, it is desirable to detect and correct differences between a data feed file and a corresponding stored database file while minimizing database downtime.