Computer based systems are often used to collect and save data from hardware devices or sensors. Such hardware devices or sensors may be configured to generate data values at a specific rate (e.g. 5 values per second), with each data point having both a value and a particular associated timestamp. Such data is hereafter referred to as “time-series data”, and examples of such a series includes but is not limited to continuous voltage readings from a voltmeter or RF (Radio Frequency) voltage values from an RF sensor.
In semiconductor manufacturing environments, time-series data can be important for process control systems. In a simple scenario, a semiconductor manufacturing process control system can monitor data and perform certain actions when data values exceed predefined thresholds. In more complex scenarios, process control systems for semiconductor manufacturing processes can utilize the timestamp linked to each data value to calculate derivatives, and perform actions based upon the rate of rise or drop in a value of a process parameter measured from a sensor.
Time-series data is also important in the development of process control algorithms by permitting observation of a current state of a semiconductor manufacturing process tool at a specific point in time, thereby allowing relational questions to be answered. For example, time-series data allows questions to be asked such as: 1) what was the current value reported from all sensors/devices at time “May 23, 2004 01:00:01.125”?, or 2) when the value from Device A reached 1.5 Volts, what was the measured value from Device B?
Computer systems can typically capture time-series data as received from various hardware devices, and then store the data in a database. As the computer systems receive each discrete value in the data series, they attach a timestamp and store the data together into a database.
At least two problems may arise in the management and/or analysis of such time-series data. A first issue relates to the accuracy of the timestamp in view of intrinsic lag times and drift. Specifically, as the timestamp may be used to generate a derivative value, the accuracy of a timestamp is just as important as the data value itself. However, it may be difficult to generate an accurate timestamp representing a moment in time when each data point was generated.
For example, some devices contain an internal clock generating a timestamp along with the data value. Other devices, however, contain only a simple sampling timer allowing generation of data values at the specified rate, without reference to an absolute time.
In the case of simple devices containing only sampling timers, the computer system responsible for capturing the data usually creates timestamps by looking at some reference clock (e.g. the computer system's own clock), when each data point is received. However, for a number of reasons, the computer system cannot simply use the current timestamp from the reference clock each time it receives a data point.
First, there exists an inherent lag between the time the data is generated by the device, and the time the data is received by the computer system. One example of such a time lag is the network delay. The unpredictable variation in the duration of this time lag precludes a simple solution to this problem, for example the subtraction of a fixed number of milliseconds for every data point.
Drift is a second reason that a computer system cannot simply use the current timestamp from the reference clock each time it receives a data point from a device containing only a sampling timer. Specifically, the clock of the device will generally exhibit an inherent degree of drift relative to the reference clock of the computer. For example, if the device is configured to report one data value per second, it may in fact report one data value every 0.998 seconds relative to the reference clock. Such drift can degrade the accuracy of the timestamp component of a time-series data.
A second problem which may arise in the management and/or analysis of time-series data, relates to the interval of sampling from multiple data sources. Specifically, data from multiple sources may be received at different intervals. FIG. 1 shows a simplified schematic diagram illustrating the flow of data from two different sensor devices. Sensor Device A may report data every second, while Sensor Device B reports data every 5 seconds. Even if two devices report data at the same interval, their data production may be out-of-synchronization (i.e. there may be a slight offset between them). This is shown in FIG. 2.
Given this potential lack of synchronization, it may prove difficult to produce a unified view of data values from all sources in order to fully represent the data extant at any one moment in time. FIGS. 3 and 4 show simplified schematic diagrams depicting creation of a unified view of data (the “combined report”), using data from the out-of-synchronization sources of FIG. 2. Without the unified view presented in FIGS. 3-4, it is difficult to answer relational questions regarding data acquired from different sources.
Many data collection systems simply write data points into a database in their raw form (i.e. with their originally assigned timestamps). Extra processing of the data is thus necessary in order to create the unified view. As shown in FIGS. 3 and 4, whenever such a unified view is to be constructed, data values may need to be shifted, interpolated, and/or duplicated in order to generate a unified table. For voluminous data sets and/or complex queries thereto, such processing can consume large amounts of time. In addition, such data processing would need to be reperformed for each query.
Accordingly, there is a need in the art for new approaches and techniques for managing timestamp data.