A time series is a time-ordered sequence of data. A financial time series is a time-ordered sequence of financial data, typically a sequence of quotes for a financial instrument. These data should be homogeneous. Homogeneity means that the quotes of a series are of the same type and the same market; they may differ in the origins of the contributors, but should not differ in important parameters such as the maturity (of interest rates, . . . ) or the moneyness (of options or their implied volatilities). The filter user is responsible for ensuring data homogeneity.
All quotes have the following general structure:                1. A time stamp. This is the time of the data collection, the arrival time of the real-time quote in the collector's environment. Time stamps are monotonically increasing over the time series. We might also have other time stamps (e.g. the reported time of the original production of the quote). Such a secondary time stamp would however be considered as side information (see item 4 immediately below) rather than a primary time stamp.        2. Information on the quote level. There are different types of level information as markets and sources are different in nature and also differently organized. Some level information can be termed “price”, some other information such as transaction volume figures cannot. Some non-prices such as implied volatility quotes can be treated as prices with bid and ask quotes. A neutral term such as “level” or “filtered variable” is therefore preferred to “price”. In the case of options, the price might be first converted to an implied volatility which is then the filtered variable. Different quoting types require different filtering approaches. This is discussed below.        3. Information on the origin of the quote: information provider, name of exchange or bank, city, country, time zone, . . . . In the filtering algorithm, we only need one function to compare two origins. This will be used when judging the independence and credibility of quotes as explained in further sections. A further analysis of bank names or other IDs is not really needed.        4. Side information: everything that does not fall into one of the three aforementioned categories, e.g. a second time stamp. This is ignored by filtering.        
The information on the quote levels is organized in different structures depending on the market and the source. Some important cases are listed here:                single-valued quotes: each quote has only one value describing its level. Example: stock indices.        bid-ask quotes: each quote has a bid value and an ask value. Example: foreign exchange (FX) spot rates.        bid or ask quotes: each quote has a bid or an ask value, often in unpredictable sequence. This can be regarded as two different single-valued time series. Example: quotes on some exchanges.        bid or ask or transaction quotes: each quote has a bid value or an ask value or a transaction value. Again, this can be regarded as three different single-valued time series. Example: the data stream from the major short-term interest rate futures exchanges also includes transaction data.        middle quotes: in certain cases, we only obtain a time series of middle quotes with are treated as single-valued quotes. The case of getting only transaction quotes (no bid, no ask) is technically identical. Also transaction volume figures are treated as single-valued quotes, for example.        OHLC quotes: open/high/low/close. An OHLC filter can be made in analogy to the bid-ask filter, with some tests of the whole quote followed by quote splitting as to be explained.We recognize a data error as being present if a piece of quoted data does not conform to the real situation of the market. We have to identify a price quote as being a data error if it is neither a correctly reported transaction price nor a possible transaction price at the reported time. In the case of indicative prices, however, we have to tolerate a certain transmission time delay.        
There are many causes for data errors. The errors can be separated in two classes:                1. human errors: errors directly caused by human data contributors, for different reasons:                    (a) unintentional errors, e.g. typing errors;            (b) intentional errors, e.g. dummy quotes produced just for technical testing;                        2. system errors: errors caused by computer systems, their interactions and their failures.Strictly speaking, system errors are also human errors because human operators have the ultimate responsibility for the correct operation of computer systems. However, the distance between the data error and the responsible person is much larger for system errors.        
In many cases, it is impossible to find the exact reason for the data error even if the quote is very aberrant. The task of the filter is to identify such outliers, whatever the reason.
Sometimes the cause of the error can be guessed from the particular behavior of the bad quotes. This knowledge of the error mechanism can help to improve filtering and, in some cases, correct the bad quotes.
Examples of some of the errors to be expected are as follows:                1. Decimal errors: Failure to change a “big” decimal digit of the quote. Example: a bid price of 1.3498 is followed by a true quote 1.3505, but the published, bad quote is 1.3405. This error is most damaging if the quoting software is using a cache memory somewhere. The wrong decimal digit may stay in the cache and cause a long series of bad quotes. For Reuters page data, this was a dominant error type around 1988! Nowadays, this error type seems to be rare.        2. “Test” quotes: Some data contributors sometimes send test quotes to the system, usually at times when the market is not liquid. These test quotes can cause a lot of damage because they may look plausible to the filter, at least initially. Two important examples:                    “Early morning test”: A contributor sends a bad quote very early in the morning, in order to test whether the connection to the data distributor (e.g. Reuters) is operational. If the market is inactive overnight, no trader would take this test quote seriously. For the filter, such a quote may be a major challenge. The filter has to be very critical to first quotes after a data gap.            Monotonic series: Some contributors test the performance and the time delay of their data connection by sending a long series of linearly increasing quotes at inactive times such as overnight or during a weekend. For the filter, this is hard to detect because quote-to-quote changes look plausible. Only the monotonic behavior in the long run can be used to identify the fake nature of this data.                        3. Repeated quotes: Some contributors let their computers repeat the last quote in more or less regular time intervals. This is harmless if it happens in a moderate way. In some markets with high granularity of quoting (such as Eurofutures), repeated quote values are quite natural. However, there are contributors that repeat old quotes thousands of times with high frequency, thereby obstructing the filtering of the few good quotes produced by other, more reasonable contributors.        4. Quote copying: Some contributors employ computers to copy and re-send the quotes of other contributors, just to show a strong presence on the data feed. Thus, they decrease the data quality, but there is no reason for a filter to remove copied quotes that are on a correct level. Some contributors run programs to produce slightly modified copied quotes by adding a small random correction to the quote. Such slightly varying copied quotes are damaging because they obstruct the clear identification of fake monotonic or repeated series made by other contributors.        5. Scaling problem: Quoting conventions may differ or be officially redefined in some markets. Some contributors may quote the value of 100 units, others the value of 1 unit. The filter may run into this problem “by surprise” unless a very active filter user anticipates all scale changes in advance and preprocesses the data accordingly.        
Filtering of high-frequency time-series data is a demanding, often underestimated task. It is complicated because of                the variety of possible errors and their causes;        the variety of statistical properties of the filtered variables (distribution functions, conditional behavior, non-stationarity and structural breaks);        the variety of data sources and contributions of different reliability;        the irregularity of time intervals (sparse/dense data, sometimes long data gaps over time);        the complexity and variety of the quoted information: transaction prices, indicative prices, FX forward premia (where negative values are allowed), interest rates, prices and other variables from derivative markets, transaction volumes, . . . ; bid/ask quotes vs. single-valued quotes;        the necessity of real-time filtering: producing instant filter results before seeing any successor quote.        
There are different possible approaches to filtering. Some guidelines determine our approach:                Plausibility: we do not know the real cause of data errors with rare exceptions (e.g. the decimal error). Therefore we judge the validity or credibility of a quote according to its plausibility, given the statistical properties of the series.        We need a whole neighborhood of quotes for judging the credibility of a quote: a filtering window. A comparison to only the “last valid” quote of the series is not enough. The filtering window can grow and shrink with data quality and the requirements for arriving at a good filtering decision.        The statistical properties of the series needed to measure the plausibility of a quote are determined inside the filtering algorithm rather than being hand-configured. The filter is thus adaptive.        Quotes with complex structures (i.e. bid/ask or open/high/low/close) are split into scalar variables to be filtered separately. These filtered variables may be derived from the raw variables, e.g. the logarithm of a bid price or the bid-ask spread. Quote splitting is motivated by keeping the algorithm modular and overseeable. Some special error types may also be analyzed for full quotes before splitting.        Numerical methods with convergence problems (such as non-linear minimization) are not used. Such methods would probably lead to problems as the filter is exposed to very different situations. The chosen algorithm produces unambiguous results.        The filter needs a high execution speed; computing all filtering results from scratch with every new quote would not be efficient. The chosen algorithm is iterative: when a new quote is considered, the filtering information obtained from the previous quotes is re-used; only a minimal number of computations concerning the new quote is added.        The filter has two modes: real-time and historical. Thanks to the filtering window technique, both modes can be supported by the same filter run. In historical filtering, the final validation of a quote is delayed to a time after having seen some successor quotes.        