Many applications today generate queries to retrieve data from a wide variety of sources (e.g., big data, analytics, and reporting). Typically, with a query, if one of the items returned in the query is corrupted, then the query as a whole will fail. However, when identifying trends based on huge volumes of data, then the accuracy of the data retrieval may not matter as much as how quickly the data is processed. In this case, it may not matter if a small sample of the data is corrupted.
One such scenario is identifying market trends where data velocity (i.e., the speed at which data is processed) is more important than hitting 100% data retrieval. An investment company may want to adjust their portfolio based on stock trends of individual investors, in which case the speed at which they identify such trends is important as the stock value may start to shift drastically. Another example is sports bets, where a gambling company may want to adjust wagering odds based on the number of wagers being made.
In general, for some situations, users do not want to throw away their query results if 99% of the data is valid, especially if they're pulling data from a large number of sources which may not provide guaranteed reliability. This may be important when other factors, such as data velocity, are of a higher priority to the user.
Structured Query Language (SQL) is a programming language for querying a database. However, there are also Not Only SQL (NoSQL) databases that may be document stores.
A NoSQL database may store a document, such as a Binary JavaScript® Object Notation (BSON) document, which is constructed from data points from multiple tables. (JavaScript is a registered trademark of Oracle Corporation in the United States and/or other countries.) A MongoDB is a NoSQL database. If one of those data points is bad, then the entire BSON document may be treated as bad and considered unusable, and the MongoDB may return an exception. BSON documents may be corrupted for a number of reasons, such as: the data was originally valid but was overwritten by a stray pointer; the database is corrected from a disk error or unclean shutdown without journaling, a byte was corrected on the network or through a broken network component; or the corruption happens when dealing with collections that may result in a segmentation fault.
A MongoDB provides a validate function to double check that the structure BSON object is properly formed and a repair function to fix the BSON document if needed, but both functions add additional time to the look up process, which adds additional performance overhead and impacts data velocity.
Another example is structured Large Object (LOB) data. In some cases, the LOB data may have multiple field definitions for the same data buffer, and, in some situations (e.g., where a packed decimal and a character field overlap), there may be invalid data. That is, it is possible that some character strings will equate to an invalid packed decimal value. A BLOB is a Binary LOB and a CLOB is a Character LOB. A BLOB may be an mp3 file, a picture or a JavaScript® Object Notation (JSON) document, while a CLOB may be a JSON document or Extensible Markup Language (XML) type of document.
In certain conventional systems, an application provides customized error handling to catch faulty data and filters these out of the result set. This solution requires affinity with how the data source will provide an error and cannot be used in off the shelf tools that typically require zero affinity to the pulled data sources.