The present disclosure relates generally to job-processing computer systems and in particular to fully or partially automated invalidation of job output data in such a system.
Large-scale data analysis can help businesses achieve various goals. For example, online services continue to proliferate, including social networking services, online content management services, file-sharing services, and the like. Many of these services are supported by large-scale computing systems including server farms, network storage systems and the like. Making sure the resources are matched to current and future user demand is an important part of efficiently managing an online service, and analysis of large quantities of data can help the service provider understand and anticipate trends in demand. As another example, sales or advertising businesses may amass customer data such as who visits which websites, for how long, and whether or what they purchase. Making use of this data, e.g., to develop or revise a marketing strategy, requires processing large quantities of data.
Analytics systems have emerged as a tool to help businesses analyze large quantities of data. Such systems can provide automated processes for generating reports that provide various windows into customer behavior. For instance, in a typical online-service environment, collected data (such as user activity logs) can be stored in a data warehouse. The collected data can be processed using distributed computing techniques to generate reports that can be delivered to executives, system managers, and/or others within the online-service provider's organization. These reports can guide decision processes, such as adding new capacity or developing or enhancing service offerings.
Analytics systems depend on data and data availability. Data for reports can come from “raw” data, such as logs of user activity, or previously processed data and can be processed through a series of processing jobs defined by an operator of the analytics system (also referred to as an “analyst”). For instance, a report on system usage patterns may be created by running a first job to process a log of user activity to identify all user identifiers that logged in at least once in a 24-hour period, then running a second job to group the user identifiers according to some metric of interest (e.g., demographics, geographic location, etc.). Such a report would depend on availability of user login and logout data (the activity log). If the login/logout data is not available, the report can be delayed, be incorrect or fail to be created.