Content sharing platforms that provide content, such as video data, typically use performance metrics and data to drive business decisions. These metrics include, but are not limited to, content item views (such as video views), watch time, channel subscriptions, and so on. In addition, the content sharing platform may also seek to analyze the usage metrics based on a variety of filters, such as geographic, time, video, and so on.
A common source of usage metrics in content sharing platforms is usage logs. The usage logs consist of raw event records, often closely-tied to the system that writes the log (e.g., giving HyperText Transport Protocol (HTTP) request strings with little to no immediate semantics). The usage logs tend to contain billions of events for a single date, and in some cases, also contain sensitive data, such as cookies or Internet Protocol (IP) addresses.
With the widely-different needs for analysis of usage metrics, multiple independent log processing pipelines have emerged on content sharing platforms, with each pipeline going directly from the raw logs to specific sets of reports. This approach typically leads to metric inconsistencies because of the discrepancies between processing logic at the individual pipelines. Because the pipelines tend to be written by different teams of individuals at different times and using different technologies, aligning the definitions by providing common classification libraries may not be a viable solution. Also, unifying all the pipelines into a single system to provide all reports can be prohibitive in terms of resources and engineering time.