Many online services generate large amounts of “activity” information. Typically, activity information includes “user activity” information and “system activity” information. User activity information includes information reflecting users' online interaction with the service such as, for example, logins, page views, clicks, “likes”, sharing, recommendations, comments, search queries, etc. System activity information includes system operational metrics collected for servers and/or virtualized machine instances supporting the online service such as, for example, call stack traces, error messages, faults, exceptions, CPU utilization, memory usage, network throughput metrics, disk utilization, etc.
Traditionally, online services have leveraged activity information as a component of service analytics to track user engagement, system utilization, and other usage and performance of the service. Often, the analytics involve batch processing activity information. More recently, online services use activity information in real-time directly in end-user features. For example, an Internet search engine service may use activity information to provide more relevant search results, an online shopping service may use activity information to provide more relevant product recommendations, an online advertising service may use activity information to provide more targeted advertisements and promotions, and a social networking service may use activity information to provide a newsfeed feature.
Because of the large numbers of users, many online services generate large volumes of activity information, sometimes within a short period of time. For example, the Netflix service (available from Netflix, Inc. of Los Gatos, Calif.) which, among other services, provides an Internet streaming media service to millions of subscribers, has been known to generate activity information for up to 80 billion online user events per day. These user events include subscriber membership changes, subscription changes, streaming media playback events, user preference changes, among others.
In order to reliably collect large amounts of activity information from applications of an online service that generate them (producer applications) and provide them in a timely manner to applications of the online service that use the activity information (consumer applications), many online services implement a data pipeline to reliably and efficiently “move” the activity information generated by the producer applications to the consumer applications. In this description, the term “application” is used to refer to computer-implemented functionality of an online service. Typically, an application is implemented in software executing as one or more computer processes on one or more computing devices (e.g., one or more servers in a data center environment). Thus, an online service may be viewed as a collection of one or more applications, each of which may provide different portion of the functionality and support for the online service, but collectively provide the overall functionality and support for the online service. For example, some applications of an online service may provide end-user functionality and other applications may provide site performance and usage analytics to service operators.
As typically implemented, a data pipeline is a collection of computer systems designed to facilitate message passing and brokering of large-scale amounts of activity information from producer applications to consumer applications for batch or real-time processing. In some cases, the data pipeline includes a distributed commit log for durably storing recent activity information obtained from producer applications and also includes a messaging brokering system (e.g., a queuing system or a publish-subscription system) for providing stored activity information to consumer applications in a timely manner.
Often, different pieces of activity information that pass through a data pipeline from producer applications to consumer applications have different data formats. For example, one producer application may generate activity information in the form of log lines and another producer application may generate activity information in the form of highly-structured markup-language (e.g., XML) documents. On a more fined-grained level, values in a piece of activity information can have different data formats. For example, one producer application may generate activity information in which calendar date values are formatted using a two-character sequence to designate the calendar year (e.g., “14”) and another producer application may generate activity information in which a four-character sequence is used (e.g., “2014”). More generally, activity information generated by producer applications may not conform to a single or a small number of known data formats and different pieces of activity information can have different formats. Further, the data format of activity information generated by a producer application may change over time. Moreover, it may be a design goal of the data pipeline to allow producer applications to generate activity information in whatever data formats the human software developers of the producer applications deem appropriate, as opposed to imposing or prescribing data formats that activity information generated by the producer applications must adhere to.
As different pieces of activity information can have different data formats, which can change over time, a challenge in implementing a data pipeline is to provide activity information to consumer applications in a data format that is expected. Historically, this challenge has been solved by ad-hoc communications between software developers of producer and consumer applications. For example, a software developer may design or configure the producer application to generate activity information in a particular custom log line format in which calendar date values use a four-character sequence to represent the calendar year. The producer application software developer may communicate this format to software developers of consumer applications as the format they should expect for activity information received from the producer application. The software developers of the consumer applications may then design or configure the consumer applications to expect this data format. If the producer application is subsequently re-designed or re-configured to generate activity information in a different format (e.g., to use a two-character sequence to represent the calendar year), the software developer must remember to communicate the format change to the software developers of the consumer applications. In worst cases, an uncommunicated format change causes a consumer application to fail or otherwise not provide expected functionality because the consumer application is not designed or configured to expect activity information from the producer application in the new data format.
Another problem with uncommunicated data format changes to activity information is that such changes can “break” rules used by computer systems in the data pipeline to route activity information obtained from producer applications to the consumer applications. For example, an online service may include an application that is configured to automatically send an e-mail to new subscribers greeting them to the service. To do so, a “binding rule”, or other pre-arranged criteria for routing activity information accepted from producer applications toward consumer applications, may be registered with an activity information routing or messaging brokering system in the data pipeline. The binding rule may express that the greeting application would like to receive certain activity information generated by a subscriber management application when a new subscriber enrolls in the service. When the routing or message brokering system obtains activity information from the subscriber management application satisfying the binding rule, it will provide the activity information to the greeting application. For example, the routing or messaging brokering system may place the activity information in a message queue from which the greeting application reads the activity information. However, if the subscriber management application is re-designed or re-configured to generate activity information such that the generated activity information no longer satisfies the binding rule, then the greeting application may no longer be notified when a new subscriber is enrolled. As a result, the new subscriber may not receive the e-mail welcoming them to the online service.
Even where communication between software developers of producer and consumer applications with the respect to activity information data format changes is consistent, such communication may be considered inefficient or cumbersome.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.