The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves may also correspond to implementations of the claimed technology.
The technology disclosed relates to maintaining throughput of a stream processing framework while increasing processing load. In particular, it relates to defining a container over at least one worker node that has a plurality workers, with one worker utilizing a whole core within a worker node, and queuing data from one or more incoming near real-time (NRT) data streams in multiple pipelines that run in the container and have connections to at least one common resource external to the container. It further relates to concurrently executing the pipelines at a number of workers as batches, and limiting simultaneous connections to the common resource to the number of workers by providing a shared connection to a set of batches running on a same worker regardless of the pipelines to which the batches in the set belong.
The technology disclosed reduces the amount of dedicated hardware and clients required to connect multiple pipelines in a container to common resources. For example, if thousand pipelines are processed over hundred worker nodes in a container, then at least ten connections are needed to be configured with relevant container resources such as a message bus (like Apache Kafka™), an output queue or sink (like Apache Kafka™), a persistence store (like Apache Cassandra™) and a global service registry (Zookeeper™). Thus, in total, for such a container, a thousand connections need to be made to each of the different relevant resources.
The technology disclosed solves this technical problem by allowing the multiple pipelines in a container to connect to relevant resources using common connections, thereby substantially reducing the number of simultaneous connections to relevant container resources.
For many analytic solutions, batch processing systems are not sufficient for providing real-time results because of their loading and processing requirements: it can take hours to run batch jobs. As a result, analytics on events can only be generated long after the events have occurred. In contrast, the shortcoming of streaming processing analytics systems is that they do not always provide the level of accuracy and completeness that the batch processing systems provide. The technology disclosed uses a combination of batch and streaming processing modes to deliver contextual responses to complex analytics queries with low-latency on a real-time basis.
In today's world, we are dealing with huge data volumes, popularly referred to as “Big Data”. Web applications that serve and manage millions of Internet users, such as Facebook™, Instagram™, Twitter™, banking websites, or even online retail shops, such as Amazon.com™ or eBay™ are faced with the challenge of ingesting high volumes of data as fast as possible so that the end users can be provided with a real-time experience.
Another major contributor to Big Data is a concept and paradigm called “Internet of Things” (IoT). IoT is about a pervasive presence in the environment of a variety of things/objects that through wireless and wired connections are able to interact with each other and cooperate with other things/objects to create new applications/services. These applications/services are in areas likes smart cities (regions), smart car and mobility, smart home and assisted living, smart industries, public safety, energy and environmental protection, agriculture and tourism.
In today's world, we are dealing with huge data volumes, popularly referred to as “Big Data”. Web applications that serve and manage millions of Internet users, such as Facebook™, Instagram™, Twitter™, banking websites, or even online retail shops, such as Amazon.com™ or eBay™ are faced with the challenge of ingesting high volumes of data as fast as possible so that the end users can be provided with a real-time experience.
Another major contributor to Big Data is a concept and paradigm called “Internet of Things” (IoT). IoT is about a pervasive presence in the environment of a variety of things/objects that through wireless and wired connections are able to interact with each other and cooperate with other things/objects to create new applications/services. These applications/services are in areas likes smart cities (regions), smart car and mobility, smart home and assisted living, smart industries, public safety, energy and environmental protection, agriculture and tourism.
Currently, there is a need to make such IoT applications/services more accessible to non-experts. Till now, non-experts who have highly valuable non-technical domain knowledge have cheered from the sidelines of the IoT ecosystem because of the IoT ecosystem's reliance on tech-heavy products that require substantial programming experience. Thus, it has become imperative to increase the non-experts' ability to independently combine and harness big data computing and analytics without reliance on expensive technical consultants.
Stream processing is quickly becoming a crucial component of Big Data processing solutions for enterprises, with many popular open-source stream processing systems available today, including Apache Storm™, Apache Spark™, Apache Samza™, Apache Flink™, and others. Many of these stream processing solutions offer default schedulers that evenly distribute processing tasks between the available computation resources using a round-robin strategy. However, such a strategy is not cost effective because substantial computation time and resources are lost during assignment and re-assignment of tasks to the correct sequence of computation resources in the stream processing system, thereby introducing significant latency in the system.
Also, an opportunity arises to provide systems and methods that use simple and easily codable declarative language based solutions to execute big data computing and analytics tasks.
Further, an opportunity arises to provide systems and methods that use a combination of concurrent and multiplexed processing schemes to adapt to the varying computational requirements and availability in a stream processing system with little performance loss or added complexity. Increased revenue, higher user retention, improved user engagement and experience may result.