The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for enabling horizontal decision tree learning from extremely high rate data streams.
Big data is a term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.
Stream computing is a critical topic of big data. Stream computing is affected by the velocity, volume, veracity, and variety of data. Stream computing applications must address low latency of processing, high speed of data flow, fine grained data granularity, and potentially unlimited data size. Scalability plays a key role in stream computing systems. Scalability involves the capability of distributed computing and parallelism.
InfoSphere® Streams is a big data and stream computing system by International Business Machines Corporation. InfoSphere® Streams is an advanced analytic platform that allows user-developed applications to quickly ingest, analyze and correlate information as it arrives from thousands of real-time sources. The solution can handle very high data throughput rates, up to millions of events or messages per second. The Internet of Things (IoT) is the network of physical objects or “things” embedded with electronics, software, sensors, and connectivity to enable it to achieve greater value and service by exchanging data with the manufacturer, operator, other connected devices, or the cloud. Each thing is uniquely identifiable through its embedded computing system but is able to interoperate within the existing Internet infrastructure. IoT produces a large amount of data to be processed in real time or in batch mode.
Decision tree induction is one of the most popular and important algorithms in large scale machine learning, both in batch mode and streaming mode big data systems. Parallelism is well-studied in streaming scenarios, but existing solutions are imperfect.
Streaming Parallel Decision Tree (SPDT) algorithm is an attempt to address high data arrival rate. SPDT uses a distributed data compressed representation (histogram) computation but uses a centralized model update, which is a bottleneck. SPDT cannot scale out due to the high-cost model update computation.
Scalable Advanced Massive Online Analysis (SAMOA) is a framework for mining big data streams. SAMOA uses a Vertical Hoeffding Tree (VHT) for classification. VHT is a distributed streaming version of decision trees tailored for sparse data. SAMOA provides a distributed model update computation from one instance's point of view. SAMOA does not utilize the instance level parallelism; therefore, it cannot handle high data arrival rate. Massive Online Analysis (MOA) is an unscalable streaming decision tree. MOA uses sequential data input and model update computation.