Some current business intelligence systems generate reports based on predefined questions. Some other business intelligence visualization products asks users to build dashboards by dragging and dropping dimensions, measures, filters, aggregations, and/or the like, e.g., to formulate answers to their questions. Both categories of systems tend to focus on highly-structured data, usually obtainable from databases, data warehouses, and/or other set stores. These systems thus tend to focus on analysis of historical data, as opposed to real-time data.
Modern analytical business users are increasingly interested in more than just standard reporting questions/inquisitive assertions like “show me the sales of each item per quarter.” Indeed, modern analytical business users are increasingly interested in asking interactive and complex time-based analytical questions in real-time. But with the advent of Big Data (e.g., a collection of data that is so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications), for example, the data sizes and data velocities (relating to the rate at which data is generated) to be dealt with are growing exponentially. Businesses thus may need to look at more and more data, more and more quickly, to remain competitive.
In this regard, there are many business users whose businesses depend on reactions to real-time data and, thus, business decisions, operations, etc., may be driven in large part by temporal and potentially ever-changing data. One example relates to the growing Internet-of-Things (IoT) industry. The IoT is based on the idea of “everything” being connected, especially when it comes to uniquely identifiable embedded computing like devices within the existing Internet infrastructure. Just as mobile devices are connected, the IoT industry posits that (otherwise) ordinary, everyday consumer products and infrastructure, such as cars, refrigerators, homes, roads, human health sensors, etc., soon will be interconnected. In brief, the IoT is expected to offer advanced connectivity of devices, systems, and services that goes beyond machine-to-machine communications, while covering a variety of protocols, domains, and applications. It will be appreciated that there is a vast number of potential data producers, and that the data produced may be generated quickly and in large amounts, and may change frequently.
Because operational data generally is fast moving, temporal data (also sometimes called streaming data) is different from the historical information from which reports may be more easily generated. Operational data analysts and operational business users oftentimes have “business questions” that are time window based (e.g., with such time windows varying from anywhere from a few seconds, to a few minutes, to a few hours, to a few days or weeks, or even beyond). Such business questions may be “comparative” or “superlative” in nature, and they may be formulated as natural queries or the like.
In this regard, comparative questions may use comparative adverbs such as “faster”, “slower”, “bigger”, “smaller”, etc., and oftentimes may be limited to just the top or bottom results (e.g., the top three to ten results). An example comparative question is, “Which online products are selling better than other products in the same category over the last three hours?”. In this example, “better” is the comparative part of the question, and the question as a whole may translate to, “Which products have sales in the last three hours that are greater than the average sales of products in the same category?”. In this example, “greater than the average sales” is what “better” means.
Superlative analytic style questions oftentimes also are asked. An example superlative question is, “What are the products that have the lowest production line defects per hour in the last 24 hours?”. In this example, “lowest” translates to an instruction to “calculate the number of product line defects each hour and show the products with the minimum defects in the hour for the last 24 hours.”
Comparative analytics tend to use terms like the ones identified above, which show a change in a measure compared to other elements over or within a period of time. By contrast, superlative analytics oftentimes involves looking for the “best”, “worst”, “slowest”, “fastest”, “top”, “bottom”, or other element(s), and doing so typically translates to calculating the maximum or minimum of a measure or aggregation of a measure, depending on semantics of the question. As an example, “best sales revenue” may translate to maximum sales revenue, and “best production line performance” may translate to minimum number of defects.
Performing analytics unfortunately is quite complicated in practice. In addition to complications in obtaining the answer to a question, it oftentimes is difficult to determine what question is being asked. For instance, transforming comparative and superlative terms like “better” and “best” into concrete queries may require an understanding of complex grammar rules, exact time aggregation and partitioning, knowledge of the context in which the question is being asked, etc. In a related vein, business users oftentimes have an idea about what to ask, but nonetheless are forced to learn about specifically provided query possibilities, programmed capabilities, mechanisms, and the like, and thus must oftentimes concentrate on the “meta-question” of how to formulate a valid question in order to get at the results of interest, as opposed to simply asking the question in the first place.
Adding real-time data to the mix can compound the problem, because there oftentimes is little known about the data until it is actually streaming in, and because there oftentimes is more analytical value to be derived from the real-time data itself than from its structure. Indeed, many business users do not know what they want to see (e.g., what questions should be asked) until the data is actually flowing. For instance, because little may be known about typical values of attributes, how such values are interrelated, what attributes are present and what their types are, etc., it can be quite challenging to even know what questions to start asking. These challenges exist in addition to the fact that such data may be arriving in large amounts and at very high rates.
There are some current search-based and well-structured systems that enable a basic form of question asking and answering. As alluded to above, another state-of-the art approach involves drag-and-drop style analytics systems that leverage well-defined and structured data that is stored in a database style system and is generally historical. Such systems typically require users to have (or to try to develop) expertise in turning a natural language question into an answer by selecting dimensions, measures, filters, aggregations, etc. It is typical for complex questions to be answerable only by pre-constructed queries created by a data analyst or developer.
Yet as will be appreciated from the above, in operational environments, for example, questions oftentimes have a temporal component and are comparative or superlative in nature. To be able to effective ask and answer relevant questions, it would be desirable for systems to understand terms such as, for example, “best”, “better”, “worst”, “worse”, “growing”, “slowing”, etc., as they relate to the data. It also would be beneficial to enable such high-level terminology to be understood, captured, and stored use in subsequent questions and queries, e.g., where real-time data from a plurality of disparate sources is involved.
Unfortunately, however, some current systems generally work on historical and highly-structured database data, oftentimes where structural information and/or information about the data itself must be known in advance. With respect to the latter, some current systems also unfortunately are search oriented (e.g., in that they sometimes require knowledge of column names and/or the like to be able to form a valid search on them) and thus can require a technical understanding of stored data and/or stored data formats. Some other current systems do not generate questions and instead only allow users to ask free-form questions that are limited in nature and do not (for example) support complex temporal, superlative, and comparative analytics.
Certain example embodiments address these and/or other concerns. For instance, certain example embodiments relate to techniques for deriving, and generating, questions and answers on multi-source, real-time and historical data, without requiring a priori knowledge of the data stream and/or the historical data source. Certain example embodiments additionally or alternatively dynamically generate complex temporal business questions and answers, e.g., from real-time and/or historical data, while supporting comparative and superlative questioning. Certain example embodiments thus enable users to analyze real-time data while it is still being generated in addition to, or in place of, merely analyzing historical or static data, without requiring detailed knowledge about the data sources, programming languages, questions to be asked, anticipated answers, etc.
One aspect of certain example embodiments relates to analyzing real-time data to a certain degree in order to automatically identify the data fields and/or enabling a user (e.g., using an interactive GUI frontend) to formulate natural language questions/requests to inspect the real-time data. Certain example embodiments further help to formulate the correct questions and, where the natural language is too vague, a per use case configurable template may be used to more precisely translate the questions, or parts thereof, into a formalized (e.g., mathematical) query (e.g., such as, for example, a SQL or RAQL query).
Another aspect of certain example embodiments relates to the intelligent analysis of parts of real-time and/or other data to help users quickly and easily formulate reasonable queries on the real-time and/or other data, e.g., using configurable building blocks derived from the data itself. Certain example embodiments take the thus-formulated natural language queries and translate them in more formalized queries that are executable on the data. In this regard, certain example embodiments reduce (and sometimes completely eliminate) the need for business users to undertake a meta-inquiry to learn about how questions should be asked in order to yield relevant results, and instead simply provide a templatized or parameterized list of possible and sensible natural language questions.
In certain example embodiments, a method of forming a natural language query template is provided. A sample of real-time events is obtained from a data stream. From the events in the sample, measures and dimensions associated therewith are identified. The identified measures and dimensions are classified as belonging to one or more distinct measures and one or more distinct dimensions, respectively. At least one of the one or more distinct measures and/or at least one of the one or more distinct dimensions are selected for inclusion in the natural language query template, with the natural language query template including natural language expressions and templated fields, and with at least one of the templated fields enabling user selection of one of a comparative and a superlative. The at least one selected distinct measure and/or the at least one selected distinct dimension is/are arranged in the natural language query template as user-selectable options in at least some of the templated fields. The natural language query template with specified user-selectable options is transformable into a formalized query executable on the data stream.
In certain example embodiments, an event processing system is provided. An event channel is configured to receive real-time events from one or more computing systems. A non-transitory computer readable storage medium is provided. Processing resources including at least one processor and a memory configured to control the system to at least: obtain a sample of real-time events from the event channel; identify, from the events in the sample, measures and dimensions associated therewith; classify the identified measures and dimensions as belonging to one or more distinct measures and one or more distinct dimensions, respectively; select at least one of the one or more distinct measures and/or at least one of the one or more distinct dimensions for inclusion in a natural language query template that includes natural language expressions and templated fields, with at least one of the templated fields enabling user selection of one of a comparative and a superlative; arrange the at least one selected distinct measure and/or the at least one selected distinct dimension in the natural language query template as user-selectable options in at least some of the templated fields; responsive to the arranging, store to the non-transitory computer readable storage medium the natural language query template in association with the arranged at least one selected distinct measure and/or the arranged at least one selected distinct dimension; and enable the natural language query template with specified user-selectable options to be transformed into a formalized query executable on the event channel.
Non-transitory computer readable storage mediums tangibly storing instructions for performing the above-summarized and/or other approaches also are provided by certain example embodiments, as well as corresponding computer programs.
These features, aspects, advantages, and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.