1. Technical Field
The present invention relates to automatic composition of information processing flows for user specified processing goals, and a computer-user interaction process for specifying the processing goals.
2. Discussion of the Related Art
In this section we first discuss current approaches to information processing that make use of components composed into flow graphs. We consider four different application areas: web 2.0 situational applications (i.e., mashups), info 2.0 information processing, traditional extract transform load (ETL) workflows, and stream processing.
At the end of this section we review existing work on automatic composition of flow graphs.
Web 2.0
Situational applications are an emerging trend in software development. For example, a number of software systems referred to as situational application platforms have been developed over the past few years. These systems allow end users to compose applications by combining reusable components. These systems often rely on a visual user interface to support this composition.
An overview of currently available situational application platforms can be found in D. Hinchcliffe, “A bumper crop of new mashup platforms”, http://blogs.zdnet.com/Hinchcliffe/?p=111&tag=nl.e622.
The following situational application platforms make use of flow graphs to process information: 1) Apatar (http://www.apatar.com/for_structured_data_mashups.html), 2) Yahoo Pipes (http://pipes.yahoo.com), and 3) RSSBus (http://rssbus.com/).
A flow is a connected graph of configurable source components, aggregator components and transformation components, all of which read and produce information, for example, RSS or Atoms feeds. A description of RSS and Atom feeds can be found in D. Johnson, “RSS and Atom in Action: Web 2.0 Building Blocks”, Manning Publications, Jul. 31, 2006, ISBN 1932394494.
Visual or text-based ways of selecting, configuring and connecting components are usually provided by the platform. For example, Yahoo Pipes provides a browser-based pipe composer GUI where users can select and configure components (i.e., modules) and establish connections therebetween. As an example, the modules for configuration and composition provided as part of Yahoo Pipes are as follows:
1) Sources (all producing RSS feeds)—Fetch feed (from a specified URL that is assumed to point to an RSS feed), Fetch Data (from a specified URL, in XML or JSON form), Flickr (with specified tags and location preferences), Google Base (with specified location, category and keywords), Yahoo! Local (finding keyword X within N miles of keyword Y), Yahoo! Search (given search string); 2) User inputs—Date input (produces date), Location input (produces location), Number input (produces number), Text input (produces text), URL input (produces url); 3) Operators—Content analysis, Count, Filter, For Each: Annotate, For Each: Replace, Location Extractor, Regex, Rename, Sort, Split, BabelFish, Truncate, Union, Unique; 4) URL—URL Builder; 5) String—String Concatenate; 6) Date—Date Builder; Date Formatter; 7) Location—Location Builder; and 8) Number—Simple Math.
A flow graph (i.e., a pipe in Yahoo Pipes) is formed as a selection of the above modules, where each module may appear once, more than once or not at all. Each module included in the pipe can be individually configured. The modules forming the pipe must be connected. The connections are established between producing and receiving endpoints of the same type. For example, an output of the URL Builder module (which is of type url) can be connected to the input parameter URL of the Fetch Feed module, requiring url type.
The Yahoo pipe editor (at http://pipes.yahoo.com) is shown in FIG. 1. The modules included in the list on the left can be dragged with a mouse and dropped onto the composition pane in the center. When dropped, the modules expand to provide editing controls for specifying parameters.
As shown in FIG. 1, the parameters can be specified by entering strings (e.g., “5” in “Find” field of “Flickr” module) or by connecting modules that produce data of compatible types (e.g., “images of” field of “Flickr” module is connected to “Image of (text)” module). In addition, parameter values can be provided by users via input modules (e.g., “Near (location)” and “Image of (text)”.
Editors like the Yahoo Pipes editor simplify flow composition for expert users who have deep understanding of the modules and their parameterization. However, visual editors can be confusing to a broader audience who do not have good knowledge of the modules and their capabilities. In addition, even for expert users the manual pipe composition process can become tedious if it must be repeated for processing different sources using the same flow graph with minor differences (such as including format adaptor modules required for connecting to different types of sources).
Info 2.0
While Web 2.0 approaches tend to focus on data available on the web and that is represented in formats like RSS, similar approaches have been used to process data not represented as RSS or Atom feeds. DAMIA service developed by IBM is one such example. This service is currently available for evaluation on the Internet at the URL http://services.alphaworks.ibm.com/damia/.
DAMIA service consists of a browser-based Web application for assembling, modifying and previewing mashups, services for handling storage and retrieval of data feeds created within the enterprise as well as on the Internet, a repository for sharing and storing feeds or information created by DAMIA, and services for managing feeds and information about mashups, search capabilities, and tools for tagging and rating mashups.
Similarly to Web 2.0 flow graphs in Yahoo Pipes, flow graphs in DAMIA are constructed using a visual editor.
Extract Transform Load (ETL)
IBM Websphere Datastage http://www.ibm.com/software/data/integration/datastatge/ is an example of an ETL tool. It provides a visual development environment to construct ETL processes, and includes an engine for real-time operation of the processes. In general, ETL processes can be implemented in any programming language, but specialized tools like Datastage simplify the implementation by using visual development environments, and provide automatic scalability for workflows in those environments. The specialized tools specify the processes as information flow graphs extracting data from data sources, e.g., databases, transforming the data using transformation operators and finally loading the resulting data into result databases. Generally, extract, load and transform operators can be viewed as components, and ETL processes as flow graphs of those components. In that conceptual level, the composition of ETL flow graphs presents similar problems to the composition of Web 2.0 and Info 2.0 flow graphs.
Stream Processing
IBM System S research project in the area of stream processing has been focused on distributed processing of high-rate data streams of unstructured information. While performance requirements of stream processing are significantly different from those of Info 2.0, flow graphs are very similar. A description of System S and stream processing core (SPC) of the system can be found in Navendu Jain, Lisa Amini, Henrique Andrade, Richard King, Yoonho Park, Philippe Selo and Chitra Venkatramani, “Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core”, Proceedings of ACM SIGMOD 2006.
Automatic Composition
Automated planning can be used to create composite applications in compositional architectures such as web services and stream processing. The applications are processing graphs composed of smaller modular components such as service invocations, data processing operators, or other (smaller) processing graphs.
In many scenarios the components are service invocations (such as web service invocations or an invocation of a method of a java class), and can be described in terms of their data effects and preconditions. In particular, we assume that a description (such as WSDL or Java object code with optional metadata annotations) of each service specifies the input requirements of the service (such as data type, semantics, access control labels, etc.). We refer to these input requirements as preconditions of service invocation, or simply preconditions. The description also specifies the effects of the service, describing the outputs of the service, including information such as data type, semantics, etc. In general, a component description may describe outputs as a function of inputs, so that the description of the output can only be fully determined once the specific inputs of the component have been determined. In practical implementations the invocations can be synchronous, such as subroutine or RPC calls, or asynchronous, such as asynchronous procedure calls, message exchange or message flow.
Under these assumptions, an automated planner can then be used to automatically assemble processing graphs based on a user-provided description of the desired output of the application. The descriptions of the components are provided to the planner in the form of a domain description. The planner can also take into account the specification of available primal inputs to the workflow if not all inputs are available for a particular planning request.
The planner composes a workflow by connecting components, starting from the primal inputs. It evaluates possible combinations of components, by computing descriptions of component outputs, and comparing them to preconditions of components connected to the output. More than one component input can be connected to one component output or one primal input. Logically, this amounts to sending multiple copies of data produced by the component output, with one copy sent to each of the inputs. In practical implementation, these do not have to be copies, and it is possible to pass data by reference instead of by value. The process terminates when an output of a component (or a set of outputs taken together) satisfies the condition specified in the user requirement. All conditions are evaluated at plan time, before any applications are deployed or executed.
If multiple alternative compositional applications can be constructed and shown to satisfy the same request, the planner may use heuristics and utility functions to rank the alternatives and select preferred plans.
The application, once composed, is deployed in an execution environment and can be executed one or more times.
Examples of a planner and an execution environment are described in Zhen Liu, Anand Ranganathan and Anton Riabov, “A Planning Approach for Message-Oriented Semantic Web Service Composition”, in AAAI-2007.
Similar work has been done in the contexts of Stream Processing, Web Services and Grid Computing.
Although existing planning methods can achieve goal-based composition, they do not have a convenient form to provide assistance to a user specifying goals. Thus, a user may not be aware of the vocabulary used in specifying system capabilities, and therefore, may have to invest time in learning the vocabulary which could be evolving.
Faceted Search
Faceted search methods use tags to define the scope of user interaction with a system. However, faceted search is limited to searching over existing information represented, for example, as documents, web pages or feeds.
One notable example of a faceted search interface is FLAMENCO search, (http://flamenco.berkeley.edu/). An overview of interfaces for managing faceted search is presented in Marti Hearst, Design Recommendations for Hierarchical Faceted Search Interfaces, ACM SIGIR Workshop on Faceted Search, August, 2006.