Many data-intensive applications require the enrichment of information from original data sources. That enriched information must be obtained through a knowledge generation process that may include initial data collection among different sources, data normalization and aggregation, and final data enrichment. Currently used techniques for obtaining such enriched information are manual, without seemingly integrated methods for automation. The methods used in current general knowledge generation processes are frequently content-specific, dealing with particular data sets on hand.
Furthermore, the interfaces used in data collection (or content extraction), data aggregation and enrichment methods are idiosyncratic. The manual process or content-specific application methods therefore have limitations in integrating the tasks of extracting content, aggregating data, and enriching data from many different sources and various intermediate content. Thus, there are difficulties in automating the overall knowledge generation process.
Advanced structured information representation and processing technologies, particularly XML (Extensible Markup Language)-related technologies, have become important in streamlining the knowledge generation process. Specification-based methods provide more flexible ways to integrate the processing of data from complex and various data sources in the knowledge generation process. For example, in “XML as a Unifying Framework for Inductive Databases,” Rosa Meo and Giuseppe Psaila (book chapter from “XML Data Management: Native XML and XML-Enabled Database Systems,” A. Chaudhri, A. Rashid, R. Zicari (eds.) (Addison-Wesley 2003)), there is proposed an XML data model called XDM for inductive databases to support mining-type data enrichment tasks. In “An XML Based Environment in Support of Overall KDD Process,” P. Alcamo, F. Domenichini and F. Turni, the authors propose an XML-based environment to support an overall KDD (Knowledge Discovery in Database) process. Those methods are restricted to a particular type of data enrichment task. PMML (Predictive Model Markup Language) (at www.dmg.org) proposes an XML data model to describe various predictive data mining algorithms for exchanging mining results and models.