The data ingestion and conversion process is generally known as data mining, and the creation of robust systems to handle this problem is the subject of much research, and has spawned the creation of many specialized languages (e.g., Perl) intended to make this process easier. Unfortunately, while there have been some advances, the truth of the matter is that none of these ‘mining’ languages really provides anything more than a string manipulation library embedded into the language syntax itself. In other words, such languages are nothing more than shorthand for the equivalent operations written as a series of calls to a powerful subroutine library. A prerequisite for any complex data processing application, specifically a system capable of processing and analyzing disparate data sources, is a system that can convert the structured, semi-structured, and un-structured information sources into their equivalent representation in the target ontology, thereby unifying all sources and allowing cross-source analysis.
For example, in a current generation data-extraction script, the code involved in the extraction basically works its way through the text from beginning to end trying to recognize delimiting tokens and once having done so to extract any text within the delimiters and then assign it to the output data structure. When there is a one-to-one match between source data and target representation, this is a simple and effective strategy. As we widen the gap between the two, however, such as by introducing multiple inconsistent sources, increasing the complexity of the source, nesting information in the source to multiple levels, cross referencing arbitrarily to other items within the source, and distributing and interspersing the information necessary to determine an output item within a source, the situation rapidly becomes completely unmanageable by this technique, and highly vulnerable to the slightest change in source format or target data model. This mismatch is at the heart of all problems involving the need for multiple different systems to intercommunicate meaningful information, and makes conventional attempts to mine such information prohibitively expensive to create and maintain. Unfortunately for conventional mining techniques, much of the most valuable information that might be used to create truly intelligent systems comes from publishers of various types. Publishing houses make their money from the information that they aggregate, and thus are not in the least bit interested in making such information available in a form that is susceptible to standard data mining techniques. Furthermore, most publishers deliberately introduce inconsistencies and errors into their data in order both to detect intellectual property rights violations by others, and to make automated extraction as difficult as possible. Each publisher, and indeed each title from any given publisher, uses different formats, and has an arrangement that is custom tailored to the needs of whatever the publication is. The result is that we are faced with a variety of source formats on CD-ROMs, databases, web sites, and other legacy systems that completely stymie standard techniques for acquisition and integration. Very few truly useful sources are available in a nice neat tagged form such as XML and thus to rely on markup languages such as XML to aid in data extraction is a woefully inadequate approach in real-world situations.
One of the basic problems that makes the extraction process difficult is that the control-flow based program that is doing the extraction has no connection to the data itself (which is simply input) and must therefore invest huge amounts of effort extracting and keeping track of its ‘state’ in order to know what it should do with information at any given time. What is needed, then, is a system in which the content of the data itself actually determines the order of execution of statements in the mining language and automatically keeps track of the current state. In such a system, whenever an action was required of the extraction code, the data would ‘tell’ it to take that action, and all of the complexity would melt away. Assuming such a system is further tied to a target system ontology, the mining problem would become quite simple. Ideally, such a solution would tie the mining process to compiler theory, since that is most powerful formalized framework available for mapping source textual content into defined actions and state in a rigorous and extensible manner. It would also be desirable to have an interpreted language that is tied to the target ontology (totally different from the source format), and for which the order of statement execution could be driven by source data content.