1. Field of the Invention
The present invention relates to production and a preprocessing system for analysis of large-scale data.
2. Prior Art
Recent years, as the entire human gene information has been discovered, there has been accumulated enormous array information, experimental data or document information, which is for use in genome analysis projects for a human being and other various creatures. Henceforth, therapy taking individual genes as an object, which is reflected on diagnosis, drug development and the like, will be enabled by elucidating not only arrays of genes but also functions thereof. In a part of medical institutions, individual gene analysis has already been started, which uses a gene analysis technology such as a gene diagnosis system and a DNA chip. Moreover, a wide application of such an analysis technology to novel industries is also expected.
Work of acquiring useful knowledge for the human being from a large amount of data, for example, elucidation of the gene functions from an integrated database regarding the genes is referred to as data mining. Heretofore, as analysis algorithms for carrying out the data mining, a correlation rule, a decision tree, clustering, a neural network, a genetic algorithm and the like have been researched. Each of these methods has been evaluated somewhat well and recognized as a useful algorithm. However, considering feasibility that data accumulated in a large amount can be actually applied to each analysis algorithm as it is, such application can be said to be almost impossible. The analysis algorithm may not directly access data stored in an RDBMS. Moreover, a necessary data structure may differ depending on each analysis algorithm, and originally, the data may not be as normal as expected. It is said that a cost required for such preprocessing for the data mining occupies 60% of the entire cost for the process.
Since there has not been a standard speculation yet as to which range in the entire process the preprocessing is referred to, preprocessing in various forms has been researched. In a database, a data query language represented by an SQL is used fully to operate data. Similarly, also in The World Wide Web Consortium (W3C) providing the extensible Markup Language (XML) (refer to http://www.w3.org/XML/), various researches have been made in order to realize data operation using a data query language. The researches described above have an object in providing means for operating data, but not in automating the operation itself. Availability of the XML has been recognized in various fields. For example, also in the field of bioinformatics, the XML has acquired evaluation as below. Specifically, according to the evaluation, though the XML has low expressivity of semantics since it is self-descriptive, ontology will be described by the XML owing to describability inherent in grammar thereof, sureness in a structure, handling easiness, a degree of penetration and the like.
With regard to a method for navigating a tree structure, there has been a tool proposed by IBM Japan Co., Ltd. and so on (see the gazette of Japanese Patent Laid-Open No. 2000-194466). Regarding an object tree, this tool only displays a path from a moving point to a root of the tree structure and a complete subtree of moving points in movement to a non-leaf node in navigation. Although the method is good as an interface for exploring target information from an object tree that is asymmetric and is formed in a complicated structure, the method cannot dynamically transform a data aggregate or a data structure upon receiving a request from a user.