1. Field of the Invention
The present invention relates to a division program, combination program, information processing technique, and to a technique effectively applicable to a processing technique used for a structured document, e.g., an XML (extensible Markup Language) document.
2. Description of the Related Art
In recent years, diverse systems, such as those of individuals, enterprises, government offices, are connected through the Internet to carry out Web services, an EDI (electronic data interchange) and an EC (electronic commerce) by the collaboration of systems.
This requires a wide range of information to be exchanged, for which the XML (extensible Markup Language) has been in the spotlight as a common platform format used for information exchange, because the XML has a flexible expression capability for structuring data and is suited to computer processing for data exchange and data processing.
The XML, originating from an SGML (Standard Generalized Markup Language) standardized by the ISO (International Organization for Standardization) in 1986, has been established by the W3C (World Wide Web Consortium) as its basic specification XML 1.0 in February 1998 for the purpose of easy utilization on the Internet. The HTML (Hyper Text Markup Language), which is a Web page creation language, has fixed tags which are specialized for a display, being faced with a problem of not responding to the requirement of information processing by a computer based on the tag information. The XML allows a user to define a tag freely, has a language structure for giving meaning to a character string within a document and allows information processing by a computer.
In the case of applying a handling such as search, update and delete to an XML document, it is handled by expanding it into to a tree structure (Document Object Model: DOM) by standard API (Application Programming Interface) software. Expansion into a DOM tree, however, requires an operating memory volume of up to five to ten times that of the original data and moreover develops unused items, thus having the shortcoming of being time-consuming.
The following is a description of problems of the conventional technique concerning the above described XML.
(1) Regarding the XML
Here, the following is how the terminology is used based on the XML standard. A character string enclosed by a pair of “<” and “>” is called a tag, a “<character string>” is called a start tag, a “</character string>” is called an end tag, an entire character string from a start tag to an end tag is called an element, a character string sandwiched between a start tag and an end tag is called an element content, a name of an element described within a tag is called an element name (or a tag name), and additional information for an element is called an attribute.
A structured document describes a data structure in the form of embedding a tag within the document itself. Having a structure of embedding a data structure in the document as a tag keeps a flexibility and extensibility against an addition, deletion and modification of data items. And naming with a meaningful name makes data possess a visibility when a person reads it.
(2) Standard API Handling of XML Documents
Two standard interfaces (API: Application Programming Interface) specifications, i.e., DOM (Document Object Model) and SAX (Simple API for XML), are established for the purpose of handling an XML document as a representative structured document.
SAX has the characteristics of small memory consumption, generally high speed, being a time-series output, and is thus suitable for simple processing for referencing.
DOM on the other hand has the characteristics of large memory consumption, generally low speed and easy of programming even for a complex processing content because elements of a document are expanded a hierarchical tree. DOM is usually used for updating an XML document.
(3) Conversion of a Large Capacity XML Document
The XSLT conversion equipped as standard in an XML environment is used for a form conversion of an XML document. The XSLT conversion, however, consumes a large amount of memory of about ten times a file size and therefore it is hard to convert a large capacity XML document of the scale of 50 MB or larger. Accordingly, the below described countermeasures (i), (ii) and (iii) have conventionally been taken. While the countermeasure (i) is the least cumbersome, it has been difficult for a document having a complex structure.
(i) Division and conversion of a file: a conceivable method is to divide a file into a plurality thereof to convert them, followed by combing the converted files for converting a large capacity XML document. It is, however, necessary to divide an XML document having a complex data structure at the most convenient dividing position, hence necessary to rely on a manual work.
(ii) Conversion by streaming processing: (a) a conversion program is written for a standard API SAX (Simple API for XML). This requires an individual new program; (b) use STX (Steaming Transformations for XML) (e.g., refer to a non-patent document 1). The (b) method is non-standard, requiring a standard style sheet to be rewritten for matching with a special specification. Since it is a single-pass stream processing, it has a shortcoming of data handling for conversion being constrained.
(iii) Use of an RDB: a large capacity XML document is once stored in an RDB (Relational Data Base) and processed in the RDB followed by extracting it as a converted XML document. This method requires an RDB handling, a new program to be created, and is hence cumbersome.
A flexible data expression form though it is, XML has a shortcoming of data processing, consuming a large volume of memory.
As a countermeasure to the above problem, a patent document 1 has disclosed a technique for developing a partial tree with analyzed elements as nodes, and also deleting unnecessary nodes when a prescribed stopping condition as a trigger occurs, thereby making analysis processing continue without falling short of memory in the case of processing an XML document by DOM.
The case of the patent document 1, however, requires a change of an operating specification of DOM per se and also a determination of individual stopping conditions according to a processing content, hence lacking versatility.    [Non-patent document 1] “Streaming Transformations for XML (STX) Version 1.0”, searched on Dec. 8, 2005; Internet <http://stx.sourceforge.net/>    [Patent document 1] Laid-open Japanese Patent Application Publication No. 2005-11183