Now days, as a World Wide Web Consortium (W3C) recommended general-purpose markup language, eXtensible Markup Language (XML) has been widely used in various applications, such as Web Service, Database, etc. XML defines a common grammar to represent data with simple and human-readable markups, and, for example, can appear as a configuration file, a database file, etc. For many XML-based applications, especially database and Web Service, response time is one critical performance criterion. Different applications have different requirements for the response time. For example, in an Online Transaction Processing (OLTP) system of a large bank, the response time is usually required to be 100 ms or less, and larger response time will cause the discomfort of users.
The response time of XML-based applications consists of many parts, where the time for XML parsing is inescapable. Since XML parsing involves many time-consuming operations, such as coding conversion, tokenization, well-formed checking, and Document Type Definition (DTD)/XML schema validation, it becomes a performance bottleneck in many XML-based applications, and occupies a main part of the response time. More particularly, some applications use huge XML documents. For example, in life science and content management, XML documents of Megabytes (MBs) are very popular, and even in some case, XML documents of Gigabytes (GBs) are needed. Such large XML documents further exacerbate parsing performance. Generally, time spent on parsing a huge XML document would be dozens of seconds, which is usually unacceptable.
Technology Challenges
Over the years the processor technology evolution has come a long way from single processing technology to the latest multicore technology. Evolution of Processor Technology can be given by: single threaded processor technology->symmetric multiprocessing using celeron dual (SMP) technology->simultaneous multithreading (SMT) HT technology->multicore processing technology.
While today's commodity processors are equipped with multiple cores which facilitates to achieve parallelism but most of the applications are not capable of exploiting this multicore mechanism. The traditional approach of sequential application development needs further evolutions. The need for paradigm shift from sequential approach to parallel approach is prominent. The need for developing new tools, frameworks is for parallel processing inevitable.
Performance Challenges
With the globalization effect, the demand for data processing has increased significantly. Business organizations are facing huge challenges to cope up with processing high volumes of transactions. The technology advancement has also increased the level of expectation of the consumers. The need of the hour is to provide information not only faster but also concisely with accurate precisions. This has opened the opportunities for Parallel processing design. It is difficult to imagine how life would have been without Google's map-reduce technology and Yahoo's Hadoop framework. Map-reduce framework plays an important role in parallel computation. Huge XML files are highly difficult to process in parallel, not suitable for high level of parallel processing need. However, XML is popular standard for data representation and widely used.
Several inventions have been made in this domain some of them known to us are described below:
US Publication 20090006944 discloses a method and system for parsing a markup language document wherein the method comprises: pre-splitting a body of the markup language document into plurality parts; scanning each of the plurality parts, wherein while each of the parts is scanned, the scanning of the part is stopped only when a specific mark is found, and then a stop point at which the scanning is stopped is recorded; splitting the body of the markup language document into a plurality of fragments using the respective stop points; parsing the plurality of fragments in parallel and producing parsing results for the respective fragments; and combining the parsing results for the respective fragments to form a parsing result for the markup language document. However, the integrated space consumption of the fragments put together is relatively same or more as that of the original huge XML document.
This invention employs a XML splitting and scanning technique which requires the system to scan through each part of the original XML document for identification of predefined marks, this necessitate the system to frequently access each part of the XML document, which is a time consuming task and seldom would able to exploit advantages of parallel processing. Also, SDML is one time write and many times read, which means once the SDML is created there is no need of applying any rules of pre-splitting again the original XML document.
US Publication 20090089658 discloses a method of parsing a hierarchically organized data document, comprising the steps of: preparsing the data document to determine a logical tree structure; automatically dividing the data document into a plurality of sections, in dependence on the logical tree structure, each section comprising at least a beginning of a logical section of the logical tree structure, with sufficient context to resolve any ambiguities; and automatically distributing the plurality of sections to a plurality of processors for concurrent parsing of the sections of the data structure. However, this document also works directly on XML document rather than converting it into another simple format for distributing the converted parts among multi cores and even distributing across servers for saving the time of preparsing process for logical division into plurality sections, each time a XML document is loaded for processing. SDML is one time write and many times read, which means once the SDML is created there is no need of applying any rules of pre-splitting again the original XML document.
All the above mentioned prior-arts fail to recognize the potential of converting a huge XML document into an intermediate format for once write and many read, which can be processed with high degree of parallelism not only within a multicore server but also across multicore servers.
In order to solve the above mentioned problems, the present invention proposes a system and method for converting a huge XML document and intro a format and structure which can be processed with high degree of parallelism to achieve high processing performance in the multicore environment.
Other features and advantages of the present invention will be explained in the following description of the invention having reference to the appended drawings.