1. Field of the Invention
The present invention generally relates to the parsing of document data described in a structured language and the extraction of information from the parsed document data.
2. Description of the Related Art
As a result of the widespread use and development of the Internet and various Web services in the recent years, structured languages, such as the Extensible Markup Language (XML), are gaining attention as one of the most useful means of storing or communicating information for various applications. For example, Japanese Laid-Open Patent Application No. 2004-46817 discloses a technique that employs a structured document format, such as XML, for the transmission of commands and reception of response data during the exchange of data between a data storage unit and a computer.
Currently, there have been proposed two major technologies for the parsing of document data described in XML. One is the object-model based DOM (Document Object Model) that parses document data described in XML and retains resultant data in a memory as a tree structured network. DOM provides easy access to the XML structured information via navigation through the nodes of the tree structure (see W3C Recommendation: Document Object Model Level 3 Core, 7 Apr. 2004: http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407).
The other major parsing technology is the event-based SAX (Simple API for XML) that operates more lightly than DOM, enabling the processing of data as acquired and therefore enabling the handling of partial document. SAX, however, takes a substantial execution time when handling large-sized documents.
DOM and SAX may be implemented as an application programming interface (API) and utilized by an upper-level host application.
FIG. 11A shows an example of an XML file data structure in which an employee list is stored as XML document data. Data such as the name, age, and sex of each employee is stored in parent nodes. More detailed information, such as the past projects that the individual employee has been involved with, his or her past promotions and awards, monthly salary records, etc., may be stored in child nodes of the parent node.
When DOM is used, after the entire XML file is parsed and the entire information about all of the employees is retained in memory, access to all of the nodes can be granted at once. However, DOM takes a considerable time before the entire file is parsed and becomes accessible when the number of employees contained in the file is very large.
On the other hand, SAX parses the XML document data sequentially from its beginning, notifying the host application about events such as the detection of an element start tag or an element end tag. In the aforementioned example, SAX parses the nodes of the employee list sequentially from the beginning of the file. As soon as data about a predetermined employee is parsed, access is granted to the information about the predetermined employee. FIG. 11B shows the sequence of parsing the XML file of FIG. 11A by SAX. As shown, SAX processes the XML file data sequentially from the beginning.
However, in the aforementioned related art, the nodes in the XML document data are handled flatly, without considering the importance of individual nodes in the XML document data. In other words, in the aforementioned related art, each node is presumed to be a uniform node having a predetermined name and possibly containing several child nodes and basically the same method and property.
Meanwhile, there is a demand to quickly grasp the overall picture of stored data rather than its detailed data, such as the names of all of the employees in the above example. However, neither DOM nor SAX can satisfy such a demand. Another demand is to process XML document data that is inputted in a format other than a file or stream format, such as a live stream format. Such a demand, however, has not been sufficiently addressed by the related art such as DOM or SAX.