One example of a data extraction system is disclosed in Patent Literature 1 which extracts desired information from a Web page with a structured document as a target to be searched. According to Patent Literature 1, the data extraction system is formed of a communication device, a central processing device, a data extraction unit, a data extraction reconstructing unit, extraction data and extraction basic data.
In the above-described structure, the data extraction reconstructing unit obtains a Web page by using the communication device and compares the obtained page with the previously obtained Web page to determine whether there occurs a change in an HTML (Hyper Text Markup Language) structure. Then, when a change occurs, the unit obtains a Web page having a new HTML structure with reference to URL (Uniform Resource Locator) described in the extraction basic data. Next, the data extraction reconstructing unit searches the new HTML for each value of the extraction basic data and reconstructs a data extraction program using tags preceding and succeeding the value.
On the other hand, Patent Literature 2 discloses a device for designating an element of a structured document which relieves a user of the burden at the time of setting an element for use as an index when a structured document is searched. According to the related art disclosed in the Patent Literature 2, the element designation device has filter data stored in advance which has a path expression that designates an element and designation information that defines designation/designation-release associated with an element specified by the path expression, obtains a matching element from a structured document based on the path expression of the designation information and obtains a descriptor associated with the path expression to designate an element or release designation of the same.
Patent Literature 1: Japanese Patent Laying-Open No. 2005-301437.
Patent Literature 2: Japanese Patent Laying-Open No. 2008-84128.
First problem of the related art disclosed in the above-described Patent Literature 1 is that a search expression that designates a target element cannot be automatically generated for a Web page whose contents change according to a system state or the number of elements to be displayed.
The reasons are that structured documents are output to have structures a little different from each other according to a system state or the number of elements to be displayed even as outputs of the same system and that it cannot be automatically determined which element is to be a reference to search an element to be detected from a single Web page.
Second problem is that it is impossible to automatically detect a structure change of a Web page whose contents change. The reasons are that since structured documents are output to have structures a little different from each other according to a system state or the number of elements to be displayed even as outputs of the same system, they fail to completely coincide with each other even without a change and that it cannot be determined which element should be confirmed to have no change from a single Web page.
Third problem is that an erroneous element might be designated. The reason is that since only a description in the vicinity of a search target is compared and detected, it is impossible to verify whether it is an appropriate element or not when the number of elements is changed, when a similar character string is added to other place or in other case.
On the other hand, according to the related art disclosed in Patent Literature 2, since designation/designation-release of an element held by a structured document is automatically executed using element designation correspondence information (filter data), flexibility is more improved than that by element designation only by a structure path expression, thereby relieving a user of the burden at the time of designating an element included in a structured document. There, however, remains a problem that a search expression for designating a target element cannot be automatically generated for a Web page whose contents are changed according to a system state or the number of elements to be displayed.
The reason is that the related art of Patent Literature 2 fails to relate to automatic generation of a search expression with a condition common to a plurality of structured documents added or with the other elements deleted from the search expression.