With the rapid development of the Internet, the Internet has become the most important platform for information distribution. However, in view of the explosive growth of information on the Internet, how to quickly and efficiently obtain information desired by users has become a problem that needs to be addressed. Conventional search engines can help people to obtain web pages through keyword search. However, they can only provide links of relevant pages. The users still need to manually browse the web pages to find information desired thereby. On the other hand, because it is impossible to customize accurate queries, a number of search results are not what the users desire, and therefore accurate and specialized search results cannot be provided. An ideal method is: to query the Internet as if it were a source of information such as a database. Web page information extraction has therefore emerged. Web page information extraction can obtain web page information of interest from different information sources, and extract and store information that is of interest to users in a database so that the users can perform information queries, searches, data mining or data analysis using the information in the database. An objective of the web page information extraction is to extract textual information of a web page and express the textual information as structured data. An objective of so doing is to convert text information that is hard to process into structured data that is easily processed and analyzed.
A web page is a document defined by Document Object Model (DOM) and Hyper Text Mark-up Language (HTML), and is a semi-structuralized document, in which valuable information is commonly stored in a backend database and presented to a user using a fixed page template. A web page is actually a file. What is presented to the users is normally content that has been interpreted by a browser. If we select “view source” from a menu, we can view actual content of the web page using a notepad. As can be seen, a web page is actually a text file that describes elements such as text, images, tables and sound, etc., on a web page using a variety of tags (e.g. headers, font, color, size, etc.). These tags separate the text content that is to be displayed in the web page. The tags introduce structured information to the document. Based on these tags, a document can be represented as a tree structure, which is referred to as a DOM structure. By locating a position of content to be extracted in the DOM structure, extraction of web page information can be realized. A common process of extracting web page information includes: obtaining position information of content to be extracted from a sample page, and for a dataset of web pages using a same template, implementing content extraction using the position information. The accuracy of the position information directly determines the quality of the web page information extraction. Due to a rapid rate of update of web pages, the DOM structure is complicated and changed frequently, thus easily leading to modifications in the position information and resulting in positioning failure or extraction of incorrect information. A web page information extraction system seeks to find a solution for an accurate and robust (“robust” has a meaning of “strong”, “sturdy’ or “steady”, etc.) positioning of web page content.
In existing technologies, there exists a method of automatically generating XPATH (XPATH is a language for finding information in a XML document, and XPATH selects nodes or node sets in the XML document using path expressions) to perform extraction of web page information. The method of automatically generating XPATH includes: selecting content for extraction from a web page by a user, recording a position of the extracted content in a DOM structure by a process, automatically generating an XPATH path that includes only tag name information and shift information from a DOM root node level-by-level down to a target node, and obtaining information from a set of web pages to be extracted using the XPATH. The automatically generated XPATH generally records only information of tag names and shifts, and oversimplifies positioning information, thus failing to follow the ever-changing web page structure. Moreover, after content of a web page is updated, problems such as failing to locate the content or locating content not intended for extraction, etc., arise after elements on the XPATH path are changed. At the same time, because the recorded information of XPATH is oversimplified, XPATH cannot be used to solve the identification problem of repeated structures. Additional computations for implementing identification and extraction of the repeated structures are thus required.
When implementing the present disclosure, inventors have discovered at least the following problems that exist in existing technologies: web page information extraction generally uses a semi-automatic information extraction method, and locates information to be extracted by analyzing a page structure. Since web page information is a type of data that changes dynamically and is updated in real time, position information becomes invalid after the content of a web page is updated and the structure of the web page is changed, leading to extraction failures or inaccurate extraction results.
On the other hand, existing technologies cannot competently solve the problem of identification of repeated structures. The automatic XPATH generation method cannot use the XPATH to solve the problem of identification of repeated structures, and requires additional computations for implementing identification and extraction of repeated structures.