Currently, webpages are generally designed for terminals with relatively large screens. Contents of webpages are often rich and have complex structures. However, nowadays, users often browse webpages through mobile terminals with smaller screens. To improve the readability of webpages displayed on a mobile terminal, it is necessary to extract page information from a webpage and display the extracted page information on a mobile terminal.
Currently, related art provides a DOM (document object model) tree-structured webpage information extraction method. The method includes obtaining a webpage and performing a lexical analysis on the webpage to obtain each word contained in the webpage. Each word is parsed to obtain each node included in the webpage. The nodes can form a DOM tree by script analysis. Further, a data recognition algorithm based on DOM is applied to identify the page information, in the DOM tree, that is relevant to the target information, and the identified page information is then displayed.
However, according to the present disclosure, the above technology may have some issues. For example, the DOM tree based webpage information extraction method requires analysis on entire contents of the webpage. The entire contents of the webpage often include a large amount of words and nodes that are irrelevant to the target information. As a result, the webpage information extraction efficiency is low.
The disclosed method and device for extracting webpage information are directed to solve one or more problems set forth above and other problems.