Web page information extraction (e.g., web crawling) is the retrieval of web page data, and the subsequent extraction and separation of useful data using program analysis. For example, writing a program to extract a certain news headline from the news channel of a certain website is a kind of web page information extraction. At present, information extraction is divided into two primary types: one type is rules-based extraction, in which the rules may be formulated manually, or can be obtained through learning; the other type is extraction utilizing machine learning methods.
One part of search engine work is web page information extraction. As the internet has developed, the scale of information on the internet has also expanded continuously. Because the data on the internet comes from a large number of different websites, and the page structures on different websites vary greatly, search engines have therefore been unable to develop universal extractors to analyze web pages from different websites.
For this reason, the earliest search engines, and particularly vertical search engines (specialized search engines targeting certain fields of knowledge), utilized many targeted extractors to resolve this problem, i.e., each extractor was targeted at extraction of web page information from a certain website or having a certain type of page structure. However, because this information extraction method required that multiple targeted extractors be maintained, it had the problem of difficulty of maintenance, and the addition of a new website or type of website required the development of new targeted extractors, which also made development costs very high.
Subsequently, people began to search for schemes capable of automatically generating extractors. For example, the Locoy Spider is an information extraction method that is primarily based on regular expression and includes functions such as information capturing, extraction, and publication, using regular expression configured by the user to realize customized capturing and extraction.
However, this type of information extraction method based simply on regular expression still requires manual configuration of regular expression; its level of automation remains low and is insufficiently supportive of high-volume web page extraction. Moreover, users need to have a mastery of regular expression knowledge, and are also required to have a substantial understanding of web page structure, so that for non-professionals, the technical threshold is relatively high.