Main body extraction plays a more and more important role in a field of search engines, mobile reading, etc. Techniques commonly used in main body extraction are rule based, DOM (Document Object Model) tree based, mark window based, maximum text block based, etc. These methods all need to exclude non-body text in a website, such as an advertisement, a website statement, etc. What is shown in FIG. 1a is a schematic diagram of a code segment for a section of website statement, and FIG. 1b is a view of the actual display effect in a webpage of the code segment in FIG. 1a. Such website statements are very common in webpages, of little value for a user's reading, and need to be excluded upon main body extraction. However, how to effectively recognize these non-body texts is a challenge.
In the prior art, the method of garbage keyword density is primarily adopted to perform a non-body text recognition. When recognizing a non-body text based on a garbage keyword, it is necessary to have a dictionary composed of garbage keywords and constantly update the dictionary. For the update of the dictionary, a new garbage keyword can only be added after a problem is found. Therefore, such a method has a serious lag, and when facing a huge amount of data of the whole internet, such a lag appears to be more prominent.