It is sometimes desired to extract text from a mark-up document. However, a problem arises in that it is difficult to distinguish meaningful or desired text from extraneous text frequently contained in the mark-up documents.
For example, it may be desired to extract text from a web page, wherein the meaningful text of the web page is the main text of the web page and the extraneous text of the web page is text forming one or more accompanying advertisements, decorations, navigation information, a header or footer of the web page etc.
It is an object of embodiments of the invention to at least mitigate one or more of the problems of the prior art.