Various methods and systems to filter undesired content from online content are possible, and particularly, methods and systems may allow a viewer to receive desired online content while unobtrusively removing undesired parts.
The Internet represents a very valuable resource containing a large quantity of information and opportunity. Nevertheless, the Internet is uncontrolled and can also be a source of undesired content. Many users or Internet providers desire to be protected from undesired content that popularizes pornography, drugs, occultism, sects, gambling games, terrorism, hate propaganda, blasphemy, and the like. In order to allow access to desired content while shielding a user from undesired content, Internet filters have been developed.
Early Internet filters were generally based on the filtering of electronic addresses (Uniform Resource Locators, “URLs”). Software compared a website address with addresses contained in a prohibited site database (a black list) and prevented access to sites known to include undesired content. Such a methodology depends on the completeness of the prohibited site database. No one has ever compiled a complete indexed database that would make it possible to determine acceptable sites for any user. Furthermore, the number of web pages published grows exponentially making it more and more difficult to update URL databases. In addition, URL based filtering either completely blocks or completely allows a URL and all associated content. Often a single URL may include both valuable information and undesired content. URL-based filtering is not sufficiently specific to allow a user access to this information while blocking undesired content.
FIG. 1a is a screenshot of an example of an on-line presentation 10 which is a simple web page. Presentation 10 includes a free text block 12 which is a structure including three elements, paragraphs 11a, 11b, and 11c. Presentation 10 also contains a list title 19, and a list 14 containing ten elements, list items 17a, 17b, 17c, 17d, 17e, 17e, 17f, 17g, 17h, 17i, 17j. Presentation 10 also contains a title 16. Inside presentation 10 there is also undesired content 20a in free text block 12 in paragraph 11a and other undesired content 20b inside of list 14 in item 17g. A URL source address 22 www.badguys.com of presentation 10 is shown in the address bar.
The HTML text source code for presentation 10 is illustrated in FIG. 1b. The HTML text source contains title 16. The beginning of title 16 is marked by a title start tag 15 and the end of title 16 is marked by a title end tag 15′.
The HTML source code contains free text block 12 with three paragraphs of text 11a-c. Each paragraph 11a,b begins with a start group tag <div> at the beginning of the paragraph and an end group tag </div> at the end of the paragraph.
The last paragraph 11c begins with a start group tag <div> but ends with a line break tag <br> marking the beginning of list title 19. After list title 19 the HTML text source contains list 14. The beginning of list 14 is marked by a list start tag 13 and the end of list 14 is marked by a list end tag 13′. Inside of list 14 are found ten elements, list items 17a-j. In list item 17g is found undesired content 20b. After list 14 is found the end group tag </div> of the group that started at the beginning of paragraph 11e. 
Referring to FIG. 2, a screenshot of the result of a first prior art Internet content filter acting upon presentation 10 is illustrated. The prior art system of FIG. 2 blocks all content from any address in a black list. Thus, because URL source address 22 www.badguys.com is black listed, presentation 10 is entirely blocked and in its place a substitute presentation 210 having a substitute title 216 from a substitute URL source address 222 is rendered. Substitute presentation 210 is obtrusive and has prevented a user from accessing any of the useful information of presentation 10.
More recently, content based filtering has been introduced. In content-based filtering a viewing object is analyzed for evidence of inappropriate content. If inappropriate content is found, the content is blocked.
For example, United States Patent Application 2007/0214263 teaches analysis of an HTML page and its associated links and a decision to allow or block the page based on the identified content. The blocking of entire HTML pages is undesirable as such blocking prevents access to both useful and undesired content of the page.
United States Patent Application 2003/0126267 further allows blocking of undesired items inside an electronic media object (for example blocking or blurring of an objectionable picture or removal of objectionable words and their replacement by some neutral character).
Prior art blocking of undesired content is illustrated in FIG. 3. Presentation 10 is replaced by a sanitized presentation 310 which includes free text 312, list 314 and a title 316. Free text 312 is similar to free text 12 except that undesired content 20b has been blocked by inserting blocking characters 320b. Similarly, list 314 is similar to list 14 except that undesired content 20a has been blocked by inserting blocking characters 320a. URL source address 22 www.badguys.com and title 16 of presentation 10 are still displayed. Thus, the prior art content blocking system removes undesired content without accounting for or adjusting the structure of the presentation. In the resulting sanitized presentation, the content of the presentation no longer fits the structure of the presentation. The result is that remaining structural items (in the example of FIG. 3, paragraph 11a and list item 17g) are unsightly, unnecessary, and may even include further undesired content associated with the removed content (in the example of FIG. 3, undesired content 20a,b).
Blocking of part of a presentation (by erasing or obscuring) is obtrusive and unsightly. Furthermore, in many applications, such blocking is not effective. For example, a school may desire to filter out predatory advances, links or search results. Just removing objectionable words may leave the links active and endanger students or even increase the danger by arousing their curiosity and encouraging them to actually visit the source of the blocked content to see what they are missing. Alternatively, one may indiscriminately black out a zone of the screen around an undesired object (e.g., an undesired picture or word) in order to also block associated content. If the blocked zone is large then this results in obscuring a lot of potentially valuable content. If the blocked zone is small then there is a substantial risk that related undesired content will not be blocked.
The above limitations of the prior art are particularly severe for data sources containing a large variety of content from different sources, for example Web 2.0-based technologies (e.g., Facebook) and the like (e.g., Wikipedia, search engines). In such applications, content from unrelated sources are organized together in a single webpage. It is therefore, on the one hand desirable to remove objectionable content along with associated data, and on the other hand it is desirable to leave unaffected data that is not associated with undesired content.
Therefore it is desirable to have an unobtrusive filter that removes undesired content and associated data without disturbing desired content and its presentation.