Various methods and systems to identify content of a presentation are possible and particularly methods and systems may unobtrusively identify and remove undesired content in real-time while allowing a viewer to receive desired content online.
The Internet represents a very valuable resource containing a large quantity of information and vast opportunities. Nevertheless, the Internet is uncontrolled and can also be a source of undesired content. Many Internet users and providers desire to be protected from undesired content that popularizes pornography, drugs, occultism, sects, gambling games, terrorism, hate, blasphemy, spam, junk mail and the like. In order to allow access to desired content while shielding a user from undesired content, Internet filters have been developed.
Early Internet filters were generally based on the filtering of electronic addresses (Uniform Resource Locators, “URLs”). Software compared a website address with addresses contained in a website database (a black list) and prevented access to websites known to include undesired content. Such a methodology depends on the completeness of the prohibited website database. No one has ever compiled a complete indexed database that would make it possible to determine acceptable websites for any user.
Data-mining technologies have been applied to tackle the task of classifying the Internet and protecting users from undesired content. Identifying undesired content in a presentation can be a challenging task. On the one hand, content analysis needs to be general enough to recognize and remove undesired content that may take a large number of different forms. On the other hand, the filter must be specific enough to differentiate undesired content from various contents that the user may desire. Traditional filtering techniques such as text content analysis and data mining are limited in the current state of the art.
Text content analysis and the related field of text mining are used for automatic classification of presentations based on their textual content. Mining applications work in the background to build a large database of information and classification data. With the exponential growth of the Internet, performing off-line content analysis and blocking all undesired URL addresses in advance has become an unmanageable task even with the best data-mining technology. In addition, URL-based filtering either completely blocks or completely allows a URL and all associated content. Often a single URL may include both valuable information and undesired content. URL-based filtering is not sufficiently specific to allow access to the desired content while blocking the access to the undesired content. Furthermore, off-line techniques cannot classify password protected websites that are not accessible to anonymous web crawling classification applications.
Therefore, there is recent interest in real-time content filtering to keep up with the demands of real-time applications (such as those that deliver web pages over the Internet) that usually have stringent time constraints (a person browsing the Internet may be annoyed by a delay of a few seconds or even a single second when requesting a webpage).
An example of the use of content filtering to classify an unknown text is US published patent application 2002/0,107,926 to Lee (Lee '926). Lee '926 teaches analyzing incoming emails and routing them to a receiver based on their textual content. When a new email comes in, the system extracts keywords from the text (“detects words”) and checks the keywords in a decision tree in order to classify the text and route the email. Lee '926 does not disclose how to extract keywords from the text or where and how the results are stored. In the application of Lee '926 (email routing), a delay of a few tens of seconds or even minutes is not critical (the email is a message sent to an anonymous server, there is no particular recipient who requested or is waiting for the email). The decision tree classification scheme of Lee '926 is useful for a limited population of texts (for example, an email pertaining to one of a few known possible matters). The decision tree classification scheme of Lee '926 is not configured to analyze complex logical rules.
Decision trees may be used for more complicated classification schemes calling up one or a few rules for actions. For example U.S. Pat. No. 7,539,658 to Perazolo et al. (Perazolo '658) uses a decision tree to classify an event and choose a set of action rules. To work efficiently, the system of Perazolo '658 must limit either the number of rules or the number of attributes tested because there is a trade-off in efficiency between the number of rules to be evaluated and the number of keywords to be considered.
To reliably classify text, content analysis needs to be very flexible. This requires sensitivity to a large number of keywords (tokens) and a large rules base to classify the text on the basis of various nuances in use of the keywords and their number. Therefore, Perazolo '658, which cannot efficiently evaluate a large number of keywords and rules simultaneously, is not suitable for real-time content analysis of unfiltered text.
According to the teachings of both Perazolo '658 and Lee '926, the input to the decision trees is a plurality of attributes (keywords, tokens). The tokens are assumed to be all known and available at the beginning of the process. Thus, both Lee '926 and Perazolo '658, implicitly require detection of tokens by known prior art methods. These prior art methods often include comparing an extracted string to a dictionary of keywords. When the keyword dictionary is large, the search becomes time-consuming, even for a relatively small text.
An alternative prior art method for extracting keywords from a text for further content analysis is to convert the text into a suffix tree. Converting a large text into a tree and then quantifying a large set of phrases in the tree requires significant memory and time. For example, U.S. Pat. No. 7,822,743 to Henkin et al. (Henkin '743) teaches both on- and off-line content analysis. In the off-line mode, without strict time constraints, Henkin '743 teaches use of suffix tree analysis, but for online applications (where time and memory limitations may be significant) Henkin '743 relies on a more limited grammar-based analysis.
Thus, prior art content analysis and keyword extraction technology such as linear dictionaries, suffix trees, or the technology of Henkin '743, Perazolo '658 and Lee '926 are not suited to reliably differentiate between desired and undesired content on-line without obtrusive delays and within reasonable constraints of memory and processing power. Thus, there is needed a super fast, efficient content analysis system for real-time classification of desired and undesired on-line content.
Recently in the field of virus detection U.S. Pat. No. 6,980,992 to Hursey et al. (Hursey '992) disclosed a method for combining virus signatures into a tree structure for real-time detection of virus strings. The detection tree approach of Hursey '992 is particularly suited to virus detection wherein a virus can be positively identified by the detection of a single long string which will almost never occur except in the virus. Therefore, detection of a single particular string is sufficient to identify the presence of a virus. The methodology of Hursey '992 is not sufficient for textual content analysis, because understanding the underlying content of a presentation requires analysis of context and not merely identification of a single predetermined pattern. Particularly, text strings (called keywords) are often short and a given keyword may occur in texts having different contents. Therefore, to identify content it is often important to know the incidence of a large number of different keywords. This means tracking the number of times each particular keyword occurs and judging the relationship of associated keywords in the text.
Thus, none of the above cited prior art is suited for detecting keywords in a text and performing content analysis on the text in real-time. Therefore, it is desirable to have an unobtrusive filter that can reliably analyze content in real-time. The filter should evaluate a large number of keywords and rules in a short period of time for real-time application.