The term web filtering is used to describe the process by which companies restrict or monitor their employees' internet use. Web filtering is achieved by a number of simple and complex means, the key types are described below.
Firms without any web filtering solution in place have an ability to read the URL requests from users from logs stored on their “firewall”. This enables at least some review of access to take place, albeit after access to the website has occurred. This is labor intensive and inadequate in providing substantive proof that a particular user was responsible for the inappropriate access.
Black lists in software can list undesirable web addresses and prevent access to those sites. White lists list acceptable web addresses and are often used to restrict access to only those sites that are contained on the white list. The scale of the internet is such that maintenance of lists is a challenge and for users, it is very frustrating if you have a genuine reason or need to access a site but first must seek approval and have it included in allowed URLs.
In more sophisticated solutions, black and white lists are used to list “exceptions to the rule” for users. For example, a user may not be allowed access to travel sites but is provided access to low cost airline websites to book flights.
One well known web filtering technique is to provide a database of thousands or millions of web addresses. By manually examining each of these web sites and categorizing the content on the web pages, it is possible to create “user profiles” whereby various levels of staff can get controlled access to the internet. This (together with reporting) means that one user may get no access to the internet, a call center worker may only have access to a single web site, a secretary has access to travel sites but not shopping and a manager has more general access with the exception of inappropriate sites such as illegal/violent/racist/pornographic etc.
The demand for this type of approach has created a small number of database builders who typically invest heavily to maintain and grow their URL lists. Many thousands of URLs can be initially “harvested” using a variety of means, but manual categorisation is still predominantly used to ensure accuracy of categorisation.
Manual categorization requires each URL reviewer to read the content/look at images on the website, decide the kind of web site it is and categorize it in their database accordingly. The accuracy of this approach is variable—it is relatively, easy to spot a pornography web site for example, but less easy to identify anonymous proxy sites.
With limited time available to categorise each site, any web site that deliberately seeks to mislead the casual review of a web site (e.g. a cookery site which when examined thoroughly turns out in fact to be pornography) can easily be successful in having the inappropriate address categorised as legitimate. Misclassifications are extremely frustrating for users and are a source of conflict between suppliers and clients.
A typical URL classifier will review some hundreds of web addresses daily. However, the Internet is said to be growing by something in the region of 7.5M new or re-named web sites each day across the world. Manual categorization typically classifies 500 web sites per person per day. Therefore, 15000 classifiers would be required to classify these websites. The cost of employing such a large number of staff would be considerable.
Companies who offer web filtering services advertise their services on the basis of the size of their database of categorized sites. It is usual for these databases to contain 15 to 17 million categorized sites. In the context of the overall web, these numbers are inconsequential and where users seek a URL not listed in the URL database, they will access the web site, whatever the content. This is embarrassing for suppliers in this field, but also, in certain markets (e.g. youth, schools etc.) it is unacceptable.
In addition web filtering suppliers are not particularly motivated to re-check URLs previously categorized: a significant volume of sites either re-name themselves or cease being operational and to delete these from URL lists risks admitting to promoting a database which appears to be growing slightly, static or even declining. Therefore the claims of web filtering suppliers about database sizes require further inspection to reach any conclusion about their capability.
Image scanning is a useful (but not foolproof) means of blocking pornography from a client network. However it tends to be expensive and is not a stand alone solution to managing broad internet access policy.
Keyword Request Analysis examines the keywords being requested by the user either within the web address being sought or within a search engine. These vary in sophistication where used. It is possible to provide an ability for users to tailor keyword requests to ensure vertical market needs are addressed. An example of this would be a Building College whose previous system stopped students searching for wire strippers, road hardcore etc. Systems can be tuned by users to allow these specific phrases but prevent the searching of the word “strippers” alone.
Some products use “in line” page scanning. This works typically where the user has asked for a web address. Prior to the site being delivered to the user, the text on the web site is scored against pre-listed words (e.g. “gamble” “poker” “pontoons”). If the score adds up to a number above a certain threshold, the user could be denied access to the web site. The advantage of this approach is that the keyword scanning does not need to have previously identified the web site and categorized it. The disadvantage of this approach is that the scanning is only effective where the words listed are scored (i.e. if offensive words, expressions—in whichever language—are not listed, it will gain a low score and be allowed on the user PC). Additionally, keyword scanning (using the gambling example above) tends to only be able to differentiate say a gambling web site from a “help from gambling” web site if “good” words/combinations are listed and negate the score gained by the offensive word.
If the “good” words are not listed or used sufficiently often, perfectly reasonable sites (e.g. Health Education) can be wrongly blocked.
Finally, the scanning of web pages for keywords is most effective in a restricted range of web sites. Pornography is a good example, where much of the language used will be slang and unique to more offensive/inappropriate sites. It will not for example be able to differentiate whether a sports story is on a sports site or a news site.
Accordingly, there are a number of problems associated with current web filtering technologies.
Database products fall further behind as a result of the sheer size of increase in the internet every day. No supplier can invest in enough resources to match internet growth/amendment rates.
“English language” products with (predominantly) English language URL lists are proving ineffective against detecting inappropriate sites within our multi cultural society.
Black & White lists, URL lists, Keyword Request review and keyword page scanning all depend upon previously scoped defence mechanisms—where a web site does not fit in with previously scoped words, phrases, exceptions etc, the web filtering fails to be effective.
Keyword page scanning is most effective in web sites with an emphasis on unique language for a particular type of site—e.g. slang words on porn sites. The ability to identify mainstream web sites by keyword scanning is very much more limited and in some cases, completely ineffective.
It is an object of the present invention to provide an improved method of website categorization and filtering.
In accordance with a first aspect of the present invention there is provided a method for configuring website categorization software to automatically categorize websites from a predetermined category of website, the method comprising the steps of:
(i) identifying a website from the predetermined category;
(ii) reading a markup *language description of the website;
(iii) extracting page content information from the markup language description;
(iv) analyzing the page content information;
(v) repeating steps (i) to (iv) for n websites in the predetermined category; and
(vi) creating a profile for the website category based upon a combination of the results of analyzing the page content information of n websites.