With the rapid global expansion of the World Wide Web, using the Internet to disseminate and capture information, specifically news related information, has become a primary channel for getting information for people. Generally, people can easily obtain any text or pictures from a computer screen via the Internet. At the same time, there is an increase in quantity, style, and channels for the distribution of news contents via the Internet. Email, internet news groups, forums, and websites have all made the Internet an important media outlet.
The information contained in the Internet is vast and complex, including a lot of good, progressive, and useful information and a lot of controversial material such as pornography, racism, and false information. The Internet is rapidly becoming a battle ground for ideas. Moreover, because of the anonymity one gets when browsing the Internet, more and more people are willing to express themselves through this channel. The rapid explosion of public opinions on the internet might gradually become a threat to social security in the form of “content threat”.
The application of a network monitoring system could exert an effective control over the complex internet information. But, most of the traditional network monitoring systems is useless to the “hide and seek” operation tactic of certain undesired URLs, since the contents of those URLs are often deleted and restored repeatedly. Therefore, it is desirable to have a new web information detecting system with high accuracy.
There are a number of web information detecting methods currently deployed in various countries.
1. One detection method mainly utilizes XMLHTTP-based techniques and properties to obtain the information from the server. From the status code of a return request, such a method could determine if the content of the web page has been deleted. However, this method only provides information on the deletion of a URL, but does not provide information on the deletion or change of the contents of the original URL. Such a method can be relatively inaccurate.
2. Another detection method obtains the status code from the HTTP's response information. The deletion of URL is determined by status code of 200 or 401. Such a method cannot determine the contents of the web, only the deletion status of the URL. The accuracy of this detection method is relatively low.
3. Yet another detection method has also been proposed, in which the domain name is resolved into IP address to check if the URL is deleted, specifically by determining if the sockets are normal. This method can also be inadequate if the contents have already been deleted.
The above existing methods for determining the web information generally have low accuracy. Most of them rely on the return code to determine whether the URL in question has been deleted. Not only do they have difficulty to detect if a URL exists, those methods cannot determine whether the content has been erased or changed.