Webpage content indexing systems, commonly known as “crawlers”, require high-bandwidth connections and usually reside on a static IP address or subnet. This allows websites containing malicious content or malicious software to spoof or misrepresent the website content, thereby potentially confusing users and/or a webpage content indexing system. Such websites can also potentially exploit a ranking mechanism employed by a content indexing system, for example allocating the website a higher search result ranking when a user performs an Internet search than the website correctly deserves.
As used herein a “threat” includes malicious software, also known as “malware” or “pestware”, which includes software that is included or inserted in a part of a processing system for a harmful purpose. The term threat should be read to include possible, potential and actual threats. Types of malware can include, but are not limited to, malicious libraries, viruses, worms, Trojans, adware, malicious active content and denial of service attacks. In the case of invasion of privacy for the purposes of fraud or theft of identity, malicious software that passively observes the use of a computer is known as “spyware”.
A hash function (i.e. Message Digest, eg. MD5) can be used for many purposes, for example to establish whether a file transmitted over a network has been tampered with or contains transmission errors. A hash function uses a mathematical rule which, when applied to a file, generates a hash value, i.e. a number, usually between 128 and 512 bits in length. This number is then transmitted with the file to a recipient who can reapply the mathematical rule to the file and compare the resulting number with the original number.
A crawler could also be termed a robot or a spider, and is a program that automatically explores the world wide web by retrieving a document and recursively retrieving at least some of the documents referenced within the document. Different algorithms are used to select which particular references to follow and depend on the purpose of the program. Crawlers can be used to build an index of referenced documents or may simply seek to validate references in a document.
An index can be used to allow relatively quick searching based on, for example, text, keyword, or a variety of other search mechanisms, to locate documents in a database. Particular properties of documents may be indexed in a database to facilitate retrieval and/or searching. The action of updating the index is commonly referred to as indexing.
A cryptographic hash is a mathematical function used to map values from a large domain into a smaller domain. A cryptographic hash is normally a one-way function as it is computationally infeasible to find any input which maps to a known output. A cryptographic hash is normally collision-free as it is computationally infeasible to locate any two distinct inputs which map to produce the same output.
In a networked information or data communications system, a user has access to one or more terminals which are capable of requesting and/or receiving information or data from local or remote information sources. In such a communications system, a terminal may be a type of processing system, computer or computerized device, personal computer (PC), mobile, cellular or satellite telephone, mobile data terminal, portable computer, Personal Digital Assistant (PDA), pager, thin client, or any other similar type of digital electronic device. The capability of such a terminal to request and/or receive information or data can be provided by software, hardware and/or firmware. A terminal may include or be associated with other devices, for example a local data storage device such as a hard disk drive or solid state drive.
An information source can include a server, or any type of terminal, that may be associated with one or more storage devices that are able to store information or data, for example in one or more databases residing on a storage device. The exchange of information (ie. the request and/or receipt of information or data) between a terminal and an information source, or other terminal(s), is facilitated by a communication means. The communication means can be realized by physical cables, for example a metallic cable such as a telephone line, semi-conducting cables, electromagnetic signals, for example radio-frequency signals or infra-red signals, optical fibre cables, satellite links or any other such medium or combination thereof connected to a network infrastructure.
There is a need for a method, system and/or computer program product which addresses or at least ameliorates one or more problems inherent in the prior art.
The reference in this specification to any prior publication (or information derived from the prior publication), or to any matter which is known, is not, and should not be taken as an acknowledgment or admission or any form of suggestion that the prior publication (or information derived from the prior publication) or known matter forms part of the common general knowledge in the field of endeavor to which this specification relates.