This invention relates to a method of filtering sections of a data stream, in particular where the data stream comprises end point identifier data which has already been extracted from a more general data stream.
Specific examples of this are SPAM filtering based on email addresses, or extraction of web data based on URLs, however the method of filtering is applicable to any set of data defined by an end point identifier.
In the example of email, lookup against a dictionary of target addresses is important in SPAM filtering for rejecting mail from known SPAM agents. However, the process of email address lookup against a target database can prove to be a significant performance bottleneck. The principal reason for this performance bottleneck is the processing overhead associated with checking whether all email addresses extracted from a data sample are in a database of target email addresses. In reality the probability of obtaining a hit on the database with an arbitrary email address is <1%. Consequently, 99% of the lookup effort is spent rejecting potential items.
In accordance with the present invention, a method of filtering sections of a data stream comprises determining a set of characters of interest; testing each section of the data stream for the presence of one or more of the set of characters of interest; and extracting sections in which at least one of the characters is present.
The present invention reduces the number of occasions when a look up must be carried out by excluding from the look up stage, sections of the data stream which do not satisfy a minimum co-incidence with characters in an end point identifier.
Preferably, the method further comprises determining a further set of characters of interest; testing for at least one character from the further set of characters in the portion of the data stream; and extracting sections in which at least one of the characters from the further sets of characters is also present in the section.
The method can be continued through several iterations by setting additional character sets.
Preferably, the method comprises testing for the one or more of the set of characters in a predetermined order.
By requiring the characters to appear in a particular order, fewer incorrect extractions are made, but at a cost of an increased memory requirement whilst the extraction processing is undertaken.
Preferably, a skip function is applied, so that only predetermined characters in each section are tested against the set of characters of interest.
This allows testing of a specific character, such as the first or last character, without first testing all the characters leading up to that one.
Preferably, the first and last characters of a section are compared with the first and last characters of the set of characters of interest and the section extracted if there is a match.
This reduces the likelihood of an incorrect match, using well defined test characters.
Preferably, the method comprises determining additional sets of characters of interest and testing for one or more of the set of characters in more than one set.
By testing for different character sets in parallel, throughput is increased despite some less valid sections being extracted as a result.
Preferably, the section comprises an end point identifier, such as a domain name; an email address; a uniform resource locator; a telephone number; or a data and time.
The end point identifiers are not limited to these types, although they tend to be the most commonly searched ones, and the invention is equally applicable to filtering other types of end point identifier.
Preferably, the extracted sections are stored in a store.
Preferably, the extracted sections are input to a look up table and compared with specific stored end user identifiers; wherein sections which match the specific end user identifiers are stored and those which do not match are discarded.