1. Field of the Invention
The invention relates to data manipulation and categorization in general, and specifically to processing of textual data for categorization, content identification and authentication.
2. Background Information
With the advent of the electronic age and the internet as a useful means for communication and storage of data, there is a need for systems for determining whether a given document was authored by a certain person, whether a given document is in a particular language, or what type of material a given document deals with. This is not well addressed by present methods of textual analysis. At best, currently it is possible to analyze a given document utilizing phrase or key word searches and then have a human look at the results of such analysis in an attempt to determine their authorship, content, or language. What is needed is a methodology that will produce a result that can be more readily analyzed by a computer without human intervention. Additionally, what is needed is a methodology that can look at frequency of character utilization, key word searches and frequency of occurrences of phrases all at once rather than looking at them discretely.
The present invention provides a method and apparatus for content identification and categorization of data. In one embodiment, a Burrows-Wheeler Transform is performed on a document of textual data to produce a set of transformed textual data. The transformed textual data is divided into a set of one or more intervals. The transformed textual data of that set of intervals is transformed to produce a pattern map. The pattern map is compared to a reference pattern map thereby producing an indication of whether the subject textual data is of a type corresponding to the reference pattern map.