The amount of data being transmitted electronically is ever increasing. Electronic data may be in any language, may have been generated using any type of word-processor, may or may not be in a format that can be executed by a computer, may be uncompressed or compressed using any type of compression scheme, and so on. To automatically process electronic data properly, a truly automated data processing system must be able to identify the type of data contained in a received electronic file.
U.S. Pat. No. 5,062,143, entitled "TRIGRAM-BASED METHOD OF LANGUAGE IDENTIFICATION," discloses a method of using 3-grams to identify the language of a received document. U.S. Pat. No. 5,062,143 is limited by requiring the use of a fixed length n-gram. In addition, U.S. Pat. No. 5,062,143 does not attempt to further distinguish any data files from one another as does the present invention. Also, U.S. Pat. No. 5,062,143 does not include n-gram weighting as does the present invention. Furthermore, U.S. Pat. No. 5,062,143 results in one similarity determination and does not make multiple determinations as does the present invention. U.S. Pat. No. 5,062,143 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,371,807, entitled "METHOD AND APPARATUS FOR TEXT CLASSIFICATION," discloses a device for and a method of classifying text as being similar to one or more known classes by requiring the received text to be in a natural language. The text is then parsed into a list of recognizable keywords. The keywords may be words, phrases, or regular expressions. From the list of keywords, certain facts about the received text are deduced. The keyword list of the received text is then compared to a list of text classes to see which class of text the received text most resembles. A similarity score for the received text is generated. From the similarity score, the received text is determined to most resemble appropriate classes If the similarity score for the received text is above a user-definable threshold the received text is determined to be of the class the received text most resembles. U.S. Pat. No. 5,371,807 may not be able process text received in multiple languages, text from different word-processors, compressed text, executable code, or non-textual data as can the present invention which is based on n-gram selection and not on keyword selection. U.S. Pat. No. 5,371,807 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,418,951, entitled "METHOD OF RETRIEVING DOCUMENTS THAT CONCERN THE SAME TOPIC," discloses a method of using an n-gram of a certain fixed length to characterize received data and known data. The commonality between the various files is then removed to further refine the characterization of each file. The refined characterization of the received file is then compared to the stored files to determine which of the stored files the received file is most similar to. U.S. Pat. No. 5,418,951 is limited by requiring fixed length n-grams. Beyond the removal of commonality, U.S. Pat. No. 5,418,951 does not attempt to further distinguish any data files from one another as does the present invention. U.S. Pat. No. 5,418,951 does not include n-gram weighting as does the present invention. Furthermore, U.S. Pat. No. 5,418,951 results in one similarity determination and does not make multiple determinations as does the present invention. U.S. Pat. No. 5,418,951 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,463,773, entitled "BUILDING OF A DOCUMENT CLASSIFICATION TREE BY RECURSIVE OPTIMIZATION OF KEYWORD SELECTION FUNCTION," discloses a method of classifying documents based on keyword selection. The document classification method of U.S. Pat. No. 5,463,773 may not be optimal if received documents are in different languages. Also, a document classification method based on keywords may not work properly on non-textual data, compressed files, or executable code. The present invention which is not based on keyword selection overcomes all of these potential problems. U.S. Pat. No. 5,463,773 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,526,443, entitled "METHOD AND APPARATUS FOR HIGHLIGHTING AND CATEGORIZING DOCUMENTS USING CODED WORD TOKENS," discloses a device for and a method of identifying the topic of a received document by converting the words in a received document to abstract coded character token. Certain tokens are then removed based on a list of stop tokens. Numbers are included on the stop token list. classifying documents based on keyword selection. The topic identification method of U.S. Pat. No. 5,526,443 may not be optimal for processing compressed documents, executable code, or non-textual documents as can the present invention which does not use tokens or previously constructed stop lists. U.S. Pat. No. 5,526,443 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,548,507, entitled "LANGUAGE IDENTIFICATION PROCESS USING CODED LANGUAGE WORDS," discloses a method of identifying the language of a received document by comparing the words in the received document to a plurality of preconceived word frequency tables for various languages. Each language table includes preselected words in that language and numbers associated with each word that represents the normalized frequency of occurrence of that word in that language. The words in the received document are compared to the words in each language table. If a word matches then the document receives the normalized frequency of occurrence value for that word. For each language table, the normalized frequency of occurrence numbers for each word in the received document that matches a word in the language table are added up to obtain a score that indicates how similar the received document is to the language of the language table. The language for which the received document receives the highest score is determined to be the language of the received document. The language identification method of U.S. Pat. No. 5,548,507 may not be optimal for processing compressed documents, executable code, or non textual documents as can the present invention which does not use preconceived normalized frequency of occurrence values. Such values may not be able to be generated for anything other than natural language documents. U.S. Pat. No. 5,548,507 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,706,365, entitled "SYSTEM AND METHOD FOR PORTABLE DOCUMENT INDEXING USING N-GRAM WORD DECOMPOSITION," discloses a device for and a method of identifying documents that contain the n-grams of a natural language search query that has been parsed into a list of fixed length n-grams. The document retrieval method of U.S. Pat. No. 5,706,365 is not a method of identifying the type of data in an electronic file using n-grams as is the present invention, but a method of using n-grams to locate other documents that contain those n-grams. U.S. Pat. No. 5,548,507 is hereby incorporated by reference into the specification of the present invention.
U.S. Pat. No. 5,717,914, entitled "METHOD FOR CATEGORIZING DOCUMENTS INTO SUBJECTS USING RELEVANCE NORMALIZATION FOR DOCUMENTS RETRIEVED FROM AN INFORMATION RETRIEVAL SYSTEM IN RESPONSE TO A QUERY," discloses a method of storing a received document into a database having a plurality of document classes. Each received document is compared against a preconceived word list that is representative of one of the possible classes in the database. The class of the word list that compares most favorably to the received document is the class that the received document will be stored in. The storage method of U.S. Pat. No. 5,717,914 may not be optimal for processing compressed documents, executable code, or non-textual documents for which it may be impossible to generate a preconceived word list. The present invention can identify these types of data without having to generate a preconceived word list. U.S. Pat. No. 5,717,914 is hereby incorporated by reference into the specification of the present invention.
In an article entitled "Improving the retrieval of information from external sources," published in 1991 in Behavior Research Methods, Instruments, & Computers, pages 229-236, Susan T. Dumais discloses a statistical method called latent semantic indexing. This method does not rotate the information to any intuitively meaningful orientation as does the present invention. Furthermore, the method proposed by Dumais is more complex and, therefore, does not perform as well as the method of the present invention. Also, the specific steps of the present method and the equations used to characterize the data is different from the steps and equation used by Dumais.
In an article entitled "Using Linear Algebra For Intelligent Information Retrieval," published December 1995 in SIAM Review, Vol. 37, No. 4, pages 573-595, Michael W. Berry et al. disclose a statistical method that combines latent semantic indexing with singular value decomposition. Since the method of berry et al. is based on latent semantic indexing, it does not rotate the information to any intuitively meaningful orientation as does the present invention. Furthermore, the method of Berry et al. is more complex and, therefore, does not perform as well as the method of the present invention. Also, the specific steps of the present method and the equations used to characterize the data is different from the steps and equation disclosed by Berry et al.