File format identification and validation may be used for data security. For example, when a file is transmitted electronically, the receiving end identifies and detects the file type, which may aid in determining if the file is safe from a variety of forms of harmful or intrusive software, including computer viruses, worms, Trojan horses, ransomware, spyware, adware, scareware, and other malicious programs which can take the form of executable code, scripts, active content, and other software. A variety of methods to verify the file format using a database are known in the art.
One method to determine the file format is by verifying in the database a correspondence between the file name suffix—“.doc”—and the file type—Microsoft word file. This may be effective for popular file format types but with the amount of possible file name suffixes, the method may not be sophisticated to detect obscure software program files. Additionally, the file may not be saved with the file name suffix. Another method is to leverage the standard Multipurpose Internet Mail Extension (MIME) to verify the given file format. For example, a set of MIME instructions may be inserted into the beginning of the data transmission which provides instructions to the electronic device about how the file should be opened or viewed. There are typically public sites of databases listing the file type detection using the basic MIME standard.
Signature-based file type verification mechanisms may be used to determine the file format. This is a pattern match between a certain length or number of bytes in a part of the file and a signature database. A file signature is data used to identify or verify the contents of a file. In particular, it may refer to a “magic number” which is generally a short sequence of bytes placed at the beginning of the file used to identify the format of the file. In use, the magic number is found in a database to identify and verify the file format. For example, the magic number in the header of the file may be analyzed, and if the magic number corresponds to a pre-stored known file type, then the file format is the file format that corresponds to the magic number.
Many databases exist for this purpose of file format verification, which may be public. For example, a crowd source machine learning system may be used to determine the file format by a binary signature. This system leverages community users to provide training samples. Unfortunately, this may be easily manipulated by a random user creating a seasoned sample set and mis-training the system. In another example, an open source project may use an abstract layer on top of the signature-based mechanism for byte pattern matching logic by consulting a database.
Because these conventional systems and methods rely on databases, the databases need to be up-to-date with a vast amount of data to comprehend file formats from a variety of software systems and applications. The signature such as the magic number may be purposely modified and therefore the security and trustability of the file cannot be ensured.