There are many tools today that purport to discover new security vulnerabilities in binary software. However, these tools often yield false positives and, most importantly, they are frequently incapable of finding many types of vulnerabilities. Many vulnerabilities in software products are publicly announced, and given a binary file, it may be desirable to identify the publicly known vulnerabilities in that binary. This is typically accomplished by manually searching a vulnerability database using a package name and version number of a binary file. However, package names and version numbers are often not visible in many types of binary files, and there are no known techniques for automatically extracting information about program version numbers from binary files for the purposes of performing a database lookup in a vulnerability database.
It is commonly known in the field that binary executables and libraries typically contain American Standard Code for Information Interchange (ASCII) text. The GNU strings utility, part of the binutils package since at least 1991, is designed to extract such text from binary data. It is a common convention for most command-line based software that runs on UNIX-like platforms including macOS and Linux to contain a message with the package or product name and the software version number somewhere in the binary executable. These messages are often, but not always, displayed to the user when the relevant binary is invoked with a special flag. Similarly, many software products embed this information in a binary along with a copyright message.
Regular expressions (regex), which were invented in 1951, are commonly used to parse text strings to match substrings conforming to certain patterns. These regex pattern matching techniques have been implemented in a variety of programming languages including C, C++, PERL, Python, Java, JavaScript, PHP and others. Regexes are commonly used by programmers for text parsing. There have been attempts in the past to use regexes to parse version numbers out of arbitrary text strings. However, because the set of possible input data to a regular expression is very large, and because the constitution of a version string is ambiguous, these regex-based approaches yield imperfect results. For example, 0.9.8b may be a valid version string for one software product, whereas 0.9.8beta is not. Alternately, it is possible that the inverse is true: 0.9.8beta is valid but 0.9.8b is not. These two facts are contradictory; a regex cannot properly match a version number for both products.
There have been numerous tools created to cross-reference a binary or a list of binaries with a vulnerability database to yield a set of known vulnerabilities in a list of software. These tools have invariably required that the package name and version be known before any cross-referencing can occur. For example, cve-check-tool (https://github.com/clearlinux/cve-check-tool), created by Intel, can find vulnerabilities in Linux packages installed on a Desktop operating system by cross-referencing a pre-determined list of installed packages from a supported package manager with a vulnerability database. There exist other, similar tools that can find known vulnerabilities in a software binary if the package name and version number are already known. For example, U.S. Patent Publication No. 2014/0082733A1 describes a system that can find known vulnerabilities in software assets provided that the asset has already been identified.
These known tools are deterministic; they do not support uncertainty in either the package name or the version number. That is, if a package name or version is not quite correct, the tools fail to find vulnerabilities. Furthermore, the tools do not analyze individual executables directly but rather they analyze information about executables, such as metadata.
Other known systems resolve a set of vulnerabilities to a list of packages where the names in the package list don't match up exactly with the vulnerability names. For example, U.S. Pat. No. 10,089,473 describes matching vulnerabilities to a pre-existing list of software while having imprecision in the software names. It approaches the problem using lexical distance measurements and a form of fuzzy matching on the CPE name. However, the method described therein does not start with a binary and the starting list used contains substantially more information than can typically be gleaned from a binary.
For example, binaries don't typically include a vendor name, product name and version in a parseable format. Specifically, it is typically only a <name, version> pair that can be readily parsed out. Furthermore, versions in a binary are often only extractable with imprecision that is not found in a manifest of installed products.
Most importantly, related work has not solved the problem of generating a list of software to cross-reference with a database. This is a difficult and non-trivial task and a proper solution is a significant contribution to the field of vulnerability mapping. To date, there have not been any successful attempts to extract package names and version numbers from a binary file without any additional outside context; to either individually or collectively select candidates for cross-reference, and then to cross-reference the information with a database using a fuzzy-matching technique to mitigate the potential errors caused by applying a regex to an arbitrary text string. Such an approach would be useful in discovering known vulnerabilities in software on platforms where a list of installed packages is not available, as is commonly the case on Linux-based firmware images.