There is often a need to operate on or use a list of small strings, such as domain names, as a single set of data that may be loaded into memory. For example, when working with domain names, there arises at times a need to operate on or use a list of all domain names, or at least a large subset of domain names. However, due to the large number of domain names (on the order of 100 million), when operating on or accessing a list of all domain names, the operation may be limited by available memory. Thus, it becomes important to reduce the memory requirement in such an operation by compressing the domain names within the list.
General purpose compression algorithms that are effective with larger documents or files (such as the LZ family of compression algorithms) may be less effective with small strings, and may even result in larger “compressed” files or outputs. Accordingly, to effectively reduce the size of the domain names list and corresponding memory required to retain the domain names in memory, a new compression scheme tailored to small strings was developed by the inventors, specifically taking advantage of unique features of small strings, domain names in particular.
Domain names are typically limited to letters (A-Z, not case-sensitive), numbers (0-9), and hyphens (-), for a total of 37 possible characters. Domain names also typically contain 63 or fewer characters. Thus, the set of characters required to represent domain names is limited. Other sets of small strings may possess similar characteristics as domain names that limit the number of characters required to fully represent the small strings. Small strings may be defined as strings with limited length and/or limited character sets forming the strings. This is as opposed to, for example, lengthy strings that may require a large character set for representation, such as large, complex documents or high-quality photographs. Examples of small strings may include domain names and physical addresses which may be strings with limited length, DNA sequences which may be strings with a limited character set forming the strings, and phone numbers which may be both strings with limited length and a limited character set forming the strings.
Accordingly, it is an object of embodiments of the disclosure to provide methods, systems, and non-transitory computer-readable storage media storing programs for compressing a set of small strings. Other objects and advantages of embodiments of the disclosure may be apparent in view the description of exemplary embodiments below.