1. Field of the Invention
The present invention is generally related to compressing strings in a manner that allows for searching of a specific compressed string in the set of compressed strings. More specifically, the present invention is related to coding uniform resource locator (URL) strings for Internet accessibility analysis.
2. Description of the Related Art
The prior art disclosed various methods for compressing text strings. For example, compression of code using a ZIP file does not allow random access for searching purposes so compressing one hundred strings using a ZIP file does not allow a search for the string at line thirty. Further, the prior art discloses delivering data to customers using a one-way hash function to provide lookup capability for strings and associating information about those strings; in this case categories, confidences, and reputation.
One example of a one-way hash function is an MD5 hash, which processes a variable length string, file or URL into a fixed length output of 128 bits. Traditional MD5 string encryption formats provide a good balance between collision avoidance and length, but require an exact match of the input string with the hashed string to find it in the database. A slight variation in input string causes a large variation in the resultant MD5 hash.
MD5 hash is well known in the prior art. Traditional hashing works by generating a specific hash value for a given string. An example is instructive:
MD5(“google.com”)=1d5920f4b44b27a802bd77c4f0536f5a
If just one character is added to the input URL, the output hash is radically different:
MD5(“google.com/”)=98f1c71b82281a60a7766c3355d575e6
Imagine a client looking up google.com in a database containing a series of hashes. If the client is off by just one character, a completely different hash is produced, and therefore google.com and its associated metadata (in this case classifications) will not be found in the database. Many applications of this technique exhibit “temporal locality” or the effect of many references to the same or similar strings over a short period of time. Therefore, if similar strings are “close” to each other in memory modern computer systems can benefit from various caching systems to maximize spatial and/or temporal locality. Unfortunately, a side effect of hash functions which uniformly distribute the hash keys of even very similar strings, is that they reduce the ability of modern computer systems to speed up access using common caching mechanisms such as disk controller caches, virtual memory paging, or the reading & storing of cache lines during memory reads.
URLMD5 hashgoogle.com/ig9b35374eeef4881fbbe97f2d0cf01958google.com/search=”art”1ba6af01fb70cac2bbbc9d2c794cb693google.com/finance/djia14d36eab3f4e7601e39f203322bde406https://google.comc7b920f57e553df2bb68272f61570210https:/google.com91da3280cd9485ca6d0c77098a6ce507
Because there is no way to determine the input strings for a given one-way hash value, a traditional way of solving this problem is to require the client to “test” many variations of the input string until a match is found. This is a way to achieve a “Longest Common Prefix” or LCP search over a set of strings using one-way hash functions. Strings can be broken down from most specific (longest) to least specific (shortest), and the iterative reduction lookups accomplish the LCP search, providing a method to test for more or less specific matches of a string in this list.
For example, using the example above:
URLLCP testgoogle.com/finance/djia1 (exact match not found)google.com/finance2 (more general match not found)google.com3 (most general and last testmatches where md5 hash =1d5920f4b44b27a802bd77c4f0536f5a)
In general, using LCP requires the client to perform many searches for each string, and no feedback is given as to how close the matched string is with the exact original string (although the client may use various methods to deduce this such as string length comparisons, number of LCP tests performed before a match is found, etc.).
The prior art discloses many references pertaining to compression algorithms and/or search algorithms. For example, Hailpern et al., U.S. Pat. No. 7,383,299 for a System And Method For Providing Service For Searching Web Site Addresses discloses searching for an incorrectly spelled URL using fuzzy logic.
Tarquini, U.S. Pat. No. 7,472,167 for a System And Method For Uniform Resource Locator Filtering discloses URL filtering by determining a hash value for a specific URL and then searching a lexical search tree data structure to determine if the a match is found indicating that the URL is hostile.
Davis, U.S. Pat. No. 7,443,841 for a Longest Prefix Matching (LPM) Using A Fixed Comparison Hash Table discloses forwarding Internet Protocol (IP) packets by hashing a portion of a fixed length key to obtain a hash value required for obtaining routing information for forwarding the IP packet.
Agarwal, U.S. Pat. No. 7,487,169 for a Method For Finding The Longest Common Subsequences Between Files With Applications To Differential Compression discloses finding the longest matching substrings between a number of potentially large datasets by hashing sections of files to detect occurrences of substrings and building suffix arrays to find the longest matches.
Kimura, U.S. Pat. No. 5,933,104 for a Method And System For Compression And Decompression Using Variable-Sized Offset And Length Fields discloses an improvement of the LZRW1 algorithm that identifies a pattern of data by calculating a hash value for the pattern and encoding the pattern of data for compressing data.
The prior art discloses various compression algorithms. The LZRW1 algorithm uses the single pass literal/copy mechanism of the LZ77 class of algorithms to compress an uncompressed data sequence into a compressed data sequence. Bytes of data in the uncompressed data sequence are either directly incorporated into a compressed data sequence as a string (i.e., as “literal items”) or, alternatively, are encoded as a pointer to a matching set of data that has already been incorporated into the compressed data sequence (i.e., as “copy items”). The copy items are encoded by offset and length values that require fewer bits than the bytes of data. The offset specifies the offset of the string being coded relative to its previous occurrence. For example, if a string of three characters occurred six bytes before the occurrence that is being encoded, the offset is six. The length field specifies the length of the matching data sequence in bytes. Compression is realized by representing as much of the uncompressed data sequence as possible as copy items. Literal items are incorporated into the compressed data sequence only when a match of three or more bytes cannot be found.
The LZ1 data compression process is based on the principle that a repeated sequence of characters can be replaced by a reference to an earlier occurrence of the sequence, i.e., matching sequences. The reference, e.g., a pointer, typically includes an indication of the position of the earlier occurrence, e.g., expressed as a byte offset from the start of the repeated sequence, and the number of characters, i.e., the matched length, that are repeated. Typically, the references are represented as “<offset, length>” pairs in accordance with conventional LZ1 coding. In contrast, LZ2 compression parses a stream of input data characters into coded values based on an adaptively growing look-up table or dictionary that is produced during the compression. That is, LZ2 does not find matches on any byte boundary and with any length as in LZ1 coding, but instead when a dictionary word is matched by a source string, a new word is added to the dictionary which consists of the matched word plus the following source string byte. In accordance with LZ2 coding, matches are coded as pointers or indexes to the words in the dictionary.
The definitions for terms used throughout this document are set forth below.
FTP or File Transfer Protocol is a protocol for moving files over the Internet from one computer to another.
HyperText Markup Language (HTML) is a method of mixing text and other content with layout and appearance commands in a text file, so that a browser can generate a displayed image from the file.
Hypertext Transfer Protocol (HTTP) is a set of conventions for controlling the transfer of information via the Internet from a Web server computer to a client computer, and also from a client computer to a Web server.
Internet is the worldwide, decentralized totality of server computers and data-transmission paths which can supply information to a connected and browser-equipped client computer, and can receive and forward information entered from the client computer.
JavaScript is an object-based programming language. JavaScript is an interpreted language, not a compiled language. JavaScript is generally designed for writing software routines that operate within a client computer on the Internet. Generally, the software routines are downloaded to the client computer at the beginning of the interactive session, if they are not already cached on the client computer. JavaScript is discussed in greater detail below.
List Search Algorithm is an algorithm used to find a particular element of a list of elements and includes linear search algorithms, binary search algorithms, interpolation search algorithms, and others.
Metadata is generally defined as data about data.
Parser is a component of a compiler that analyzes a sequence of tokens to determine its grammatical structure with respect to a given formal grammar. Parsing transforms input text into a data structure, usually a tree, which is suitable for later processing and which captures the implied hierarchy of the input. XML Parsers ensure that an XML document follows the rules of XML markup syntax correctly.
String is a sequence of characters (numbers, letters, symbols and/or the like).
URL or Uniform Resource Locator is an address on the World Wide Web.
Web-Browser is a complex software program, resident in a client computer, that is capable of loading and displaying text and images and exhibiting behaviors as encoded in HTML (HyperText Markup Language) from the Internet, and also from the client computer's memory. Major browsers include MICROSOFT INTERNET EXPLORER, NETSCAPE, APPLE SAFARI, MOZILLA FIREFOX, and OPERA.
Web-Server is a computer able to simultaneously manage many Internet information-exchange processes at the same time. Normally, server computers are more powerful than client computers, and are administratively and/or geographically centralized. An interactive-form information-collection process generally is controlled from a server computer, to which the sponsor of the process has access. Servers usually contain one or more processors (CPUs), memories, storage devices and network interface cards. Servers typically store the HTML documents and/or execute code that generates Web-pages that are sent to clients upon request. An interactive-form information-collection process generally is controlled from a server computer, to which the sponsor of the process has access.
World Wide Web Consortium (W3C) is an unofficial standards body which creates and oversees the development of web technologies and the application of those technologies.
XHTML (Extensible Hypertext Markup Language) is a language for describing the content of hypertext documents intended to be viewed or read in a browser.
XML (Extensible Markup Language) is a W3C standard for text document markup, and it is not a language but a set of rules for creating other markup languages.
The prior fails to provide a solution to these problems and others.