1. Technical Field
The present invention relates generally to network communications over TCP/IP and more particularly to connecting low bandwidth services between local area networks (LANs) and ameliorating packet fragmentation.
2. Description of Related Art
It is known that virtual private networks (VPN) allow remote employees access to an enterprise's information systems. VPNs are used to connect remote offices to headquarters for time critical enterprise resource management operations.
The communication network typically comprises a public network (e.g., the Internet). The connections to the communication network from the branch office and the central office typically cause a bandwidth bottleneck for exchanging the data over the communication network. The exchange of the data between the branch office and the central office, in the aggregate, will usually be limited to the bandwidth of the slowest link in the communication network aggravated by the latency imposed by encryption and decryption of the VPN overhead.
For example, the router connects to the communication network by a T1 line, which provides a bandwidth of approximately 1.544 Megabits/second (Mbps). The router 170 connects to the communication network by a T3 line, which provides a bandwidth of approximately 45 Megabits/second (Mbps). Even though the communication network may provide an internal bandwidth greater than 1.544 Mbps or 45 Mbps, the available bandwidth between the branch office and the central office is limited to the bandwidth of 1.544 Mbps (i.e., the T1 connection).
Moreover, many applications do not perform well over the communication network due to the limited available bandwidth. Developers generally optimize the applications for performance over a local area network (LAN) which typically provides a bandwidth between 10 Mbps to Gigabit/second (Gbps) speeds. The developers of the applications assume small latency and high bandwidth across the LAN between the applications and the data. However, the latency across the communication network typically will be 100 times that across the LAN, and the bandwidth of the communication network will be 1/100th of the LAN.
Connecting a branch office to headquarters is likely to involve tying two local area networks to routers which are connected by a wide area network. This requires traversing a number of gateways controlled by different parties. The maximum packet size (also called the MTU, or Maximum Transmission Unit) and default packet size can vary depending on the media. For ethernet (LAN), the max packet size is 1500 octets. For token ring and FDDI, it is 4096 octets. The IP protocol was designed for use on a wide variety of transmission links. Although the maximum length of an IP datagram is 64K, most transmission links enforce a smaller maximum packet length limit, called an MTU. The value of an MTU depends on the type of the transmission link.
The design of IP accommodates MTU differences by allowing routers to fragment IP datagrams as necessary. The receiving station is responsible for reassembling the fragments back into the original full size IP datagram. As the IP packets are routed independently of each other, different packets between the same end hosts could take different routes with varying MTU sizes. However, the lack of end-to-end information can quickly result in oversized packets being received by the intermediate routers that have to route them somehow. The IP protocol provides a convenient solution: the IP fragmentation, a mechanism where a single inbound IP datagram is split into two or more outbound IP datagrams. The worst impact of IP fragmentation is in the router-to-router communication. If a router-to-router IP packet is fragmented somewhere in the path, the receiving router has to reassemble the original packet, resulting in significantly reduced switching performance.
An additional problem with deployment of VPNs is that there is latency introduced by the encryption and decryption of transmissions. Because of the encryption of traffic, the same files transmitted twice will not look the same and this prevents conventional caching strategies.
For example, in a centralized server implementation having multiple branches, computers in each of the multiple branch offices make requests over the VPN to central servers for the organization's data. The data transmitted by the central servers in response to the requests quickly saturate the available bandwidth of the central office's connection to the communication network, further decreasing application performance and data access at the multiple branch offices. This is particularly troublesome for entities which span multiple timezones as congestion can dominate the work day.
It is also known that mechanisms for caching improve application performance and data access. A cache is generally used to reduce the latency of the communication network (e.g., communication network) forming the VPN (i.e., because the request is satisfied from the local cache) and to reduce network traffic over the VPN (i.e., because responses are local, the amount of bandwidth used is reduced).
Webpage caching, for example, is the caching of web documents (i.e., HTML pages, images, etc.) in order to reduce web site access times and bandwidth usage. Web caching typically stores local copies of the requested web documents. The web cache satisfies subsequent requests for the web documents if the requests meet certain predetermined conditions.
One problem with web caching is that the Time to Live parameter is generally not easily changed. Thus the management of a web cache is at least tricky and not conveniently purged or updated. Every browser can have a slightly different version of a document. Another problem is that the web cache stores entire objects (such as documents) and cache-hits are binary: either a perfect match or a miss. Even where only small changes are made to the documents, the web cache cannot use the cached copy of the documents to reduce network traffic.
It is also known that randomly chosen polynomials are used to “fingerprint” bit-strings. This method, first published by Michael O. Rabin Center for Research in Computing Technology Harvard University Report TR-15-81 (1981), is applied to produce a very simple string matching algorithm and a procedure for securing files against unauthorized changes. The method is provably efficient and highly reliable. However it is also known that the Rabin fingerprinting scheme is not as secure as more expensive cryptographic hash functions.
It is known that the Rabin-Karp algorithm is a string searching algorithm created by Michael O. Rabin and Richard M. Karp in 1987 that uses hashing to find a substring in a text. It is used for multiple pattern matching rather than single pattern matching. Running time performance is considered a reason that it is not widely used. However, it has the advantage of being able to find any one of kstrings or less in a predictable time regardless of the magnitude of k.
References: U.S. Pat. Nos. 5,511,159, 5,627,748, 5,778,231, 5,953,006, and    U. Manber, “Finding Similar Files In a Large File System”, Proc. 1994 Winter Usenix Technical Conference, January 1994, pp. 1-10.    B. S. Baker, “Parameterized Pattern Matching: Algorithms and Applications,” J. Comput. Syst. Sci. 52(1), February 1996, pp. 28-42.    B. S. Baker, “Parameterized Duplication In Strings: Algorithms and an Application to Software Maintenance,” SIAM J. Computing, 26(5), October 1997, pp. 1343-1362.    E. W. Myers, “An O(ND) Difference Algorithm and Its Variations,” Algorithmica, 1986, pp. 1:251-266.    B. S. Baker, “On Finding Duplication and Near-duplication in Large Software Systems,” Second Working Conference on Reverse Engineering, 1995, pp. 86-95.    H. L. Berghel and D. L. Sallach, “Measurements of Program Similarity in Identical Task Environments,” SIGPLAN Notices, 9(8), August 1984, pp. 65-76.    S. Brin, J. Davis, and H. Garcia-Molina, “Copy Detection Mechanisms For Digital Documents,” Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD), 1995, pp. 1-21.    A. Broder, S. Glassman, M. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proceedings of the Sixth International World Wide Web Conference, April 1997, pp. 391-404.    K. W. Church and J. I. Helfman, “Dotplot: A Program For Exploring Self-similarity In Millions of Lines of Text and Code,” Journal of Computational and Graphical Statistics, 2(2), June 1993, pp. 153-174.    N. Heintz, “Scalable Document Fingerprinting,” Proceedings of the Second USENIX Workshop on Electronic Commerce, Nov. 18-21, 1996, pp. 1-10.    S. Horwitz, “Identifying the Semantic and Textural Differences Between Two Versions of a Program,” Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), June 1990, pp. 234-245.    H. T. Jankowitz, “Detecting Plagiarism in Student PASCAL Programs,” Computer Journal, 31(1), 1988, pp. 1-8.    J. H. Johnson, “Substring Matching For Clone Detection and Change Tracking,” Proc. International Conf. on Software Maintenance, 1994, pp. 1-7.    PocketSoft. .RTPatch Professional, Feb. 23, 1998    T. Proebsting and S. A. Watterson, Krakatoa: Decompilation in Java (does bytecode reveal source:). USENIX Conference on Object-oriented Technologies and Systems, June 1997, pp. 1-13.    N. Shivakumar and H. Garcia-Molina, “Building a Scalable and Accurate Copy Detection Mechanism,” Proceedings of 1st ACM International Conference on Digital Libraries (DL'96), March 1996, pp. 1-9.    On finding duplication in strings and software, technical report, AT&T Bell Laboratories, February, 1993    Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (Sep. 1, 2001). “The Rabin-Karp algorithm”. Introduction to Algorithms (2nd edition ed.). Cambridge, Mass.: MIT Press. pp. 911-916. ISBN 978-0262032933.
It is also known that a hash function is any well-defined procedure or mathematical function which converts a large, possibly variable-sized amount of data into a small datum. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes. In the present patent application we define a variable-sized amount of data converted to a hash as a data paragraph. A circuit is disclosed for selecting data paragraphs from a data object. That portion of a data object which is below the minimum size of a data paragraph is defined as a remainder.
It is known that hash functions are used to speed up table lookup or data comparison tasks—such as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.
Thus it can be appreciated that connecting branch offices with enterprise applications presents bandwidth, security, and data integrity problems which are aggravated by virtual public networks. What is needed is a way to address VPN fragmentation, data duplication and enable low latency, high responsiveness for users who must live remotely through an encrypted, low bandwidth link from their central datacenter and applications.