1. Field of the Invention
The present invention relates generally to computer networks and, more particularly, to transferring files among computers over the Internet using the Hypertext Transfer Protocol.
2. Related Art
The Hypertext Transfer Protocol (HTTP) is an application-level messaging protocol used for distributed, collaborative information systems. As is known, since 1990 HTTP has been in use on the Internet (the global network of all computers interconnected via Transmission Control Protocol/Internet Protocol, TCP/IP). The principal application in which HTTP has been used is the World Wide Web global information initiative, providing support for hypermedia, electronic commerce, distributed applications, etc. There is a wide variety of information that is publicly known about HTTP. Indeed, there are many documents, known as Internet RFCs (Requests for Comments), which specify a great deal of information about HTTP. For example, RFC 2068, RFC 2296, RFC 2227, RFC 2295, RFC 2145, RFC 2109, RFC 2069, RFC 1945, and RFC 2518 are just a few of the published documents that specify various features and aspects of HTTP. Each of the foregoing, publicly available documents is hereby incorporated by reference in its entirety.
As is further known, the HTTP protocol may be conducted securely across the public Internet using Secure Sockets Layer (SSL) technology. Among other features, this security scheme provides encryption, and hence privacy, to HTTP messages as they traverse the network. When used with HTTP server digital certificates (as is typically the case), HTTP server authentication is also supported for the client. HTTP conducted in this manner is referred to in the art as HTTPS. Unless otherwise specified, as used herein, the term HTTP should be understood to encompass both HTTPS and the unsecured form.
Likewise, there are various versions of HTTP, including versions HTTP 1.0, HTTP 1.1, HTTP-NG, etc. Unless otherwise specified, as used herein, the term HTTP should be understood to be version-independent.
As is known, HTTP is based on a request/response paradigm. An HTTP client computer program establishes a connection with an HTTP server program, executing on a server computer known to the client by its Internet Protocol (IP) address or Domain Name System (DNS) name. (In the latter case, the HTTP client first uses DNS to translate the DNS name into an IP address.) The HTTP client then sends an HTTP request message. The HTTP request message includes: a method, specifying the general operation the server is requested to perform; a Uniform Resource Identifier (URI), specifying the resource against which the server is requested to perform the method; and an HTTP protocol version. The request message may further include various request headers (specifying additional attributes and modifiers to the request) and a request body (specifying arbitrary, application-defined information to be used in handling the request).
The server processes the request and sends an HTTP response message in reply. The HTTP response message includes the HTTP protocol version and an overall success or error code. The response message may further include various response headers (specifying additional informative attributes concerning the server, its handling of the request, or the response) and a response body (providing arbitrary, application-defined information pursuant to the transaction).
In practice, an HTTP proxy program, executing on another server computer, frequently plays a role in relaying the HTTP request and response messages, as well. HTTP proxies are used when the HTTP client and server are separated by an Internet firewall (a partitioning of computers on the network into multiple domains, such that the client and server computers reside in different, non-communicating domains). In this case, rather than connecting directly to the HTTP server on the server computer, an HTTP client will instead connect to an HTTP proxy using the proxy's IP address or DNS name. The client will send to the proxy a modified HTTP request message, containing the IP address or DNS name of the desired HTTP server. The proxy, which is allowed to communicate across the firewall, will then connect to the HTTP server on the client's behalf, and relay the HTTP request message to the server. In turn, the proxy will likewise relay the HTTP response message from the server back to the client.
In a typical HTTP interaction, the HTTP client computer is executing a Web browser program, and the URI refers to a Web page (frequently authored using Hypertext Markup Language, HTML). The HTTP request is thus a request for the Web server program to fetch the indicated page (from the server computer operating system's file system), and download it to the Web browser for display to the user. The HTTP response body includes the file content for the requested page.
In another typical HTTP interaction, the URI may refer to an HTTP server extension program (authored using any of a wide variety of Web server extension frameworks, such as CGI, FastCGI, server plugins, servlets, etc). The HTTP request message may further include request headers and/or a request body containing arguments to the HTTP server extension program. The HTTP request is thus a request for the HTTP server program to invoke the HTTP server extension program with the given arguments, and return an HTTP response from the program. Web form handling, frequently used in Web commerce, is an example of this kind of HTTP interaction.
Although these are typical HTTP interactions, it is important to note that utilization of HTTP is not fundamentally restricted to these examples. As is known, the HTTP protocol itself allows extension request and response headers to include arbitrary, application-defined values. Furthermore, both HTTP request and response bodies are permitted to include arbitrary, application-defined content of any size. HTTP clients, likewise, are not restricted to being Web browsers; any program can be constructed as an HTTP client. Finally, HTTP proxy and server implementations generally observe all of these allowances.
It is sometimes desirable to exchange files of various kinds, on a regular or ad-hoc basis, between one party and various other independent parties, who have a relationship of some kind with the first. For example, a business providing computer support services may desire to receive defect log files from its customers' contracted computers, for purposes of centralized troubleshooting. Similarly, another department of the same business may desire to receive inventory files of installed software from its customers' contracted computers, for purposes of software update consultation. A third department of the business may desire to receive hardware utilization metrics from its customers' contracted computers, for purposes of scalable billing. In all three cases, various files need to be transferred to the business from its customers.
Transfers in the opposite direction are sometimes desirable, as well. For example, the second department of the business mentioned above may desire to allow recommended software updates to be downloaded to its customers' contracted computers, following a consultation engagement.
FIG. 1 illustrates these scenarios. In FIG. 1, the party providing the services is referred to herein as Supplier 120. The parties relating with Supplier 120, such that they need to upload and download files to/from Supplier 120, are referred to herein as Customers 110n (where n=A, B, . . . ). Thus the term, Supplier/Customer Problem, refers herein to the general problem of providing a distributed computer architecture which most cost-effectively enables file transfers of the kind just described.
There are several aspects of the Supplier/Customer Problem that influence solution design. One such aspect is the recognition that Suppliers and Customers are typically independent organizations. They do not share their proprietary computer networks, which are segregated from the Internet by firewalls. Although Suppliers 120 can be presumed, by the nature of their business, to have a continuous Internet server (e.g., Web server) presence, some Customers 110n may not have continuous Internet connections (although the pervasiveness of Internet Service Providers, ISPs, gives Customers without such connections a means to periodically gain client access to Internet 140 as needed).
A related recognition is that the Customer computer on which a file is to reside (referred to herein as the Customer Repository 130n) is not necessarily directly connected to Internet 140, even though the Customer network may be thusly connected (e.g., via a proxy across a Customer firewall 150n). Similarly, the Supplier computer on which a file is to reside (referred to herein as Supplier Repository 130n) is likewise not necessarily accessible directly from Internet 140. Although Supplier 120 may have an Internet presence, Supplier Repository 130n may be sealed from Internet 140 via Supplier firewall 160.
A third aspect of the Supplier/Customer Problem is security. Suppliers 120 often desire to transfer files only to and from authentic, authorized Customers 110n. Similarly, Customers 110n often desire assurance that files are only being transferred to and from authentic Suppliers 120. Furthermore, some security-conscious Customers 110n will desire that their file contents be private (i.e., encrypted) across the file transfer.
A fourth aspect of the Supplier/Customer Problem is repository flexibility. Suppliers 120 sometimes must handle multiple relationships with multiple sets of Customers 110n, where the relationships are best handled out of different departments within Supplier 120 (as in the example given above). In these cases, it is frequently desired to have multiple Supplier Repositories 130n, to better organize the Supplier's internal business processes. For example, one department within a Supplier 120 could desire to use one particular Repository worldwide. Another department, pursuant to a different business relationship with a different (but possibly overlapping) set of Customers, could desire to use three other Repositories, one for European business, one for Asian business, and one for Americas business. The layout of specific Supplier Repositories is referred to herein as the Supplier Repository “topology”. Furthermore, should the Supplier Repository topology change over time, it is often desired to limit the extent of such changes so that Customers 110n are unaffected. For example, if the aforementioned department with three Repositories re-organized to use just one in the future, it would be desired to keep this re-organization internal to the Supplier. In that way, Customers' Repository software, performing file transfers against Supplier Repositories 130n, would remain unchanged, resulting in a greatly lowered maintenance cost.
A fifth aspect of the Supplier/Customer Problem is immediate delivery feedback. It is usually desired that Customer Repositories be able to readily detect when file uploads/downloads against Supplier Repositories fail. This is necessary to minimize the costs of supporting the solution.
A sixth aspect of the Supplier/Customer Problem is file scalability. Depending on the application, files to be uploaded/downloaded will have varying lengths and contents. Some may be very short, while others may be millions of bytes in length. Some may encode textual characters, while others may be binary.
Finally, a last aspect of the Supplier/Customer Problem is infrastructure re-use. In order to be maximally cost-effective, it is desired to implement a solution which addresses all the above aspects, and yet uses a minimum of new technologies. Instead, existing network protocols, computers, software and architecture should be leveraged. Doing so serves to minimize implementation cost within both the Supplier and Customer organizations. This is most important with respect to the cost of implementation on the Customer side since, while there is but one Supplier 120, there are many Customers 110n, each particularly sensitive to cost.
Thus a solution to the Supplier/Customer Problem must address all of the preceding issues.
Various techniques are known in the art for exchanging files across computer networks. Several of these have been applied to the Supplier/Customer Problem in the past. The most applicable prior art utilizes the Internet as the fundamental substrate across which Supplier/Customer file transfers occur. The Internet is indeed an attractive component of a Supplier/Customer Problem solution. This is because (as discussed above) in recent times Suppliers 120 can be presumed to have a continuous Internet server presence, while even those Customers 110n who do not, can cheaply obtain an on-demand, client-only presence through any of a large number of ISPs. Thus a Supplier/Customer Problem solution utilizing the Internet, with the Customer as a file upload/download client and the Supplier as a server, is clearly promising. However, prior art solutions based upon the Internet still fall short of addressing all of the specific issues described above regarding the Supplier/Customer Problem.
For example, the File Transfer Protocol (FTP) has often been used in the past to perform file transfers across a simple client/server architecture, as illustrated in FIG. 2A. As is known, there is a wide variety of information publicly available concerning FTP, including such Internet RFCs as RFC 2228, RFC 959, RFC 783, and RFC 765. The foregoing documents are incorporated herein by reference in their entireties.
A typical FTP session involves an FTP client computer program and an FTP server program executing on a server computer 210n whose IP address or DNS name are known to the client computer 220 (FIG. 2A). (As in the HTTP case, an FTP proxy 230 may also be involved, to relay the FTP messages and data across an Internet firewall 240.) Upon receiving a new connection request, FTP server 210n requests that FTP client 220 authenticate itself. FTP server implementations generally support user ID/password-based authentication, pursuant to whatever underlying user account scheme is provided by the server computer's operating system. A typical UNIX-based FTP server, for example, would authenticate clients based on a user ID/password pair resolved against the ‘passwd’ database. A typical Microsoft Windows/NT-based FTP server would instead authenticate clients based on a user ID/password pair resolved against the NT domain. Upon authentication, various FTP commands are supported in the FTP protocol by client 220, proxy 230 and server 210n to upload and download files to/from the filesystem of FTP server 210n. 
This overall FTP architecture presents numerous problems from the standpoint of the Supplier/Customer Problem. First, FTP proxy utilization for FTP client/server communication across firewalls is not currently as widespread as with other prior-art protocols (to be discussed below). This makes the widespread applicability to Customers of an FTP-based solution problematic, since Customers must either have FTP proxy access across their firewall to the Internet, or have no firewall at all.
Second, the security scheme used by known FTP servers requires that Customers have user accounts on the Supplier FTP server computer 210n. User account maintenance on computer operating systems typically incurs significant operating costs, especially when large numbers of user accounts are concerned. Furthermore, the presence of user accounts on Internet-accessible computers often creates security concerns, requiring further operating investment to monitor. Thus known FTP servers' authentication frameworks are not conducive to the security and cost-effectiveness considerations in the Supplier/Customer Problem. At a minimum, the solution must easily allow for other sorts of authentication/authorization frameworks, besides those rooted in the server computer operating system's user account scheme.
A related security concern with FTP is the lack of support within known FTP servers for private file transfers. This again limits the applicability of the FTP architecture for solving the Supplier/Customer Problem, since some sensitive Customers will require encryption during the upload/download.
Third, the limitation in all known FTP servers to just those files accessible to the server computer operating system's file system is problematic. As was discussed earlier, it is often desired to be able to distribute files on multiple, internal Supplier Repositories which are not directly accessible from the Internet. It is also desirable to hide changes in the Supplier Repository topology from Customers. If FTP server 210n is placed outside the Supplier's firewall (FIG. 2A), FTP server 210n is not inherently able to cross the firewall and access remote, internal Supplier Repositories. Instead, known FTP servers only have access to files available via their server computer operating system's file system.
To handle this limitation, some prior-art FTP-based systems use modified architectures, illustrated in FIGS. 2B and 2C. In one such prior-art modified FTP architecture (FIG. 2B), a distributed-file-system solution, such as Network File System (NFS), marries the publicly-accessible FTP server computer file system with the various internal Supplier Repository file systems. Supplier Repository topology is hidden from Customer Repositories by the use of virtual path schemes supported by the distributed file system technology employed. But known distributed file systems were not designed for secure cross-firewall access, and so many security concerns apply with this approach. In addition, Supplier Repository topology cannot change without also necessitating re-configuration of the Supplier firewall. This coupling increases the cost of the solution.
In another such prior-art modified FTP architecture (FIG. 2C), rather than having the FTP server outside the firewall, a conventional FTP proxy 260 is used by the Supplier to proxy Customer FTP traffic from outside Supplier firewall 250, to the internal Supplier Repositories 210n, each of which hosts an FTP server program. However, this architecture shares the previous problem of coupling change in the Supplier Repository topology with reconfiguration of the Supplier firewall. Worse, the Customer Repository FTP clients are still required to know the particular topology of the Supplier Repository computers: their DNS names, and an understanding of which Supplier Repository handles which kind of file, which Customer geographic location, etc. This makes infeasible to mask from Customers any change in the Supplier Repositories.
As an alternative to FTP-based prior-art systems, Internet electronic mail (e-mail) has been used in the past to address the Supplier/Customer Problem, as illustrated in FIG. 3. As is known, these solutions use such e-mail architectural components as Simple Mail Transfer Protocol (SMTP), the Internet mail message format, and the popular ‘sendmail’ e-mail client/server program. Again, there are many documents discussing these components, including Internet RFC 822, RFC 821, and other documents. These publicly available documents are hereby incorporated by reference in their entireties.
As is further known, in a typical e-mail delivery, an SMTP client computer program (e.g., executing on Customer Repository client computer 305) connects to an SMTP server program executing on a server computer whose IP address or DNS name are known to the client. The server computer may be the ultimate destination of the e-mail message (e.g., one of Supplier Repositories 360n), or it may be a mail exchange (e.g., Customer mail exchange network 310) for use when a direct connection to the destination computer is not possible (e.g., prevented by Internet firewalls 340 and 350 or a down computer) or desirable (for example, due to bandwidth concerns). If a mail exchange, SMTP server program 310 there will proxy the e-mail message on to the ultimate destination (e.g., one of Supplier Repositories 360n), or perhaps another mail exchange (e.g., global-area mail exchange network 330 and thence Supplier mail exchange network 320) until finally the destination is attained. The receiving SMTP server program stores the message body to the server computer's file system or a database. In this way, prior-art solutions for the Supplier/Customer Problem perform file upload from Customer to Supplier Repository. File download is achieved via e-mail sent from Supplier to Customer Repository.
Alternately, e-mail may use other transport protocols besides SMTP, as well (although when travelling across the public Internet 330 SMTP is generally used). SMTP client programs are likewise very diverse, as are server programs and stores (some, for example, store messages to databases). But, as is known, all share an essential feature of e-mail: it is asynchronous, meaning the client program disconnects without ascertaining the delivery success or failure.
This characteristic alone makes the e-mail architecture a poor choice for solving the Supplier/Customer Problem, where immediate delivery feedback is desired. All known email-based solutions to the Supplier/Customer Problem in the art either forgo delivery feedback entirely, or incorporate complicated methodologies for discerning delivery success/failure from so-called “bounced” messages. Compared to Supplier/Customer Problem solutions where immediate delivery feedback comes for free (as it does, for example, with FTP and HTTP, due to the synchronous nature of the protocols and architectures) any such methodologies are relatively costly to construct and support.
Furthermore, e-mail is not designed for scalable file size and content. For example, depending on the size of a message, and the mail exchanges across which the message travels, a message may be split into multiple messages because the original was too large. As is known, re-assembly is generally not automatic, due to the wide heterogeneity in today's e-mail systems. Similarly, depending on the content of a message and the exchanges across which it travels, a message may be encoded as it travels. As with re-assembly, decoding is frequently not automatic. Thus it is a further cost for a Repository server to be constructed and supported to reassemble the pieces, and/or decode the message. As with immediate delivery feedback, above, when other Internet application protocols and architectures (such as FTP and HTTP) do not have such issues, there is little justification for e-mail.
As a final alternative to both FTP- and e-mail-based prior-art systems, HTTP has been used with a simple client/server architecture to perform file transfers in prior art, as illustrated in FIG. 4A. As is known, the HTTP protocol supports the GET method, in which the URI is taken by the HTTP server program to be the name of a resource whose content is to be downloaded, in the HTTP response body, to the HTTP client program. In particular, the resource named in the URI can be a file to be downloaded. Known HTTP server programs require the file to be accessible via the HTTP server computer operating system's file system.
The HTTP protocol also supports the PUT method, in which the URI is taken by the HTTP server program to be the name to assign to a file whose content is to be uploaded, in the HTTP request body, from the HTTP client program. The URI is expected to be the name to assign to the new file. Known HTTP server programs require the location in which the file is placed to be accessible via the HTTP server computer operating system's file system.
In known prior solutions to the Supplier/Customer Problem, the GET and PUT methods have been used to download and upload files, respectively, from and to HTTP server computers. HTTP proxies are used in those cases where the HTTP client and server are segregated. Indeed, there are several positive aspects to utilization of HTTP for file transfer. Because HTTP request and response bodies (in which the file contents are placed for transfer) support arbitrary data, file size and content are not an issue. Because the HTTP architecture is synchronous, immediate delivery feedback is a given. And because HTTP client/server support for encryption via HTTPS is widespread, a ready privacy option is available.
But as a solution for the Supplier/Customer Problem, this simple HTTP architecture has major shortcomings. First, the PUT method is not implemented in some HTTP client, server and proxy programs, limiting the widespread applicability of the simple HTTP architecture to file download only.
Second, the simple HTTP architecture, like the FTP architecture, lacks support for various Customer authentication and authorization schemes. HTTP server programs commonly provide built-in support for these security features to only the same extent that FTP servers do: via the underlying user account structure of the server computer's operating system. Thus the same problems apply to the simple HTTP architecture's capacity to support authentication and authorization as apply to the FTP architecture.
Finally, the simple HTTP architecture, like the FTP architecture, lacks support for multiple, internal Supplier Repositories whose topology are shielded from Customer Repositories. Like FTP server programs, known HTTP server programs access files via their server computer operating system's file system. To address this limitation, the simple HTTP architecture can be extended using distributed file systems (FIG. 4B) or an HTTP proxy (FIG. 4C) on the Supplier side, just as the FTP architecture could be extended (as discussed above). But these extensions still suffer from the same problems as they did in the FTP case.
As a result, there is a need for an improved system for uploading and downloading files between distributed, segregated Supplier and Customer Repositories, such that all of the issues disclosed above are resolved.