1. Field of the Invention
The present invention pertains to a method and a system of accessing data, for example using a HTTP-based protocol.
2. Discussion of Related Art
An application program (also referred to herein as “an application”), a client program (also referred to herein as “a client”) and a server program (also referred to herein as “a server”) are computer software programs that run on a computer system (i.e., one or more computer systems having one or more processors). A computer system is a physical computer device such as, but not limited to, a desktop computer, a server computer, or a laptop computer; or a virtual computer device for example a Virtual Machine (VM). A computer system runs an operating system. An operating system provides access to one or more files stored on that computer system using a collection of computer software modules, also known as function calls. For example, an operating system such as Linux, can implement the published POSIX standard to provide a collection of function calls to open, close, read or write files.
Interchange of computer data using a client program and server program is a well-known technology. A client program communicates with a server program using a communication protocol over a network, for example a LAN, WAN or the Internet. Examples of a communication protocol are TCP, UDP, HTTP, HTTPS, socket-based communication, HTTP 1.1 WebDAV. A client sends a request for data to a server. Based on that request a server sends data that is a response to that request.
The client program and the server program may be running on the same computer or separate computers. A client program may be running on one or more computers. A server program may be running on one or more computers. The computers running clients and servers are connected to each other in some form over the network.
Server and client programs follow some type of communication protocol to be able to understand each other. A client asks a server about its capabilities. The server then responds with a list of services it offers. The client may utilize the services to fulfill its goals by making additional requests to the server.
The HTTP protocol is popular and a well-known standard for communicating over a computer network, for example LAN, WAN and the Internet or the World Wide Web (WWW). A current HTTP protocol version is HTTP 1.1 and is described in the IETF RFC 2616. An extension to the HTTP 1.1 protocol is HTTP 1.1 WebDAV. This protocol is described in IETF RFC 4918.
The HTTP 1.1 WebDAV protocol in its simplest form allows a computer to read from and write to web resources on a remote storage device using the WWW. A web resource can be a file. The protocol also supports the equivalent of hierarchical folder listings, file and folder metadata reporting, file and folder deleting and such features that existing traditional file-based file systems (for example, Portable Operating System Interface or POSIX-based file systems) offer, all of it over the WWW. In addition, the protocol also supports file versioning over the WWW. For example, the protocol allows for client programs to connect to remote storage solutions over the WWW and provision data at the remote location as if it were network mounted POSIX file system.
For example, the HTTP protocol supports the OPTIONS request which enables the server to provide a list of WebDAV commands that it supports and how. The WebDAV protocol implements some requests. The implementation of other WebDAV requests is optional. The PROPFIND request is used to retrieve properties and metadata from a resource. It is equivalent to getting properties and metadata about a file and getting a hierarchical directory or folder list. The MKCOL request is used to create collections. For example, a collection can be a directory or folder. The HTTP GET request is used to retrieve a complete or partial resource, for example a file, from a remote location on the WWW. The HTTP PUT request is used to store a complete or partial resource, for example a file, from a remote location on the WWW. The COPY request duplicates a resource, for example a file. For example, a detailed description of HTTP 1.1 WebDAV protocol, and HTTP 1.1 protocol can be found in IETF RFC 2616.
A storage cache or the method of storage caching is defined as a computer program component that transparently stores data on a storage device with relatively faster access or in computer memory so that future requests for the same data from a storage device with relatively slower access may be retrieved from the cache and delivered faster. If requested data is contained in the cache, the request can be fulfilled by simply reading from the cache, which can be relatively fast. Otherwise, the data can be fetched from the storage device containing the data, which can be relatively slower. A cache can be a portion of volatile computer memory (e.g. RAM), or non-volatile computer storage (e.g., solid state disk (SSD), hard disk, storage area network (SAN), or network attached storage system (NAS)).
A WebDAV client is a client program, which is a computer software program that runs on a computer system and using the WebDAV protocol for communication. The WebDAV client communicates with a WebDAV server (i.e., a server using a WebDAV protocol for communication). The WebDAV client implements a software abstraction layer between conventional file input output operations, for example POSIX function calls, implemented by an operating system on a computer system, and a WebDAV server. The WebDAV server is a computer software program that implements one or more versions of the WebDAV protocol.
A conventional implementation of a PROPFIND request may utilize storage caching of information on web resources. In this configuration, when a WebDAV client wants data about a resource, for example a file, hosted on a WebDAV server, the WebDAV client retains PROPFIND responses locally to avoid sending redundant requests to the WebDAV server when asking for data on the same resource again. In some instances, in anticipation of application programs that are running on the computer system that is also running the WebDAV client planning to ask for information on additional web resources, for example, files, a WebDAV client implementation may choose to pre-emptively request for data on those additional web resources and store them to a local cache.
A conventional implementation of an HTTP GET request by a WebDAV server is when an application makes a read request for a specific number of bytes from a specific part of a file that is being served by a WebDAV server. The WebDAV client runs on a computer system. An application program may also be running on the same computer system. The application program issues a read request to read a portion of a file. The file is located on a WebDAV server. As a result, the WebDAV client running on the computer system receives the read request. Instead of sending the read request to the WebDAV server, the WebDAV client first looks for this data in a local cache that the WebDAV client maintains. If the WebDAV client does not find the data in its local cache, the WebDAV client prepares to send a request to the WebDAV server. Instead of making the exact WebDAV GET byte-range request to retrieve only the data that is requested, the WebDAV client requests for more data, for example, the WebDAV client may send a request for the entire file to the server. This act of reading data that is not originally requested is performed in anticipation that the requesting application program and other following application programs may request other parts of the file. For example, if an application program asks for the first 8 Kilobytes from a 1024 Megabyte file, a conventional WebDAV client will send a WebDAV GET request to retrieve the entire 1024 Megabyte file. The WebDAV client would then store the 1024 Megabyte file on a local cache and only deliver the first 8 Kilobytes of this data that were originally requested by the application program. When the WebDAV client receives a subsequent read request for this file from an application program (i.e. the same application program or another application program on the same computer system), the WebDAV client on that computer system reads from the locally cached copy of the file to retrieve the requested data instead of making another WebDAV GET request to the server.
This conventional method of a WebDAV client managing read requests from an application and only issuing a GET request when data is not cached is applicable when a file is of a reasonable size, for example a few megabytes or a few gigabytes. In this case, the WebDAV client bears a one-time expense of downloading the entire file and then subsequent requests do not require further network access to the server. If the file is relatively large, for example, from several gigabytes to a few terabytes, the one-time expense (or investment) can be relatively high, as this may require a relatively large bandwidth and/or a relatively longer period of time to download the large file. A return on investment (ROI) may be even low, if the file is not needed any longer after the initial read. There are several other situations, some of which are discussed in this application where the process of explicit caching by a WebDAV client can be detrimental to system performance.
For example, consider a case where an application program acting as a WebDAV server implements a server program by using some of the methods described in the Provisional Patent Application No. 61/733,228, filed on Dec. 4, 2012, and entitled “METHOD AND SYSTEM FOR STORAGE AND DISSEMINATION OF DATA FILES AND VIRTUAL DERIVED DATA FILES”, the entire content of which is hereby incorporated by reference, and that a computer software program acting as a WebDAV client implements the client program using some of the methods described in the 61/733,228 application. For example, as disclosed in the 61/733,228 application, a data file is defined as one or more bytes that exist in computer memory or on a computer storage device, such as a hard disk or a clustered storage device. A data file can be exposed to a computer program via a well-known interface, for example, an Object Storage Solution interface, or a POSIX file-system interface. One or more data files is referred as a collection of data files. The symbol D is used to indicate the one or more data files or collection of data files. A data file that is physically stored on a storage device is referred to as a first data file. A data file that is virtually presented to a consumer as if it were stored on a storage device but is not actually stored on a storage device, and is derived from the first data file, is referred to as a second virtual derived data file. A client program entrusts a server program with a first data file D that is of a known data type TD. The goal for the client is to access the first data file D at a future time. A client program may also have to access additional data of same or different data types that are derived from data file D at a future time. If the server program does not provide for such derived data or provide the ability to create such derived data, the client program ram would have to look for alternative services. If such services are not available, the client program has to generate the derived data by itself. Hence, a method for storing and retrieving data files on a storage device is provided. In one embodiment, the method allows for defining a data file virtualization policy that provides a client program with the ability to send the client program's intent to access the data stored in a first data file, as well as an intent of the client program to access data files of other data types that are derived from the first data file D. A data file virtualization policy is defined as the intent, by a client program, of accessing a first data file D, as well as derived data files D1, D2 . . . DN. One or more derived data files D1, D2 . . . DN are derived from the first data file D, and are virtual. The term virtual implies that one or more data files D1 . . . N do not physically exist on a storage device. The term virtual further implies that a directory listing of data files D1 . . . N is available to the client program. The client program believes that data files exist on the server side storage device. The term virtual further implies that a derived data file DJ (where, 1≦J≦N) is generated by the server program by reading the first data file D wholly or partially, dynamically, on-demand, when a client requests for that specific derived data file. A data file virtualization policy is denoted herein as PD or PD(1 . . . N). A client program sends a first data file D (i.e., one or more first data files D) to a server program accompanied by a virtualization policy PD that corresponds to each first data file D in the one or more data files D. The one or more data files D are of the same data type TD. When a server program receives the one or more first data files and PD, from a client program, the server program takes the one or more first data files and PD and stores it on a storage device. Using either a database or a known structure or protocol, it associates the one or more first data files with virtualization policy PD. For example, a known protocol would be for the server to save the one or more first data files and PD into the same server-side storage generated UUID for each file in the one or more first data files, into PD. In this example, a first data file D would be one of the files in the one or more first data files. A client program sending to a server program, a first data file D and an associated first data file virtualization policy PD(1 . . . N) using a computer network. D can be one data file of data type TD or it can be more than one data files of the same data type TD. A client program does not know what derived data types are supported by a server program for a data file of data type TD. Therefore, a client program may request a server program to send back a list of supported derived data types. The client program may also request a list of supported parameters for each supported derived data type. The parameters allow a client program to control the output of the derived data that will be subsequently generated on-demand by the server program. Once the list of supported parameters is known, a client program can announce its intent to request for all supported derived data types or only a subset of supported derived data types, at a future time.
In this case, the conventional method to implement a WebDAV GET request will not be efficient when one or more WebDAV GET requests are sent by a WebDAV client to the WebDAV server requesting for parts of a Virtual Derived Data File. For example, if multiple application programs running on multiple computer systems simultaneously request for different parts of the same virtual derived data file, each read request is sent to a corresponding WebDAV client running on the same computer system as the application program. If the WebDAV client uses a conventional method of implementing a WebDAV GET request, it can be shown that a WebDAV client program places unnecessary compute load on the WebDAV server program to generate parts of the virtual derived data file not originally requested by the client-side application program making the original read request. In one-time processing methods, such as some analytics algorithms for example object tracking methods, the original application program making the read request does not need more of the virtual derived data file than originally requested. Even if, in this case, the original application needs more virtual data file, it may be nearly impossible to predict what portion of that very large virtual derived data file the application may subsequently need.
Each WebDAV client program sends one or more WebDAV GET requests for a larger byte-range, including, but not limited to, asking for the whole file, from a virtual derived data file. More than one computer system having implementations of the WebDAV server may receive the multiple WebDAV GET requests for large portions of the same virtual derived data file. Each server computer system would then try to generate this virtual derived data file causing the server side computer system or systems to generate more bytes than required. These bytes from the file are retained in the WebDAV client's computer memory, and cached by the WebDAV client on a computer storage device for future use. However, the application program making the original read request to the WebDAV client, may not require that data.
If a client program issues multiple WebDAV GET requests to the server, each request asking for parts of the first data file that will not be accessed again for a period of time, the conventional method of implementing a WebDAV GET request will not be efficient. When the first data file, from which the virtual derived data file is derived, is a large data file, such as several gigabytes (e.g. 100 GB) to many terabytes (e.g. 3 TB), it may be inefficient to transfer a large portion of the file from the server to local storage attached to the computer system running the client program.
Therefore, there are several scenarios where the conventional method of implementing a WebDAV GET request may not efficient. As it can be appreciated, the example provided above is only one of many possible scenarios as there may be other scenarios that can contribute to proving that the conventional method of implementing a WebDAV GET request may not efficient under these other scenarios as well.