Distributed File Systems are known in the computer-related arts and typically support conventional hierarchical file organizations. For instance, a user or an application can create directories and store files inside the created directories. Distributed File Systems oftentimes are set up to have master/slave architectures and oftentimes can provide high data bandwidth and scalability by using many nodes in a single cluster where every single instance can support many files. For instance, hundreds of nodes may be provided, and each cluster may support millions of files.
A Distributed File System additionally or alternatively may be tuned to support large files, e.g., that sometimes may be gigabytes or more in size. Very large files may be reliably stored across machines in a large cluster, for instance. Internal to the Distributed File System, a file may be split into one or more blocks, and these blocks may be stored in multiple nodes in the cluster.
Replicated data may be stored in multiple nodes in some implementations, and this design can in some cases help in reliability and/or performance of the overall Distributed File System.
Distributed File Systems oftentimes are designed to favor batch processing over interactive use by users, e.g., placing the emphasis on high throughput for data accessions, etc., rather than low latency for data accessions. To facilitate the processing of many files and/or large files, Distributed File Systems oftentimes are designed to handle network failures, failures of nodes in a cluster, and/or the like.
The webMethods ActiveTransfer, commercially available from the assignee, is an integrated managed file transfer solution that in essence provides a single point of control for all file transfer activities, both inside and outside an extended enterprise. ActiveTransfer enables organizations to exchange information securely over a computer-mediated network such as the Internet and supports a variety of communication protocols, including HTTP, HTTPS, FTP, FTPS (SSL), SFTP (SSH), SCP, WebDAV, and WebDAVs protocols. It also supports a variety of file system architectures such as, for example, FAT32, NTFS, etc. ActiveTransfer thus enables users to interact with a variety of different business partners, while offering data security solutions including support for the world's most stringent encryption standards (such as, for example, SSL and integrated PGP). For instance, a user can apply global and/or per-user IP address restrictions, apply policies that restrict file transfer activities during specific days of the week and/or time of day, etc. ActiveTransfer thus can help in bringing together B2B, application support, managed file transfer, and/or other activities, for example, in a service-oriented platform.
Although ActiveTransfer is useful in many scenarios, the inventors have recognized that the current ActiveTransfer implementation could be improved in certain ways. For example, the architecture underlying ActiveTransfer stores files in a File System, which can limit the size and/or amount of data being processed. Although the File System could be implemented over a cluster of nodes, a large degree of homogeneity likely would be needed, e.g., as all nodes in the cluster would have to use the shared file system (e.g., using mapped network drives).
End users also may in some instances have to implement custom logic for processing files, thereby potentially requiring a high degree of skill, time, and in-depth knowledge about the files being processed, the systems being used, etc. And because files typically will not be broken into pieces, individual clients oftentimes will have to process large files sequentially, thereby creating the potential for networking bottlenecks. In a similar vein, end users themselves may have to manage the space in the File System, make backups for possible recovery purposes, etc.
The inventors have, however, recognized that ActiveTransfer can be integrated with certain Distributed File System (DFS) techniques, e.g., to help address these and/or other issues. For instance, integrating ActiveTransfer with a DFS architecture could be useful in allowing end users to upload files using ActiveTransfer and have these files processed in batches. It will be appreciated that the write-once-read-many access model for files would be useful in this example scenario.
Certain example embodiments thus relate to the integration of an integrated managed file transfer solution (e.g., ActiveTransfer and/or the like) with a Distributed File System (e.g., HDFS, Amazon S3, and/or the like), optionally with a map reduce framework (e.g., Hadoop and/or the like). Among other things, map reduce frameworks provide programming models for processing large data sets with a parallel, distributed algorithm on a cluster. The presence of the DFS (optionally with a map reduce framework) advantageously can aid in: the processing of vast amounts of data in-parallel on large clusters in a reliable and fault-tolerant manner, the storage of very large amounts of data (e.g., terabytes, petabytes, etc.), the high-throughput accessing of such information, and/or the like. Files and/or portions of files may in some instances be stored in a redundant fashion, e.g., across multiple machines, to help deal with potential network and/or node failures, high availability (e.g., as a general matter and potentially to parallel applications), etc.
ActiveTransfer itself may be thought of as a virtualization layer for the (possibly secure) transfer of files. The implementation of ActiveTransfer as a Virtual File System (VFS) may involve “adapters” to aid in file input/output (I/O) processing during runtime. That is, in order to handle “virtual files,” a decoupling layer may be defined separate from the physical implementation, and an adapter that can take care of the particular application programming interface (API) calls may be provided.
Yet when an integrated managed file transfer solution such as ActiveTransfer is integrated with a more complex distributed file system, further processing of the metadata of a file may be needed. For example, when working with a more complex framework for distributed processing of large data sets across clusters of computers using simple programming models and/or a more distributed, scalable, and portable file system, it may not be sufficient to adapt a particular API at execution time, e.g., by providing a runtime adapter. Although it is possible to pre-configure or hard-code constants as configuration data, etc., the application of fixed values reduces flexibility of the file system being implemented and can detract from some of the benefits provided by the more complex file system.
The inventors thus have recognized that the use of a File Metadata Handler or the like may in certain example embodiments facilitate the storing of files in the DFS and/or the processing of such files, e.g., in a more dynamic manner. The File Metadata Handler of certain example embodiments may be able to work with, or take the place of, a more conventional file handler adapter, e.g., to provide possibly required pre-processing and/or additional file processing at runtime.
For example, in a DFS, files may be split across nodes physically based on size. However, files sometimes may additionally or alternatively be split logically. Logical splitting may, however, require information about how the data in the file is stored. A File Metadata Handler according to certain example embodiments may help in maintaining this information, e.g., so that it can be provided to the map reduce framework, etc.
As another example, for the processing of Map-Reduce Tasks, knowledge of how the data in a given file is stored. For instance, Map-Reduce Tasks may need to know whether there are multiple columns in a row, e.g., to decide which column(s) should be processed and which column(s) can be ignored. Similarly, if there is any validation, data transformation, etc., to be performed, the map reduce framework may need to be provided with such information. A File Metadata Handler according to certain example embodiments may help in maintaining this information, as well.
As still another example, certain file formats can require custom definitions as to how logical file splits in a Distributed File System are to take place. For instance, a custom definition may need to be provided for EDI-type data, e.g., such that data for each block in an EDI file can be processed by a Map-Reduce Task in a sequential manner. A File Metadata Handler according to certain example embodiments could help in defining transaction boundaries in and/or across files.
One aspect of certain example embodiments relates to storing files in a Distributed File System where an integrated managed file transfer solution (such as ActiveTransfer) acts as an interface, and enabling an end user to define metadata linked with the files being processed via the integrated managed file transfer solution.
Another aspect of certain example embodiments relates to a File Metadata Handler that helps provide file metadata required for processing files such as, for example, the customized and/or automatic logical splitting of files in a Distributed File System.
Another aspect of certain example embodiments relates to a File Metadata Handler that helps provide file metadata required for Map-Reduce Tasks so that processing of data (e.g., validation, transformation, skipping of columns in a row, etc.) can be customized at runtime.
In certain example embodiments, there is provided a method for enabling the accessing and/or processing of data stored in one or more computer systems in connection with a file metadata handler that has built-in support for a plurality of different file formats. A request to create a file metadata definition for the file metadata handler is received, with the file metadata definition being configurable for a first file format, and with the first file format being a file format for which the file metadata handler at least initially does not have built in support. Files in the first file format have associated metadata that needs to be processed in order to process such files and/or files in the first file format have associated metadata for which processing logic is required to be defined in order to process such files. The received file metadata definition is linked with an identifier that identifies files of the first file format. Format information that includes information about elements that are available to files having the first file format is received. The file metadata definition is updated with received transaction boundary information, data validation information, information concerning data transformations to be applied to the elements, and/or linkages between encrypted protocols and decryption tools, if any, for the first file format. The updated file metadata definition is stored in a non-transitory computer readable storage medium, in enabling the file metadata handler to access and/or process data in the first file format.
According to certain example embodiments, the File Metadata Handler may help an end user to define or specify information such as, for example, file envelope format (e.g., headers, body, and payload if any), file format (XML, Text, JSON, CSV, etc.), elements that define transaction boundaries, validation of the data, conditions and/or logic for data transformation and processing, support for processing of encrypted data, etc. This may help in some cases provide a link for the files in the first file format that have associated metadata that needs to be processed in order to process such files and/or for which processing logic is required to be defined in order to process such files.
In certain example embodiments, a method of processing files in connection with a server is provided. A listing of objects is provided to a client computing device from a plurality of different computer systems that have different respective file systems and/or are accessible via different respective transport protocols, with the listing organizing the objects in a common, integrated view of virtual folders. A request to process an object that is stored remote from the client computer device is received from the client computing device. A determination is made as to whether the server has built-in support for processing the object, based on the file system to be accessed, and the transport protocol to be used, in processing the request to process the object. In response to a determination that the server has built-in support for processing the object, the request to process the object is processed using the built-in support of the server. In response to a determination that the server does not have built-in support for processing the object: a file format for the object is determined based on how the object is named; a file metadata definition for the determined file format is accessed from a data store, with the file metadata definition including format information about elements that are available in the determined file format, as well as (a) transaction boundary information, (b) data validation information, (c) information concerning data transformations to be applied to the elements, and/or (d) linkages between encrypted protocols and decryption tools; and the request to process the object using the accessed file metadata definition is processed. The server does not have built-in support for processing objects stored on one or more distributed file systems, and the data store includes file metadata definitions for each said distributed file system.
In certain example embodiments, there are provided non-transitory computer readable storage mediums tangibly storing instructions that, when executed by at least one processor of a computer system, perform the above-described and/or other methods.
Similarly, in certain example embodiments, a computer system including a processor, a memory, and a non-transitory computer readable medium, is configured to execute computer functions for accomplishing the above-described and/or other methods. For instance, in certain example embodiments, a data processing system is provided. A managed file transfer solution server includes at least one processor and a memory, with the managed file transfer solution being configured to provide a single virtual folder based view of data stored on a plurality of different computer systems and receive requests to process data in connection with the plurality of different computer systems. The different computer systems support at least FTP and HTTP protocols, as well as at least one distributed file system. A file metadata handler is configured to store user-configurable definitions, with the definitions providing metadata information needed to access at least some of the computer systems and/or the data stored therein. At least one user-configurable definition is provided for the at least one distributed file system, and the at least one user-configurable definition provided for the at least one distributed file system includes metadata needed to access data stored on the at least one distributed file system. At least some of the definitions stored by the file metadata handler are accessible in dependence on a naming convention associated with data being requested.
These aspects and example embodiments may be used separately and/or applied in various combinations to achieve yet further embodiments of this invention.