Many medium or large scale organizations have legacy systems containing tens to hundreds of thousands of VSAM or sequential data files that are accessed/maintained by COBOL or PL/1 programs. With these types of files, the contents layout is not specified in the file or self evident. Rather, the layout is separately defined in some COBOL copy book or PL/1 “include” file. Unfortunately, these kinds of file systems lack any kind of metadata catalog or repository, so there is no definitive link between any data-containing file and the layout(s) that can be used to the data-containing file contents, which effectively makes the contents very difficult to access.
Nevertheless, such data-containing files can potentially contain information that is valuable to the organization but, since their contents are not easily accessible and the current cost of identifying what may be contained in even a single one is relatively high, most organizations ignore these files as potential sources of insight. Compounding the problem is that there is no easy way to discern which files contain nothing of interest, which contain information that must be retained due to regulatory requirements, and which may contain sensitive or valuable information, so the sheer numbers make even attempting to find out prohibitive. While data virtualization tools like IBM® InfoSphere™ Classic Federation Server for z/OS® are useful in performing an iterative mapping process to attempt to match such files in cases where only a few files need to be matched, that process is tedious, time consuming and does not scale to cases where even hundreds, let alone thousands, or hundreds of thousands of such files—a not uncommon situation for some organizations.
Thus, there is an ongoing technological problem involving being able to match format defining data structures with data-containing structures when large numbers of both are involved.