1. Field of Invention
The present invention relates to a method of modeling, storing and transferring information in a neutral form in a computer. The present invention reduces the complexity and effort presently involved in the modeling and storage of information while creating a storage format which enables complete parallel processing of data. Furthermore it enables not only direct integration of different models and their stored data without remodeling or reloading but dynamic evolution of those models and their data after some earlier implementation.
2. Description of Prior Art
Since their introduction about 40 years ago, computers have increasingly been used as mechanisms for the storage of information. FIG. 1 summarizes the prior art for the computer storage, retrieval and transfer of information. Despite exponentially rising demand and all of the effort that has been expended during the past several decades to develop and apply methods for data modeling storage and transfer, the bulk of organized information is still handled outside of databases. This has been due to the complexity, inflexibility and cost of modeling, storage and transfer of organized data that is inherent in prior art techniques.
Whether they employed data models or not; prior art techniques have, so far as is known been unable to achieve any practical success in data transfer without a user being intimately familiar with both the structure and the definitions of the data involved, and also with the particular application programs and interface languages employed for its storage. The result has been an inherent complexity in data modeling storage and transfer, putting data organization and access beyond the reach of many potential information workers who generate and seek to exchange, interpret and use their data.
Users have been forced to apply the skills of information technology specialists for data organization and information management where they can afford or justify the costs. Otherwise, users have been forced to limit arbitrarily the scale of their work where they cannot afford such specialists. At the current complexity and escalating costs of prior art methods, only a small portion of the generated data that exists is processed into database form.
Methods available in the prior art for the storage and transfer of un-modeled collections of information have required a great deal of a-priori knowledge on the part of the user, both of the structure and definitions of the data and the application programs involved. Spreadsheets have been one common method employed for such storage. Its usefulness is restricted to collections of data of limited complexity and size. Storage and transfer has typically been required to be a complete rather than a partial spreadsheet file. Un-modeled information has been typically stored directly in an application program format, e.g., an Excel(copyright) or Lotus(copyright) spreadsheet file.
Although the information can be downloaded to ASCII or other primitive standard formats for transfer purposes, it is done with significant additional loss of any already limited data definition. Larger and more complex collections of information typically have been stored in the format of the specialized applications programs within which they are generated, e.g., accounting programs, simulation modeling programs, or engineering calculation programs. Here too, retrieval and transfer typically have been limited to complete rather than partial files.
Often, such information has been merely organized into streams of ASCII or comparable format. This has imposed severe limitations on the storage of structure and content of the data as opposed to merely the data values. Transfer of any and all such organized information, including spreadsheets thus has relied heavily upon a user""s knowledge in advance of the structure and content of the information. Typically, a user has operated the same application interface program that generated the data to reduce the degree of difficulty involved in the many interpretive aspects of its transfer. The result is a severely restricted exchange of information both within and between data user organizations.
As shown in FIG. 1, data modeling has represented an increase in the degree of organization from a collection of un-modeled organized information. The clear advantage has been that portions of a collection of information can be transferred rather than only the entire file. However, this advantage has been achieved with a major increase in the complexity and cost of the data management.
Data modeling for a collection of information involves a combination of (a) organization of the data values; (b) some system or technique for definition of each data value; and (c) a system for structuring the relationships among various data values for storage in a manner which was capable of supporting accurate retrieval and transfer of targeted subsets of the information. Many modeling methods have been proposed, and of those only a limited number have proven useful enough to receive widespread use. Examples of proposed methods include: entity relationship (ER); Nijssen""s information analysis method (NIAM); IDEFIX, a graphical language; EXPRESS, a product information model; and object oriented model (OOM). Discussions of these methods are available in many publications. Chapter 2, pp. 12-30 of Schenck xe2x80x9cInformation Modeling: The EXPRESS Wayxe2x80x9d is representative. Only ER and OOM are, so far as is known, presently in widespread use.
All of these prior art modeling methods have imposed a-priori relationships and some form of hierarchy both for the organization and the storage of data. The particulars of such a hierarchy have varied with the data modeling method and with the individual modeler. In each case, however, the hierarchy selected has been incorporated into a structure of the information for storage and has been different and specific to each major modeling-based database and its associated software products.
Thus the storage of information in relational and object oriented databases has been both program dependent and a function of the particular form of relationships and hierarchy chosen during modeling. As a result, the retrieval and transfer of database information have required a significant knowledge of both the hierarchy imposed on the data and the particular data storage structure that has been employed by the database product employed. These have been major limitations of the current technology which have become particularly evident to users involved in the retrieval and transfer of information.
Efforts have been made to reduce the severity of these problems through the development of a standard database software program interface languagexe2x80x94standard query language (SQL). This effort has met with some success, in part because of the rapidly changing features of individual database software programs and the data itself.
Both the known data modeling techniques and the standard query language interface scheme (SQL) also have involved a complex set of rules and conventions whose understanding and practice are well beyond the reach of typical data users. See for example Date, xe2x80x9cA Guide to the SQL Standardxe2x80x9d Appendix A, pp 137-152. As a result, specialists in database management have been required for data modeling, which has given rise to high total database management costs and complicated communication problems between data users and data modeling specialists. The combination of these limitations has resulted in limited application of data modeling for the storage and transfer of organized data.
Each of the prior art data modeling and storage techniques has involved extensive integrated organization of the data and its structure, which has severely limited the capacity for parallel processing in both data storage and in data retrieval and transfer. Major relational and object oriented database vendors have developed complex optimization routines to guide the selection of pathways through these integrated structures. These have provided limited mapped segments that can be stored and retrieved in parallel. These segments have tended to be short relative to the overall retrieval process, prompting numerous mergings and analysis of intermediate results in order to determine the next correct stage of the storage or retrieval.
All of this effort has consumed processor capacity, reducing the amount of computer capacity available for direct data storage or retrieval work. The more parallel paths that were created, the more complex were the paralleling, merging and analysis efforts. The net effect was rapidly rising complexity and a declining percentage of useful storage or retrieval work produced from each parallel storage or retrieval path that was added.
As a result of such inherent diminishing returns, the prior art was able to achieve only limited gains in storage or retrieval performance through parallel processing. The limited gains that were possible came only with major increases in the complexity of overall database storage, retrieval and transfer operations.
The storage, retrieval and transfer limitations imposed by database program dependent data hierarchies and structures has been particularly evident in large, complex databases involving diverse interactions. Such complexities were common to the engineering and manufacturing life cycle where there has been a great need to share data across applications, across vendor platforms and between contractors, suppliers and customers. As a result, efforts have been made to define a neutral form for data modeling, storage and transfer which would be independent of both the application program from which it was taken and of any application to which it would be applied. After considerable years of effort, the International Standards Organization (ISO) published ISO 10303, Product Data Representation and Exchange. This set a standard for both a neutral form (ISO STEP Neutral Form) and a modeling language (EXPRESS) through which data was to be organized for incorporation into the neutral form for storage and transfer.
Severe practical limitations have so far led to minimal implementation of ISO 10303. ISO STEP and EXPRESS do nothing to reduce the complexity inherent in prior art data modeling and storage. In fact, they have seemingly increased the complexity by adding a layer of requirements on top of those which have already existed. The overhead associated with constructing an ISO STEP neutral form was high, typically between 10 and 20 times the size of the raw data file. Furthermore, there was no technique for exchanging EXPRESS modeling information together with a neutral form file so that the neutral form information could be immediately interpreted upon receipt of the transfer. While eliminating direct application program dependence, the construction and interpretation of the neutral form file has also still been hierarchically dependent on the EXPRESS modeling form of hierarchical representation.
ISO 10303 could possibly offer potential for the effective use of parallel processing of ISO compliant data. The ISO STEP neutral form took the form of groups of the data and their related information, each of which was organized into a separate and independent record in the neutral form file. Because there was no dependency between groups of neutral form records, each could be processed completely in parallel.
U.S. Pat. No. 4,864,497 attempted to address the issue of practical neutral or common data structures for the storage and access of information by multiple application programs. However, the approach taken preserved data hierarchy and integration as the premise for data modeling. It also adopted a convention for neutral file definition that did not apparently reduce the complexity or rigidity present in prior art data structure technology.
Prior art methods for addressing and accounting for relationships among data have been inherently complex, rigid, and highly application program dependent. As shown in FIG. 2, a hierarchy and integration of structures consistently have been employed in prior art data structure technology for the storage, retrieval and transfer of information having any significant complexity. At sufficiently low levels of complexity, non-integrated, non-hierarchical methods have been employed, but they have been limited to application dependent program formats or low level formats such as ASCII, the limitations of which have already been discussed above.
Hierarchy and integration in data structure, including all known forms of relational and object oriented technology, has involved the creation of a top-down logical network to describe all of the inter-relationships and precedence orders associated with the flow of information to be data modeled. The information described by this network has then been used to construct data models which have defined integrated structures for data storage. Since there was a direct relationship between the logical network, its corresponding data model, and integrated structure for the storage of actual data, changes in the logical network necessarily changed the data model, and in turn the integrated storage of the data.
Since the networks involved were typically complex, single changes in the network could precipitate extensive changes in data modeling and storage. Any errors in logic introduced during a change, either in the logic network or data model itself or in the protocols used for storage, could result in faulty storage which produced errors in data retrieval and transfer. Any changes in the data model or data storage protocols made after actual data storage had begun created significant problems for future access to such data.
Prior art technology has sought first to define the largest possible universe within which to construct its information network and second to preserve the resulting information network and its related data model and data structure for as long as possible once it has been established. It has been forced to do so because of various factors. These have included: the complexity and interlocking relationships between the hierarchy and integrated structure of the data model; the corresponding complex structure of data storage and the difficulty in changing data hierarchy and structure once it has been established; the inability to first create and then to merge a series of separate more localized information networks, data models and data structures into one final composite or universal set. See for example Teorey, xe2x80x9cDatabase Modeling and Design: The Fundamental Principles,xe2x80x9d Chapter 3.
In effect, large amounts of time, often man-months to man-years, have had to be spent gathering information to define a single internally consistent universal information network. This has then been translated into a single rigid fixed element data model and in turn into a rigid, fixed element data storage structure. Once the resulting database was placed in service, there was an almost unyielding resistance to change it. Since the work environment was actually changing daily, the result was that a data model and data storage structure were well out-of-date the day they went into service. However, they remained in service unchanged for extended periods of time because of the great complexity and cost associated with changing them.
U.S. Pat. No. 5,303,367 represented an apparent effort to overcome at least a portion of the severe limitations in the prior art technology. However, the concept of hierarchical and integrated networking, modeling and storage of information was retained completely. The complexity underlying the limitations of prior art data structure technology were thus maintained. This complexity was increased by introducing the concept of a networking structure inversion, which had to be carried out each time a new element of structure was added to the system. In addition, a format for data storage that provided all of the information necessary for use in retrieval and transfer without a-priori knowledge of the data was apparently lacking.
Recently, the concept of data warehouses has become a topic of interest. Data warehouses are one of the mediums for the co-mingling of data from different databases. Prior art data structure technology did not permit, as far as is known, the co-mingling of different data models or their associated data without creation of a new model that encompassed each of the different models and their content. Also, substantial effort or xe2x80x9cmanual cleansingxe2x80x9d of data structures and data information was typically required for legacy or new databases in order to incorporate the information into another database. Included in cleansing operations were the rationalization of a wide variety of different but equivalent data formats or configurations. This was done so that all of the values reported in the consolidated field name or set of field names conformed to the same format and configuration. These cleansing operations represented one of the largest expenditures of time and money in the creation of data warehouses.
With the present invention and in the preferred method for its implementation, multiple collections of data items are organized and stored in a computer-based environment as sets of information in a neutral, non-hierarchical form. Optionally, application of the neutral and non-hierarchical form can be limited to the organization of the collection, allowing for the use of alternative means for the storage of the collection of itself.
According to the present invention, neutral form means that each data item in each collection has specified and associated with it sufficient information, expressed in a particular universal and invariant form, to define itself, to distinguish itself from all other data items, and to recognize any other data item to which it is related. Such information provides a later method for recognizing and sequentially assembling selected related data items, both that exist within the original data item collection and that traverse two or more data item collections that have been organized and stored together. According to the present invention, non-hierarchical form means that the data items in a collection are organized with no priority among themselves or among those of any other collection they are stored with. Non-hierarchical form further means that organization of data items in a collection is achieved without knowledge of or the establishment of relationships with data items either in previously organized and stored collections or with data items in collections that will be organized and stored together at a later time. Other methods heretofore have required the organization and structure of all such relationships to be recognized, understood and defined at the time of storage.
According to the method of the present invention, all data items from any number of collections, once neutrally and non-hierarchically organized and stored together, coexist in parallel together, at once completely distinguishable and separately accessible, yet at the same time totally aware of and sequentially relatable to all other data items to which they are related. One can thus find any sequentially related subset of data items from among the total assembly of those present in storage without having knowledge of or having taken any steps to establish the sequential relationships with any such data items during the organization and storage of data items.
According to the present invention, the universe of a particular scope of information to be modeled in a computer can be represented as a collection consisting of any number of distinct scope segments. Each of these scope segments (referred to later as instance segments) can be modeled separately as individual sets of information.
By formulating such individual scope segment models and their corresponding sets of information in a particular and neutral form, individual segments from the same universal scope of information are automatically and dynamically linked. Through their dynamic linking these independent scope segment models and their corresponding sets of information function as the equivalent of a single model and set of information for the universe of a particular scope of information. The neutral expression of the information also provides an effective medium for the transfer of information.
The method of the present invention that accomplishes these advantages is comprised of three components: (1) the organization of information into instances; (2) expression of instances of information in neutral form; and (3) universal typing and neutral form expression of data items in an instance together with the dictionary information on all properties of those data items for their transfer.
An individual instance is characterized as one segment of the universe of that instance. Individual segments of the same universe of an instance are intrinsically and dynamically related. A computer environment is characterized as a universe of instances, typically spanning more than one instance universe. Certain portions of instances from different instance universes are intrinsically dynamically linked if they share certain information in common. The neutral form expression of instance information is employed to store individual data items in a set of information in a manner which at once enables both the complete isolation and therefore parallel processing of individual instance data sets and the recognition and exploitation of intrinsic dynamic links between different instance data sets for the retrieval and transfer of targeted subsets of diverse sets of information stored in the same computer environment.
The neutral form is achieved by assigning to each data item in an instance data set a generalized structural tag comprised of three elements: (1) a data type; (2) a data reference; and (3) a data organization. The meanings and properties of each of these elements for a particular data item are stored in a dictionary system whose frame of reference is the computer environment within which the data item in neutral form is stored.
All structural tags employed within a specific computer environment are defined to be internally consistent and unambiguous. The properties of the structural tag and the discrete data values stored in neutral form are used to target and retrieve instance data sets from an environment. Universal data typing is defined to enable simultaneous storage of the data items from both retrieved instance data sets and from the associated records in the environment""s dictionary system.
Using this universal data typing, dictionary data items employed in the structural tags of retrieved data items are themselves expressed in neutral form. The combination of retrieved instance data items and their associated dictionary data items, now expressed in the same neutral form, are combined into a single file for transfer. This transfer file is completely self contained, including not only the data items of interest, but all of the information required to understand and to interpret each such data item.
The method of the present invention not only overcomes the limitations of the prior art but it also introduces significant capabilities heretofore not available. It establishes a method for organizing and storing sets of information which achieves:
relationships between an unlimited number of different and individually modeled instance segments of the same universe of an instance as well between certain instance segments of different instance universes without a-priori knowledge of their existence, their structure or their content and without remodeling or reloading of the data for individual instance segments;
practical form for expression of individual instance data items in the instance data sets that comprise an instance segment which
contains all of the information required to understand and interpret each data item; and
yet is independent both of the source and the ultimate application of the information;
independence of the individual instance data sets of an instance segment which enables complete parallel processing of information for storage as well as for retrieval and transfer;
practical non-hierarchical and non-integrated format for the structuring of instance data sets that comprise an instance segment which
is independent of the size and complexity of a set of information;
yet accounts for all relevant relationships among data items comprising each instance data set; and
allows dynamic versioning without loss of connection to information based on a previous version of data structure;
reconciliation of independently defined instance structures and instance data sets, enabling the direct co-mingling and integration of unlimited numbers of different sets of information;
reconciliation of different legacy instance data set structures and data values with reduced needs for manual cleansing and reorganization;
parallel processor organization and storage of individual instance data sets across diverse sets of information encompassing different instance universes and different instance segments of any particular instance universe;
parallel processor retrieval of all or any portion of an unlimited number of different but related sets of information without any requirement for a-priori knowledge of the formats, structures, or contents or relationships among the various sets of information;
transfer of any retrieved information complete with all the properties and relationships required to understand and interpret the information without any requirement for a-priori knowledge of its format, structure, or content.