Computerized relational databases are used to form information systems which model real world issues and are composed of objects, the relationships i.e., facts, between those objects and the constraints and rules which govern these relationships and objects. Objects are physical or logical entities, capable of being uniquely identified. In this respect, objects are said to be essentially noun-like. Facts define the manner in which objects interact with one another, and are essentially verbs or are verb-like. Constraints modify or constrain the inter-relationships between objects and facts, and as such are analogous to adverbs and pronouns. As the use of information systems increases and the design of such systems advance, so increases the complexity of the real world issues they are expected to accurately model.
In creating an information system, a user needs to accurately transform the real world model, also known as the external view of the data, to its actual physical implementation, using a particular database language on a particular computer system. This implementation is also called the physical view. In order to realize the power inherent in relational databases, it must be made possible for someone with no computing background or education to be able to design and implement information management systems and query meaningful data from them without having to learn a specific computer language.
The physical view of an information system is expressed in one of a number of database design languages. Examples of database design languages well known to those skilled in the art include Structured Query Language (SQL) and Microsoft Access. These database design languages are well adapted to carry out the storage and subsequent retrieval of data stored therein, but the languages themselves are both unnatural and highly technology specific. This means that database design languages are not typically used or understood by the end users of the information systems the languages model. The use of these design languages is a largely intuitive process practiced by database analysts who are familiar with the internal complexities of such languages.
The transformation of an information system from its external view to its physical view is time consuming, and at once formalized while remaining something of an art form. In order to assist database analysts in modeling data for information system design, several Computer Aided Software Engineering (CASE) tool sets have been developed, and are well known to those skilled in the art.
Prior art CASE tool sets were generally based upon entity-relationship modeling (ER). ER models, while providing a useful means of summarizing the main features of an application, are typically incapable of expressing many constraints and derivation rules that commonly occur in that application. An overview of ER-base tools may be found in Ovum (1992) and Reiner (1992) A state-of-the-art example is discussed in Czejdo et at. (IEEE Computer, March 1990, pp. 26-37).
In order to capture much more of the detail of an application, object-role modelling (ORM), also known as fact-oriented modeling, was developed. Well known prior art versions of ORM include Natural-Language Information Analysis Method (NIAM), Binary-Relationship Modelling (BRM), Natural Object Role Modelling (NORM), and Predicator Set Model (PSM). One version of ORM, Formal Object Role Modelling (FORM) is based on extensions to NIAM and has an associated language (FORML) with both graphical and textual forms (Halpin and Odowska, 1992). FORM and FORML were developed in part by one of the inventors of the present invention.
The use of symbol-driven CASE tool sets provides a powerful instrument for conceptualizing the model of a given information system, but their use is not intuitively obvious to the untrained user. For such a user, being able to model information systems using a language with which the user is already facile is a more powerful approach. FORML provides the user with a natural language-like command set, and is thus readily learned.
Several CASE tool sets for object-role modeling exist. Among those known by persons skilled in the art are RIDL (Detroyer et at, 1988; Detroyer 1989; Nienhuys-Cheng 1990), GIST (Shoval et al, 1988) and IAST (Control Data, 1982). RIDL is currently marketed by Intellibase. These ORM-based CASE tool sets generally conform only to a binary-only version of ORM, although RIDL has recently added support for fact types of higher arity. In general, these systems are based upon the explicit "drawing" of symbols on diagram. Users of these tool sets typically specify their information systems by placing symbols directly on diagrams. In the typical CASE tool set, a different tool is used for each type of symbol used. The emphasis in these tool sets is on the notation of the symbols and what they mean, not the underlying semantics of the language upon which the notation rests.
An "optimal normal form" method for mapping from ORM to normalized relational tables was introduced in NIAM in the 1970's. This method ignored certain cases and provided a very incomplete specification of the methodology for constraint mapping. A significant extension of NIAM, capable of completely mapping any conceptual schema expressed in the graphic version of FORML to a redundancy-free, relational schema, was introduced as RMAP (Relational Mapping, Ritson and Halpin, 1992). RMAP differs from other mapping methods, such as RIDL-M, by enabling a wider variety of constraints; e.g., n-ary subset, equality, exclusion, closure and ring constraints.
Database professionals, using ORM-based CASE tool sets are markedly more productive than similar workers without them. A tool set which contains a mapping schema such as RMAP is even more powerful, and results in further productivity. FORML based tool sets which implement RMAP represent the current state of the art with respect to ORM-based tool sets. Given FORML's graphical and textual language forms, the potential exists to combine the power, flexibility and precision of ORM based CASE tool sets with the ease and rapidity of use of graphical user interfaces common in modem computer systems. This will have the effect not only of further increasing the productivity of CASE tool sets in the hands of computer professionals, but will place these powerful software engineering tools in the hands of heretofore naive users as well.
While prior art natural language CASE tools do fulfill some of the promise of their basic concept, they lack the power of the symbol driven systems to model complex databases with facility. Until the present invention, there existed no CASE tool set for database design which combined the power, flexibility and accuracy of ORM using natural language-like constructs with a graphical user interface to translate the natural language-like constructs into ORM symbology and automatically map the conceptual schema so formed into a relational schema for implementation on a number of SQL-like database languages. The present invention effects a six-fold reduction in the number of user operations necessary to draw symbols on ORM-based diagrams by allowing users to type information in an approximately natural language. Users can think about the semantics of information and not waste time laboring on symbol drawing, which dampens the semantic thought process.
In addition to the ER and ORM-based prior art tool sets previously discussed, there have been efforts by other workers to automate the process of database specification using different methodologies. Some of the more pertinent attempts are described below.
U.S. Pat. No. 4,688,196 to Thompson et. al. teaches a natural language interface generating system which allows a naive user to create and query a database based on a system of menu-driven interfaces. As the user addresses command words, in a natural language, to the interface generating system it provides a menu of words which could legally follow each word as it is input. The menu is provided by referencing pre-defined, resident files. Thompson calls these flies grammars and lexicons. The commands input by the user are translated by the system, which then provides an automatic interactive system to generate the required interface in the following manner. After the database is loaded in, the interface generating system poses a series of questions to the user's technical expert. In response to these questions, the user or his expert must identify which tables in the database are to be used; which attributes of particular tables are key attributes; what the various connections are between the various tables in the database and what natural language connecting phrases will describe those relations.
U.S. Pat. No. 4,939,689 to Davis et. al. teaches a system for the creation of database structures and subsequent querying of those structures by use of a text driven outliner system. The Davis system uses another form of resident dictionary table, which is again previously defined. In Davis, the user inputs a textual outline which defines the format of the database. This outline is then used to create data entry screens to facilitate data entry.
After creating database information systems (and assuming the data to populate those systems has been input), the information system must be accurately queried. Efforts by others skilled in the present art teach two broad strategies to enable the naive user to form queries.
The first prior art solution to the query generation problem is through the use of natural language parsers. This methodology takes a query which is input in a desired natural language such as English or Japanese, and parses the query into its component parts. Each component of the query is then used to form the translation of the original natural language query into a database language query. Until the present invention, this was typically accomplished by some form of resident database or dictionary file which translated the parsed command words and phrases into their respective equivalents in the database design language.
European Patent Application EP 0522591A2, filed 10 Jul., 1992 by Takanashi et. al., teaches a system typical of this "parse and look up" strategy, whereby a natural language query is entered and parsed into its constituent parts. The parser uses both a resident grammar table and a resident terminology dictionary to translate the meaning of individual command words and phrases into the database design language. The difficulty with fully implementing this solution is the richness and power i.e., the size and variable structure, of most natural languages. Each possible word and many phrases must have a corresponding entry in the resident tables to make the system truly utile. If this is not done, the power of the natural language interface is substantially weakened in that a command will not be understood by the system.
The cost, both monetary and in computer overhead, of creating and maintaining a large, full-time resident natural language interface to any substantial information system is prohibitive. Furthermore, end users are still required to know the types of questions and keywords the parser and resident dictionary files will understand. This is because the resident table methodology does not fully account for the relationships between data objects and the constraints on those objects. For example, if a user wants to know Mr. Smith's age, it is not sufficient to ask "How old is Smith?" since Smith might be a person or the Smith Tower. Instead the user must type "How old is the person called Smith?". As a result, the learning curve for using natural language parsers is still extremely high.
The second solution to the query generation problem in the prior art is through the use of query tools. Query tools are based on the physical structures of the database and not the information contained therein. Information can be broadly categorized as a set of interacting conceptual objects, i.e. things you want to store--e.g., Person, Address, etc. Facts are relationships between objects--e.g. a Person lives at an address. When information is stored in a database, it is represented as a set of physical structures, e.g. tables. Absent considerable database expertise on the part of an end user, the physical representation of the data is invariably unintelligible to him or her. To enable, therefore, such a naive user to query data based on the physical structure it is stored in will require a significant training effort to ensure understanding of these physical structures.
In formulating a query using either a natural language parser or a physical structure query tool, one final issue remains. The user can never be sure that the query which is ultimately formed by either process is actually phrased correctly. When querying physical structures, absent significant training, the naive user doesn't understand the manner in which the data was stored. When using a natural language parser, the same problem arises due to the ambiguity inherent in that natural language. If, for instance, a user asked "How old is Smith?", and the computer answers "55", the answer may be for the person Smith, or the Smith Tower. This is reminiscent of the experience of a reporter who telegraphed Cary Grant's agent, asking about Mr. Grant's age. The reporter, sensitive to the cost per word of sending a telegraph, queried "HOW OLD CARY GRANT?". The actor, when the telegraph was inadvertently delivered to him, replied, again by telegraph, "OLD CARY GRANT JUST FINE". Clearly, unless the syntax of the query is correct, a naive user may retrieve an uncertain answer or an answer to an unintended query.
A common design feature of prior art CASE tools as previously discussed is the use of a pre-defined table or tables both to effect the translation of natural language inputs and to specify the exact nature of the data objects, facts and constraints as well as the interrelationships therebetween. As discussed, this methodology is costly, inefficient and not fully effective.
A further design feature of CASE tools currently in use for information system specification is their use of symbols instead of a natural language. A symbology-driven CASE tool set is at once imprecise and cumbersome, requiring several steps to perform the transformation from a chart of symbols to a database specification in a computer language.
There is therefore a need for apparatus that allows users to specify and create an information system using natural language or natural language-like commands, which will precisely specify the system's objects, facts and constraints without ambiguity or excessive overhead. This means should be capable of graphical depiction to define the interrelationships among the data elements in an unambiguous manner. The information used to create the system should be useable to define both the structure of the database itself as well as subsequent queries to that database once it is completed. There is a another need for a means for a naive user to be able to specify these queries to the system, again using natural language like commands which are not bound by previously entered definitions in a translation table. There is yet another need for a means for ensuring that any query which is created for the purpose of accessing the information system will, precisely and again without ambiguity, convey the user's intended question and return a correct, unambiguous answer.