One of the main purposes of computer systems is to manage information. This management of information is performed internally by data management systems. Generally, data management systems may be divided into two categories: 1) Database management systems; and 2) Text search and retrieval systems.
The first type of data management system imports and manipulates data into internal representations so that the data may be located and modified. When required, these systems generate a suitable representation of this data which is read by humans or used by another system. This category of data management system includes: hierarchical, network, relational, object-oriented database management systems and knowledge based management systems.
Within hierarchical, network and relational databases, information about an entity (a transaction, a stock item, a person, a company, an address etc.) is usually referred to as a "record" (although sometimes a record may contain information about many entities). Within each record the various "attributes" of the entity are usually classified into "fields".
Within object-oriented database management systems and knowledge based management systems these basic units may have other names such as "object" and the information regarding the object may have names such as "slot" or "member". Each of the attribute fields/slots has a format which can be, for example, integer, real number, boolean, character etc. Others are records/objects. Some fields/slots have specific formats (e.g., date, time), but yet others are free-format text.
Once the database has been constructed, it may be used to perform the following operations:
Add a record/object PA1 Locate and change a record/object PA1 Locate and delete a record/object PA1 Retrieve information PA1 ". . . They use context-free rewrite rules with non-terminal semantic constituents. The constituents are categories or metasymbols such as attribute, object, present (as in display or print), and ship, rather then NP (Noun Phase), VP (Verb Phase), N (Noun), V (Verb), and so on. . . . Semantic grammars have proven to be successful in limited applications including LIFER, a data base query system distributed by the (US ) Navy . . . and a tutorial system named SOPHIE which is used to teach the debugging of circuit faults. PA1 Rewrite rules in these systems essentially take the forms PA1 S.fwdarw.What is &lt;OUTPUT-PROPERTY&gt; of &lt;CIRCUIT-PART&gt;? PA1 OUTPUT-PROPERTY.fwdarw.the &lt;OUTPUT-PROP&gt; PA1 OUTPUT-PROPERTY.fwdarw.&lt;OUTPUT-PROP&gt; PA1 CIRCUIT-PART.fwdarw.C23 PA1 CIRCUIT-PART.fwdarw.D12 PA1 OUTPUT-PROP.fwdarw.voltage PA1 OUTPUT-PROP.fwdarw.current PA1 In the LIFER system, there are rules to handle numerous forms of wh-queries such as PA1 What is the name of the carrier nearest to New York? PA1 Who commands the Kennedy? PA1 etc. PA1 These sentences are analysed and words matched to metasymbols contained in lexicon entries. For example, the input statement `Print the length of the Enterprise` would fit with the LIFER top grammar rule (LTG) of the form PA1 &lt;LTG&gt;.fwdarw.&lt;PRESENT&gt; the &lt;ATTRIBUTE&gt; of &lt;SHIP&gt; PA1 where print matches &lt;PRESENT&gt;, length matches &lt;ATTRIBUTE&gt;, and the Enterprise matches &lt;SHIP&gt;. Other typical lexicon entries that can match &lt;ATTRIBUTE&gt; include CLASS, COMMANDER, FUEL, TYPE, BEAM, LENGTH, and so on." PA1 Cross checking and validating the data PA1 Integrating the data with database systems PA1 Sorting and classifying the text data PA1 an attribute type identifier (for the classification of an attribute of the free-format data which is associated with that component node); PA1 a pointer to the beginning of a sub-string within the text object's text string (i.e. beginning of the element associated with the component node). PA1 an integer containing the character length of the element sub-string (of the data). PA1 zero, one or more other component nodes (nested within this component node or otherwise associated with the component nodes so that the other component nodes can be accessed via the component node) preferably stored as an array; PA1 a matching weight (to indicate the relative importance of this element when performing comparisons with other text objects); PA1 a boolean variable indicating whether this attribute type identifier is a low level matching element; and PA1 depending on time/space considerations, one or more values to assist in the matching process. (See section on "text string operations" below for more details.) PA1 a parsing priority value (giving a notional "priority" to the elements of the free-format data associated with the component node so that a priority may be allocated and used in the determination of the best interpretation of free-format text when ambiguities exist). PA1 "76 Box Rd Townsville QLD" PA1 "PO Box 92 Geelong VIC" PA1 "39 Main St Box Hill NSW" PA1 "8 Box Ave Devonport TAS" PA1 "76 Box Rd Townsville QLD" PA1 "110 Box St Parramatta NSW"
These operations will be referred to as "normal database operations".
Storing of information about an entity in fields/slots is suitable for many types of data. There are however, some types of data which do not have a suitable standard structure. One best example of data which does not have a standard structure is "address" data. As most databases store people's address information in one, two or three free-format fields, performing normal database operations on individual attributes of the address is very difficult. Note that the term "attribute" is used in this specification to refer to a property of an "element" of data.
For example, the free-format data "35 Pitt Street, NORTH SYDNEY" has a number of "elements". Each element has an associated "attribute". An attribute of the element "NORTH" is that it is a "geographical indicator". An attribute of the element "12" is that it is a "number". Note that the "low level" elements correspond to the "tokens" of data i.e., the element "NORTH" is a token of the data. The data also includes higher level elements, however, e.g., "NORTH SYDNEY" is an element which includes two tokens and this element has the attribute of being a "town". An attribute of the entire data "12 Pitt Street, NORTH SYDNEY", i.e., the total "element" is that it is an "address". An alternative term for element is "component".
For each element of this free-format data to be provided with its own field for the associated attribute would increase the size and complexity of the database quite significantly, even for this simple example of addresses. Where the database includes information on people, together with their addresses, for example, in order to avoid complexity, and particularly with older databases, address data may be stored in a single field labelled "address". This field contains the address in free-format form and it is therefore not possible with present database technology to perform normal database operations on individual elements of the address--those elements cannot be accessed separately (apart from the total combination of elements which makes up the address, which can of course, be accessed as a whole, as "address").
This problem is to some extent addressed by the science of database scrubbing/cleansing. This field of commercial endeavour applies parsing processes to free-format text with the objective of creating new database fields for the attributes of the free-format text and entering into those fields completely standardised data. This standardising of data includes converting all spelling variations into one consistent set. (eg "Street".fwdarw."St".) The above example would produce the following:
House Number Street Name Street Type City 12 Pitt St Sydney
The new database fields are then used to perform normal database operations. An entire industry is devoted to this field, applying large, complex and expensive software packages to take information stored in databases, analyse and process the information to produce new databases including more fields for the attributes of the information records, thus providing more flexibility for operations which can be applied to the records.
Much has been written about the field of database cleansing/scrubbing (see e.g., "Dealing with Dirty Data" DBMS Magazine, September, 1996). The process is expensive--a complete cleansing operation for a large database can cost millions of dollars, as it is so time consuming and the software packages that have been developed to cleanse databases are very complex--and it is still limited by the fundamental requirement that to perform database operations on an element, the element must have a field to itself.
This brings us to the second major problem which afflicts the present methods of storing computerised information in commercial databases. Practically all commercial data is stored within hierarchical, relational databases or flat data files which have a structure which is fixed at time of design, but information by its very nature is complex and can have almost an infinite number of different attributes. To create a database containing fields for each and every attribute for each and all types of different information is just not practical, if not totally impossible, and certainly the cost of any attempt to produce a database containing fields for all the types of information available to humanity would be cost prohibitive.
Even a fairly trivial (although very important) example illustrates the scale of the problem. Consider international addresses, i.e., addresses the world over. Although four or five free-format fields can contain any address, to design a database table which has a data field for every possible attribute of all international addresses would contain hundreds, if not thousands of data fields. England has counties, USA and Australia have states, Japan has districts and different orders of addresses, etc.
The field of database cleansing/scrubbing is therefore a partial solution at best. It still requires the same fundamental database structure of a field for each data attribute. One can build more and more complex databases but this problem can never be completely resolved, and limits the computerised handling of information significantly.
Natural language processing systems are known that employ "Semantic Grammars" to encode semantic information into a syntactic grammar. These systems are mainly used to provide natural language interface to other systems such as a data base management system. The following description comes from a book by Patterson, D. W. "Artificial Intelligence and Expert Systems".
These types of systems receive information in structured or free-format form and converts it to its own representations.
Although the interface is flexible the database they interface to has a fixed structure and these systems are unable to perform changes on the original (human readable) data.
Indeed there are many prior art systems which provide "Natural Language" interfaces to structured databases. All of these systems provide translation from "Natural Language" into some form of structured data and suffer from the same problems described above.
Refer to U.S. Pat. No. 4,787,035, Bourne, D. "META-INTERPRETER" and U.S. Pat. No. 5,454,106, Burns, L., Malhotra, A., "Database retrieval system using natural language for presenting understood components of . . . " for examples of such systems.
As discussed earlier, one type of database management systems are knowledge based management systems (KBMS).
These systems employ the concept of attribute "slots" on an object. Slots provide or change information regarding the object either directly onto the stored values or indirectly through procedures. A simple example of "slots" will illustrate the concept: a "Square" object has two attribute slots "Length" and "Area". The "Area" slot does not need to store a value because its value can calculated by squaring the "Length" value.
Although these types of systems do not require fixed database structures, they do however, need to transform the original data into internal data representations which must be put through a very process intensive "language generation" process to produce information that is understandable by humans. If these types of systems were required to maintain the original data for use by other systems and humans, a small change would require the whole text string to regenerated.
The text search and retrieval category of data management system does not import the data but builds searchable indices which point to the original data. This category includes: document storage & retrieval systems; and Internet search engines.
These types of systems have very successful because they leave the original information in human readable form. This basic principle means that unlike the prior art database system described above, the underlying data can be very easily shared with many systems of this type. Another reason for their success is that improvements in technology can be implemented without requiring conversion of the original data. Data conversion is not only extremely expensive, but it is also a major source of data errors.
There are however, major drawbacks in using this type of system to manage data. Compared with the database systems described above. The major limitation is that the data cannot be manipulated--it cannot be modified, it must remain as it is. Other database functions which are very difficult to perform include:
From these limitations, we can see that this category of data management system is suited to unstructured data which does not need to be changed.
In text search and retrieval systems, it is known to process a documentation collection to identify specific attributes of each document such as its "subject" topic. The types of documents which have been processed by this type of system include books, newspapers, reports, manuals and e-mail messages.
Most of these types of systems, however, only look for individual words to match and do not look at words in context. Some others identify words that are nouns but do not classify the type of noun. Both are unsuitable for data such as address data, which contains a large portion of proper nouns.
Further, the original data cannot be changed within context.
For more information regarding this area, refer to the works published by Gerald Salton.
Note that the term "text object" as used in the following description should not be confused with the terminology "text object" which has been used in systems to describe software techniques which assist in the storage and transfer of pieces of text data between computer systems by encapsulating the text string. Techniques which have used the term "text object" range from the "String" object employed within Apple Computer's operating systems (where the object contains a leading two byte "length" value and the text string) to the "Compound String" object employed by the X-Windows operating system (where the object encapsulates multiple encodings, language translations and font styles of one piece of information.)