Previously, there have been two major approaches to accessing data stored in electronic format. The first process is known as information retrieval and operates on a strict string search approach. Accordingly, if a user is to enter a query in the form of a keyword, using the information retrieval technique, the entire database will be searched for a string which matches the keyword. Obviously, such a system suffers from the drawback that it may well miss relevant entries should the form of the word in the database differ slightly to the form of the keyword. This problem can be overcome by using a stemming technique in which the keyword is truncated and a global word ending added. Again however this suffers from the drawback that numerous irrelevant records can then be located which include similar keywords.
In the second approach, known as knowledge representation, all the information from the database must be precoded using a special knowledge representation language to form a new database. This requires an operator to scan and analyse the data, placing relevant information in different knowledge representation fields. Once this has been completed, this allows users to access the information by entering queries in a knowledge representation language. This uses logic and theorem proving and is therefore not immediately accessible to users without specialised knowledge. In addition to this, knowledge representation approaches suffer from a drawback that the databases are initially hard to create and once created, even harder to change.
Both of the above mentioned techniques are anyway unsuitable for use with data stored in a semi-structured format. A semi-structured database is a database in which some of the data within the database is stored in specific fields which denote the type of data whereas the remainder of the data will simply be stored under a general field, such as a free text field.
Databases of this form are generally created by either scanning in hardcopy records having predetermined formats, or having an operator enter data manually. However, because of the versatility of free text type fields, the data entered may vary in content and style. Whilst this reduces restrictions on the data that can be entered, making the database easier to create, it does mean that the different types of data stored cannot be determined by identifying the field in which the data is stored. Examples of cases were data is stored in such a semi-structured format include the Yellow Pages® directory, Exchange and Mart, Loot, and The British National Formulary.
Thus, for example, in the Yellow Pages® directory, the headings of various sections will be stored in a record that is designated as a heading field. Each individual advert (hereinafter referred to as an item) will include a name field and a free text field. A name entry is stored in the name field, whereas a free text entry, such as a description of the companies products or services, an address entry and a telephone number entry, will all be stored in the same free text field.
Accordingly, if information retrieval were applied to the Yellow Pages® directory, a search for a keyword would search through all the headings, company names and the free text. As the type of data is not accounted for, a heading may be located as a relevant result, when in fact the items associated with that heading are the results required. On the other hand, a knowledge representation technique of searching the database, would require that the database be translated into a separate knowledge representation database which could then be searched using knowledge representation techniques. The original Yellow Pages® data would then be redundant, although if it were updated a new knowledge representation database would be required.