1. Field of the Invention
The present invention relates to methods and/or systems for selecting data sets, which finds particular application in selecting documents for instance from an information base such as that accessible using the Internet.
2. Background of Related Art
The Internet world-wide Web is a known communications system based on a plurality of separate communications networks connected together. It provides a rich source of information from many different providers but this very richness creates a problem in accessing specific information as there is no central monitoring and control.
In 1982, the volume of scientific, corporate and technical information was doubling every 5 years. By 1988, it was doubling every 2.2 years and by 1992 every 1.6 years. With the expansion of the Internet and other networks the rate of increase will continue to increase. Key to the viability of such networks will be the ability to manage the information and provide users with the information they want, when they want it.
The present invention however, is not concerned with providing another tool for searching systems such as the World Wide Web (W3): there are already many of these. They are being added to frequently with ever increasing coverage of the Web and sophistication of search engines.
Instead, embodiments of the present invention relate to the following problem: having found useful information on W3, how can it be stored for easy retrieval and how can other users likely to be interested in the information be identified and informed?
More specifically, the applicant""s co-pending application PCT/GB96/00132 provides an information retrieval agent, known as a JASPER agent, that is used for identifying and retrieving information from distributed information systems such as the W3.
It uses techniques, such as hierarchical agglomerative clustering, to define 30 relationships between various sources of information existing on W3. However, inaccuracies can arise within these defined relationships. This can result in documents having dissimilar subject matter being clustered together. The nature of the clustering technique is that one inaccurately clustered document can then multiply into several.
According to a first aspect of the present invention there is provided apparatus for determining a measure of similarity between at least a first and a second data set, said apparatus comprising:
i) input means for receiving at least said first and second data sets;
ii) processing means for identifying a set of keywords in at least the first of the data sets, the processing means having access to at least one rule set and identifying the set of keywords by use of said at least one rule set, the processing means further determining said measure of similarity; and
iii) output means to output said measure of similarity
wherein said rule set includes a rule concerning relative location of data items in a respective data set, and wherein said processing means determines the measure of similarity by comparing at least one set of key words, identified by said processing means in the first data set, with a set of keywords comprising or derived from said second data set.
Embodiments of the present invention enable two or more keywords within a data set to be associated with each other, for example keywords that form a phrase, with the result that the accuracy in comparison of similarity of data sets may be improved.
Preferably, the apparatus further comprises information retrieval means and a data store, said first data set comprising data retrieved from an information base by said information retrieval means and said second data set comprising a set of key words stored in said data store. For instance, the set of keywords may have been provided by a user, or stored in a user profile.
The rule set may provide means to identify adjacent items in the data set which can be treated together, as a single keyword. This entails not only location information but also, for instance, a grammatical test on adjacent items such as one or more of the following:
1) a noun followed by a noun or a predetermined set of indicia;
2) a verb followed by a noun or a predetermined set of indicia;
3) an adjective followed by a noun or a predetermined set of indicia; and
4) a predetermined set of indicia followed by a noun or a verb or a further predetermined set of indicia.
According to a second aspect of the present invention there is provided a 5 method of determining a level of similarity between first and second data sets, wherein said method comprises the steps of:
i) applying identifying tags to selected data items in at least the first of the data sets, in accordance with at least a first rule;
ii) identifying a set of potential key words by reference to either the presence or the absence of said identifying tags;
iii) selecting sets of two or more potential keywords which are adjacent by applying at least a second rule;
iv) classifying each selected set of potential keywords as a single keyword;
v) generating a set of keywords which comprises each classified set of potential keywords as a single keyword, together with the remaining keywords from the identified set of potential keywords; and
vi) comparing the generated set of keywords with a set of keywords either comprising or derived from the second data set.
For instance, said first rule may advantageously relate at least in part to the grammatical category of the data items.
Said at least a second rule may comprise one or more rules from the following set:
1) a noun followed by a noun or a predetermined set of indicia;
2) a verb followed by a noun or a predetermined set of indicia;
3) an adjective followed by a noun or a predetermined set of indicia; and
4) a predetermined set of indicia followed by a noun or a verb or a further predetermined set of indicia.
Identifying associated key words within documents, and other forms of information, located on W3 and other information bases, provides improvements in the accuracy of the relationships defined between these documents, and other forms of information, compared with prior art systems and methods.