1. Field of the Invention
This invention lies in the field of methods and apparatus for data management and retrieval and finds particular application to the field of methods and apparatus for identifying key data items within data sets.
2. Related Art
Recent advances in technology, such as CD-ROMs, Intranets and the World Wide Web have provided a vast increase in the volume of information resources that are available in electronic format.
A problem associated with these increasing information resources is that of locating and identifying data sets (e.g. magazine articles, news articles, technical disclosures and other information) of interest to the individual user of these systems.
Information retrieval tools such as search engines and Web guides are one means for assisting users to locate data sets of interest. Proactive tools and services (e.g. News groups, broadcast services such as the POINTCAST(trademark) system available on the Internet at www.pointcast.com or tools like the JASPER agent detailed in the applicant""s co-pending international patent application PCT GB96/00132 (U.S. application Ser. No. 08/875,091 filed Jul. 22, 1997, now U.S. Pat. No. 5,931,907) the subject matter of which is incorporated herein by reference) may also be used to identify information that may be of interest to individual users.
In order for these information retrieval and management tools to be effective, either a summary or a set of key words is often identified for any data set located by the tool, so that users can form an impression of the subject matter of the data set by reviewing this set of key words or by reviewing the summary.
Summarising tools typically use the key words that occur within a data set as a means of generating a summary. Key words are typically identified by stripping out conjunctures such as xe2x80x9candxe2x80x9d, xe2x80x9cwithxe2x80x9d, and other so-called low value words such as xe2x80x9citxe2x80x9d, xe2x80x9carexe2x80x9d, xe2x80x9ctheyxe2x80x9d etc, all of which do not tend to be indicative of the subject matter of the data set being investigated by the summarising tool.
Increasingly key words and key phrases are also being used by information retrieval and management tools as a means of indicating a user""s preference for different types of information. Such techniques are known as xe2x80x9cprofilingxe2x80x9d and the profiles can be generated automatically by a tool in response to a user indicating that a data set is of interest, for example by bookmarking a Web page or by downloading data from a Web page.
Advanced profiling tools also use similarity matrices and clustering techniques to identify data sets of relevance to a user""s profile. The JASPER tool, referred to above, is an example of such a tool that uses profiling techniques for this purpose.
In the Applicant""s co-pending European patent application number EP 97306878.6 (corresponding to U.S. application 09/155,172 filed Sep. 22, 1998), the subject matter of which is incorporated herein by reference, a means of identifying key terms consisting of several consecutive words is disclosed. These key terms are used as well as individual key words within a similarity matrix. This enables terms such as xe2x80x9cInformation Technologyxe2x80x9d and xe2x80x9cWorld Wide Webxe2x80x9d to be recognised as terms in their own right rather than as two or three separate key words.
However these techniques for identifying key words and phrases are less than optimal because they eliminate conjunctive words and other low value words in order to identify the key words and phrases of a particular data set. They only identify phrases which contain high value words alone, such as xe2x80x9cinformation retrievalxe2x80x9d. However, conjunctive terms often provide a great deal of contextual information.
For example, in the English language, the phrase xe2x80x9cbread and butterxe2x80x9d has two meanings. The first relates to food and the second relates to a person""s livelihood or a person""s means of survival. Similarly, in the English language, the term xe2x80x9cbread and waterxe2x80x9d again relates to food and also has a second meaning that is often used to imply hardship.
An information retrieval or management tool that eliminates all conjunctive words during the process of identifying key words and phrases in a block of text would reduce the phrases xe2x80x9cbread and butterxe2x80x9d and xe2x80x9cbread and waterxe2x80x9d to a list of key words consisting of xe2x80x9cbreadxe2x80x9d, xe2x80x9cbutterxe2x80x9d, xe2x80x9cwaterxe2x80x9d. In such a list, the second meanings of hardship and a person""s livelihood are lost.
A further problem is that names such as xe2x80x9cBank of Englandxe2x80x9d, xe2x80x9cStratford on Avonxe2x80x9d or terms such as xe2x80x9cblack and whitexe2x80x9d, xe2x80x9con and offxe2x80x9d are reduced to their constituent, higher value words, thus altering the information returned by the tool.
According to a first aspect of the present invention there is provided an apparatus for managing data sets, having: an input means for receiving data sets as input; means adapted to identify, within a said data set, a first set of words comprising one or more word groups of one or more words, conforming to a predetermined distribution pattern within said data set, wherein said words in said word groups occur consecutively in the data set; means adapted to identify, within said first set, a sub-set of words comprising one or more of said word groups, conforming to a second predetermined distribution pattern within said data set; means adapted to eliminate said sub-set of words from said first set thereby forming a set of key terms of said data set; and output means for outputting at least one said key term.
According to a second aspect of the present invention there is provided a method of managing data sets, including the steps of:
1) receiving a data set as input;
2) identifying a first set of words conforming to a first distribution pattern within said data set, said first set comprising one or more word groups of one or more words, wherein said words in said word groups occur consecutively in the data set;
3) identifying a sub-set of word groups in said first set, said sub-set conforming to a second distribution pattern within said data-set;
4) eliminating said sub-set from said first set thereby identifying a set of key terms;
5) outputting said key terms.
Thus embodiments of the present invention identify, within a received data set, a first set of word groups of one or more words according to a first pattern within the data set and then identify a second pattern of word groups from within the first set. The key terms are those groups of one or more words within the first set that do not conform to the second pattern.
The approach of identifying, within the data set, patterns of word groups, enables key terms to be extracted without first eliminating low value words. This has the advantage that conjunctive words and other low value words can be retained within the data set so that terms such as xe2x80x9con and offxe2x80x9d, xe2x80x9cbread and waterxe2x80x9d and xe2x80x9cchief of staffxe2x80x9d can be identified as key terms in their own right.
This improves the quality of the key terms extracted and also allows key terms of arbitrary length to be identified.
Preferably said first distribution pattern requires that each word group in the first set occurs more than once in said data set and preferably said second distribution pattern requires that each word group in the sub-set comprises a word or a string of words that occurs within a larger word group in the first set.
Thus embodiments of the present invention pick out any repeated words and phrases, and then eliminate any word or phrase already contained in a longer one. For instance, if a document refers to xe2x80x9cInternet search enginesxe2x80x9d more than once, the whole phrase will become a key term but xe2x80x9cInternetxe2x80x9d and xe2x80x9csearch enginexe2x80x9d on their own would be eliminated, as would xe2x80x9csearchxe2x80x9d and xe2x80x9cenginexe2x80x9d as single words.
Preferably said first aspect includes means for modifying said word groups, adapted to remove low value words occurring before the first high value word in a word group and adapted to remove low value words occurring after the last high value word in a word group. In the trivial case of a word group composed of a single, low value word, the word group itself will be eliminated.
Preferably said second aspect includes the step of:
6) removing any low value word occurring before the first high value word in a word group and removing any low value word occurring after the last high value word in a word group.
Removing low value words from the beginning and end of word groups improves the quality of the word groups returned by the key term extractor.
Preferably the first aspect includes means for weighting each said word group in said first set according to how frequently each said word group occurs in said first set and means for modifying said weighting of at least a first word group in proportion to a weighting of a second word group in said sub-set and means for selecting said key terms for output in dependence upon said weightings.
Preferably the second aspect includes the steps of:
9) weighting each word group in said first set according to how frequently each said word group occurs in said first set;
10) modifying said weightings of at least a first word group in said first set in proportion to a weighting of a second word group in said sub-set;
11) selecting said key terms for output in dependence upon said weightings.
Weighting word groups according to their frequency of occurrence provides a mechanism for ordering the identified key terms.
Modifying weightings according to the weighting of terms in the sub-set enables terms eliminated from the first set to influence the weightings of those terms that remain and of which the eliminated terms form sub-strings. In this way, a sub-string that occurs frequently within the data set may have an appropriate influence on the identification of key terms.
An assumption is made that those key terms occurring most frequently are most relevant to the information content of the data set.
Preferably the first aspect includes means for modifying any word in any word group, adapted to remove any prefix and adapted to remove any suffix from a word to form a stemmed word.
Preferably the second aspect includes the step of:
7) modifying any word in any said word group by removing a prefix or suffix from the word thereby forming a stemmed word.
The removal of prefixes and suffixes allows each word to be reduced to a neutral form so that weightings independent of prefixes and suffixes can be calculated.
Thus words that are repeated but with different prefixes and/or suffixes are accounted for as repeat occurrences of the same word.
Preferably the first aspect includes means for storing said prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
Preferably the second aspect includes the step of:
8) storing said removed prefix or suffix in association with said stemmed word thereby enabling said prefix or suffix to be restored to said stemmed word.
Restoring prefixes and suffixes to stemmed words improves the quality of key terms forming output of embodiments of the present invention.