1. Field of the Invention
The present invention relates to an information abstracting method, an information abstracting apparatus, and a weighting method, which can be used when extracting prescribed keywords from a plurality of character string data sets divided into prescribed units, such as data provided by teletext services, and also relates to a teletext broadcast receiving apparatus.
2. Description of the Related Art
In recent years, with the advent of the multimedia age, a large variety of information has come to be provided not only in the form of packaged media such as CD-ROMs but also through communications networks, commercial broadcasts, and the like. Such information includes textual information, provided by electronic books, teletext broadcasts, etc., in addition to video and voice information. Textual information is made up of character codes, such as the ASCII code and the JIS code, that can be readily processed by computer. However, for human beings, textual information poses problems in that the amount of information that can be displayed at a time is small, and in that it takes a long time to grasp the main points in it as compared to image information. These problems will become an important concern when we consider increasing amounts of information as the information society advances. Possible approaches to these problems may be by developing techniques for automatically interpreting the content of a document and rendering it into an easy-to-understand form. One such approach is the study of natural language processing in the research field of artificial intelligence. For practical implementation, however, there are many problems yet to be overcome, such as the need for large dictionary and grammatical information and the difficulty in reducing the probability of erroneously interpreting the content of text to a practical level, and so far there are few practical applications.
On the other hand, in recent years, receivers designed to receive teletext broadcasts that are transmitted as character codes over the air have been developed and made commercially available, and textual information provided for homes has been rapidly increasing in volume. In teletext broadcasting, large numbers of programs are provided, and since the information provided is in the form of text, the user can obtain information by reading text displayed on a television screen. This in turn presents a problem in that to grasp the whole content of information the user has to read a large number of characters, turn over the pages in sequence, and so on. In fact, in the case of news or the like, the user usually has no previous knowledge of what will be provided as information, and therefore, is not aware of what information is of interest to him. It is therefore difficult for the user to extract only information that he needs; as a result, the user has to select by himself the necessary information after looking through the whole content of information. This means considerable time has to be spent in getting the necessary information, which has been a barrier to increasing the number of users who enjoy teletext broadcasts. A need, therefore, has been increasing to provide an information abstracting facility for abstracting the main points of information in teletext and displaying only the main points. Some teletext channels broadcast an abstract of major news, but there are still quite a few problems to be overcome, such as, the abstract itself consists of several pages, the content, format, and length of an abstract differ from one broadcast station to another, and so on.
Among information abstracting techniques for document data that have so far been put to practical use is the keyword extraction technique. Intended for scientific papers and the like, this technique involves calculating the frequencies of occurrence of technical terms and the like used in a paper and selecting keywords of high frequency of occurrence to produce an abstract of the paper. The reason that such a technique has been put to practical use is that it is intended for documents, such as papers in specific fields, where the number of frequently used terms is more or less limited. For such fields, it is relatively easy to prepare a dictionary of terms to be extracted as keywords. The keywords automatically extracted using this technique are appended to each paper and used for the sorting out and indexing of the papers.
However, if the above-described keyword extraction technique is applied to the abstracting of a teletext broadcast program, it simply extracts keywords of high frequency of occurrence from the keywords appearing in the program. The result is the extraction of many keywords relating to similar things, and an abstract constructed from such keywords will be redundant. Furthermore, in the case of teletext broadcasts, there often arises a need to extract topics common to a plurality of programs as timely information, besides an abstract of the content of a particular program. For example, when news programs are being broadcast on a plurality of channels, a need may occur to abstract information so that common topics can be extracted as the current trend from the news programs on the different channels. In such cases, a technique is necessary that distinguishes the keywords repeatedly appearing within the same program from the keywords appearing across the plurality of programs. Furthermore, if it is attempted to apply the conventional natural language processing technique to news programs, for example, since news programs tend to contain very many proper nouns, it is not possible to prepare an appropriate terminology dictionary in advance. Accordingly, the prior art techniques cannot be applied as they are. Hence, a need for an information abstracting technique that can handle information trends and that does not require a terminology dictionary.
Moreover, when using keywords as an abstract of information, if keyword extraction is performed simply on the basis of the frequency of occurrence, there arises the problem that many keywords relating to similar things are extracted and relations between keywords are not clear, with the result that the information obtained is short on substance for the number of keywords extracted. For example, when keywords were actually extracted in order of frequency from seven teletext news programs being broadcast in the same time slot, the results as shown in FIG. 15 were obtained, which shows the seven highest-use keywords. Parts (a) and (b) in FIG. 15 show the results of experiments conducted on news received at different times. As can be seen from these results, it is clear that the keyword extraction simply based on the frequency of occurrence has a problem as an information abstract. For example, in part (a), the first keyword {character pullout} and the sixth keyword {character pullout}{character pullout} both relate to the same topic, but this cannot be recognized from the simple listing of keywords shown in FIG. 15. A technique is therefore needed that avoids doubly extracting keywords having similar meanings and that explicitly indicates association between keywords in an information abstract. An approach to this need may be by preparing dictionary information describing keyword meanings and keyword associations, as practiced in the conventional natural language processing technique; however, when practical problems are considered, preparing a large volume of dictionary information presents a problem in terms of cost, and the dictionary information to be prepared in advance must be reduced as much as possible. Furthermore, in the case of teletext news programs, preparing a dictionary itself in advance is difficult since a large number of proper nouns are used. It is therefore necessary to provide a technique that automatically deduces association between keywords without using a dictionary and that abstracts information by also taking association between keywords into account.
The above-mentioned techniques are necessary not only for teletext broadcasts but for character string data in general provided in the form of character codes. For example, in the case of scientific papers also, there may arise a need to provide an abstract of a common topic in a scientific society, not just an abstract of each individual paper. Furthermore, in electronic mail equipment used in communications networks, these techniques will become necessary when extracting recent topics, etc. that are frequently discussed in all electronic mail messages.
Further, when extracting keywords based on the frequency of occurrence without using a dictionary, since a thesaurus cannot be prepared in advance, it becomes necessary to process keywords that describe the same thing but are expressed differently. For example, the name of an athlete may be written as {character pullout} at one time and as {character pullout} at other times. For such keywords expressed differently but used with the same meaning, the frequency of occurrence of each keyword must be added together to count the frequency. It is therefore imperative to develop a technique that treats different keywords, such as {character pullout} and {character pullout}, as similar keywords without using a dictionary and that calculates the frequency of occurrence by taking into account the frequency of occurrence of such similar keywords.
Such differently expressed keywords are frequently used, for example, in teletext broadcasts in which data are provided from a plurality of broadcast stations with different people producing data at different stations. In particular, in the case of a broadcast, such as news, dealing with events happening in the real world, the problem is that, unlike the case of scientific topics, there is no predefined terminology. Accordingly, for a keyword describing the same event, different expressions may be used from one information source to another. Therefore, for application to teletext broadcasts, it is an important task to develop a technique for processing differently expressed keywords.
For calculating similarity between differently expressed keywords without using a dictionary, one method may be by using the number of common characters or its proportion between two keywords.. For example, in the case {character pullout} and {character pullout}, the two characters {character pullout} are common, that is, two of the four characters are the same between the keywords. In other words, not less than half the character string matches, so that these keywords can be considered similar. However, calculating the similarity simply based on the number of common characters has a problem because in the case of keywords {character pullout}, {character pullout}, and {character pullout} having the two characters {character pullout} in common, for example, the similarity among them will become the same. It is therefore imperative to devise a method of calculation that provides a large similarity between {character pullout} and {character pullout} but a small similarity between {character pullout} and {character pullout}.
Processing differently expressed keywords becomes necessary especially in teletext broadcasts, etc. in which data are provided from a plurality of broadcast stations with different people producing data at different stations. Specifically, in the case of a broadcast dealing with events happening in the real world, such as news, the problem is that, unlike the case of scientific topics, there is no predefined terminology. Furthermore, in the case of names of conferences, names of persons, names of companies, etc., similar keywords may be used to represent topics having no relations with each other. Therefore, for application to teletext broadcasts, it is an important task to develop a technique for processing differently expressed keywords.
Furthermore, when extracting keywords from data and producing an abstract on the basis of the frequency of occurrence of the extracted keywords, articles such as "a" and "the", prepositions, etc. in the English language, for example, are very frequent keywords. It is therefore necessary to remove such frequent keywords which do not have significance in representing topics.
In teletext broadcasts also, there may be a situation where English sentences are quoted. In such a situation, there is a possibility that the frequency of keywords, such as prepositions and articles, which are not significant in representing a topic, may become large. It is therefore imperative to develop a technique that does not output such keywords as an information abstract.
In view of the above-outlined problems with the prior art, it is an object of the present invention to, provide an information abstracting method, an information abstracting apparatus, a weighting method, and a teletext broadcast receiving apparatus, which can be used to extract more appropriate keywords from data than the prior art can.