1. Field of the Invention
The present invention relates to an information abstracting method, an information abstracting apparatus, and a weighting method, which can be used when extracting prescribed keywords from a plurality of character string data sets divided into prescribed units, such as data provided by teletext services, and also relates to a teletext broadcast receiving apparatus.
2. Description of the Related Art
In recent years, with the advent of the multimedia age, a large variety of information has come to be provided not only in the form of packaged media such as CD-ROMs but also through communications networks, commercial broadcasts, and the like. Such information includes textual information, provided by electronic books, teletext broadcasts, etc., in addition to video and voice information. Textual information is made up of character codes, such as the ASCII code and the JIS code, that can be readily processed by computer. However, for human beings, textual information poses problems in that the amount of information that can be displayed at a time is small, and in that it takes a long time to grasp the main points in it as compared to image information. These problems will become an important concern when we consider increasing amounts of information as the information society advances. Possible approaches to these problems may be by developing techniques for automatically interpreting the content of a document and rendering it into an easy-to-understand form. One such approach is the study of natural language processing in the research field of artificial intelligence. For practical implementation, however, there are many problems yet to be overcome, such as the need for large dictionary and grammatical information and the difficulty in reducing the probability of erroneously interpreting the content of text to a practical level, and so far there are few practical applications.
On the other hand, in recent years, receivers designed to receive teletext broadcasts that are transmitted as character codes over the air have been developed and made commercially available, and textual information provided for homes has been rapidly increasing in volume. In teletext broadcasting, large numbers of programs are provided, and since the information provided is in the form of text, the user can obtain information by reading text displayed on a television screen. This in turn presents a problem in that to grasp the whole content of information the user has to read a large number of characters, turn over the pages in sequence, and so on. In fact, in the case of news or the like, the user usually has no previous knowledge of what will be provided as information, and therefore, is not aware of what information is of interest to him. It is therefore difficult for the user to extract only information that he needs; as a result, the user has to select by himself the necessary information after looking through the whole content of information. This means considerable time has to be spent in getting the necessary information, which has been a barrier to increasing the number of users who enjoy teletext broadcasts. A need, therefore, has been increasing to provide an information abstracting facility for abstracting the main points of information in teletext and displaying only the main points. Some teletext channels broadcast an abstract of major news, but there are still quite a few problems to be overcome, such as, the abstract itself consists of several pages, the content, format, and length of an abstract differ from one broadcast station to another, and so on.
Among information abstracting techniques for document data that have so far been put to practical use is the keyword extraction technique. Intended for scientific papers and the like, this technique involves calculating the frequencies of occurrence of technical terms and the like used in a paper and selecting keywords of high frequency of occurrence to produce an abstract of the paper. The reason that such a technique has been put to practical use is that it is intended for documents, such as papers in specific fields, where the number of frequently used terms is more or less limited. For such fields, it is relatively easy to prepare a dictionary of terms to be extracted as keywords. The keywords automatically extracted using this technique are appended to each paper and used for the sorting out and indexing of the papers.
However, if the above-described keyword extraction technique is applied to the abstracting of a teletext broadcast program, it simply extracts keywords of high frequency of occurrence from the keywords appearing in the program. The result is the extraction of many keywords relating to similar things, and an abstract constructed from such keywords will be redundant. Furthermore, in the case of teletext broadcasts, there often arises a need to extract topics common to a plurality of programs as timely information, besides an abstract of the content of a particular program. For example, when news programs are being broadcast on a plurality of channels, a need may occur to abstract information so that common topics can be extracted as the current trend from the news programs on the different channels. In such cases, a technique is necessary that distinguishes the keywords repeatedly appearing within the same program from the keywords appearing across the plurality of programs. Furthermore, if it is attempted to apply the conventional natural language processing technique to news programs, for example, since news programs tend to contain very many proper nouns, it is not possible to prepare an appropriate terminology dictionary in advance. Accordingly, the prior art techniques cannot be applied as they are. Hence, a need for an information abstracting technique that can handle information trends and that does not require a terminology dictionary.
Moreover, when using keywords as an abstract of information, if keyword extraction is performed simply on the basis of the frequency of occurrence, there arises the problem that many keywords relating to similar things are extracted and relations between keywords are not clear, with the result that the information obtained is short on substance for the number of keywords extracted. For example, when keywords were actually extracted in order of frequency from seven teletext news programs being broadcast in the same time slot, the results as shown in FIG. 15 were obtained, which shows the seven highest-use keywords. Parts (a) and (b) in FIG. 15 show the results of experiments conducted on news received at different times. As can be seen from these results, it is clear that the keyword extraction simply based on the frequency of occurrence has a problem as an information abstract. For example, in part (a), the first keyword xe2x80x9cxe2x80x9d and the sixth keyword xe2x80x9cxe2x80x9d both relate to the same topic, but this cannot be recognized from the simple listing of keywords shown in FIG. 15. A technique is therefore needed that avoids doubly extracting keywords having similar meanings and that explicitly indicates association between keywords in an information abstract. An approach to this need may be by preparing dictionary information describing keyword meanings and keyword associations, as practiced in the conventional natural language processing technique; however, when practical problems are considered, preparing a large volume of dictionary information presents a problem in terms of cost, and the dictionary information to be prepared in advance must be reduced as much as possible. Furthermore, in the case of teletext news programs, preparing a dictionary itself in advance is difficult since a large number of proper nouns are used. It is therefore necessary to provide a technique that automatically deduces association between keywords without using a dictionary and that abstracts information by also taking association between keywords into account.
The above-mentioned techniques are necessary not only for teletext broadcasts but for character string data in general provided in the form of character codes. For example, in the case of scientific papers also, there may arise a need to provide an abstract of a common topic in a scientific society, not just an abstract of each individual paper. Furthermore, in electronic mail equipment used in communications networks, these techniques will become necessary when extracting recent topics, etc. that are frequently discussed in all electronic mail messages.
Further, when extracting keywords based on the frequency of occurrence without using a dictionary, since a thesaurus cannot be prepared in advance, it becomes necessary to process keywords that describe the same thing but are expressed differently. For example, the name of an athlete may be written as xe2x80x9cxe2x80x9d at one time and as xe2x80x9cxe2x80x9d at other times. For such keywords expressed differently but used with the same meaning, the frequency of occurrence of each keyword must be added together to count the frequency. It is therefore imperative to develop a technique that treats different keywords, such as xe2x80x9cxe2x80x9d and xe2x80x9cxe2x80x9d, as similar keywords without using a dictionary and that calculates the frequency of occurrence by taking into account the frequency of occurrence of such similar keywords.
Such differently expressed keywords are frequently used, for example, in teletext broadcasts in which data are provided from a plurality of broadcast stations with different people producing data at different stations. In particular, in the case of a broadcast, such as news, dealing with events happening in the real world, the problem is that, unlike the case of scientific topics, there is no predefined terminology. Accordingly, for a keyword describing the same event, different expressions may be used from one information source to another. Therefore, for application to teletext broadcasts, it is an important task to develop a technique for processing differently expressed keywords.
For calculating similarity between differently expressed keywords without using a dictionary, one method may be by using the number of common characters or its proportion between two keywords. For example, in the case of xe2x80x9cxe2x80x9d and xe2x80x9cxe2x80x9d, the two characters xe2x80x9cxe2x80x9d are common, that is, two of the four characters are the same between the keywords. In other words, not less than half the character string matches, so that these keywords can be considered similar. However, calculating the similarity simply based on the number of common characters has a problem because in the case of keywords, xe2x80x9cxe2x80x9d, xe2x80x9cxe2x80x9d, and xe2x80x9cxe2x80x9d having the two characters xe2x80x9cxe2x80x9d in common, for example, the similarity among them will become the same. It is therefore imperative to devise a method of calculation that provides a large similarity between xe2x80x9cxe2x80x9d and xe2x80x9cxe2x80x9d but a small similarity between xe2x80x9cxe2x80x9d and xe2x80x9cxe2x80x9d.
Processing differently expressed keywords becomes; necessary especially in teletext broadcasts, etc. in which data are provided from a plurality of broadcast stations with different people producing data at different stations. Specifically, in the case of a broadcast dealing with events happening in the real world, such as news, the problem is that, unlike the case of scientific topics, there is no predefined terminology. Furthermore, in the case of names of conferences, names of persons, names of companies, etc., similar keywords may be used to represent topics having no relations with each other. Therefore, for application to teletext broadcasts, it is an important task to develop a technique for processing differently expressed keywords.
Furthermore, when extracting keywords from data and producing an abstract on the basis of the frequency of occurrence of the extracted keywords, articles such as xe2x80x9caxe2x80x9d and xe2x80x9cthexe2x80x9d, prepositions, etc. in the English language, for example, are very frequent keywords. It is therefore necessary to remove such frequent keywords which do not have significance in representing topics.
In teletext broadcasts also, there may be a situation where English sentences are quoted. In such a situation, there is a possibility that the frequency of keywords, such as prepositions and articles, which are not significant in representing a topic, may become large. It is therefore imperative to develop a technique that does not output such keywords as an information abstract.
In view of the above-outlined problems with the prior art, it is an object of the present invention to provide an information abstracting method, an information abstracting apparatus, a weighting method, and a teletext broadcast receiving apparatus, which can be used to extract more appropriate keywords from data than the prior art can.
A first invention provides an information abstracting apparatus which comprises input means for accepting an input of character string data divided into prescribed units, with each individual character represented by a character code;
keyword extracting means for extracting a keyword for each of said prescribed units from the character string data input from said input means;
weighting means for weighting said extracted keyword by taking into account a state of occurrence, in the other prescribed units, of keywords that are identical to said extracted keyword;
keyword selecting means for selecting at least one keyword from said extracted keywords on the basis of the weighted result; and
output means for outputting said selected keyword as an information abstract relating to said character string data.
A second invention provides a teletext broadcast receiving apparatus which comprises teletext broadcast receiving means for receiving a teletext broadcast;
channel storing means for storing a plurality of channels of prescribed programs;
keyword extracting means for extracting a keyword from each of said prescribed programs received by said teletext broadcast receiving means on said channels stored in said channel storing means;
weighting means for weighting said extracted keyword by taking into account a state of occurrence, in the other prescribed programs, of keywords that are identical to said extracted keyword;
keyword selecting means for selecting keywords from said extracted keywords on the basis of the weighted result; and
display means for displaying all or part of said selected keywords as an information abstract relating to said teletext broadcast.
A third invention provides an information abstracting apparatus which comprises input means for accepting an input of character string data divided into prescribed units each subdivided into prescribed paragraphs, with each individual character represented by a character code;
keyword extracting means for extracting a keyword for each paragraph in each of said prescribed units from the character string data input from said input means;
keyword associating means for generating a keyword association by associating one keyword with another among keywords obtained from the same paragraph;
weighting means for weighting said extracted keyword by taking into account a state of occurrence, in the other prescribed units, of keywords that are identical to said extracted keyword, and for weighting said generated keyword association by taking into account a state of occurrence, in the other prescribed paragraphs; of keyword associations that are identical to said generated keyword association;
selecting means for selecting keywords and keyword associations from said extracted keywords and said generated keyword associations on the basis of the weighted results; and
output means for outputting said selected keywords and keyword associations as an information abstract relating to said character string data.
A fourth invention provides a teletext broadcast receiving apparatus which comprises teletext broadcast receiving means for receiving a teletext broadcast;
channel storing means for storing a plurality of channels of prescribed programs;
keyword extracting means for extracting a keyword from each of said prescribed programs received by said teletext broadcast receiving means on said channels stored in said channel storing means;
keyword associating means for generating a keyword association by associating one keyword with another among keywords obtained from the same paragraph in the same program;
weighting means for weighting said extracted keyword by taking into account a state of occurrence, in the other prescribed programs, of keywords that are identical to said extracted keyword, and for weighting said generated keyword association by taking into account a state of occurrence, in the other prescribed paragraphs, of keyword associations that are identical to said generated keyword association;
selecting means for selecting keywords and keyword associations from said extracted keywords and said generated keyword associations on the basis of the weighted results; and
display means for displaying all or part of said selected keywords and keyword associations as an information abstract relating to said teletext broadcast.
A fifth invention provides an information abstracting apparatus which comprises input means for accepting an input of character string data divided into prescribed units, with each individual character represented by a character code;
keyword extracting means for extracting a keyword for each of said prescribed units from said input character string data;
similarity calculating means for calculating similarity between keywords thus-extracted;
weighting means for weighting said extracted keyword by taking into account a state of occurrence, in the other prescribed units, of keywords that are identical or similar to said extracted keyword;
keyword selecting means for selecting keywords from said extracted keywords on the basis of the weighted result; and
output means for outputting said selected keywords as an information abstract relating to said character string data.
A sixth invention provides a teletext broadcast receiving apparatus which comprises teletext broadcast receiving means for receiving a teletext broadcast;
channel storing means for storing a plurality of channels of prescribed programs;
keyword extracting means for extracting a keyword from each of said prescribed programs received by said teletext broadcast receiving means on said channels stored in said channel storing means;
similarity calculating means for calculating similarity between keywords thus extracted;
weighting means for weighting said extracted keyword by taking into account a state of occurrence, in the other prescribed programs, of keywords that are identical or similar to said extracted keyword;
keyword selecting means for selecting keywords from said extracted keywords on the basis of the weighted result; and
display means for displaying all or part of said selected keywords as an information abstract relating to said teletext broadcast.
A seventh invention provides an information abstracting apparatus which comprises input means for accepting an input of character string data divided into prescribed units each subdivided into prescribed paragraphs, with each individual character represented by a character code;
keyword extracting means for extracting a keyword for each paragraph in each of said prescribed units from said character string data input from said input means;
keyword associating means for generating a keyword association by associating one keyword with another among keywords obtained from the same paragraph;
similarity calculating means for calculating similarity between keywords thus extracted, on the basis of a plurality of factors including said keyword association;
weighting means for weighting said extracted keyword by taking into account a state of occurrence, in the other prescribed units, of keywords that are identical or similar to said extracted keyword, and for weighting said generated keyword association by taking into account a state of occurrence, in the other prescribed paragraphs, of keyword associations that are identical to said generated keyword association;
selecting means for selecting keywords and keyword associations from said extracted keywords and said generated keyword associations on the basis of the weighted results; and
outputting said selected keywords and keyword associations as an information abstract relating to said character string data.
A eighth invention provides a teletext broadcast receiving apparatus which comprises teletext broadcast receiving means for receiving a teletext broadcast;
channel storing means for storing a plurality of channels of prescribed programs;
keyword extracting means for extracting a keyword from each of said prescribed programs received by said teletext broadcast receiving means on said channels stored in said channel storing means;
keyword associating means for generating a keyword association by associating one keyword with another among keywords obtained from the same paragraph in the same program;
similarity calculating means for calculating similarity between keywords thus extracted, on the basis of a plurality of factors including said keyword association;
weighting means for weighting said extracted keyword by taking into account a state of occurrence, in the other prescribed programs, of keywords that are identical or similar to said extracted keyword, and for weighting said generated keyword association by taking into account a state of occurrence, in the other prescribed paragraphs, of keyword associations that are identical to said generated keyword association;
selecting means for selecting keywords and keyword associations from said extracted keywords and said generated keyword associationson the basis of the weighted results; and
display means for displaying all or part of said selected keywords and keyword associations as an information abstract relating to said teletext broadcast.
A ninth invention provides an information abstracting apparatus which comprises exception keyword storing means for prestoring keywords which are not processed as keywords, wherein
when extracting a keyword in each prescribed unit from the character string data input from input means, any keyword identical to a keyword stored in the exception keyword storing means is excluded from the group of keywords to be extracted.
A 10th invention provides a teletext broadcast receiving apparatus which comprises exception keyword storing means for prestoring keywords which are not processed as keywords, wherein
when extracting a keyword from each of the programs received by the teletext broadcast receiving means on the channels stored in channel storing means, any keyword identical to a keyword stored in the exception keyword storing means is excluded from the group of keywords to be extracted.
According to the first invention, an input of character string data divided, for example, into prescribed units, with each individual character represented by a character code, is accepted,
a keyword is extracted for each prescribed unit from the input character string data,
the extracted keyword is weighted by taking into account the state of occurrence in the other prescribed units of keywords that are identical to the extracted keyword,
keywords are selected on the basis of the weighted result, and
the selected keywords are,output as an information abstract relating to the character string data.
Accordingly, keywords used in common among many units, for example, are preferentially selected as an information abstract. This means that keywords appearing in topics common to many units are selected and extracted as abstracted information representing the general trend in the character string data.
According to the second invention, the teletext broadcast receiving means receives a teletext broadcast,
the channel storing means stores a plurality of channels of prescribed programs,
the keyword extracting means extracts a keyword from each of the prescribed programs received by the teletext broadcast receiving means on the channels stored in the channel storing means,
the weighting means weights the extracted keyword by taking into account the state of occurrence in the other prescribed programs of keywords that are identical to the extracted keyword,
the keyword selecting means selects keywords on the basis of the weighted result, and
the display means displays all or part of the selected keywords as an information abstract relating to the teletext broadcast.
In this invention, a score is calculated for each keyword in such a manner that a higher score is given, for example, to a keyword appearing in a larger number of programs. As a result, from the teletext programs being broadcast on the channels stored in the channel storing means, keywords in a topic raised in common among many programs, for example, can be extracted as an information abstract. That is, from ever-changing contents such as programs broadcast as teletext, the trend in the latest information is extracted and displayed as an information abstract.
According to the third invention, an input of character string data divided into prescribed units each subdivided into prescribed paragraphs, with each individual character represented by a character code, is accepted,
a keyword is extracted for each paragraph in each prescribed unit from the input character string data,
a keyword association is generated by associating one keyword with another among keywords obtained from the same paragraph,
the extracted keyword is weighted by taking into account the state of occurrence in the other prescribed units of keywords that are identical to the extracted keyword, and also, the generated keyword association is weighted by taking into account the state of occurrence in the other prescribed paragraphs of keyword associations that are identical to the generated keyword association,
keywords and keyword associations are selected on the basis of the weighted results, and
the selected keywords and keyword associations are output as an information abstract relating to the character string data.
In this invention, a score is calculated for each keyword and for each keyword association in such a manner that a higher score is given, for example, to a keyword appearing in a larger number of units or a keyword association appearing in a larger number of paragraphs. Based on the thus calculated scores, keywords and keyword associations are selected and extracted as an information abstract. As a result, as compared to a case, for example, where keywords having high frequency of occurrence are simply selected, pairs of keywords having closer association and frequently occurring together can be extracted; when displaying an information abstract, if there are many keywords closely associated with each other, only representative keywords are displayed, and when displaying a plurality of associated keywords, the keywords are displayed by associating one keyword with another, thereby preventing the content of the information abstract from becoming redundant or difficult to understand.
According to the fourth invention, the teletext broadcast receiving means receives a teletext broadcast,
the channel storing means stores a plurality of channels of prescribed programs,
the keyword extracting means extracts a keyword from each of the prescribed programs received by the teletext broadcast receiving means on the channels stored in the channel storing means,
the keyword associating means generates a keyword association by associating one keyword with another among keywords obtained from the same paragraph in the same program,
the weighting means weights the extracted keyword by taking into account the state of occurrence in the other prescribed programs of keywords that are identical to the extracted keyword, and also weights the generated keyword association by taking into account the state of occurrence in the other prescribed paragraphs of keyword associations that are identical to the generated keyword association,
the selecting means selects keywords and keyword associations on the basis of the weighted results, and
the display means displays all or part of the selected keywords and keyword associations as an information abstract relating to the teletext broadcast.
In this invention, a score is calculated for each keyword and for each keyword association in such a manner that a higher score is given, for example, to a keyword appearing in a larger number of programs or a keyword association appearing in a larger number of paragraphs. Based on the thus calculated scores, keywords and keyword associations are selected and extracted as an information abstract. As a result, as compared to a case, for example, where keywords having high frequency of occurrence are simply selected, pairs of keywords having closer association and frequently occurring together can be extracted by associating one keyword with another. Especially, in teletext, when information obtained from the same information source, such as news, is broadcast on different channels, a plurality of common keywords occur in different programs to describe the same event; therefore, by removing redundant keywords from the abstract in accordance with keyword associations, and by clarifying their associations, an abstract rich in content can be presented for viewing. This makes it possible to grasp the main points in a large volume of information in a short time.
According to the fifth invention, an input of character string data divided into prescribed units, with each individual character represented by a character code, is accepted,
a keyword is extracted for each prescribed unit from the input character string data,
similarity between keywords thus extracted is calculated,
the extracted keyword is weighted by taking into account the state of occurrence in the other prescribed units of keywords that are identical or similar to the extracted keyword;
keywords are selected on the basis of the weighted result; and
the selected keywords are output as an information abstract relating to the character string data.
In this way, keywords which do not exactly match, for example, are treated as similar keywords, and higher scores are given as the number of similar keywords extracted increases, such keywords being displayed as an information abstract.
According to the sixth invention, the teletext broadcast receiving means receives a teletext broadcast,
the channel storing means stores a plurality of channels of prescribed programs,
the keyword extracting means extracts a keyword from each of the prescribed programs received by the teletext broadcast receiving means on the channels stored in the channel storing means,
the similarity calculating means calculates similarity between keywords thus extracted,
the weighting means weights the extracted keyword by taking into account the state of occurrence in the other prescribed programs of keywords that are identical or similar to the extracted keyword,
the keyword selecting means selects keywords on the basis of the weighted result, and
the display means displays all or part of the selected keywords as an information abstract relating to the teletext broadcast.
In this way, the frequencies of keywords which are similar to each other but are differently expressed in different programs, for example, are added together.
According to the seventh invention, an input of character string data divided into prescribed units each subdivided into prescribed paragraphs, with each individual character represented by a character code, is accepted,
a keyword is extracted for each paragraph in each prescribed unit from the input character string,
a keyword association is generated by associating one keyword with another among keywords obtained from the same paragraph,
similarity between keywords thus extracted is calculated on the basis of a plurality of factors including the keyword association,
the extracted keyword is weighted by taking into account the state of occurrence in the other prescribed units of keywords that are identical or similar to the extracted keyword, and also, the generated keyword association is weighted by taking into account the state of occurrence in the other prescribed paragraphs of keyword associations that are identical to the generated keyword association,
keywords and keyword associations are selected on the basis of the weighted results, and
the selected keywords and keyword associations are output as an information abstract relating to the character string data.
In this way, the scores of keywords which are similar to each other but are differently expressed in different programs, for example, are added together. By adding the scores in this manner, a high score is given to a keyword that is significant in expressing the information abstract even if the keyword is expressed differently.
According to the eighth invention, the teletext broadcast receiving means receives a teletext broadcast,
the channel storing means stores a plurality of channels of prescribed programs,
the keyword extracting means extracts a keyword from each of the prescribed programs received by the teletext broadcast receiving means on the channels stored in the channel storing means,
the keyword associating means generates a keyword association by associating one keyword with another among keywords obtained from the same paragraph in the same program,
the similarity calculating means calculates similarity between keywords thus extracted, on the basis of a plurality of factors including the keyword association,
the weighting means weights the extracted keyword by taking into account the state of occurrence in the other prescribed programs of keywords that are identical or similar to the extracted keyword, and also weights the generated keyword association by taking into account the state of occurrence in the other prescribed paragraphs of keyword associations that are identical to the generated keyword association,
the selecting means selects keywords and keyword associations on the basis of the weighted results, and
the display means displays all or part of the selected keywords and keyword associations as an information abstract relating to said teletext broadcast.
In this way, for keywords differently expressed in different programs, for example, their similarity is calculated using their associated keywords, and if similar keywords are extracted, their scores are added together. In particular, for keywords similar in expression but used in totally different topics, there occurs no similarity between their associated keywords, so that the resulting similarity decreases. This ensures accurate calculation of similarity.
According to the ninth invention, when extracting a keyword in each prescribed unit from the character string data input from the input means, any keyword identical to a keyword stored in the exception keyword storing means is excluded from the group of keywords to be extracted.
In this way, when, for example, data written in English is input, if articles, prepositions, etc. are prestored in the exception keyword storing means, these keywords can be prevented from being included in the group of keywords displayed as an information abstract.
According to the 10th invention, when extracting a keyword from each of the programs received by the teletext broadcast receiving means on the channels stored in the channel storing means, any keyword identical to a keyword stored in the exception keyword storing means is excluded from the group of keywords to be extracted.
In this way, when, for example, English sentences, etc. are included in a teletext program, keywords such as articles, prepositions, etc. which are not significant in describing a topic, will not be included in the information abstract.