The present invention relates to computational linguistics. In particular, the present invention relates to a method of automatically filtering searches of large, untagged, heterogeneous collections of machine-readable texts using text genre.
The word xe2x80x9cgenrexe2x80x9d usually functions as a literary substitute for xe2x80x9ckind of text.xe2x80x9d Text genre differs from the related concepts of text topic and document genre. Text genre and text topic are not wholly independent. Distinct text genres like newspaper stories, novels and scientific articles tend to largely deal with different ranges of topics; however, topical commonalties within each of these text genres are very broad and abstract. Additionally, any extensive collection of texts relating to a single topic almost always includes works of more than one text genre so that the formal similarities between them are limited to the presence of lexical items. While text genre as a concept is independent of document genre, the two genre types grow up in close historical association with dense functional interdependencies. For example, a single text genre may be associated with several document genres. A short story may appear in a magazine or anthology or a novel can be published serially in parts, reissued as a hard cover and later as a paper back. Similarly, a document genre like a newspaper may contain several text genres, like features, columns, advice-to-the-lovelorn, and crossword puzzles. These text genres might not read as they do if they did not appear in a newspaper, which licenses the use of context dependent words like xe2x80x9cyesterdayxe2x80x9d and xe2x80x9clocalxe2x80x9d. By virtue of their close association, material features of document genres often signal text genre. For example, a newspaper may use one font for the headlines of xe2x80x9chard newsxe2x80x9d and another in the headlines of analysis; a periodical may signal its topical content via paper stock; business and personal letters can be distinguished based upon page lay out; and so on. It is because digitization eliminates these material clues as to text and document genres that it is often difficult to retrieve relevant texts from heterogeneous digital text collections.
The boundaries between textual genres mirror the divisions of social life into distinct roles and activitiesxe2x80x94between public and private, generalist and specialist, work and recreation, etc. Genres provide the context that makes documents interpretable, and for this reason genre, no less than content, shapes the user""s conception of relevance. For example, a researcher seeking information about supercolliders or Napoleon will care as much about text genre as contentxe2x80x94she will want to know not just what the source says, but whether that source appears in a scholarly journal or in a popular magazine.
Until recently work on information retrieval and text classification has focused almost exclusively on the identification of topic, rather than on text genre. Two reasons explain this neglect. First, the traditional print-based document world did not perceive a need for genre classification because in this world genres are clearly marked, either intrinsically or by institutional and contextual features. A scientist looking in a library for an article about cold fusion need not worry about how to restrict his search to journal articles, which are catalogued and shelved so as to keep them distinct from popular science magazines. Second, early information retrieval work with on-line text databases focused on small, relatively homogeneous databases in which text genre was externally controlled, like encyclopedia or newspaper databases. The creation of large, heterogeneous, text databases, in which the lines between text genres are often unmarked, highlights the importance of genre classification of texts. Topic-based search tools alone cannot adequately winnow the domain of a reader""s interest when searching a large heterogeneous database.
Applications of genre classification are not limited to the field of information retrieval. Several linguistic technologies could also profit from its application. Both automatic part of sentence taggers and sense taggers could benefit from genre classification because it is well known that the distribution of word senses varies enormously according to genre.
Discussions of literary classification stretch back to Aristotle. The literature on genre is rich with classificatory schemes and systems, some of which might be analyzed as simple attribute systems. These discussions tend to be vague and to focus exclusively on literary forms like the eclogue or the novel, and, to a lesser extent, on paraliterary forms like the newspaper crime report or the love letter. Classification discussions tend to ignore unliterary textual types such as annual reports, Email communications, and scientific abstracts. Moreover, none of these discussions make an effort to tie the abstract dimensions along which genres are distinguished to any formal features of the texts.
The only linguistic research specifically concerned with quantificational methods of genre classification of texts is that of Douglas Biber. His work includes: Spoken and Written Textual Dimensions in English: Resolving the Contradictory Findings, Language, 62(2):384-413, 1986; Variation Across Speech and Writing, Cambridge University Press, 1988; The Multidimensional Approach to Linguistic Analyses of Genre Variation: An Overview of Methodology and Finding, Computers in the Humanities, 26(5-6):331-347, 1992; Using Register-Diversified Corpora for General Language Studies, in Using Large Corpora, pp. 179-202 (Susan Armstrong ed.) (1994); and with Edward Finegan, Drift and the Evolution of English Style: A History of Three Genres, Language, 65(1):93-124, 1989. Biber""s work is descriptive, aimed at differentiating text genres functionally according to the types of linguistic features that each tends to exploit. He begins with a corpus that has been hand-divided into a number of distinct genres, such as xe2x80x9cacademic prosexe2x80x9d and xe2x80x9cgeneral fiction.xe2x80x9d He then ranks these genres along several textual xe2x80x9cdimensionsxe2x80x9d or factors, typically three or five. Biber individuates his factors by applying factor analysis to a set of linguistic features, most of them syntactic or lexical. These factors include, for example, past-tense verbs, past participial clauses and xe2x80x9cwh-xe2x80x9d questions. He then assigns to his factors general meanings or functions by abstracting over the discourse functions that linguists have applied assigned to the individual components of each factor; e.g., as an xe2x80x9cinformative vs. involvedxe2x80x9d dimension, a xe2x80x9cnarrative vs. non-narrativexe2x80x9d dimension, and so on. Note that these factors are not individuated according to their usefulness in classifying individual texts according to genre. A score that any text receives on a given factor or set of factors may not be greatly informative as its genre because there is considerable overlap between genres with regard to any individual factor.
Jussi Karlgren and Douglass Cutting describe their effort to apply some of Biber""s results to automatic categorization of genre in Recognizing Text Genres with Simple Metric Using Discriminant Analysis, in Proceedings of Coling ""94, Volume II, pp. 1071-1075, August 1994. They too begin with a corpus of hand-classified texts, the Brown corpus. The people who organized the Brown corpus describe their classifications as generic, but the fit between the texts and the genres a sophisticated reader would recognize is only approximate. Karigren and Cutting use either lexical or distributional featuresxe2x80x94the lexical features include first-person pronoun count and present-tense verb count, while the distributional features include long-word count and character per word average. They do not use punctuational or character level features. Using discriminant analysis, the authors classify the texts into various numbers of categories. When Karigren and Cutting used a number of functions equal to the number of categories assigned by hand, the fit between the automatically derived and hand-classified categories is 51.6%. They improved performance by reducing the number of functions and reconfiguring the categories of the corpus. Karigren and Cutting observe that it is not clear that such methods will be useful for information retrieval purposes, stating: xe2x80x9cThe problem with using automatically derived categories is that even if they are in a sense real, meaning that they are supported by the data, they may be difficult to explain for the unenthusiastic layman if the aim is to use the technique in retrieval tools.xe2x80x9d Additionally, it is not clear to what extent the idiosyncratic xe2x80x9cgenresxe2x80x9d of the Brown corpus coincide with the categories that users find relevant for information retrieval tasks.
Geoffrey Nunberg and Patrizia Violi suggest that genre recognition will be important for information retrieval and natural language processing tasks in Text, Form and Genre in Proceedings of OED""92, pp. 118-122, October 1992. These authors propose that text genre can be treated in terms of attributes, rather than classes; however, they offer no concrete proposal as to how identification can be accomplished.
An advantage of the present invention is that it enables automatic filtering of information retrieval results according to text genre at a relatively small computational cost by using untagged texts. The use of cues that are string recognizable eliminates the need for tagged texts. According to the present invention, texts are classified using publicly recognized genre types that are each associated with a characteristic set of principles of interpretation, rather than automatically derived text genres. This increases the utility of genre classifications produced using the present invention in applications directed at the lay public. The utility of the present invention to the lay public is further increased because it can recognize the full range of textual genre types, including unliterary forms such as annual reports, Email communications and scientific abstracts, for example.
The method of the present invention for automatically identifying the text genre of a machine-readable, untagged, text provides these and other advantages. Briefly described, the processor implemented method begins with a computer user indicating a desired topic for each text retrieved. Next, for each of the retrieved texts having the desired topic the processor generates a cue vector that represents occurrences in the text of a first set of nonstructural, surface cues, which are easily computable from the text. Afterward, the processor classifies each retrieved text according to text genre using the text""s cue vector and a weighting vector associated with each text genre. The processor then uses the text genres to determine an order of presentation to the computer user of the retrieved texts.
Other objects, features, and advantages of the present invention will be apparent from the accompanying drawings and detailed description that follows.