The present invention relates to a text analysis method, particularly for finding acronyms and variants of acronyms of reference terms in a text.
A very large proportion of all database contents are available in non-structured form, most of it in text form. The Internet, as the largest distributed text database, is assumed to have a capacity of approximately one billion static websites and approximately five hundred billion dynamically generated websites. The amount of the stored online data volume is estimated to be roughly one thousand Petabyte and is still increasing. Automated text mining methods to handle such an information load and to analyze the data are required.
Text mining generally refers to an automated process of extracting information from a text. Text mining typically involves the process of structuring the input text, deriving patterns within the structured data and, finally, evaluating and interpreting the output. Typical text mining objects include, for instance, categorization and clustering of text, extraction of concepts, production of taxonomies, sentiment analysis and the like.
Prior art methods can be applied to formal texts such as books, scientific publications, patent documents, well-managed web sites and the like, which use an accurate, formal language, to find acronyms of reference terms.
An acronym is an artificial word or sequence of letters which usually includes one or more characters of words of a word group, such as “IT” for “information technology”. In formal texts, standard acronyms of compound terms are typically formed by using only initial characters. Such acronyms are usually introduced with a definition when used for the first time, which helps to identify them.
In contrast, acronyms in informal texts are frequently used without definition. One reason is that informal texts are provided for a closed group of people who typically share a common understanding of the content of the informal text so that definitions of acronyms are considered to be unnecessary. Moreover, in informal texts formation rules for building acronyms can be softened resulting in nonstandard acronyms which are variants of classical acronyms or which can even consist of a multitude of words used contextually for a nonstandard purpose.
Y. Park and R. J. Byrd describe a method for finding acronyms and their definition in “Hybrid Text Mining for Finding Abbreviations and Their Definitions”, Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, 2001. Acronyms and respective definitions are identified by using common use of rules for forming abbreviations, text markings and cue words. Candidates for acronyms are identified by heuristics such as “the initial character is either a letter or a cipher”, “the potential abbreviation is at least two characters long”, “the candidate includes at least one upper-case character”, or “the candidate cannot be the initial word of a sentence.” Additionally, the character string must not be a member of, for instance, an official dictionary, of a list of names or of a manually constructed list of stop words and the like.
Groups using informal texts can be, for instance, call center agents attending to service requests and answers related to product failures. Documents provided by such groups (call transcripts or summaries) are characterized by being prepared hastily and under time pressure. Informal texts are characterized by a high rate of typographical errors such as misspellings, typing faults, individual abbreviations, grammatically inaccurate or incomplete sentences in note form and the like.
Prior art text mining methods frequently fail to find acronyms or variants of acronyms of reference terms in such informal texts using an informal language. Prior art solutions require either an exact string matching between the acronym and the reference term or, if a fuzzy string matching is allowed, the stronger the variant deviates from the reference term, the higher the number of misfits there will be. When analyzing a text for acronyms and variants with major deviations from a classical acronym, in particular abbreviations, a user has to be involved. The user usually has to manually examine a subset of the text corpus to detect variants that differ from the reference terms by more than a minor spelling deviation and these variants found have to be collected in a list of known acronyms which is used for analyzing the text. The results have to be manually checked and probably refined. Thus, manual interaction of the user is necessary not only once but for every new text corpus because new authors may be using largely different ways of abbreviating the same terms.