1. Technical Field
The present invention pertains to text analysis and processing systems. In particular, the present invention pertains to a system that identifies acronyms and extracts the appropriate acronym expansion from text.
2. Discussion of Related Art
An acronym is a word that is formed from the initial letter or letters of each component of a compound term (e.g., NATO, RADAR, SNAFU, etc.), while an abbreviation is a shortened form of a written word or phrase that is used or substituted for the whole word (e.g., “amt” is an abbreviation for amount). Acronyms and abbreviations tend to overlap and are frequently used in daily verbal discourse, in written documents and in electronic documents and web pages on the Internet. In certain communities (e.g., military, engineering, medicine, etc.), numerous acronyms are employed constantly. For example, a page of a military document commonly includes in excess of ten acronyms.
Acronyms may present challenges to readers in several manners. In particular, individuals unfamiliar with a certain acronym tend to have difficulty understanding the acronym and using the acronym in vocabulary. For example, commonly known acronyms, such as “LASER” and “CDROM”, are widely understood, while infrequently used or subject specific acronyms may be difficult for readers to understand (e.g., “AABFS” for Amphibious Assault Bulk Fuel System). Further, individuals preparing and/or compiling information for customers (e.g., librarians, technical writers, etc.) are aware of acronyms and typically provide convenient manners to search and access an acronym expansion. Systems that provide these types of services in the digital and electronic media are commonly referred to as “digital libraries” and “document databases”. In order to be effective, a digital library should recognize acronyms and the corresponding expansion during a search. This process may be performed manually; however, preparing acronym lists with corresponding expansions in this fashion becomes prohibitive due to the effort required and is prone to errors.
The related art has attempted to overcome these problems by providing various systems for acronym expansion. For example, the AcronymFinder system enables access to a manually compiled list of acronyms on the Internet. This system receives manual submissions of acronyms and corresponding expansions to update the list. The compiled list (e.g., in excess of 150,000 acronyms) is available for embedding in applications.
The Acronym Finding Program (AFP) is an early acronym extraction system designed primarily for an optical character recognition (OCR) environment. This system utilizes a few simple heuristics for acronym identification and expansion. For an example of this type of system, reference is made to Taghva et al., “Recognizing Acronyms and Definitions”, Proceedings of the Fourth International Conference on Document Analysis and Recognition, pp. 191–198, 1999, Los Alamitos, Calif.: IEEE Computer Society, the disclosure of which is incorporated herein by reference in its entirety.
A further system, TLA, is derived from ATP and uses five heuristics. This system produced a performance of 68% recall and 91% precision on a set of computer science technical reports. For an example of this type of system and performance, reference is made to Yeates, “Automatic Extraction of Acronyms from Test”, Proceedings of the Third New Zealand Computer Science Research Student's Conference, pp. 117–124, 1999, Hamilton, New Zealand, the disclosure of which is incorporated herein by reference in its entirety.
Another acronym extraction system employs text compression algorithms. This system uses zero-order compression models as a manner to extract acronym expansions, where the model parameter settings are learned using an encoded training set. For an example of this type of system, reference is made to Yeates et al., “Using Compression to Identify Acronyms in Text”, Proceedings of the IEEE Data Compression Conference, pp. 582–589, 2000, Los Alamitos, Calif.: IEEE Computer Society, the disclosure of which is incorporated herein by reference in its entirety.
Yet another acronym extraction system exploits duality of patterns and relations. The system is seeded with extraction patterns for acronym expansion relations. After an initial set of extractions has been obtained, the extracted instances are utilized to learn new patterns and the process repeats until convergence. For examples of this type of system, reference is made to U.S. Pat. No. 6,385,629 (Sundaresan et al.) and to Yi et al., “Mining the Web for Acronyms Using the Duality of Patterns and Relations”, Proceedings of the ACM CIKM '99 Second Workshop on Web Information and Data Management”, pp. 48–52, Kansas City, Mo., the disclosures of which are incorporated herein by reference in their entireties.
Still another system for acronym expansion is a heuristic extractor and server commonly referred to as “Acrophile”. This system includes three versions of varying capability that use acronym identification and expansion extraction rules. For an example of this type of system, reference is made to Larkey et al, “Acrophile: An Automated Acronym Extractor and Server”, Proceedings of the ACM Digital Libraries Conference, pp. 205–214, 2000, the disclosure of which is incorporated herein by reference in its entirety.
The related art systems described above suffer from several disadvantages. In particular, the AcronymFinder system is highly inefficient due to the list being generated by manual submissions. Further, the list is typically generic and static and may not suit or be tailored to various needs of particular organizations. Although the above-described systems extract acronyms and corresponding expansions, the results produced by these systems have limited accuracy. This tends to frustrate readers since the systems may omit acronyms within text or provide incorrect expansions for the acronyms, thereby requiring the reader to perform an additional task of ascertaining the correct expansion in another manner (e.g., manually). Thus, there exists a need in the art for a system that processes electronic text and documents and produces acronyms and corresponding expansions with a high degree of accuracy.