To extract valuable information automatically from electronic documents, we need systems to understand the information those documents contain. In particular, a huge number of abbreviations are used in electronic documents, especially technical documents, with or without their corresponding definitions. As the number of electronic documents increases, the use of abbreviations increases too and the methods for generating abbreviations are increasing. Thus, the ability to find correct abbreviations and their definitions is very important for natural language understanding systems such as information retrieval, glossary extraction and speech recognition.
For example, if a person searches documents about the “World Wide Web” on the Internet, search engines will only retrieve documents containing the exact string “World Wide Web”. However, if the systems have the capability of recognizing abbreviations and finding their definitions, those systems could return documents containing “WWW” and “W3” as well.
Many acronym dictionaries have been complied by hand and published in books and on the web. Some examples are found at AcronymFinder (http://www.acronymfinder.com), The World Wide Web Acronym and Abbreviation Server (http://www.ucc.ie/info/net/acronyms/acro.html), Hanford Acronym and Abbreviation Directory (http://www.hanford.gov/acronym) and Telecom Acronym Reference (http://www.tiaonline.org/resources/acronym.cfm).
Acronym and definition lists published in dictionaries appear to have high accuracy because they were deliberately collected and compiled by experts. However, it is very difficult and time-demanding work to keep the dictionaries up-to-date. The accuracy of web-based lists varies widely. The problem with many web-based acronym lists is that they allow people submit new acronyms and their definitions freely. If the list administrators don't examine those candidates very carefully or if they don't know about the domains very well, incorrect pairs of acronyms and definitions could be added.
There has been some work on automatically recognizing acronyms and their definitions in texts. AFP (Acronym Finding Program) and TLA (Three Letter Acronym) are early attempts to automatically find acronyms and their definitions in free format texts. See Taghva, Kazem and Jeff Gilbreth. Recognizing Acronyms and their Definitions, Technical Report 95-03, Information Science Research Institute, University of Nevada, Las Vegas, June 1995 and Yeates, Stuart. Automatic Extraction of Acronyms from text. In Proceedings of the Third New Zealand Computer Science Research Students' Conference. Pp 117-124, 1999.
AFP was developed as a post-processing system for the improvement of text output from optical character recognition devices. It considers an upper-cased word from 3 to 10 characters in length as an acronym and tries to match the letters of the acronym and the first letters of a sequence of words to find a probable definition. TLA is very similar to AFP but it considers the first three letters in each word of a possible definition.
More advanced methods are found in Sundaresan and Yi's work and the Acrophile system at UMass Amherst. See Larkey, Leah, Paul Ogilvie, Andrew Price and Brenden Tamilio. “Acrophile: An Automated Acronym Extractor and Server”, In Proceedings of the ACM Digital Libraries conference, pp. 205-214, 2000 and Sundaresan, Neel and Jeonghee Yi. Mining the Web for Relations, In The Ninth International World Wide Web Conference, 2000. http://www9.org/w9cdrom/363/363.html.
Sundaresan and Yi identify acronyms and the definitions in the documents on the web. They start with a base set of acronym-definition pairs, in which the definitions occur in structural relationship to the acronyms in web pages and then crawl the web to look for new acronym-definition pairs. In this system, they consider upper-case words as acronyms and match acronyms with the first character of words. The Acrophile system handles a wider range of acronym patterns than other work. It allows single interior “/” or “-”, interior lower cases and one non-final digit in acronyms.
Acronym processors have been embedded in a couple of different applications such as U.S. Pat. Nos. 5,161,105 and 5,634,084. U.S. Pat. No. 5,161,105, inventors Kugimiya, et al., entitled “Machine translation apparatus having a process function for proper nouns with acronyms” has a device for determining if a word string in a sentence is a proper noun with an acronym and a device for examining if the number of first letters of each of a certain number of words corresponds to the number of letters of the acronym. If these words are registered in a dictionary, it outputs the corresponding terms after the words are translated. When the words are not registered in the dictionary, it outputs the words without translating them. U.S. Pat. No. 5,634,084, inventors Malsheen, et al., entitled “Abbreviation and acronym/initialism expansion procedures for a text to speech reader”, has an acronym/initialism expanding procedure that identifies acronyms and initialisms in text message and parses pronounceable syllables within the identified words and generates a substitute string comprising a sequence of units, each unit selected from the set consisting of a letter, number, pronounceable syllable and multiple letter identifier.