Automatic extraction of terminology from text is important for a variety of activities that involve process of natural language. It is an especially pressing need for the writers and translators of technical manuals, for whom it can help maintain the consistency and correctness of translation and usage and decrease the cost of their activities.
Among the uses to which the identification of technical terminology can be put are the following:
The identification of terms requiring translations in a bilingual terminology dictionary for humans or in the automated dictionaries of a computerized natural language translation system. PA1 The identification of new terms requiring definition in a glossary of a technical document or in a dictionary; PA1 The identification of the terms in a text document which should be used for indexing that document in a computerized information retrieval system. PA1 The identification of domain-specific concepts in a domain for use in a knowledge-representation system that models that domain. PA1 The identification of additional entries for lexicons for natural language parsers in order to improve the performance of those parsers for a variety of applications. PA1 The identification of terms to be used in algorithms for determining the topic of a text document.
A Technical term is a word string that has a particular meaning in a domain. A multi-word technical term is a term that consists of more than one word. A technical term can be a common noun phrase such as "central processing unit"l0 or "market share". It may also be a proper noun phrase such as "United States Patent Office" or "New York Stock Exchange".
U.S. Pat. No. 4,566,295 to K. Toth describes an improvement in stenographic systems using word frequencies.
U.S. Pat. No. 4,625,295 to J. T. Skinner describes a hardware means for locating predefined characters, words, or combinations of words.
U.S. Pat. No. 4,744,050 to Hirosawa et al describes a method of determining the most frequently used phrases in a text, while this invention is concerned with noun phrases that occur more than once in a text.
U.S. Pat. No. 4,813,010 to T. Okamoto et al deals with the extraction of hierchical structure in a document as indicated by section headings. Word, phrase, and symbol frequencies are used, with higher frequency forms being preferred candidates for inclusion in headings. Okamoto uses actual frequency information rather than simple repetition information as with the invention of this application. Simple repetition information is only information as to whether a word string appears a minimum number of times in a text file.
U.S. Pat. No. 4,868,750 to Kucera et al describes a means for determining grammatical tags for sequences of words in text. Kucera is concerned with a method of grammatical tagging, while the applicants' invention merely uses grammatical tagging. Applicants' invention could use the grammatical tagging of Kucera; however, his tagging method is not preferred.
U.S. Pat. No. 4,888,730 to McRae et al teaches replacing frequently used words with their synonyms.