The present invention relates generally to a terminology extraction method applicable to a cloud computing environment, and more particularly, but not by way of limitation, to a system, method, and computer program product for extracting terminology (specific to the domain) automatically with an unsupervised approach.
Conventionally, automatic terminology recognition/extraction (ATR) approaches extract candidate terms from the given corpus, ranking them using an ATR ranking technique, sorting the candidate terms according to ranking scores, and finally, selecting the top N terms or terms with ranking scores above a certain threshold as the terminology.
The conventional c-value technique has limitations such as the c-value technique is biased towards the terms with more tokens in them (i.e., if a term has more words/tokens, it has higher probability of being ranked higher than a term with less words/tokens). The c-value technique is designed for the recognition of multi-word terms and hence fails to extract domain-specific single word terms. Also, the Term Frequency-Inverse Document Frequency (TF-IDF) technique has limitations such as if a term occurs in almost all of the documents (of the corpus), the IDF score for that term would be zero or near to zero. This is problematic because sometimes terms which are domain-specific are also common across documents. Further, assuming a term A is related to another term B and shares common tokens (e.g. “battery”, “lead-acid battery”, “battery recycling process” where the token is “battery”), if the Term Frequency (TF) of A is high but the TF of B is low and both of the terms have almost same IDF, A will be ranked higher but B will be ranked at the bottom. However, this is not desirable. Intuitively, the high ranking of one term implies that any strongly-related term is likely to be of similar importance in the domain even if it occurs infrequently. TF-IDF does not take this in consideration.