Enterprises and/or other types of entities often create, collect and/or otherwise use natural language documents in the course of their operations.
It is often desired to generate keywords for some or all of the natural language documents created, collected and/or otherwise used by an enterprise.
One technique for generating keywords for natural language documents is referred to as keyword extraction. Keyword extraction is widely used in information retrieval, topic detection, automatic tagging of documents and many other tools and solutions.
One of the most popular approaches to extract keywords from natural language text is the tf-idf approach. A paper that discusses the tf-idf approach is Lott, Brian, “Survey of Keyword Extraction Techniques”, which can be found at http://www.cs.unm.edu/˜pdevineni/papers/Lott.pdf.
The tf-idf approach is based on word statistics. The word statistics includes word statistics on a document level and on a corpus (a set of documents) level. The tf-idf approach essentially makes two assumptions. The first assumption is that terms that appear more frequently in a document are more important in the document than terms that appear less frequently in the document. The number of occurrences of a term t in a document d is referred to as term frequency and is denoted as:tfd(t)  (1)
The second assumption is on the corpus level. This assumption states that terms that occur in fewer documents in the corpus are more important than terms that occur in more documents in the corpus. For example the word “the” occurs in almost all online CNN newspaper articles for the year 2013, and as might be predicted based on the second assumption, the word “the” is less important than terms that occur in fewer of the online CNN newspaper articles for the year 2013.
The importance of a word in a corpus is sometimes referred to herein as its semantic load. In view of the above, it may be said that the term “the” does not carry a significant semantic load in the online CNN newspaper articles for the year 2013.
On the other hand, the word “software” occurs in fewer of the online CNN articles for the year 2013. Thus, it may be said that the word “software” carries more semantic load (than the word “the” carries) in the online CNN articles for the year 2013.
The second assumption in the tf-idf approach is formalized by determining of an inverse document frequency (idf) of a term in a corpus, based on the following definition:
                                          idf            C                    ⁡                      (            t            )                          =                  log          ⁢                                                  C                                                    1              +                                                                {                                      d                    ∈                                          C                      ⁢                                              :                                            ⁢                                                                                          ⁢                      t                      ⁢                                                                                          ⁢                      is                      ⁢                                                                                          ⁢                      in                      ⁢                                                                                          ⁢                      d                                                        }                                                                                                        (        2        )            
where C refers to the corpus                t refers to the term and        d refers to a document in the corpus.        
In the definition set forth above, it can be seen that the numerator will have a value equal to the number of documents in the corpus. The denominator will have a value equal to 1+the number of corpus documents that include the term. Thus, for a corpus of a given size, the inverse document frequency idf of a term in the corpus will decrease as the number of documents that are in the corpus and include that term increase.
After determining the term frequency, tf, of a term t in a document d, as well as the inverse document frequency, idf, of the term t in the corpus C, an tf-idf value of the term t in the document d in the corpus C, may be determined based on the following definition:tf−idfd,C(t)=tfd(t)·idfC(t).  (3)
where C refers to the corpus                t refers to the term and        d refers to the document.        
In the definition set forth above, the tf-idf value of a term t in a document d and a corpus C is equal to the product of the term frequency, tf, of the term in the document d and the inverse document frequency, idf, of the term t in the corpus C.
The keywords chosen for a document d will typically be the terms that have a high tf-idf value for the document d and corpus C.