1. Field of the Invention
The present invention generally relates to a method of automated labeling of unlabeled text data and, more particularly, to a method that assigns labels without manual intervention and can also be used to extract relevant features for a keyword search of the data.
2. Background Description
Very often, organizations have large quantities of machine readable text documents to which they would like to assign labels for such purposes as developing a categorizer for new texts, enabling the retrieval of old texts, and the like. These text documents could be various electronic documents, including, among other things, Web pages (the World Wide Web (WWW) portion of the Internet, or simply xe2x80x9cthe Webxe2x80x9d), electronic mail (i.e., e-mail), a collection of Frequently Asked Questions (FAQs). Current solutions to labeling such text documents usually include a large amount of costly manual labor, and cannot be completely automated (e.g., they require manual intervention).
It is therefore an object of the invention to provide a method of automatically labeling of unlabeled text data, independent of human intervention, but that does not preclude manual intervention.
It is another object of the present invention to provide a method to extract relevant features of unlabeled text data for a keyword search; that is, an automatic method of adding appropriate linguistic variants as part of an indexing mechanism.
According to the invention, there is provided a method of automated labeling of unlabeled text data. A document collection is established as a reference answer set. A label, e.g., the URL of a Web page, is attached to each document. Members of the answer set are converted to vectors representing centroids of clusters of documents. Unlabeled text data are categorized relative to the centroids by a nearest neighbor algorithm. Then, a supervised machine learning algorithm is trained on the newly labeled data, and a categorization classifier (e.g., a rule based classifier) classifies the data for each cluster. Alternatively, a feature extraction algorithm may be run on classes generated by the step of categorizing, and search features output which index the unlabeled text data.
Although the invention contemplates a fully automated process of categorizing unlabeled text data or extracting relevant features from the unlabeled text data for keyword search, human intervention may optionally be used to further refine the process. For example, the automated categorizations might be manually checked and updated by shifting documents from one cluster to another and thereafter the data re-categorized using a nearest neighbor algorithm. These steps would then be iterated until the process stabilizes or some iteration parameter reached. Also, the document collection established as the reference answer set might be manually augmented and/or edited with additional information useful to the categorization process, e.g., synonyms of words occurring in the documents.
The method of this invention may use information from several disparate and separate sources, such as a Web site, a database of Frequently Asked Questions (FAQs), and/or databases of other document collections, a the reference answer set. Sets of related Universal Resource Locators (URLs) could also be used in the categorization process.