Typically, the corpora collected manually or automatically are analyzed with the machine learning method to generate the classifier models of a certain specific class to be used in information extraction, knowledge mining and other natural language processing applications.
In the task-oriented or domain-oriented natural language processing applications, such as domain-specific information extraction and named entity recognition, collecting corpora with extensive coverage and tagging the collected corpora are the important factors for improving the recognition accuracy.
There exist some methods for automatically collecting and tagging corpus. In these methods, corpora are collected from the web or other resources by a search engine based on some sample seeds. However, in these existing methods, the corpus coverage is completely dependent on the limited initial sample seeds. Therefore, it is required to collect richer corpus based on more sample seeds.