Web search engines provide search tools for entering text strings to search for documents on the Internet. Such text-based search tools are not well suited for finding forms for various reasons. The difficulty in searching for forms is partly due to the fact that forms related to many different topics can have similarities and a user searching for a particular form must thus review many potential results from these different topics. For example, forms related to employment, medicine, and athletic activities all include text such as “name,” “address,” “phone,” “registration,” “medicine,” “physician,” etc. Searching for a particular form can thus be time consuming and burdensome for a user. The user may be required to try multiple search text strings and/or search through many results to find the particular form of interest.
Existing document classifications techniques do not accurately classify documents to facilitate searching for forms by topic. This is because such techniques typically rely on only the words in the document. One such technique uses automatic text classification using supervised learning in which pre-defined category labels are assigned to documents based on the likelihood suggested by a training set of labeled documents. Words in the documents are considered features of the documents and these features are used to categorize the documents. However, this technique does not adequately categorize forms because forms in multiple categories usually share a large number of common words and thus common features. Words alone are a poor criterion for categorizing forms.