The present application relates to information management, and more particularly, to technologies for topic identification in natural language contents, and for searching, ranking, and classification of such contents.
In the information age, more and more individuals and organizations are faced with the problem of information overload. Accurate and efficient methods for information access, including collection, storage, organization, search and retrieval are the key to successful information management.
Much of the information is contained in natural language contents in the form of documents. Various theoretical and practical attempts have been made to organize and determine the amount and relevancy of information in natural language contents for efficient access to such information. The existing techniques, including various search engines and document classification systems, however, are often not sufficiently accurate in identifying the quantity and focus of the information in the content, thus often cannot effectively serve the information needs of their users. There is still a need for accurate, efficient, and automated technologies to search, rank, and organize large amounts of natural language contents based on the focus and quantity of information they contain.
One particular challenge in information management is to efficiently organize the so-called “unstructured data”. Usually, a document collection in its natural state is unorganized, or in a so-called unstructured state. Such document collections, in the general sense, include Web pages scattered over the Internet, various documents in a company or other organizations, and even documents on many personal computers, as well as emails. Information in the unstructured document data is accessed usually by sending queries to an information retrieval system such as a search engine or index server that returns the documents believed to be relevant to the query.
The problem with using queries to access unknown data is that the user may not always know what kind of information may actually exist in the document collection, and what key words are the most effective for retrieving the most relevant information. Often, time is wasted before the needed information is found.
One approach to provide easy access to unstructured document data is to manually organize a document collection into a directory or category structure, from which, users can browse the collection according to the topics or areas of interest for their information needs. This approach, however, usually carries a huge labor cost associated with manually building and maintaining such directories or category systems.