The advent of the World Wide Web creates new challenges for information processing. The amount of information is growing exponentially, as well as the number of internet users. There are billions web pages and hundreds million users so far. The current internet provides huge amount of information which is beyond the human being's capability to handle it. Also, more and more low quality and redundant information is posted in the internet, which creates even more difficulties to find useful information. Without an efficient way to help human handle the internet/intranet information, more and more money and time will be wasted on the internet information highway.
Also, with advances in computer technology, network, storage and internet/intranet technologies, vast amounts of information have become readily available throughout the world. Actually, more and more businesses, individuals and institutions rely on computer-accessible information on a daily basis. However, as the total amount of accessible information increases, the ability to find useful information becomes increasingly more difficult.
Currently, there are three major ways to find internet/intranet information: high quality human maintained directories such as Yahoo!, search engines, and knowledge based search. Human maintained directories cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve. Also, a directory can only cover limited topics.
Search engines first use special software, referred to as ‘robots’, ‘spiders’ or ‘crawlers’, go out and retrieve web pages. The web pages are parsed to generate keywords to index the pages. Then indexes are stored in a database with a rank for each web page. The rank reflects the relevance of the web pages to certain keywords. When an internet user enters a query with a keyword or keywords, search engines retrieve the web pages which match the keywords in the database.
Automated search engine usually return too many low quality matches. Most internet portals provide both directory and search engine services for user queries. Although the search engine technologies have been improved in the past several years, people still feel frustrated for internet information search. Often, the wanted information cannot be found or it needs to spend too much time to find it. The search results are a list of web pages that have to be scanned to find useful information. There may be millions web pages for some commonly used words. Also, the search results are the same for different people or for same people at different time. Lastly, the search can only find the web pages that contain the query words, however the new knowledge that is not explicitly contained in web pages cannot be found.
Knowledge based search is another way to search internet information, such as Ask Jeeves. However, like the directories, such knowledge base is built by people, which is also very costly and difficult to update and maintain, and the knowledge base cannot be very big. Also, the covered knowledge in such system is quite limited.
For structured data like relational database, knowledge base can be built using data mining methods. Such data mining methods can be implemented with the standard classification, clustering or machine learning algorithms. However, internet/intranet web pages and about 80% corporate information are stored with unstructured text documents like e-mail, news article, technical and patent portfolios. To extract knowledge from text data, some complicated text mining and learning algorithms are required.
Currently, most text mining researches and developments are still similar to the data mining algorithms and approaches such as standard clustering, classification, predication and decision tree algorithms. However, the problems which text mining has to solve are quite different with data mining. Firstly, for data mining, the samples usually have fixed feature set. In most cases, all samples have the same number of features. For text mining, it is hard to define features for text, or the feature set is huge if each word is considered as a feature. Secondly, it is hard to define what knowledge is for text. For data mining, knowledge is considered as the training results of classification, predication, regression or other functions. However, for text mining, these methods cannot provide enough information for user's query or the retrieved information may not be what user wants. Also, text mining needs very large amount of text information to extract reasonable knowledge. In addition, the accuracy and speed for text mining are also very important for real application.
Text mining is a relatively new research area and has a lot of challenging problems. However, there are also a lot advantages for the development of text mining technology and method. Firstly, it is easy to collect huge amount of text information from the World Wide Web for analysis. Secondly, a lot of terms have already been manually classified in the web that can be directly used. Also, the internet/intranet users can be directly benefited from the new text mining methods.