The present invention relates generally to information processing for searching and more particularly to processing documents to be integrated to a database for a search engine.
Information on the Web is growing at an astronomical rate. Just the publicly indexable Web contains more than 800 million pages of information, encompassing about 6 terabytes of text data on over 3 million servers. Though it is usually free to get information from the Web, finding the information of your interest is difficult. In order to quickly respond to a question, a good search engine typically depends on a good database of pre-processed information. In other words, processing information for a search engine is a very important task.
Existing search engines use different techniques to process information. Some companies deploy hundreds of human editors to manually categorize the documents. After the documents are correctly categorized, search engines can quickly find the appropriate responses for a question. Such human-intensive approach is an expensive and difficult task that is difficult to scale. In the long run, this approach may not be able to keep up with the information growth.
There are companies that give all relevant responses to a searcher indiscriminately. The way they prepare the documents is through key word matching techniques. They have very powerful crawlers that keep searching for information, and then providing the searchers with all documents having the same key words as in the searcher""s question. There are at least two problems with such techniques. First, huge crawlers mean lots of results. If you go to such companies to search for a topic, you might get thousands of hits. The searcher has to go through all of the responses to find an answer. The second problem is that many responses are totally irrelevant to the question. For example, your questions are on fixing windows as in windows and doors. Responses might include fixing the Microsoft Windows!
To reduce the number of responses for a searcher, some companies process information by prioritizing them based on the number of sites linked to them. This approach makes it difficult for a searcher to gain access to sites not commonly accessed.
There are also companies that switch the table around. The higher a site is willing to pay them, the more frequent the site will appear in their searches. They process the information by prioritizing them based on how much the information""s owner pays them. Again, such information processing techniques are not addressing users"" needs of trying to quickly identify the relevant information from the huge amount of Web pages.
Another weakness in existing information processing techniques is that not only do they provide many irrelevant responses, they are typically unable to provide responses related to your questions. For example, if your question is on butter, responses typically would not include margarine.
Information processing also depends on the types of questions a search engine can respond. A trend in Web searching is the desire to search in natural-language, such as in plain English. As the Web moves into every sector of the society, a large part of the population does not feel comfortable searching by search words. It is un-natural. If the search engine depends on certain grammatical rules in a natural language, information processing for searching typically has to follow similar grammatical rules.
No matter whether the search engine is in natural language or in key words, the challenge remains. Information for a search engine has to be processed so that the engine can quickly access the growing wealth of information, and more appropriately respond to an inquiry.
It should be apparent from the foregoing that there is still a need to process information to be integrated to a database for a search engine so that the engine can quickly identify appropriate responses when the amount of information is huge and when the information is growing at an astronomical rate.
The present invention provides methods and apparatus to automatically process information to be appropriately integrated into a database for searching and retrieval. It is applicable even if the amount of information is large and is growing at a fast pace. Also, due to the invention, responses to searches are very relevant. The invention is suitable to both natural-language searches and key word searches. Web documents are used to illustrate the invention.
One embodiment first determines the context or domain of a document. Then, domain-specific phrases in the document are automatically extracted based on grammar and dictionaries. From these phrases, categories in a category hierarchy are identified, and the document is linked to the categories. Later when a question asks for information related to these phrases, the corresponding categories in the hierarchy are found, with the document retrieved to answer the question.
In the invention, there can be three different types of dictionaries: A common dictionary, a negative dictionary and a domain-specific dictionary. The negative dictionary includes phrases that should be ignored, while the domain-specific dictionary includes phrases specific to the domain. In one embodiment, the common dictionary includes phrases commonly used by the general public, and phrases in the domain-specific dictionary.
The domain-specific phrases can be linked together by a category hierarchy. It can be a structure that connects categories together, with each category having one or more phrases. The phrases can be grouped together under a category if they belong to the same concept, or if they are equivalent. Categories are grouped together in the hierarchy if they are under the same concept or if they are related categories. Categories can also be grouped together under a broader category if they have some types of order relationship.
In one embodiment, the document is automatically processed by first identifying every phrase in the document, based on the common dictionary. The identified phrases that have entries in the negative dictionary are ignored. For the remaining phrases, those with entries in the domain-specific dictionary are extracted. Any remaining phrases are new ones.
Each of the identified domain-specific phrases can be matched with phrases in the categorization hierarchy. When there is a match, the corresponding document, or the URL of the document is linked to that phrase in the categorization hierarchy.
For the new phrases, they can be referred to a human editor. If the new phrases are irrelevant, they are included in the negative dictionary. Next time when the same new phrases arise from another document, they would not be considered. However, if the new phrases are relevant, they can be added into the domain-specific dictionary. Recommendation can be given to the editor as to where to incorporate new phrases into the existing categorization hierarchy. The editor would try to link the new phrases, with the document, to existing categories. If that cannot be done, the editor may create new categories in the hierarchy. If too many documents are linked to one category, the editor may also be notified to create new categories or sub-categories. Such systematic and orderly growth of the categorization hierarchy are very useful for information organization and information retrieval.
In one embodiment, a question is transformed to one or more frequently-asked-question formats, which are linked to one or more phrases or categories in the hierarchy. To respond to the question, the documents linked to those phrases can be retrieved to be presented to the user.
Through the categorization hierarchy, new documents or information is much better organized. This will significantly reduce the amount of time required to identify relevant information to respond to questions. Also, since the categorization process is domain specific, information is organized more logically, leading to highly relevant responses to questions.
The invention is also applicable to human learning. The editor can be a student, and the categorization hierarchy can be her knowledge filing system. If a document or phrases are in an area she has learnt before, they can be automatically and systematically filed to her system. New information or phrases, automatically identified, can be referred to her to be learned. After learning, she can be suggested as to where to file the information in her existing filing system. In other words, she can link the information to what she has learnt before. Such systematic and logical learning approaches significantly help her organize new information, which, in turn, enhance knowledge retrieval in time of needs.
Other aspects and advantages of the present invention will become apparent from the following detailed description, which, when taken in conjunction with the accompanying drawings, illustrates by way of example the principles of the invention.