1. Field of the Invention
The present invention relates to the collection of documents from a network, and more specifically to a document collection apparatus for efficiently collecting documents for each specific use.
2. Description of the Related Art
A retrieval engine for a document processed through a network such as Intranet, WWW, etc. is realized by a document collection apparatus (robot, spider or crawler) for collecting documents from a network, and a retrieval engine for generating a keyword index for collected documents.
The document collection apparatus repeats processes of starting collecting documents based on a given URL (Uniform Resource Locator) group (a URL group which is a starting point from which the collection starts), collecting an uncollected document referenced based on the information about the reference among documents, for example, an anchor, a hyperlink, etc. as a prospect to be collected next, etc. for a predetermined number of times. Thus, a document collection crawler periodically collects a document in a range from several tens of millions of URLs to several hundreds of millions of URLs. A URL refers to a description system of specifying a method of locating the position of information processed through a network.
Recently, a rapidly increasing number of documents exist in through networks, and it is announced as a survey result by Inktomi Company, etc. in January, 2000 that the number of unique documents in the Internet has reached one billion. In July, 2000, it is announced as a survey result by Cyveillance Company in the U.S. that the size of Internet is about 2.1 billion documents, and is estimated to double in 2001.
If documents are collected from among a billion URLs, it will take three years to completely collect the documents even if a million URLs are collected a day (about 10 URLs=40 Kbytes per second). Then, the information in the documents collected on early days becomes obsolete when the documents are completely collected. Therefore, an intellectual document collection apparatus for efficiently collecting only significant information for each use has long been demanded.
Document collection apparatuses for collecting documents by priority for a specified use are listed below.
For example, the invention disclosed by Japanese Patent Publication No. 9-311802 collects new information by priority.
Documents that are considered to be similar in contents are collected based on the following concepts.
a) The scope of the collection is limited by the number of hierarchical levels.
For example, like the invention disclosed by Japanese Patent Publication No. 9-218876, cross-referenced documents are considered to be similar in contents, but they have no semantic relation when they are different in hierarchical level. Therefore, document are collected with the collection scope limited by the number of hierarchical levels.
b) Only documents semantically similar to one another are collected.
For example, as the invention disclosed by Japanese Patent Publication No. 10-105572, the semantic similarity is computed by making a matching check on the contents of document, and only semantically similar documents are collected from among the referenced documents.
c) Only documents having appropriate character strings in referenced documents are collected.
For example, like the inventions disclosed by Japanese Patent Publication No. 10-260979 and No. 2000-9011, based on the referencing expression in a referenced document, for example, the contents of an anchor tag in the HTML, it is determined whether or not the document referenced by the referencing expression is to be collected next.
Generally, more popular documents are collected by priority.
A more frequently referenced document, that is, a document referenced by a large number of documents, is considered to be popular. By collecting documents in order from the most frequently referenced document in the collected document group, popular documents can be collected by priority.
However, the concepts of the above mentioned conventional technology are insufficient to collect documents requested on a portal site of a community such as an enterprise. For example, the portal site in an enterprise, that is, the requirements of a corporation portal include the following conditions.
A large number of documents generated inside and outside a company in real time are automatically collected.
A semantic analysis and categorization are automatically performed.
Documents are collected, and a categorization result is fed to an appropriate position (depending on a user) on the screen.
In collecting documents, an enormously large number of documents inside and outside the company are not collected at random, but are necessarily collected by selecting documents from the viewpoint of the relation to a job from inside documents. A viewpoint of the relation to a job is different from having a specific semantic contents, or having significance. For example, in an Intranet community of an enterprise of a certain scale, the contents of documents are semantically diversified. In addition, outside (for example, Internet) documents, for example, the information about hobbies is popular, but is not always significant to a corporate portal.
However, the conventional concept in collecting documents, for example, obtaining latest information by priority, obtaining information in a specified field by priority, and obtaining popular information by priority has the problem that, relating to the information about hobbies, for example, normally significant documents but not significant to the community can be collected.
In addition, for example, when documents are collected in a method of collecting only semantically similar documents of the above mentioned conventional technology, each concept has the following problems.
Simply limiting the number of hierarchical levels requires a simple process, but does not guarantee that semantically similar documents are collected by priority, or important documents are collected without fail.
In the system of checking the contents of documents and determining whether or not they are semantically similar to one another, a keyword is retrieved by analyzing the text in a document normally in a natural language process, and an analysis is done based on the similarity of the retrieved keyword. Therefore, it takes a long time to perform the process. Actually, only about 100 documents can be processed. Therefore, processing several billions of documents one by one cannot be practically completed. Assuming that the process can be completed, the precision can be 70% through 80%. Since the process largely depends on the type of language, it is necessary to have a determination tool for each language.
Even when it is determined whether or not documents are to be collected based on the referencing expression, a character string used in the referencing expression often contains fixed words and phrases (familiar expressions) such as ‘home page’, ‘return to top’, ‘click here’, etc., and does not always indicate the semantic contents of a referenced document.