1. Field of the Invention
The present invention relates to a document processing system which stores a plurality of items of document data and relation information representing the relations between the items of document data, a control method thereof, a program, and a storage medium.
2. Description of the Related Art
Advanced storage techniques that require low costs enable accumulation and management of a larger amount of document data than was impossible using conventional techniques. File servers, document management systems and groupware for implementing such functions have prevailed; progress continues on information processing apparatuses such as PCs; and various devices including a copying machine, printer, image scanner, facsimile apparatus, digital camera, document storage, and MFP (Multi-Function Peripheral) having image transmission and reception functions can now connected to a network. In a network environment on the customer side, a large amount of document data is exchanged between information processing apparatuses and various office machines. A storage infrastructure is coming into practical use capable of proactively storing document traffic distributed throughout an office network.
An example of such a storage infrastructure is a multi-function image processing apparatus disclosed in Japanese Patent No. 3486452 (reference 1). The image processing apparatus connects at least two image output apparatuses in order to provide a multi-function image processing apparatus which reliably makes a copy of a necessary image without troubling a user. This apparatus monitors the processing parameters of an image processing job, and determines whether an activated job satisfies a predetermined condition. When a job determined to satisfy the condition is executed, the apparatus sends image data to an original output destination and to another image output apparatus (e.g., an image file). One goal of this storage infrastructure is to audit security in order to prevent leakage of confidential information. The storage infrastructure also has a purpose of efficiently reusing existing assets by minimizing the re-execution of document processing similar to that already performed.
A storage infrastructure which proactively stores document traffic distributed throughout an office network stores document content data and also various types of additional information, that is, metadata related to documents. For example, relation information between two documents, and history information on the lifecycle of a document are stored as metadata in association with a given document. Examples of related documents are grouped documents belonging to the same category, documents of old and revised editions, application data and a snapshot document created while in printing, similar documents, documents containing the same page, and documents containing similar images. Metadata pertaining to the lifecycle of a document includes, for example, information on the contents of processing executed for the document, parameters, time, apparatus used, location, and the operator of the processing.
Japanese Patent Laid-Open No. 2004-78735 (reference 2) discloses a filing system which implements some document management functions in a document-handling apparatus (e.g., a printer, scanner, copying machine, FAX, projector, or digital camera). Every time the system handles a document, it transmits the document information to a document management server together with additional information on a person concerned who handled the document.
In the field of electronic document data files, a file format is used to express metadata associated with document content data in relation to document data. According to OpenDocument Format (ISO/IEC 26300) and Office Open XML (Ecma-376), the document file format contains a metadata representation by an XML document.
Japanese Patent Laid-Open No. 09-91301 (reference 3) discloses a document information management system which builds the continuity and relation of information between the digital world of a computer and a paper document. According to this technique, a paper document is embedded in the document information management system in the digital world. The system allows directly accessing the digital world via a paper document. Further, the system implements a hypertext using a paper document. The system adds selection information to description information recorded at an arbitrary position on paper to search for and output a desired relation information file (electronized document). Paper also records link information for searching for the relation information file.
A technique well known as PageRank® is described in U.S. Pat. No. 6,285,999 (reference 5) and Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd, in “The PageRank Citation Ranking: Bringing Order to the Web”, 1998, http://www-db.stanford.edu/˜backrub/pageranksub.ps (reference 6). This technique exploits the vast link structure of the Web. A link from one page to another is regarded as a supporting vote, and the importance of the page is determined from this “poll”; that is, from the sum of all votes. At this time, not only the given poll, that is, the link count on the page is taken into consideration, but also the page that cast a given vote is analyzed. A vote cast by a page of higher “importance” is evaluated more highly, and the given page is regarded to as more “important”.
The volume of stored documents that are considered critical office resources is expected to greatly increase. Creation and processing of documents are basic office tasks, and the document growth capacity is ever increasing. It is difficult to organize a large number of dynamically accumulated documents in a tree structure classification such as by category. An improved method is required for efficient searching for a desired document among many accumulated and unorganized documents. As such a search method, in addition to Internet search services, full text search and content search within an office network (also known as an enterprise search), are becoming popular.
To efficiently search for a desired document among a large number of accumulated documents, it is important to use document data, a variety of metadata associated with the document, as well as relations with other documents. A more advanced and practical search function can be provided if metadata reflecting a user activity in the office, such as processing executed for a document by a user, can be used as a key for search.
Applicability will be widely expanded by setting a plurality of documents and metadata as nodes, and exploiting a semantic network formed by the relations between documents and relations between metadata as a kind of knowledge representation. The network of documents and metadata is usable for so-called data mining and business intelligence upon classification, analysis, and modification. The network expresses a document and the action of an office worker in association with the document. A so-called “Wisdom of Crowds” or “Collective Intelligence” can be derived and exploited by integrating the network by statistical processing. Note that the “Wisdom of Crowds” has received attention as a factor which features the trend of “Web 2.0” in the Internet. An application of the “Wisdom of Crowds” to even an intranet is expected to greatly increase the productivity of an office as a whole.
However, once printed on paper or facsimile-sent, an online document electronically linked to the semantic network or an electronic document of a file format containing metadata loses its metadata and relation data to other documents. That is, a document offline in the network such as a paper document or facsimile document is disconnected from metadata and a semantically related network.
According to reference 4 described above, link information is recorded on paper to search for a relation information file in the digital world. However, in processing such as paper scanning or facsimile reception, an offline document and metadata associated with the processing cannot be linked again to an online semantic network. In other words, an online document in a storage infrastructure and a paper document having undergone the processing (and metadata associated with the processing) cannot be accumulated and managed in association with each other.