Information can be generally divided into structured data and unstructured data, and according to statistics, unstructured data mainly including text documents and streaming media constitute more than 70% of the information. The structure of structured data, i.e., a two-dimensional table structure, is comparatively simple. Structured data are typically processed by a database management system (DBMS). Such technique has been under development since the 1970s and was flourishing in the 1990s; the research and development and application of the technique for processing structured data are quite advanced at present. Unstructured data do not have any fixed data structure; hence unstructured data processing is very complicated.
Different applications in different system environments have different storage needs. For example, an application needs to reduce seek times of a disk head to improve performance when the data are saved in a hard disk, while an embedded application does not need to do that because the data of the embedded application are saved in the system memory. For example, a DBMS provided by the same manufacturer may use different storage formats on different platforms. Hence the document storage standards affect transplantability and scalability of the system.
In prior art, the document format that provides the best performance for openness and interchangeability is the PDF format from Adobe Acrobat. However, even though the PDF format has actually become a standard for document distribution and exchange worldwide, different applications cannot exchange PDF documents, i.e., PDF documents provides no interoperability. Moreover, both Adobe Acrobat and Microsoft Office can process only one document at a time and can neither manage multiple documents nor operate with docbases.
In addition, the existing techniques are significantly flawed concerning document information security. Currently, the most widely used documents, e.g., Word documents and PDF documents, adopt data encryption or password authentication for data security control without any systematic identity authentication mechanism. Privilege control cannot be applied to a part of a document but only to the whole document. The encryption and signature of logic data are limited, i.e., encryption and signature cannot be applied to arbitrary logic data. Likewise, a contents management system, while providing a satisfactory identity authentication mechanism, is separated from a document processing system and cannot be integrated with the document processing system on the core unit. Therefore the contents management system can only provide management down to the document level, and the document will be beyond the security control of the contents management system when the document is in use. Essential security control cannot be achieved in this way. And the security and document processing are usually handled by separated modules, which may easily cause security breaches.