Information can be generally divided into structured data and unstructured data and, according to statistics, unstructured data mainly including text documents and streaming media constitute more than 70% of the information. The structure of structured data, i.e., a two-dimensional table structure, is comparatively simple. Structured data are typically processed by a database management system (DBMS). Such technique has been under development since the 1970s and was flourishing in the 1990s; the research and development and application of the technique for processing structured data are quite advanced at present. Unstructured data do not have any fixed data structure; hence unstructured data processing is very complicated.
Various of unstructured document processing applications are popular among users and different document formats are used at present, for example, existing document editing applications include Microsoft Word, WPS, Yongzhong Office (a branch of Open Office), Red Office (another branch of Open Office), etc. Usually a contents management application has to handle 200 to 300 ever updating document formats, which causes great difficulty to application developers. The document interoperability, digital contents extraction and format compatibility are becoming the focus of the industry, and problems as follows need solutions:
(1) Documents are not universal.
Users can exchange documents processed with the same application, but cannot exchange documents processed with different applications, which causes information blockage.
(2) Access interfaces are not unified and data compatibility costs are highly.
Since the document formats provided by different document processing applications are not compatible with each other, a component of another application should be used for a document processing application to parse an incompatible document (if that another application provides a corresponding interface) or too many research resources are spent in the software development stage to parse the document format from head to toe.
(3) Information security is poor.
The security control measures for a written document are quite limited, mainly including data encryption and password authentication, and widespread damages caused by information leaks in companies are found every year.
(4) Processes work only for a single document, multi-document management is lacking.
A person may have a large number of documents in his computer, but no efficient organization and management measure is provided for multiple documents and it is difficult to share resources such as font/typeface file, full text index, etc.
(5) Layer techniques are insufficient.
Some applications, e.g., Adobe Photoshop and Microsoft Word, have more or less introduced the concept of layer, yet functions and management of the layer are too rudimentary to meet the practical demands.
(6) Search methods are limited.
Massive information in the present networks results in a huge number of search results for any search keyword. While the full text search technique has solved the problem of recall ratio, precision ratio has become the major concern. However, the prior art does not fully utilize all information to improve the precision ratio. For example, the font or size of characters may be used for determining the importance of the characters, but both are ignored by the present search techniques.
Large companies are all working to make their own document format the standard format in the market and standardization organizations are also leaning toward the creation of a universal document format standard. Nevertheless, a document format, whether a proprietary document format (e.g., .doc format) or an open document format (e.g., .PDF format), leads to problems as follows:
(a) Repeated Research and Development and Inconsistent Performance
Different applications that adopt the same document format standard have to find their own ways to render and generate documents conforming to the document format standard, which results in repeated research and development. Furthermore, some rendering components developed by some applications provide full-scale functions while others provide only basic functions. Some applications support a new version of the document format standard while others only support an old version. Hence, different applications may present the same document in different page layouts, and rendering errors may even occur with some applications that are consequentially unable to open the document.
(b) Barrier to Innovation
The software industry is known for its ongoing innovation; however, when a new function is added, descriptive information about the function needs to be combined with the corresponding standard. A new format can only be introduced when the standard is revised. A fixed storage format makes technical innovation less competitive.
(c) Impaired Search Performance
For massive information, more indexes need to be added so as to enhance search performance, yet it is hard for a fixed storage format to allow more indexes.
(d) Impaired Transplantability and Scalability
Different applications in different system environments have different storage needs. For example, an application needs to reduce seek times of a disk head to improve performance when the data are saved in a hard disk, while an embedded application does not need to do that because the data of the embedded application are saved in the system memory. For example, a DBMS provided by the same manufacturer may use different storage formats on different platforms. Hence the document storage standards affect transplantability and scalability of the system.
In prior art, the document format that provides the best performance for openness and interchangeability is the PDF format from Adobe Acrobat. However, even though the PDF format has actually become a standard for document distribution and exchange worldwide, different applications cannot exchange PDF documents, i.e., PDF documents provides no interoperability. Moreover, both Adobe Acrobat and Microsoft Office can process only one document at a time and can neither manage multiple documents nor operate with docbases.
In addition, the existing techniques are significantly flawed concerning document information security. Currently, the most widely used documents, e.g., Word documents and PDF documents, adopt data encryption or password authentication for data security control without any systematic identity authentication mechanism. Privilege control cannot be applied to a part of a document but only to the whole document. The encryption and signature of logic data are limited, i.e., encryption and signature cannot be applied to arbitrary logic data. Likewise, a contents management system, while providing a satisfactory identity authentication mechanism, is separated from a document processing system and cannot be integrated with the document processing system on the core unit. Therefore the contents management system can only provide management down to the document level, and the document will be beyond the security control of the contents management system when the document is in use. Essential security control cannot be achieved in this way. And the security and document processing are usually handled by separated modules, which may easily cause security breaches.