The present information relates to the information processing technology, and in particular, relates to a method and system for processing a text.
In the past more than 20 years, with the expeditious development of electronic technology, the performance and capacity of a computer network, particularly the Internet, have increased explosively. Users use computers to process and edit various kinds of information to form a great number of electronic texts every day. These electronic texts (hereinafter as the texts) comprise texts stored in a form of document and texts stored in a database in forms of recording and sheet. The information in these texts is important assets for a person or an enterprise. Storing and processing these texts in electronic forms such as documentation or records provides convenience for users to reuse the information therein and improves work efficiency.
However, in some cases, a text may be damaged (for example a text document is damaged) to thereby affect reuse of information in the text, which wastes resources (such as time, etc.) inputted when forming the text. There are various reasons for damage of a text, for example, communication failure, memory medium fault, etc. Besides, fault in the operational system (OS) and applications for processing documents would also damage a document. Document damage can be decreased to the minimum, but cannot be prevented completely.
When a document is damaged, the user typically wishes retrieving the undamaged part from the document, thereby eliminating the necessity to re-edit the whole document content. For a document of word processing type, text is always the most important content therein, whereas the format and other non-text information are relatively insignificant. Thus, it is important to restore the text in the document. Text is typically stored in the form of character codes in a document according to a predetermined character set.
A character set refers to a set of specific characters and is categorized into a single-character set (a single-byte coded character set) and a multi-character set (double-byte or multi-byte coded character set). The single character set mainly comprises coded character sets such as ASCII and Latin-1, mainly for alphabetical languages such as English and among others. The multi-byte character set mainly comprises coded character sets such as GB2312, GBK, GB18030, Shift-JIS, ISO2022 and among others, mainly for Chinese, Japanese, and Korean, etc. For the Windows of Microsoft, its core is coded by UTF-16, which is a double-byte code, and its outer-layer application varies with the language settings (locale) in use. For example, the Chinese windows may uses GB2312 codes or GB18030 codes.
A database for storing information may also use different codes. During installation, the database system software may set a default code, but upon setting up different databases, the code can be designated as required by the user. Upon installation of the database system software, for example, two database are established, one being for processing employee information, and one being for processing machine equipment information. The employee information may use GB2312 codes or UTF-16 codes so as to support Chinese, or use Shift-JIS codes so as to support Japanese. The machine equipment information may use ASCII codes, because information such as the names and IP addresses of the machine equipment are all ASCII codes.
One important reason for document damage is loss of bytes. Documents stored in a hard disk or U-disk, due to unexpected factors such as vibration and complex environment, some bytes of the document will be damaged, thereby causing loss of bytes.
Copying a document between databases, especially between databases using different codes, may also cause loss of bytes. For example, if data overflow occurs in the copy buffer, loss of bytes may occur. For a plurality of cooperative servers of a multinational company, since they are required to support English, Chinese, Japanese and Korean users in different countries, the documents therein use single-byte codes and double-byte or multiple-byte codes. Upon synchronization or backup between such cooperative server systems, improper method will cause overflow, and thus bytes may lose. For example, for a database using multi-byte codes, the number of coded bytes for each character is 1 to 3. When a string of characters are copied, a 512-byte buffer area is used. When the buffer area is full, due to program design problem or memory distribution problem, the last character may not be completely copied, i.e. loss of bytes. For example: for the words “ABC ”, if ABC each is single-byte coded, they will occupy 3 bytes, and the two words  is three-byte coded and thus they each need to occupy three bytes. If the above encoded “ABC ” is stored in the last 8-byte space, the last byte for the last character will lose.
Format conversion between different document formats or data formats, especially format conversion for a content including a text between different encoded systems or applications, may cause loss of bytes.
Due to development of hardware and software technologies and many years' use of computer to process various kinds of information, some enterprise users may accumulate various kinds of different documents which are based on different software and hardware systems. Since the scenarios required to be processed are complex, during the process of re-using these accumulated documents, loss of bytes in the text occur frequently.
Thus, it is necessary to adopt technical processing measures for the above different scenarios so as to try best to restore the damaged text. Moreover, a mechanism is needed to detect whether processing a text is safe to the text.