In general, one or more articles include, without limitations, paper documents like newspapers, magazines, and electronic documents, for example Portable Document Format (PDF), word documents, Adobe documents, printed documents, brochure, images, scanned documents, books, etc. The one or more articles may contain scripts having, without limitations, texts, characters, words, images, symbols, and letters etc. in one or more languages include, but are not limited to, Indic, Hebrew, Thai, and Tibetan have complex scripts i.e. Unicode complex scripts. The complex scripts include characters, texts, words, and symbols whose look, shape, and appearances are different and complex as compared to Latin script whose script are straight and simple in fashion. Particularly, complex scripts are different from the Latin script in terms of interpretation and shape of a character, texts, symbols etc. In these complex scripts, the look, shape, position and attachment of a glyph depend on an order of the characters and also on the contextual position of the characters in the text (i.e. what character precedes and/or follows it). Typically, the glyph is an elemental symbol within an agreed set of symbols, intended to represent a readable character for the purposes of writing, and intended to express thoughts, ideas and concepts.
Conventionally, the complex scripts may require complex transformations and processing between text input and text output for rendering and displaying the complex script with proper layout on a display unit of a computing device. Particularly, rendering order of the complex scripts is different than writing order of the complex scripts. In such a case, the complex scripts are reordered to render original texts. However, the complex scripts text reordering is contextual, so there is a possibility of having multiple possible original texts while re-using and/or retrieving the actual original text. Particularly, multiple glyphs are required to be mapped into single or multiple character(s) sequence. A mapping table having mappings of each glyph to the single character or the multiple characters sequence is difficult to be maintained since same glyph can't have multiple character(s) mappings. Additionally, in the conventional method, high processing power and resources are required for rendering the complex scripts, such as fonts, script specific rules, and layout and font engines. Furthermore, such a way of rendering may result data loss where some texts and/or characters may be lost while rendering. In some scenarios, the complex scripts are displayed properly, but sometimes such scripts may not enable the user to reuse or retrieve or parse the script, for example read only text may be generated on which no operation can be performed.
Considering a scenario in which incorrect texts or data are extracted and displayed in a target file when a user wants to use partial or complete portion of text by copying from source PDF file in some other text editor like notepad or wordPad etc. For example, the incorrect text is rendered when a reviewer quotes a portion of text from the source PDF document to a review document by copy-paste operation.
For example, FIG. 1 illustrates an example for rendering an Indic script character for display. In the illustrated FIG. 1, an appearance of an Indic script character changes due to the presence of a symbol namely “Halant” after the Indic script character and the presence of a constant after Halant. That is, in the FIG. 1, an SA character 100 having a Unicode of (U+0938) changes its shape due to a Halant character 102 having a Unicode of (U+094D) after the SA character 100 and the presence of a TA character 104 having a Unicode of (U+0924) after the Halant character 102. Specifically, the right vertical line of the SA character 100 is removed to form the SATA character 106.
Another example of Indic script characters combining to form a singleton ligature is illustrated in FIG. 2. As illustrated in FIG. 2, three initial characters, a JA character 200 having a Unicode of (U+91C), a Halant character 202 having a Unicode of (U+094D), and a NYA character 204 having a Unicode of (U+091E) generate a single GYA character 206. To generate the GYA character 206 the right vertical line of the JA character 200 is removed due to the presence of the Halant character 202 to form a Half JA character 208. Then, the Half JA character 208 is combined with the NYA character 204 to form the final GYA character 206. However, placement of such halant character and/or NYA character may change while displaying or rendering on the display unit due to lack of reordering.
Another example of an alteration of a position of a dependant mark is illustrated in FIG. 3. As illustrated in FIG. 3, a character 300, a constant, having a Unicode of (U+0915) and a character 302, a dependant vowel, having a Unicode of (U+0941) are combined to form the character 304 having a glyph 306. However, the glyph 306 is not in the correct position, thus the glyph 306 is repositioned horizontally along an axis 308 (x axis) to form the final character 310.
In some scenarios, an incorrect output document is rendered when a user wants to convert the source file from PDF to some other format.
In some scenarios, incorrect searches are rendered in terms of locating occurrence or position of a given key word while performing search and navigation when a user wants to search the text (keyword) within the source file. For example, incorrect locations or positions of a keyword are provided when the user wants to navigate to the location where the search keyword exists within the PDF document or when the user wants to find the set of PDF files that matches the given search criteria using a search engine.
In one conventional method, only Unicode value of each character is stored. For example, Consider the text: “kA k^mA [] or []”, where Unicode value stream of k, A, ^ and m is stored. For example, say k=0xC95; A=0xCBE; ^=0xCCD; m=0xCAE are stored in form of 0xC95 0xCBE 0xC95 0xCCD 0xCAE 0xCBE. However, in the conventional method, no rendering information i.e. glyph stream values of corresponding characters are stored. Also, such rendering depends on user application and settings.