The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for improvement of searches using historic code points associated with characters.
Search engines provide a way for users to find information pertaining to a specified set of words, usually referred to as a query string. When a user enters a query string into a search engine, the search engine examines an associated index and provides a listing of best-matching data according to predetermined criteria, usually with a short summary containing a title of a document and sometimes parts of the text associated with the best-matching data. The associated index is built from the information stored with the data and the method by which the information is indexed. Most search engines support the use of the Boolean operators AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The search engine looks for the words or phrases exactly as entered.
However, conventional search engines make an assumption that each character in a query string has a unique code point for a given encoding, and thus, each query string would have a unique set of code points. In character encoding terminology, a code point or code position is any of a set of numerical values that make up a code space or code page. For example, American Standard Code for Information Interchange (ASCII) comprises 128 code points in the range 0hex to 7Fhex, extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode® comprises 1,114,112 code points in the range 0hex to 10FFFFhex. The Unicode® code space is divided into seventeen planes (the basic multilingual plane and 16 supplementary planes), each with 65,536 (=216) code points. Thus the total size of the Unicode® code space is 17×65,536=1,114,112. One example of a mapping on character and code point is as follows: the word “cloud” is equivalent to “U+0063 U+006C U+006F U+0075 U+0064” in Unicode®.
The notion of a code point is used for abstraction, to distinguish both:                the number from an encoding as a sequence of bits, and        the abstract character from a particular graphical representation (glyph).        
This is because one may wish to make these distinctions:                encode a particular code space in different ways, or        display a character via different glyphs.        
For Unicode®, the particular sequence of bits is called a code unit. For the UCS-4 encoding, characters/code points are encoded as 4-byte (octet) binary numbers (which is fixed-width and simple, but inefficient), while in the UTF-8 encoding, characters are encoded as 1- to 4-byte numbers (which is variable-width, hence more efficient but more complex, and backward-compatible with ASCII). Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph but a unit of textual data. The precise appearance of the character depends on the font. However, code points may also be left reserved for future assignment (most of the Unicode® code space is unassigned), or given other designated functions.
A Unicode® text file is not necessarily merely a sequence of code points encoded into 4-byte blocks. Instead, an encoding scheme is used to serialize a sequence of code points into a sequence of bytes. A number of such encoding schemes exist, and these trade between space efficiency and ease of encoding. A variable number of bytes can be used for each character. For example, UTF-8 maintains some compatibility with ASCII. Encoding schemes also take into account endianness, which is the ordering of individually addressable sub-components within the representation of a larger data item as stored in external memory, such as big-endian and little-endian, and may have the property of being a self-synchronizing code, meaning character boundaries can be found without having to read from the beginning of the string.