Textual processing is an important part of human language-based systems. Tokenization is the process of splitting a stream of input text into words, phrases, symbols, or other meaningful elements known as tokens. A list of tokens becomes input for further processing, such as, for example, morphological analyzing, parsing, text mining, indexing, and searching. The process of tokenization is useful both in linguistics for text segmentation and in computer science for lexical analysis before interpretation or compilation.
Programming languages may provide split functions for tokenizing a given string into basic linguistic components. Such split functions may utilize a regular expression (e.g., “regex” or “regexp”) which is written in a formal programming language and can be interpreted by a regular expression processor. Some of the programming languages, such as Perl, have fully-integrated regular expressions into the syntax of the languages themselves. Other programming languages, like C, C++, Java, and Python, provide instead access to regular expressions only through libraries.
In the split functions, for example, tokenization may be achieved after identifying a set of separators known as delimiters. In identifying basic language words, delimiters may include whiter spaces and/or punctuation marks. However, using delimiters for tokenization may be overly complicated and inefficient with respect to storage and processing time since input text has to be parsed for each of the delimiters. In addition, current processes for tokenization may not be adapted for use with intermixed input text written in two or more languages.
Accordingly, there exists a need in the art to overcome the deficiencies and limitations described hereinabove.