The Internet enables a user of a computer system to identify and communicate with millions of other computer systems located around the world. Integral to network communication is the entry and communication of strings of characters between the computer systems. For example, a client computer system may identify each of these other computer systems using a unique numeric identifier for that computer called an Internet Protocol (“IP”) address. When a communication is sent from a client computer system to a destination computer system, the client computer system may specify the IP address of the destination computer system in order to facilitate the routing of the communication to the destination computer system. The Domain Name System (DNS) has been developed to make it easier for users to remember the addresses of computers on the Internet. DNS resolves a unique alphanumeric domain name that is associated with a destination computer into the IP address for that computer. Thus, a user who wants to visit the Verisign website need only remember the domain name “verisign.com” rather than having to remember the Verisign web server IP address, such as 65.205.249.60. Additionally, for example, strings of characters are utilized in search requests, text messages, email communication, social media applications, and the like.
Currently, when a string of characters is entered, the string of characters often does not include markers, e.g., spaces, that identify words in the string of characters. For example, a computer system or user may enter the string of characters “thisisadomain.com.” Because of the structure of the domain name, the string of characters does not include markers to identify the words in the string. To process the string of characters, for example, for searching, domain name suggestion, etc., the string of characters needs to be broken into the words contained in the string of characters. For example, in order to process and utilize “thisisadomain.com,” the string of characters needs to be segmented or broken into its component words, e.g., “this is a domain.” To address this, text segmentation or tokenization algorithms have been utilized to identify the words.
Currently most segmentation or tokenization algorithms rely on dictionaries, “word fitness”, character co-occurrence, and other linguistic properties. These algorithms, however, may not be able to accurately identify the words in the string of characters. For example, if a string of characters “choosespain.com” is entered, the linguistic based algorithms may equally identify the possible words as “chooses” and “pain,” or “choose” and “spain.” Thus, there is a need for systems and methods that address and enhance the segmentation of strings of characters.