As computation and communication technologies are developed towards a more sophisticated level, the Internet starts to play a vital role in both of our work and private life, and a huge number of software applications have emerged and will continue to emerge on the Internet. Particularly, in accordance with a globalization trend of world economy, the Internet technologies have been significantly improved over the past decade, and the amount of information exchanged on the Internet has dramatically increased. Under this circumstance, it has become a significant challenge to develop software applications to efficiently and promptly extract valid or relevant information from the information that is available on the Internet.
Natural language processing (also called Information Retrieval) refers to a process to organize the available information and capture the relevant information therein based on certain user requirements. In some particular situations, natural language processing is focused on the process of capturing the relevant information from the information, organized or not, and this particular process is sometimes called as an Information Search process or an Information Seek process.
Different natural language processing methods are applied to process certain information content that is organized in a certain format. For example, a typical method searches for literature data in the information content using conventional retrieving tools based on bibliography, abstract and/or index. Unique nature, characteristics and process of each retrieving tool allow the information search to be conducted from different perspectives. As a specific example, the information search may be implemented in either a chronological order or a reverse chronological order. The information search in the chronological order are sometimes costly with a low efficiency, while the information search in the reverse chronological order processes more recent information with a priority, oftentimes leading to better search performances and search results. Other than the typical method described herein, a retrospective method is also applied to process some information content. Information search in the retrospective method is highly targeted, because this method is focused on tracing and searching for the references given by the existing bibliography.
One particular natural language processing field is concerned with recognition of languages that are applied to compose and/or compile specific information content. Language recognition can become a significant challenge, when two or more similar languages (such as Cantonese and Mandarin, Southern Fujian Dialect and Mandarin, etc.) are possibly used. These similar languages generally share certain amount of common vocabularies that make it difficult for machines to distinguish which one or more languages among these similar languages are specifically adopted to compose the relevant information content. For example, there are many words that are used in both Cantonese and Mandarin, so it is difficult to compile a pure Cantonese vocabulary list. In particular, as the cultures of Hong Kong and Taiwan continuously are disseminated to that of the mainland China, Mandarin Chinese borrows many Cantonese vocabularies, which further blurs the difference between these two languages and causes many errors in identifying the information content that is compiled in either Mandarin Chinese or Cantonese.