1. Field of the Invention
The present invention relates generally to a computer-implemented method and system of text analysis for enabling automated text processing.
2. Description of Related Art
In their pursuit of greater efficiency and profitability, businesses are increasingly replacing manual systems with computer technology and automated systems. Improvements in text processing technology and the development of electronic networks such as the Internet now readily permit spoken, handwritten, and scanned text to be recognized, processed, and stored as computer-accessible data. In view of the high cost of manually processing from the content of such text, it is desirable to use computer technology to automatically derive knowledge therefrom. However, the nearly infinite variety of written and spoken text has proved to be an obstacle to the development of automated systems for analyzing content and deriving information from text.
Prior art technologies for automated text analysis can generally be categorized as xe2x80x9cupstream technologiesxe2x80x9d that address the complexities of language (such as linguistics and natural language processing.), and xe2x80x9cdownstream technologiesxe2x80x9d that are directed to enabling computers to handle knowledge (such as machine understanding and artificial intelligence). These different technologies are usually applied in isolation from each other. This isolation has inhibited the overall potential for automated text analysis.
The prior art language and text analysis systems typically include a database module and a processing module. The database module contains definitions and/or semantic information corresponding to individual words. The processing module customarily performs a variety of processes upon the language or text to provide a simplified representation of the text that can be processed by a computer. Language or text analysis systems of this type are used by search engines and other information retrieval systems.
Certain prior art processing modules provide each word with a semantic tag and are therefore referred to as xe2x80x9ctaggersxe2x80x9d. Processing modules can also be used to decompose a stream of text into individual sentences, fragments, and words. These individual words are sometimes referred to as xe2x80x9ctokens,xe2x80x9d and this analysis step is referred to as xe2x80x9ctokenizationxe2x80x9d. Following tokenization, the stream of text can be subjected to further semantic or linguistic processing such as identification of basic units of grammar, subdivision into corresponding fragments, and application of higher level algorithms.
The prior art language and text analysis systems are subject to several known disadvantages. Some prior systems require symbolic representations of the words and/or tokens. Many prior systems are characterized by excessive and unnecessary levels of processing. Furthermore, to analyze text, many prior art systems require an understanding of the precise meaning of the language or text.
The prior art language and text analysis systems cannot readily be configured to automatically determine such information as the relevance, weight or quantification of language or text. In particular, the prior art systems are not effective in deriving such information in correlation with the underlying purpose of the text. Such a system would be particularly suitable and advantageous in automatically processing data acquired in response to a particular inquiry, such as survey results.
It would be an advantage to provide a method and system for automatically analyzing text. It would be a further advantage if such method and system were available to automatically convert text into a format that could be further automatically processed to derive information regarding text content.
The invention is a method and system for text analysis. In the invention, a computer is used to analyze, parse, and manipulate natural language text according to a series of specific steps. Text is decomposed into small, homogenous segments that can be readily correlated to one another, to quantitative data, or to a knowledge database. The invention thereby enables the automated processing, analysis, and comparison of differing text streams to derive information and conclusions therefrom, and/or to build new or add to existing knowledge databases.
In the preferred embodiment of the present invention, the words of an input text are labeled with semantic tags. In one embodiment, the input text is acquired from a response to one or more input requests or prompts, such as survey. A series of operations are then performed on the semantically labeled input text. These operations can include splitting text, translating idioms, combining text, editing word tags, deleting unnecessary or superfluous words, identifying phrases or combinations, and rearranging expressions.
Text fragments are portions of the input text that are obtained as an output of any intermediate step of the present invention. The combination of words that is generated at the completion of the text analysis is a segment that can then be further processed, for example, by a computer to derive statistical information, to generate a report, or to build a knowledge database.
In another embodiment, the various operations to be performed upon the text portions comprise the steps of searching the text for particular combinations of words and/or tags, and changing the combination according to a corresponding prescription or rule. In yet another embodiment, the step of providing each word with a semantic tag can be accomplished using a commercially available tagging program, such as CLAWS, developed by the University of Lancaster, England.
In a further embodiment, a initial preparation step can be performed upon the roughly separable text portions; this initial step can be done prior to the other recited steps, such as the step of providing words with semantic tags. This initial preparation step may include spell-checking, character replacement, parsing the roughly separable text into smaller preliminary fragments and/or a variety of other cleaning operations. This step may have, as one purpose, the effect of formatting the stream of text to fit a set of proscribed parameters for a commercially available tagging program.