The present invention relates to an apparatus, a method, and a program for analyzing a document including a text-based visual representation such as so-called ASCII art.
Large amounts of data are analyzed to extract various pieces of useful information. One technique of this type is to analyze an electronic document (text data) in order to evaluate topics or matters of concern. In general, natural language processing such as morphological analysis or syntax analysis is performed to analyze a document.
On the other hand, a text-based visual representation called ASCII art or text art may be used, in addition to normal text, in an electronic document acquired from the Internet or the like. In such a visual representation, information is represented by the appearance of arrays of characters or symbols, and the used characters or symbols themselves have little meaning on the visual representation. Therefore, appropriate information may not be able to be extracted depending on the processing for general document analysis. Consequently, a part of the text-based visual representation has been conventionally separated from the electronic document to perform natural language processing on the parts (text) except the visual representation in order to analyze the content.
Information has also been extracted from the part of the text-based visual representation in the electronic document. For example, there is a conventional technique for preparing a dictionary in which a character/symbol string listed in advance as a visual representation is associated with the content (meaning) represented by the visual representation to extract information from the part of the visual representation.