With the rapid growth of the Internet and World Wide Web, there has been an equally rapid growth in the number of textual listings available. As used herein, a textual listing is a comparatively short text (typically around 10 words in length) often, but not always, having highly idiosyncratic and/or ungrammatical styles. Such listings are usually informally produced and therefore often have typos or rely heavily on abbreviations. Typical examples include classified ads, product listings, tenders, etc., as shown in Table 1 below.
TABLE 1DOMAINEXAMPLERental housing2x2 Quiet Cozy Charming Single Family Home -advertisement1515 Martin AvenueUsed sports goodsNikeBaseball Gloves Black 12″ LHTDeal information forPanasonic 32″ 1080p LCD TV - $329 @ Best-electronics goodsBuy
As textual listings are expected to continue in growth, it is understood that improvement in a machine's reasoning capability will be strongly tied to the ability to extract information from such listings. For example, consider an online shopping site listing a wide variety of information about offered merchandise; detecting brands/styles/features that are frequently mentioned on the postings would allow a company to design a better marketing strategy. To this end, it is known in the art to develop so-called semantic models in which symbols (e.g., words or tokens) are stored along with information about what those symbols mean in the “real world.” In effect, using such semantic models, machines are able to effectively understand the data being processed and, therefore, perform such processing more efficiently, more accurately and with less human intervention.
Most information extraction techniques developed for formal texts, however, would be inapplicable to textual listings because of their informal and idiosyncratic styles. To address these challenges, several approaches have been proposed to apply machine learning algorithms or an external knowledge base. These approaches, however, commonly require human supervision to produce training data or to build a knowledge base. An example of such a system 100 is illustrated in FIG. 1. As shown, an information extraction component 102 operates upon a text corpus 108 based on a form of a semantic model comprising dictionaries 104 and rules 106 that are generated through user input 110, i.e., using manual assessment of at least some portion of the text corpus 108. For example, the dictionaries 104 typically include semantic data for specific words (e.g., “Companies=Samsung, LG, Sony, Apple . . . ”) whereas the rules 106 set forth specific patterns associated with the information of interest (e.g., “Company=‘manufactured by ——————’, ‘—————— is a company’, ‘companies including ——————’, etc.). Being manually generated through analysis of the text corpus 108, such dictionaries and rules are expensive to develop. Substantially multiplying this expense is the fact that these efforts must be repeated for each new domain or set of information to be analyzed.
Thus, it would be preferable to provide techniques that permit the rapid and accurate development of semantic models based on textual listings, while minimizing the need for human input in the development of such semantic models.