A. Technical Field
The present invention relates to creation of structured data from plain text, and more particularly, to creation of structured data from plain text based on attributes or parameters of a web-site's content or products.
B. Background of the Invention
In recent years, the Internet has grown at an explosive pace. More and more information, goods, and services are being offered over the Internet. This increase in the data available over the Internet has made it increasingly important that users be able to search through vast amounts of material to find information that is relevant to their interests and queries.
The search problem can be described at least two levels: searching across multiple web-sites; and searching within a given site. The first level of search is often addressed by “search engines” such as Google™ or Alta Vista™ or directories such as Yahoo™. The second level, which is specific to the content of a site, is typically handled by combinations of search engines and databases. This approach has not been entirely successful in providing users with efficient access to a site's content.
The problem in searching a website or other information-technology based service is composed of two subproblems: first, indexing or categorizing the corpora (body of material) to be searched (i.e., content synthesis), and second, interpreting a search request and executing it over the corpora (i.e., content retrieval). In general, the corpora to be searched typically consist of unstructured information (text descriptions) of items. For e-commerce web-sites, the corpora may be the catalog of the items available through that web-site. For example, the catalog entry for a description might well be the sentence “aqua cashmere v-neck, available in small, medium, large, and extra large.” Such an entry cannot be retrieved by item type or attribute, since the facts that v-neck is a style or sweater, cashmere a form of wool, and aqua a shade of blue, are unknown to current catalogs or search engines. In order to retrieve the information that this item is available, by item type and/or attribute, this description must be converted into an attributed, categorized description. In this example, such an attributed, categorized description may include properly categorizing the item as a sweater, extracting the various attributes, and tagging their values. An example of such a description is illustrated in Table 1.
TABLE 1ItemStyleColorMaterialSizesSweaterv-neckAquaCashmereS, M, L, XL
Current technology permits such representations in databases. Further, for many standard items, numeric codes are assigned to make the job of search and representation easier. One such code is the UN Standard Products and Services Code (UN/SPSC), which assigns a standard 8-digit code to any human product or service.
However, while the taxonomies and the technology to represent the taxonomies may exist, conventional systems are unable to generate the taxonomic and attributed representation for an object from its textual description. This leads to the first of the two problems outlined above: the content synthesis problem. More specifically, that is the problem of how to convert plain text into structured objects suitable for automated search and other computational services.
The second problem is one of retrieving data successfully; once the data has been created and attributed, it must be accessible. E-commerce and parametric content sites are faced with a unique challenge, since they must offer search solutions that expose only those products, contents or services that exactly match a customer's specifications. Today, more than 50% of visitors use search as their preferred method for finding desired goods and services. However, e-commerce web sites continue to offer their customers unmatched variety, category-based navigation of e-commerce sites (“virtual aisles”), which have become increasingly complex and inadequate. In particular, many web-sites that offer a large catalog of products are often unable to find products with precise or highly parameterized specifications, and instead require the user to review dozens of products that potentially match these specifications.
A few statistics help to emphasize the importance of good searching ability. An important metric that measures the conversion rate of visitors to e-commerce sites into buyers is the book-to-look ratio. The industry average is that only 27 visitors in a 1000 make a purchase. The biggest contributor to this abysmal ratio is failed search. Forrester Research reports that 92% of all e-commerce searches fail. Major sites report that 80% of customers leave the site after a single failed search. Therefore, improving the search capability on a site directly increases revenue through increased customer acquisition, retention, and sales.
While all web-sites experience some form of these search problems to some extent, the problem is particularly acute for web-sites with a deep and rich variety of content or products. Examples are electronic procurement networks, financial sites, sporting goods stores, grocery sites, clothing sites, electronics, software, and computer sites, among many others. Another class of sites with a deep search problem comprises of those carrying highly configurable products such as travel and automotive sites. Ironically, as a rule of thumb, the more a web-site has to offer, the greater the risk that customers will leave the site because of a failed search.
When a customer physically enters a large department store, she can ask a clerk where she can find what she is looking for. The clerk's “search” is flexible in that he can understand the customer's question almost no matter how it is worded. Moreover, the clerk's “search” is generally accurate since the clerk can often specifically identify a product, or initial set of products, that the customer needs. Searches on web sites need to be equally flexible and accurate. In order for that to happen, a visitor's request must be understood not only in terms of the products, but also in terms of the request's parameters or characteristics. However, conventional information retrieval systems for web-site content have been unable to achieve this.
Some of the conventionally used methods used to find goods and services on web sites, and some problems with these conventional methods are outlined below:
1. Keyword-based search: In this method, users type a set of words or phrases describing what they want to a text box, typically on the main page of the site. A program on the site then takes each individual word entered (sometimes discarding “noise” words such as prepositions and conjunctions), and searches through all pages and product descriptions to find items containing either any combination of the words. This method, when given an English sentence or phrase, either returns far too many results or too few. For example, if a customer requests, “show me men's blue wool sweaters,” the search could be unsuccessful for the following reasons. It would either return only those pages that contain all the words in this request, or return any page that contained any single word in the search. In the former case, no items would be found, though there might be many products with those characteristics for sale. For instance, it is possible that aqua cashmere cardigan would not be matched, since it contains none of the keywords. In the latter case, a large number of items would be found, most of which would be of no interest to the customer. For example, blue wool slack may be incorrectly matched, since it contains the keywords “blue” and “wool.” Some keyword-based searches weight results based on how many keywords are matched.
Keyword-based approaches are widely used in medical transcription applications, database access, voice-mail control and web search. Virtually all commercial natural-language interface products use this approach. In this approach, certain words are regarded as meaningful, and the remainder as meaningless “glue” words. Thus, for example, in the sentence “show all books written by Squigglesby” the words “show,” “book,” and “written” may be regarded as keywords, the word “by” as a meaningless glue word, and the word “Squigglesby” as an argument. The query would then be formed on the theory that a book author named Squigglesby was being requested.
In such systems, keywords are generally some of the common nouns, verbs, adverbs and adjectives, and arguments are proper nouns and numbers. There are exceptions, however. Prepositions are usually regarded as glue words, but in some circumstances and in some systems are regarded as keywords. Generally, this is due to the human tendency to omit words in sentences, known in the argot as “ellipses.” The sentence “Show all books by Squigglesby” is an example of this, where the verb “written” is excluded. In order to cope with this, some keyword-based systems make “by” a keyword.
There are a few specialized cases of, or variations on, keyword searches. Database approaches are an example of a widely used variant on keyword-based approaches. In these systems, the database developer associates keywords or identifiers with specific database fields (columns in specific tables). Various words, specifically interrogative pronouns and adjectives, some verbs, and some prepositions, have fixed meanings to the database query program. All other words can be available as keywords for a template-based recognition system. In response to a user's sentence, the interface system may match the user's sentence to a template set constructed from the database developer's information about database structure and identifiers, and its built-in interpretation of its hardwired keywords. A Structured Query Language (SQL) statement would then be generated which encodes the meaning of the user's sentence, as interpreted by the interface system.
Another example of a specialization of the keyword-based approach is a catalog-based approach. Catalogs are databases of products and services. A “category” is the name of a table: the attributes of the category are some columns of the table. In this approach, a question is first searched by a category word, and then the remainder of the question is used as keywords to search for matching items within the category. For example, “blue woolen sweater” would first search for “blue” “woolen” and “sweater” as keywords indicating a category, and then (assuming “sweater” succeeded as a category keyword and the others did not), for “blue” and “woolen” as keywords within the sweater category. The difficulty with this approach is that cross-category queries fail, since no individual category is available to match in such cases. Further, parameters that are not present in the product descriptions in the category are not used.
Some of the central limitations of keyword-based systems are described below:
Meanings of words are fixed, independent of context. In keyword-based systems, keywords have fixed semantics. This is a distinct departure from the use of normal language by humans. Words in natural language derive their meaning through a combination of “symbol” (the word itself) and “context” (the surrounding text and background knowledge). The most glaring example is prepositions in the presence of ellipses. For instance, “by” can indicate the subject of almost any transitive verb, as well as physical proximity or indicating an object or method to use to accomplish a particular task. Another example of meaning dependent on context is that “green” can refer to a color, a state of freshness or newness, or, disparagingly, to inexperience. A quick glance at any page of any dictionary will show that most words have multiple, and often unrelated, meanings, and context is what disambiguates them. Contrary to this nuanced usage of words, in general, keyword-based approaches choose one single meaning for each word, and apply that meaning consistently in all searches. This problem is fundamentally unfixable in these systems: in order to attach a contextual semantic to a word, strong parsing technology is required and a means must be found of specifying a word in context, sufficient for a program to understand the contextual meaning.    Strongly tied to an application. Since the meanings of words must be fixed so strongly, these systems have the interface strongly tied to (and, in general, inseparable from) the application. There is no toolkit comparable to the popular Graphical User Interface (“GUI”) toolkits to form a keyword-based natural-language interface to an arbitrary application.    Missed meanings attached to glue words, especially prepositions. An assumption behind keyword-based approaches is that glue words carry no meaning or semantic content. Unfortunately, in practice there are very few words whose meanings are always unimportant. The words chosen as glue words are those whose meaning is most context-dependent, and thus their semantic content is largely missed.    High error rates, non-robust. Since meanings are attached to words independent of context, meanings can often be guessed wrong. For example, one vendor in this space,. Linguistic Technology Corporation, distributes a product (“EnglishWizard”) that permits database users to ask questions of a database. A demonstration is given with a database of purchasers, employees, sales, and products. In this example database, numbers always refer to the number of employees. This produces a sequence where, when a user asks “who purchased exactly two items,” the .answer is “no one.” However, when a user asks how many items a particular individual purchased, the answer is “two.” The reason for the discrepancy could be that EnglishWizard did not really understand the question. Instead, the first user question was mapped to a question about employees since it included a number in it.
2. FREE-FORM KEYWORD SEARCH: This category replaces keywords with previously-asked questions and the “right” answers, and returns the answers to the typed-in question. Examples of such systems are described in detail in U.S. Pat. No. 5,309,359, entitled “Method and Apparatus for Generating and Utilizing Annotations to Facilitate Computer Text Retrieval,” issued on May 3, 1994 to Katz, et al., and U.S. Pat. No. 5,404,295, entitled “Method and Apparatus for Utilizing Annotations to Facilitate Computer Retrieval of Database Material,” issued on Apr. 4, 1995 to Katz, et al. In systems employing free-form keyword searching, questions and answers are stored as sets. The question is typically stored in a canonical form, and a rewrite engine attempts to rewrite the user question into this form. If the user question maps into a pre-determined question for which the answer is known, then the answer is returned by the system. Such an approach is used by http://www.AskJeeves.com for Web searching applications, and for lookups of frequently-asked questions (FAQs).
Such systems have several limitations, including the following:    A relatively small number of questions can be answered: The number of questions that can be answered is linearly proportional to the number of questions stored—thus, this method can only be used when it is acceptable to have a relatively small number of questions that can be answered by the system.    Cannot directly answer a user's question: Since such a system processes a user question in toto, and does not attempt to parse it or extract information from the parts, it cannot be used where the solution to the user question requires the use of a parameter value that can be extracted from the question. In sum, the system can merely point the user at a page where his question can be answered—it cannot directly answer the user question.
3. UNDERSTANDING-BASED SEARCHES: Systems incorporating understanding-based searches attempt to understand the actual meaning of a user's request, including social and background information. An example of such a system is Wilensky's UNIX-based Help system, UC. UC had built into it a simple understanding of a user's global goals. Wilensky explained that a consequence of not having such a deep understanding was that the system might offer advice, which literally addressed the user's immediate question in a way that conflicted with the user's global goals. A specific example is that a request for more disk space might result in the removal of all the user's files—an action that met the immediate request, but probably not in a way that the user would find appropriate.
Understanding based systems are generally confined to conversational partners, help systems, and simple translation programs. In general, it should be noted that the underlying application is quite trivial; in fact, the interface is the application. Various specialized systems have also been built, to parse specific classes of documents. A good example is Junglee's resume-parser. Researchers in this area have now largely abandoned this approach. Indeed, the academic consensus is that full understanding is “AI-complete”: a problem that requires a human's full contextual and social understanding.
There have been multiple previous attempts to use natural language as a tool for controlling search and computer programs. One example of these is Terry Winograd's “Planner” system, which was described in his 1972 doctoral thesis. Winograd developed an abstract domain for his program, called the “Blocks World.” The domain consisted of a set of abstract three-dimensional solids, called “blocks,” and a set of “places” on which the blocks could rest. Various blocks could also rest on top of other blocks. Planner would accept a variety of natural language commands corresponding to the desired states of the system (e.g., “Put the pyramid on top of the small cube”), and would then execute the appropriate actions to achieve the desired state of the system. Winograd's system accepted only a highly stylized form of English, and its natural-language abilities were entirely restricted to the blocks' domain. The emphasis in the system was on deducing the appropriate sequence of actions to achieve the desired goal, not on the understanding and parsing of unrestricted English.
A variety of programs emerged in the 1980's to permit English-language queries over databases. EasyAsk offers a representative program. In this system, the organization or schema of the database is used as a framework for the questions to be asked. The tables of the database are regarded as the objects of the application, the columns their attributes, and the vocabulary for each attribute the words within the column. Words that do not appear within the columns, including particularly prepositions, are regarded as “noise” words and discarded in query processing.
Such understanding-based systems have a variety of problems, including the following:    Ignored vital relationships: Database schemas are designed for rapid processing of database queries, not semantic information regarding the databases. Relationships between database tables are indicated by importing indicators from one table into another (called “foreign keys”). Using the relationships in the schema as a framework for questions ignores some vital relationships (since the relationship is not explicitly indicated by key importation).    Lost semantic information: Prepositions and other “noise” words often carry significant semantic information, which is context-dependent. For example, in a database for books, authors, and publishers, the preposition “by” may indicate either a publisher or an author, and may indicate the act of publishing or authoring a book.
In addition to the problems described above with respect to some of the different approaches that currently exist for retrieving data, all of the above approaches share the limitation that the Natural Language (“NL”) interface for each application must be handcrafted; there is no separation between the NL parser and interface, and the application itself. Further, development of the interface often consumes more effort than that devoted to the application itself. None of the currently existing approaches to NL interfaces is portable across applications and platforms. There is no NL toolkit analogous to the Windows API/Java AWT for GUIs, nor a concrete method for mapping constructs in NL to constructs in software programs.
Thus, there exists a need for a system and method for creating structured parametric data from plain text, both for purposes of content synthesis and for purposes of data retrieval. Further, such a system should be portable across applications and platforms. In addition, such a system should be able to support searches on any relevant criteria which may be of interest to a web-site's visitors, and by any arbitrary range of values on any parameter. Further, there exists a need for a system which updates seamlessly, invisibly, and rapidly to accommodate a change, when a web-site adds or modifies the products it offers.