A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
A computer program listing appendix is included in the attached CD-R created on Dec. 12, 2000, labeled xe2x80x9cCreation of Structured Data from Plain Text,xe2x80x9d and including the following files: CommodityProperty.nml (13 KB), DefaultSeg14Result.xml, (2 KB), ElectricalProperty.nml (16 KB), Example.txt, Grammar.txt, INML.xml, (5 KB), MeasurementProperty.nml (22 KB), Output.txt, (3 KB), PeriodProperty.nml (6 KB), PhysicalProperty.nml (36 KB), ReservedNameProperty.nml (6 KB), Seg14.nml (30 KB), Seg14Phrasing.nml (71 KB), UsageProperty.nml (7 KB), and Utility.nml (6 KB). These files are incorporated by reference herein.
A. Technical Field
The present invention relates to creation of structured data from plain text, and more particularly, to creation of structured data from plain text based on attributes or parameters of a web-site""s content or products.
B. Background of the Invention
In recent years, the Internet has grown at an explosive pace. More and more information, goods, and services are being offered over the Internet. This increase in the data available over the Internet has made it increasingly important that users be able to search through vast amounts of material to find information that is relevant to their interests and queries.
The search problem can be described at least two levels: searching across multiple web-sites; and searching within a given site. The first level of search is often addressed by xe2x80x9csearch enginesxe2x80x9d such as Google(trademark) or Alta Vista(trademark) or directories such as Yahoo(trademark). The second level, which is specific to the content of a site, is typically handled by combinations of search engines and databases. This approach has not been entirely successful in providing users with efficient access to a site""s content.
The problem in searching a website or other information-technology based service is composed of two subproblems: first, indexing or categorizing the corpora (body of material) to be searched (i.e., content synthesis), and second, interpreting a search request and executing it over the corpora (i.e., content retrieval). In general, the corpora to be searched typically consist of unstructured information (text descriptions) of items. For e-commerce web-sites, the corpora may be the catalog of the items available through that web-site. For example, the catalog entry for a description might well be the sentence xe2x80x9caqua cashmere v-neck, available in small, medium, large, and extra large.xe2x80x9d Such an entry cannot be retrieved by item type or attribute, since the facts that v-neck is a style or sweater, cashmere a form of wool, and aqua a shade of blue, are unknown to current catalogs or search engines. In order to retrieve the information that this item is available, by item type and/or attribute, this description must be converted into an attributed, categorized description. In this example, such an attributed, categorized description may include properly categorizing the item as a sweater, extracting the various attributes, and tagging their values. An example of such a description is illustrated in Table 1.
Current technology permits such representations in databases. Further, for many standard items, numeric codes are assigned to make the job of search and representation easier. One such code is the UN Standard Products and Services Code (UN/SPSC), which assigns a standard 8-digit code to any human product or service.
However, while the taxonomies and the technology to represent the taxonomies may exist, conventional systems are unable to generate the taxonomic and attributed representation for an object from its textual description. This leads to the first of the two problems outlined above: the content synthesis problem. More specifically, that is the problem of how to convert plain text into structured objects suitable for automated search and other computational services.
The second problem is one of retrieving data successfully; once the data has been created and attributed, it must be accessible. E-commerce and parametric content sites are faced with a unique challenge, since they must offer search solutions that expose only those products, contents or services that exactly match a customer""s specifications. Today, more than 50% of visitors use search as their preferred method for finding desired goods and services. However, e-commerce web sites continue to offer their customers unmatched variety, category-based navigation of e-commerce sites (xe2x80x9cvirtual aislesxe2x80x9d), which have become increasingly complex and inadequate. In particular, many web-sites that offer a large catalog of products are often unable to find products with precise or highly parameterized specifications, and instead require the user to review dozens of products that potentially match these specifications.
A few statistics help to emphasize the importance of good searching ability. An important metric that measures the conversion rate of visitors to e-commerce sites into buyers is the book-to-look ratio. The industry average is that only 27 visitors in a 1000 make a purchase. The biggest contributor to this abysmal ratio is failed search. Forrester Research reports that 92% of all e-commerce searches fail. Major sites report that 80% of customers leave the site after a single failed search. Therefore, improving the search capability on a site directly increases revenue through increased customer acquisition, retention, and sales.
While all web-sites experience some form of these search problems to some extent, the T problem is particularly acute for web-sites with a deep and rich variety of content or products. Examples are electronic procurement networks, financial sites, sporting goods stores, grocery sites, clothing sites, electronics, software, and computer sites, among many others. Another class of sites with a deep search problem comprises of those carrying highly configurable products such as travel and automotive sites. Ironically, as a rule of thumb, the more a web-site has to offer, the greater the risk that customers will leave the site because of a failed search.
When a customer physically enters a large department store, she can ask a clerk where she can find what she is looking for. The clerk""s xe2x80x9csearchxe2x80x9d is flexible in that he can understand the customer""s question almost no matter how it is worded. Moreover, the clerk""s xe2x80x9csearchxe2x80x9d is generally accurate since the clerk can often specifically identify a product, or initial set of products, that the customer needs. Searches on web sites need to be equally flexible and accurate. In order for that to happen, a visitor""s request must be understood not only in terms of the products, but also in terms of the request""s parameters or characteristics. However, conventional information retrieval systems for web-site content have been unable to achieve this.
Some of the conventionally used methods used to find goods and services on web sites, and some problems with these conventional methods are outlined below:
1. Keyword-based search: In this method, users type a set of words or phrases describing what they want to a text box, typically on the main page of the site. A program on the it site then takes each individual word entered (sometimes discarding xe2x80x9cnoisexe2x80x9d words such as prepositions and conjunctions), and searches through all pages and product descriptions to find items containing either any combination of the words. This method, when given an English sentence or phrase, either returns far too many results or too few. For example, if a customer requests, xe2x80x9cshow me men""s blue wool sweaters,xe2x80x9d the search could be unsuccessful for the following reasons. It would either return only those pages that contain all the words in this request, or return any page that contained any single word in the search. In the former case, no items would be found, though there might be many products with those characteristics for sale. For instance, it is possible that aqua cashmere cardigan would not be matched, since it contains none of the keywords. In the latter case, a large number of items would be found, most of which would be of no interest to the customer. For example, blue wool slack may be incorrectly matched, since it contains the keywords xe2x80x9cbluexe2x80x9d and xe2x80x9cwool.xe2x80x9d Some keyword-based searches weight results based on how many keywords are matched.
Keyword-based approaches are widely used in medical transcription applications, database access, voice-mail control and web search. Virtually all commercial natural-language interface products use this approach. In this approach, certain words are regarded as meaningful, and the remainder as meaningless xe2x80x9cgluexe2x80x9d words. Thus, for example, in the sentence xe2x80x9cshow all books written by Squigglesbyxe2x80x9d the words xe2x80x9cshow,xe2x80x9d xe2x80x9cbook,xe2x80x9d and xe2x80x9cwrittenxe2x80x9d may be regarded as keywords, the word xe2x80x9cbyxe2x80x9d as a meaningless glue word, and the word xe2x80x9cSquigglesbyxe2x80x9d as an argument. The query would then be formed on the theory that a book author named Squigglesby was being requested.
In such systems, keywords are generally some of the common nouns, verbs, adverbs and adjectives, and arguments are proper nouns and numbers. There are exceptions, however. Prepositions are usually regarded as glue words, but in some circumstances and in some systems are regarded as keywords. Generally, this is due to the human tendency to omit words in sentences, known in the argot as xe2x80x9cellipses.xe2x80x9d The sentence xe2x80x9cShow all books by Squigglesbyxe2x80x9d is an example of this, where the verb xe2x80x9cwrittenxe2x80x9d is excluded. In order to cope with this, some keyword-based systems make xe2x80x9cbyxe2x80x9d a keyword.
There are a few specialized cases of, or variations on, keyword searches. Database approaches are an example of a widely used variant on keyword-based approaches. In these systems, the database developer associates keywords or identifiers with specific database fields (columns in specific tables). Various words, specifically interrogative pronouns and adjectives, some verbs, and some prepositions, have fixed meanings to the database query program. All other words can be available as keywords for a template-based recognition system. In response to a user""s sentence, the interface system may match the user""s sentence to a template set constructed from the database developer""s information about database structure and identifiers, and its built-in interpretation of its hardwired keywords. A Structured Query Language (SQL) statement would then be generated which encodes the meaning of the user""s sentence, as interpreted by the interface system.
Another example of a specialization of the keyword-based approach is a catalog-based approach. Catalogs are databases of products and services. A xe2x80x9ccategoryxe2x80x9d is the name of a table: the attributes of the category are some columns of the table. In this approach, a question is first searched by a category word, and then the remainder of the question is used as keywords to search for matching items within the category. For example, xe2x80x9cblue woolen sweaterxe2x80x9d would first search for xe2x80x9cbluexe2x80x9d xe2x80x9cwoolenxe2x80x9d and xe2x80x9csweaterxe2x80x9d as keywords indicating a category, and then (assuming xe2x80x9csweaterxe2x80x9d succeeded as a category keyword and the others did not), for xe2x80x9cbluexe2x80x9d and xe2x80x9cwoolenxe2x80x9d as keywords within the sweater category. The difficulty with this approach is that cross-category queries fail, since no individual category is available to match in such cases. Further, parameters that are not present in the product descriptions in the category are not used.
Some of the central limitations of keyword-based systems are described below:
Meanings of words are fixed, independent of context. In keyword-based systems, keywords have fixed semantics. This is a distinct departure from the use of normal language by humans. Words in natural language derive their meaning through a combination of xe2x80x9csymbolxe2x80x9d (the word itself) and xe2x80x9ccontextxe2x80x9d (the surrounding text and background knowledge). The most glaring example is prepositions in the presence of ellipses. For instance, xe2x80x9cbyxe2x80x9d can indicate the subject of almost any transitive verb, as well as physical proximity or indicating an object or method to use to accomplish a particular task. Another example of meaning dependent on context is that xe2x80x9cgreenxe2x80x9d can refer to a color, a state of freshness or newness, or, disparagingly, to inexperience. A quick glance at any page of any dictionary will show that most words have multiple, and often unrelated, meanings, and context is what disambiguates them. Contrary to this nuanced usage of words, in general, keyword-based approaches choose one single meaning for each word, and apply that meaning consistently in all searches. This problem is fundamentally unfixable in these systems: in order to attach a contextual semantic to a word, strong parsing technology is required and a means must be found of specifying a word in context, sufficient for a program to understand the contextual meaning.
Strongly tied to an application. Since the meanings of words must be fixed so strongly, these systems have the interface strongly tied to (and, in general, inseparable from) the application. There is no toolkit comparable to the popular Graphical User Interface (xe2x80x9cGUIxe2x80x9d) toolkits to form a keyword-based natural-language interface to an arbitrary application.
Missed meanings attached to glue words, especially prepositions. An assumption behind keyword-based approaches is that glue words carry no meaning or semantic content. Unfortunately, in practice there are very few words whose meanings are always unimportant. The words chosen as glue words are those whose meaning is most context-dependent, and thus their semantic content is largely missed.
High error rates, non-robust. Since meanings are attached to words independent of context, meanings can often be guessed wrong. For example, one vendor in this space, Linguistic Technology Corporation, distributes a product (xe2x80x9cEnglishWizardxe2x80x9d) that permits database users to ask questions of a database. A demonstration is given with a database of purchasers, employees, sales, and products. In this example database, numbers always refer to the number of employees. This produces a sequence where, when a user asks xe2x80x9cwho purchased exactly two items,xe2x80x9d the answer is xe2x80x9cno one.xe2x80x9d However, when a user asks how many items a particular individual purchased, the answer is xe2x80x9ctwo.xe2x80x9d The reason for the discrepancy could be that EnglishWizard did not really understand the question. Instead, the first user question was mapped to a question about employees since it included a number in it.
2. FREE-FORM KEYWORD SEARCH: This category replaces keywords with previously-asked questions and the xe2x80x9crightxe2x80x9d answers, and returns the answers to the typed-in question. Examples of such systems are described in detail in U.S. Pat. No. 5,309,359, entitled xe2x80x9cMethod and Apparatus for Generating and Utilizing Annotations to Facilitate Computer Text Retrieval,xe2x80x9d issued on May 3, 1994 to Katz, et al., and U.S. Pat. No. 5,404,295, entitled xe2x80x9cMethod and Apparatus for Utilizing Annotations to Facilitate Computer Retrieval of Database Material,xe2x80x9d issued on Apr. 4, 1995 to Katz, et al. In systems employing free-form keyword searching, questions and answers are stored as sets. The question is typically stored in a canonical form, and a rewrite engine attempts to rewrite the user question into this form. If the user question maps into a pre-determined question for which the answer is known, then the answer is returned by the system. Such an approach is used by http://www.AskJeeves.com for Web searching applications, and for lookups of frequently-asked questions (FAQs).
Such systems have several limitations, including the following:
A relatively small number of questions can be answered: The number of questions that can be answered is linearly proportional to the number of questions storedxe2x80x94thus, this method can only be used when it is acceptable to have a relatively small number of questions that can be answered by the system.
Cannot directly answer a user""s question: Since such a system processes a user question in toto, and does not attempt to parse it or extract information from the parts, it cannot be used where the solution to the user question requires the use of a parameter value that can be extracted from the question. In sum, the system can merely point the user at a page where his question can be answeredxe2x80x94it cannot directly answer the user question.
3. UNDERSTANDING-BASED SEARCHES: Systems incorporating understanding-based searches attempt to understand the actual meaning of a user""s request, including social and background information. An example of such a system is Wilensky""s UNIX-based Help system, UC. UC had built into it a simple understanding of a user""s global goals. Wilensky explained that a consequence of not having such a deep understanding was that the system might offer advice, which literally addressed the user""s immediate question in a way that conflicted with the user""s global goals. A specific example is that a request for more disk space might result in the removal of all the user""s filesxe2x80x94an action that met the immediate request, but probably not in a way that the user would find appropriate.
Understanding based systems are generally confined to conversational partners, help systems, and simple translation programs. In general, it should be noted that the underlying application is quite trivial; in fact, the interface is the application. Various specialized systems have also been built, to parse specific classes of documents. A good example is Junglee""s resume-parser. Researchers in this area have now largely abandoned this approach. Indeed, the academic consensus is that full understanding is xe2x80x9cAI-completexe2x80x9d: a problem that requires a human""s full contextual and social understanding.
There have been multiple previous attempts to use natural language as a tool for controlling search and computer programs. One example of these is Terry Winograd""s xe2x80x9cPlannerxe2x80x9d system, which was described in his 1972 doctoral thesis. Winograd developed an abstract domain for his program, called the xe2x80x9cBlocks World.xe2x80x9d The domain consisted of a set of abstract three-dimensional solids, called xe2x80x9cblocks,xe2x80x9d and a set of xe2x80x9cplacesxe2x80x9d on which the blocks could rest. Various blocks could also rest on top of other blocks. Planner would accept a variety of natural language commands corresponding to the desired states of the system (e.g., xe2x80x9cPut the pyramid on top of the small cubexe2x80x9d), and would then execute the appropriate actions to achieve the desired state of the system. Winograd""s system accepted only a highly stylized form of English, and its natural-language abilities were entirely restricted to the blocks"" domain. The emphasis in the system was on deducing the appropriate sequence of actions to achieve the desired goal, not on the understanding and parsing of unrestricted English.
A variety of programs emerged in the 1980""s to permit English-language queries over databases. EasyAsk offers a representative program. In this system, the organization or schema of the database is used as a framework for the questions to be asked. The tables of the database are regarded as the objects of the application, the columns their attributes, and the vocabulary for each attribute the words within the column. Words that do not appear within the columns, including particularly prepositions, are regarded as xe2x80x9cnoisexe2x80x9d words and discarded in query processing.
Such understanding-based systems have a variety of problems, including the following:
Ignored vital relationships: Database schemas are designed for rapid processing of database queries, not semantic information regarding the databases. Relationships between database tables are indicated by importing indicators from one table into another (called xe2x80x9cforeign keysxe2x80x9d). Using the relationships in the schema as a framework for questions ignores some vital relationships (since the relationship is not explicitly indicated by key importation).
Lost semantic information: Prepositions and other xe2x80x9cnoisexe2x80x9d words often carry significant semantic information, which is context-dependent. For example, in a database for books, authors, and publishers, the preposition xe2x80x9cbyxe2x80x9d may indicate either a publisher or an author, and may indicate the act of publishing or authoring a book.
In addition to the problems described above with respect to some of the different approaches that currently exist for retrieving data, all of the above approaches share the limitation that the Natural Language (xe2x80x9cNLxe2x80x9d) interface for each application must be handcrafted; there is no separation between the NL parser and interface, and the application itself. Further, development of the interface often consumes more effort than that devoted to the application itself. None of the currently existing approaches to NL interfaces is portable across applications and platforms. There is no NL toolkit analogous to the Windows API/Java AWT for GUIs, nor a concrete method for mapping constructs in NL to constructs in software programs.
Thus, there exists a need for a system and method for creating structured parametric data from plain text, both for purposes of content synthesis and for purposes of data retrieval. Further, such a system should be portable across applications and platforms. In addition, such a system should be able to support searches on any relevant criteria which may be of interest to a web-site""s visitors, and by any arbitrary range of values on any parameter. Further, there exists a need for a system which updates seamlessly, invisibly, and rapidly to accommodate a change, when a web-site adds or modifies the products it offers.
The present invention provides a system, method, and an architecture for receiving unstructured text, and converting it to structured data. In one embodiment, this is done by mapping the grammatical parse of a sentence into an instance tree of application domain objects. In addition, the present invention is portable across different application domains.
A system in accordance with the present invention can be used for creating structured data from plain text, to allow for the efficient storing this structured data in a database. For example, from the free text description of a number of products, the structured data (which could be an extracted object and its attributes) can be used to create individual entries in a product database, and thus create content for an ecommerce website or web market. Alternately, or in addition, such a system can be used for creating structured data from from a plain text query, for using this structured data to retrieve relevant data from a database. For example, a user""s free text query can be converted to a database query that corresponds to the objects of the database and their attributes. Such a system overcomes the limitations of conventional search engines by accepting free form text, and mapping it accurately into a structured search query.
The present invention recognizes that understanding natural language is neither required nor desired in generating structured data; rather, what is desired is the ability to map natural language onto program structure. Further, there is a natural relationship between the parse of the sentence as expressed in a parse tree and a component tree in a program. Thus, the natural language sentence is understood as instructions to build a component tree. A content engine takes in a natural language sentence and produces a program component tree. The component tree is then further simplified before it is passed to a program for execution.
As mentioned above, a system in accordance with the present invention can be used across various applications. In the various embodiments of the present invention, the meaning of a word is dependent only on the application and the role of the word in the sentence. Thus, the definition of a word is largely the province of the application developer. Briefly, words act as identifiers for components. A word in a sentence serves as an identifier for program objects. As discussed above, many words in English or other natural languages have multiple meanings with the meanings dependent upon context. Similarly, for the present invention, a word may be used as an identifier for multiple objects.
In one embodiment, the present invention transforms an English sentence into a set of software objects that are subsequently passed to the given application for execution. One of the advantages of this approach is the ability to attach a natural language interface to any software application with minimal developer effort. The objects of the application domain are captured, in one embodiment, by using the Natural Markup Language (xe2x80x9cNMLxe2x80x9d). The resulting interface is robust and intuitive, as the user now interacts with an application by entering normal English sentences, which are then executed by the program. In addition, an application enhanced with the present invention significantly augments the functionality available to a user.
When given a plain text sentence in a natural language, a system in accordance with one embodiment of the present invention performs the following steps:
(i) A parsing algorithm applies a formal context-free grammar for the natural language to derive all parses of a given sentence. For purposes of discussion, English is used as an example of the natural language of the plain text. However, it is to be noted that the present invention may be used for any natural language. In one embodiment, all parses of the sentence are derived in the time taken to derive a single parse (e.g., concurrently). Preferrably all parses are stored in a single data structure whose size is dramatically smaller than the number of individual parse trees, often just a constant factor larger than the size taken to store a single parse tree. It is to be noted that, in one embodiment, the correct map of a sentence is only known after all possible parses have been attempted.
(ii) A mapping algorithm then uses the structure of each parse tree for a given sentence to attempt to derive an object representation of the sentence within the domain of interest based on the application-specific NML model. In other words, the mapping algorithm maps each parse outputted by the parser, into an instance tree of objects. In one embodiment, this is done by generating instance trees, mapping each parse onto an instance tree, pruning the instance trees generated, and then using a best-match algorithm on the pruned trees to select the best match.
(iii) A reduced form of the NML object description instance is created as an instance of a Domain Markup Language (xe2x80x9cDMLxe2x80x9d). This DML is passed to the application program for execution.
The features and advantages described in this summary and the following detailed description are not all-inclusive, and particularly, many additional features and advantages will be apparent to one of ordinary skill in the art in view of the drawings, specification, and claims hereof. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter.