Field of the Invention
The present invention relates most generally to methods of extracting information from text, and more particularly to a natural language processing method of analyzing Internet content for advertising purposes, and still more particularly to a method of opportunistic natural language processing in which only rules with the least likely elements present are evaluated.
Background Discussion
For nearly half a century, Artificial Intelligence (AI) research has unsuccessfully sought an approach leading to “understanding” a natural language (NL) as well as an average five-year-old child understands that language. So far, that quest has been fruitless, and the present invention does not shed any additional light on that quest.
The very first NL processing program, the ELIZA program written by Dr. Joseph Weizenbaum in 1966, showed that much was possible without any understanding at all, simply by looking for the presence of particular words and phrases. Dr. Weizenbaum subsequently, in 1976, wrote the book, Computer Power and Human Reason, from Judgment to Calculation, which carefully crafted the hard upper bounds of approaches such as those used in ELIZA. Those upper bounds were surpassed, though only slightly, with the later introduction of Fuzzy Logic. Still, there is no path seen to extend the approaches such as those used in ELIZA to ever equal the capabilities of a 5-year-old, regardless of how much future development and computer power might be applied.
Seeing that normal human understanding would forever remain out of reach with the ELIZA approach, most AI researchers worked on other methods. They have failed to produce anything of significant value, despite nearly half a century of creative work at major universities.
The Internet is long past due for a technological makeover. Google was the last big technological advance, allowing users to break free of the chains of hyperlinks and the Internet Yellow Pages. The next significant advance will allow users to state their problems in ordinary English, rather than having to instruct a search engine with a carefully constructed search string, and even without understanding their problem sufficiently to formulate a question. All the users will need to do is mention their present challenge/problem/irritation somewhere, and if there is a good solution, they will soon receive a message from another Internet user providing a solution and explaining how to overcome their challenge. This is a much more effective marketing mechanism than the presently used advertisements. Of course there will remain plenty of need for search engines, especially in areas that are already well understood. But this is a big step and it raises the question: Where is this headed?
Science fiction authors have been speculating on the destiny of the Internet since times long before the Internet even existed. A mind-controlled version of the “Internet” was presented in the movie, Forbidden Planet, where mere thoughts were enough to produce anything, including monsters that killed the neighbors.
In the book The Fall of Colossus, the first sequel in the trilogy that started with the movie Colossus (The Forbin Project), people around the world had Internet-like access to an omniscient, artificially intelligent mind having great power.
The new Hulu.com series The Booth at the End explores the idea of an advanced intelligence providing advice to people. The advice leads people to perform seemingly unrelated actions to induce a butterfly-effect, resulting in the realization of their wishes or leading to some other result that satisfies (or kills) them.
In the Future Tense article, Could the Internet Ever “Wake Up”?, published by the Technology section of Slate.com, authors Christof Koch and Robert Sawyer examine the prospect of the Internet becoming conscious.
And in the AI computer program the present inventor wrote behind DrEliza.com, statements of symptoms made by patients with chronic illnesses are analyzed. The program often asks questions of patients so as to “drill down” into their conditions. DrEliza.com then speculates on cause-and-effect chain(s) leading to the described illness. This is believed to be the only Natural Language Processing (NLP) program that does something of intrinsic value.
The present inventor's evolving vision is built on the Internet as it is, in its present condition, but with a very special AI synthetic “user,” who lurks unseen while watching everything, and sends custom-formatted notifications whenever anyone posts content that shows an opportunity to produce a valuable response. This could easily accomplish everything from selling products better than present advertising methods to displaying new cures for people's illnesses to solving political problems. People registered to receive notifications of certain content would be notified, but most of the people on the Internet would be entirely unaware of the AI synthetic “user,” other than perhaps noticing that every time they post about a problem, someone quickly contacts them with a good solution to their problem.
Google once offered the ability to store a search string, and sent notifications whenever new content appeared on the Internet that seemingly met the search criteria. This was unable to accomplish the purposes of the present invention for many reasons, but it is an interesting prior art system because it is the only art known to the present inventor in which a computer watched the Internet for interesting content, and then sent timely notifications when the content was found. The present invention will do the same, only with a much greater ability to eliminate false negatives and positives, so that it can be a practical and commercially valuable component in an automated marketing system.
The Internet as a Sales Tool: Long before computers existed people sought ways to identify prospective “customers” for ideas and products, ranging from religions to cookware, cigarettes to ships. They have also sought various ways to reach their prospective “customers” once they were identified. Relatively recently, activity was driven by mailing lists, sometimes accompanied by additional information to assist in selecting who to contact. Many companies had dedicated and expensive departments devoted to telephone and direct mail advertising. Of course, unwilling prospects quickly adapted to this by discarding “junk mail” without even bothering to open it.
Early pre-Internet computers held databases of prospects and bulk mail letter writing programs to process those databases into messages to send to prospects. This tipped the economic balance in favor of direct mail over telephone advertising, but only those people who appeared in some database could be reached. When computers replaced people in producing the messages, the vast majority of “junk mail” was directed to people uninterested in reading it, so the junk mail was quickly discarded upon receipt.
The dawn of the Internet brought email, which was infinitely less expensive than “snail mail,” soon followed by vendors who would send out millions of email solicitations. Here, even a 1/10,000 sales rate was seen as a “success.” However, email programs soon received additional features to discard junk emails, so bulk email programs now rarely even reach the people to whom emails are sent.
Then search engines like Google appeared, and people could search for prospects based on what people posted on the Internet, though dozens of “hits” would have to be examined to find a genuine prospect, and then a search for contact information would have to be conducted. Done well, this could lead to personally-produced emails that would reach valid prospects. However, this process was expensive, except in the case of high-value products for reliably identifiable prospects. Even this was problematic because by this time most people skipped over any emails that looked to be selling something.
This caused a massive shift in Internet advertising, firstly because bulk email is identifiable by email providers like Google, Yahoo, and Microsoft by the appearance of many emails from the same place carrying similar content. These are now summarily discarded as “spam” without the intended recipients ever even seeing them. Secondly, bulk emails are easily recognizable spam, so even if one evades spam filters, most people don't even bother reading them.
Advertising then shifted to placing ads on nearly every web page in an attempt to evade the spam filters. However, as of this writing, the latest iterations of spam filters now eliminate ads from web pages using strategies such as reading every web page twice, comparing the two copies, and eliminating anything that changes between the readings. They also access a central database of advertisers to eliminate all displays from identified Internet advertisers.
The present invention completely sidesteps this seemingly endless cat-and-mouse game of spam and filters. It provides an automated way to transform the Internet into something characterized by mutually and reciprocally desired exchanges, rather than by a cesspool of spam that crowds and pollutes the Internet. Of course, as with any powerful technology, mishandling and inappropriate application could produce undesirable results.
The Internet as a Problem Solving Tool: The advent of search engines, like Google.com, transformed the Internet into the most powerful research tool ever known. However, search engines display only “relevant” information in response to explicit and intentionally devised searches expressed in search strings. They do not answer questions or solve problems. However, now that the Internet has become congested with both commercial information and drivel, it has become nearly impossible to find the information needed to answer questions and solve problems.
The first commercial-quality question answering program on the Internet is http://WolframAlpha.com. In creating this program, it was quickly discovered that separate subroutines were needed for each type of question. While (at present) the program can easily relate different currencies, it is unable to answer any chemistry-related questions, to take an example. Hence, WolframAlpha seems to be broadly inapplicable to answering questions, except within its own fairly restricted domains, where it admittedly performs quite well.
The first problem solving program on the Internet is http://DrEliza.com, written by the present inventor. The present invention emerged as a rewrite of DrEliza was being contemplated. DrEliza is designed to diagnose chronic illnesses in sufficient detail to lead to known cures. DrEliza works not by “understanding” what people are saying, but rather by recognizing embedded statements of symptoms within unstructured statements about health conditions.
DrEliza.com is believed to be the only serious program designed to solve problems, rather than answer questions.
One of the essential things that DrEliza.com provides that other systems have missed, is an ability to recognize “statements of ignorance” embedded in people's explanations. In the vast majority of situations, a person's explanation of a problem contains subtle indicators of what he or she does not know, and that if he knew he would have already solved the problems. Of course users don't know what they don't know; but another person (or computer) that recognizes what a user needs to know can easily discern these “tells” and act upon them. Experts in all areas are quite familiar with recognizing statements of ignorance of important concepts in their respective areas of expertise. For example, any patent agent or patent attorney, when listening to someone expressing a need for a device to solve a particular problem, knows to suggest that he or she search USPTO.gov or a comparable database of patents and patent applications to determine what devices and methods have already been disclosed to do the job.
Selling Products: “Selling products” is fundamentally the process of explaining how something solves someone's problems. In a very real sense, DrEliza.com is “selling” its solutions to people's chronic health problems, because if they don't “buy” the advice, the advice cannot be of any help. The challenge is in getting a prospective customer to state a problem in a way that a particular product can be related to the particular problem. This has remained an unsolved problem on the Internet, leading to the obnoxious ads that now litter nearly all Internet sites.
However, analyzing much/most/all of what a person writes for the subtle “tells” indicating the particular products or kinds of products he might be interested in acquiring is quite within present-day Natural Language Processing (NLP) methods. There is just one problem, and it's a significant one at that: present methods advanced enough to do the job are orders of magnitude too slow for any practical use.
To illustrate, putting DrEliza.com on the Internet required turning off much of its analysis to enable it to respond to a screen full of text within a few seconds. Keeping up with the entire Internet would require several orders of magnitude more speed, and this is exactly what the present invention provides.
A system to sell a list of specific products would comprise much less than 1% of a great all-knowing Internet foreseen by some authors. The structure would be pretty much the same, but the tables would only have to contain just enough entries to recognize needs for the particular products, and not for any of the remainder of all human knowledge. For example, all text not written in the first person, and everything lacking a verb indicating a problem, could be summarily ignored.
It is one thing to produce a correct answer and quite another thing to convince a user that an answer is indeed correct. WolframAlpha “sells” its answers in two ways: (1) instead of simply declaring the answer, it embeds the answer into a sentence stating precisely what it is an answer to, and (2) it displays exactly how it parsed the question and developed several possible answers, only one of which was chosen for display.
DrEliza “sells” its diagnoses by identifying the sub-conditions that appear to lead to the presenting symptoms, and asking about other symptoms that those same sub-conditions would probably produce. Being asked about subtle symptoms never mentioned is enough to convince most users that DrEliza is on the right track. Further, DrEliza displays probabilities of its diagnoses that increase toward 100% (or decrease toward 0%) as the user answers various questions. In most cases the potential sub-conditions can be easily tested for. Where the sub-conditions are reversible, users often achieve permanent cures rather than temporary treatments.
Review of Various Relevant Prior Art Components, Their Deficiencies in Solving the Problem, and Modifications Needed: The technology for producing form letters is well known, along with various enhancements to evade spam filters. The most salient modification needed for the present invention includes the ability to respond to input metadata by varying the format or including the metadata into the text. The present inventor considers this obvious to anyone skilled in the art.
Technologies for “crawling” the Internet are similarly well known, and there are currently a number of both commercial and “shareware” programs for accomplishing this. However, some high-use sites like Facebook.com and Twitter.com now intercept and shut down web crawlers and provide their own proprietary interfaces. These interfaces either throttle access down to lower speeds and/or charge for usage of these interfaces.
An as-yet unaddressed and thus unresolved legal question is whether denying access to web crawlers may constitute a criminal antitrust and/or restraint of trade violation. However, these violations are not presently on the U.S. Justice Department's priority list of laws actually enforced, leaving Internet companies in a situation where they must “self-help” to defend themselves from such activity.
The expedient of distributing the crawling of these sites over the many individual representatives of the various manufacturers offering products easily subverts all presently known methods of detecting this sort of crawling activity. Each representative's computer would only have to access a few pages from each of these sites and forward them to a central site for aggregation and analysis. The central site could then crawl sites like Facebook and Twitter without being detected.
Crawlers have long used multiple computers to increase performance, but not other people's computers that are widely distributed, each with their own assortment of cookies. This will hide the nature of the crawlers.
A new method of parsing the natural language content of the Internet, faster than prior art methods to make it practical to “understand” everything on the Internet enough to sell products, is the “glue” technology that makes the present invention possible.
Prior Art Natural Language Processing: Many textbooks have been written about Natural Language Processing (NLP), keeping pace with its evolution and development since the original ELIZA program by Prof. Joseph Weizenbaum in 1966. NLP methods now vary widely depending on the goals of the program, e.g., searching, identifying content domain, full “understanding,” identifying symptoms, transcribing speech, translating text to other languages, and so forth. To date, what has been wanting is a way to identify which of many known potential needs, wants, problems, symptoms, pains, expenses, etc., a user has stated or implied, such that it can then be determined which products might best satisfy those needs and wants. In most cases the needs and wants will be expressed in first-person statements, notable exceptions being statements related to prospective purchases for Christmas, birthdays, etc., when the text author may be shopping for gifts. It is also necessary to ascertain the pertinent time (e.g., has this come and gone, is it anticipated in the future, etc.) and multiple negation (a language phenomenon largely restricted to English). In short, what is needed to handle complex marketing requirements is relatively simple in comparison to some other NLP applications, such as language translation.
The methods embodied in ELIZA live on in present-day chatbots—experimental programs that mimic human conversation. They also live on in DrEliza.com, written by the present inventor to find solutions for chronic illness problems.
One thing that nearly everyone notices when using NL processing programs, such as chatbots and language translators, is the time it takes for the programs to prepare responses. It takes a lot of computation to process natural language. Even on modern multi-gigahertz processors, most NL processing programs are “trimmed” to eliminate non-essential computations to make them run fast enough not to be objectionably slow, albeit at some cost in accuracy and at the cost of eliminating some less-important analysis.
It is the primary goal of the present invention to greatly accelerate NL processing speed so that Internet providers, like Google.com and Yahoo.com, can more intelligently respond to their users, speeding processing sufficiently so that rules can be incorporated to perform as well as is practical, without significant regard for the amount of processing needed to operate that fast.
The ELIZA approach is good at solving problems, but is not suitable for answering questions. Sales involve problem solving of a special nature, where the only solutions available are the products that are available for sale. In problem solving, the only thing a computer must do is recognize particular problem situations where the computer can render good advice, while ignoring all other input. To illustrate, to recognize a situation where it may be possible to sell a hybrid electric car:
Statement Qualifying a ProspectStatement Disqualifying a ProspectIs this about the author, e.g.,Does the author already have an equalwritten in the first person?or better solution in hand, e.g., mentioning “my Honda Prius”?Has the author recognized that Has the author said somethinghe/she has a problem, suggesting that he doesn't havee.g., with words like “can't,” the money to purchase, e.g.,“cannot, “won't,” “will not,” etc.?mentioning “my food stamps”?Has the author mentioned aIs the author already aware of theproblem that is addressed byproduct, e.g., by mentioninga represented product,it by name?e.g., “poor gas mileage”?Can the author afford theIs the author aware of problemsrepresented product, e.g., hasthat this product has, e.g., he mentioned “my job.”mentioning the “high cost ofIs this a current problem, e.g., ishybrid cars”?it written in the presentimperfect tense?
This is a much deeper analysis than that provided by prior art methods that simply look for the use of certain words to trigger the display of advertisements without regard for the context in which they are used.
Simplistic NL processing was done crudely in the original ELIZA, and elegantly in DrEliza.com. There is a barrier common to nearly all NLP development efforts that limits NLP performance: the more statement elements recognized, the more time is needed to test for all of them, wherever they might occur. NLP projects have therefore had to strike a compromise between functionality and speed that falls several orders of magnitude short of functioning as a useful component in analyzing the entire Internet in real time.
The present invention provides a way to sidestep this barrier, so that much deeper analysis can be performed without significantly slowing the program. Program speed is nearly unaffected by the number of statements that can be recognized. Hence, it becomes possible to deeply analyze the semantics of statements to determine what is being said and implied, while handling the data nearly as fast as it can be brought into memory. The potential applicability of this new methodology goes far beyond the presently used methods.
There are other prospective methods for doing the searching part of this NLP, e.g., Google.com could conceivably do this if it could accept very complex search strings that would be thousands of characters long, rather than the present 10-word limit. The methods used at Google.com are simply too inefficient to ever build a business model around performing the same things the present invention seeks to perform, even if Google were to open up the Google.com interface enough to allow some such queries. Google.com is also missing other essential capabilities, e.g., metadata extraction, most notably the extraction of contact information.
The performance barrier has constrained NLP for the last 40 years, as various developers have sought various ways to process more efficiently without really addressing the fundamental limitations of this barrier. An example is the method commonly used in some automatic spelling correction programs, where chains of misspellings having the same first two letters are maintained in order to shorten the searching. This runs fast for spelling correction, but not for anything else. Google has attempted to resolve the problem simply by dedicating large buildings full of computers at the barrier. That it has taken 40 years to conceive a way past this barrier should resolve any issues concerning “obviousness” under the patent laws.
The natural language processing part of the present invention functions similarly to state of the art “chatbots”—programs that carry on a running natural language conversation with users. These programs are sometimes created as part of “Turing test competitions” to determine which best mimic actual human conversationalists. Some of the principles used in the present invention are similar, but the objectives and the advantages are different from those of chatbots. Specifically, chatbots are designed to mimic and maximize conversation, but the present invention, like its predecessor DrEliza.com, is designed to minimize conversation while providing maximum informational value. The present invention will thus produce no response to the vast majority of Internet postings.
Moreover, chatbots are not designed or expected to be useful. In contrast, the entire purpose of this invention is to be more useful and more valuable than competing advertising methods.
Still further, chatbots are designed to emulate human conversation. However, when responding to intercepted email contents and other private conversation this can be a very disadvantageous, as it causes users to suspect that their email conversations are not private. When responding to intercepted private communications, the output must not appear to be human-sourced. It would be advantageous, therefore, to be able to customize a generic-looking advertisement to the specific intercepted conversations.
And further still, chatbots may require a second or more to process a simple statement and produce a response, whereas the presently inventive system must analyze much more deeply, process much longer postings, and produce responses without consuming more than a few milliseconds of time in order to keep up with the Internet. This requires 3-4 orders of magnitude more speed.
Finally, like most AI programs, NLP programs are almost always table driven.
The foregoing background discussion, including consideration of the known prior art systems and methods, reflect the current state of the art of which the present inventor is aware. Reference to, and discussion of, these known systems and methods is intended to aid in discharging Applicant's acknowledged duty of candor in disclosing information that may be relevant to the examination of claims to the present invention. However, it is respectfully submitted that none of the background art discloses, teaches, suggests, shows, or otherwise renders obvious, either singly or when considered in combination, the inventive method described herein.