1. Field of the Invention
This invention relates generally to methods and systems for information retrieval, processing and storing, and more particularly to methods and systems of finding, transforming and storage of facts about a particular domain from unstructured and semi-structured documents written in a natural language.
2. Description of the Related Art
The transformation of information from one form to another was and still is quite a formidable task. The major problem is that the purpose of information generation in the first place is communication with human beings. This assumption allowed and forced the use of loosely structured or purely unstructured methods of information presentation. A typical example would be a newspaper article. Sometimes the information is presented in a little more structured form like in a company's press release, or in SEC 10-K form. But even in the latter case the majority of information is presented using plain (e.g. English) language. With the information explosion there has been, particularly with the Internet, the need for aggregation and automatic analysis of the virtually infinite amount of information available to public became apparent and urgent. The fundamental problem with this analysis is in the very fact that the information is originated by human beings to be consumed by human beings. So, to perform aggregation and automatic analysis of this information a computer needs to transform/translate semi-structured or completely unstructured text into a structured form. But to do that one needs to create a machine that can understand natural language—this task is still far beyond the grasp of AI community. Furthermore, to understand something means not only to recognize grammatical constructs, which is a difficult and expensive task by itself, but to create a semantic and pragmatic model of the subject in question.
A number of scientists and businesses tried to solve this problem by creating a statistically generated ontology of a subject area and generating tools to navigate the Internet and other sources of information using this ontology and key words. Some of them went even further and generated the “relevance” index to prioritize pieces of information (e.g. web pages) by their “importance” and “relevance” to the question [e.g. Google].
The fundamental problem with this approach is that it still does not perform the task at hand—“analyze and organize the sea of information pieces into a well managed and easily accessible structure”.
Transformation of information contained in billions and billions of unstructured and semi-structured documents that are now available in electronic forms into structured format constitutes one of the most challenging tasks in computer science and industry. Internet created a perception that everything one needs to know is at his/her fingertips. Search engines strengthen this perception. But the reality is that the existing systems like Google, Yahoo and others have two major drawbacks: (a) They provide only answers to isolated questions without any aggregations; so there is no way to ask a question like “How many CRM companies hired a chief privacy officer in the last two years?”, and (b) the relevancy/false positive number is between 10% and 20% on average for non specific questions like “Who is IT director at Wells Fargo bank?” or “Which actors were nominated for both Oscar and Golden Globe last year?” These questions require the system that collects facts and then present them in structured format and stored in a data repository to be queried using SQL-type of a language.
The following metaphor can be applied. Keyword search can be viewed as a process of sending scouts to find a number of objects that resemble what one is looking for. The system that converts unstructured data into a structured repository becomes an oracle that does not look for answers but just has the information ready.
The Internet has been generated by the efforts of millions of people. This endeavor could not be achieved without a flexible platform and language. HTML provided such a language and with its loose standards has been embraced worldwide. But this flexibility is a mixed blessing. It allows for unlimited capabilities to organize data on a web page, but at the same time makes its analysis a formidable task. Though there is no theoretical possibility to create an algorithm to analyze page structure of an arbitrary web page, the fact that the ultimate goal of a page is to be read by a human being makes the problem practically solvable.
Major challenge of information retrieval field is that it deals with unstructured sources. Furthermore, these sources are created for human not machine consumption. The documents are organized to match human cognition process that is based on using conventions and habits immanent to a multi-sense multi-oracle perception.
Examples of multi-sense perception include the conventions that dictate the position of a date in a newspaper (usually on the top line of a page, sometimes on the bottom line, or in a particular frame close to the top of the page) or continuation of the article in the next column with the consideration of a picture or horizontal line dividing the page real estate into areas. Examples of multi-oracle perception mechanisms include the way how companies describe their customers—it can be a press release, can be a list of use cases, list of logos, or simply a list of names on a page called “Our customers”.
With the increase of throughput the Internet pages become more and more complex in structure. Now they include images, sounds, videos, flashes, complex layout, dynamic client side scripting, etc. This complexity makes the problem of extraction of units like article quite problematic. The problem is aggravated by the lack of standards and the level of creativity of web masters. Some hopes can be placed on the emerging semi-structured data feed standards like RSS, but the web pages that mimic the centuries old tradition of presenting news on page for human eyes are here to stay.
The problem of extracting main content and discarding all other elements present on a web page constitutes a formidable challenge. At the moment the status quo is that the automatic systems that “scrape” articles from different web sites for consolidation or analysis use so-called templates. Templates are formal descriptions of a way how a webmaster of a particular newspaper presents the information on the web. The templates constitute three major challenges. Firstly, one needs to maintain many thousands of them. Secondly, they have to be updated on a regular basis due to ever changing page structures, new advertisement, and the like. Because newspapers do not notify about these changes, the maintenance of templates require constant checking. And thirdly, it is quite difficult to be accurate in describing the article, especially its body, since each article has different attributes like number of embedded pictures, length of title, length of body etc.
Temporal information is critical for determination of relevancy of facts extracted from a document. There are two problems to be addressed. One is to extract time stamp(s) and another one is to attribute the time stamp(s) to the extracted facts. The second problem is closely related to the recognition of HTML document layout including determination of individual frames, articles, lists, digests etc. The time stamp extraction process should be supplemented with the verification procedure and strong garbage model to minimize false positive results.
A timestamp can be either explicit or implicit. An explicit timestamp is typical for press release, newspaper articles and other publications. An implicit timestamp is typical for the information posted on companies' websites, when it is assumed that the information is current. For example, executive bios and lists of partners typically have implicit timestamp. The date of a document with an implicit timestamp is defined as a time interval when a particular fact was/is valid.
Implicit timestamp extraction is straightforward. When a fact is extracted from a particular page for the first time, the lower bound of the time interval is set to the date of retrieval—we can assume that the fact was valid at least at the day of retrieval and possibly earlier. At the same time the upper bound of the time interval is also set to the date of the retrieval—we can assume that the fact was valid on the day of retrieval. As the crawler revisits the page and finds it and the facts unchanged the upper bound of the time interval is increased to the date of the visit (the fact continues to hold on the date of the visit).
Explicit timestamps are much harder to extract. There are three major challenges: (1) multi-document nature of a web page; (2) no uniform rule of placing timestamps and (3) false clues. Typical examples of a multi-document page are a publication front page in a form of a digest or a digest of company's press releases.
In the case of newspaper the convention is that the top of the page contains the today's date, and all articles are presumed being timestamped with this date. The situation with a web page is much more complex, since with the development of convenient tools for web page design people became quite creative. Nevertheless, the overall purpose of the web page—to distribute information in a way convenient to a reader—keeps the layout of a page from becoming completely wild. That is even more applicable to business-related articles, where the goal is to produce easily scannable documents for busy business readers. In most cases the timestamp of an article is positioned at the top of a document, while the documents on the page are positioned in a sequential order looking from html tags prospective.
The variety of the ways how documents created by humans represent the same facts, demands the system that needs to recognize and extract them to be a hybrid one. That is why homogeneous mechanisms can not function properly in an open world, and thus rely on constant tuning or on focusing on a well defined domain.
For a long time the main thrust in Information Retrieval field was in building mechanisms to deal with ever growing amount of available information. With the explosion of the Internet the problem of scalability became critical. For keyword based search systems scalability is straightforward. For a system of facts extraction like Business Information Network the problem of scalability is significantly more complex. That is because facts about the same object occur in different documents, and thus should be collected separately but used together to infer additional facts and to verify or refute each other, and to build a representative description of an object.
The original premise of Information Retrieval was to create mechanisms to retrieve relevant documents with as low as possible number of false negative (missed) and false positive (not-relevant) ones. All existing search engines are based on that premise with the emphasis on low false negative part. The relevancy (false positive rate) of search results is a very delicate subject, which all search vendors try to avoid. As a matter of fact, independent studies showed that a typical keyword search of a business person like “Wells Fargo”+“IT Director” generates up to a thousand url links out of which just 10% are relevant and even they are located all over the place; the probability to see a relevant link in the first page of search results (first 10 links) is practically the same as the probability to see it on the 90th page (links 900 to 910). As opposed to search engines, the system that provides answers simply can't afford to have high level of false positive rate. The system becomes useless (unreliable) if false positive rate is higher than a single digit. To provide that level of quality the system should employ special protective measures to verify the facts stored in its repository.
URL-based (static) Internet currently consists of more than 8 billion pages and grows with the speed of 4 million pages per day. These do not reflect so-called Deep Web or dynamically generated request-response web pages that represent one order of magnitude more than the static Internet. That humongous size of the search space presents significant difficulty for crawlers, since it requires hundreds of thousands computers and hundreds of gigabits per second connections. There is a very short list of companies like Google, Microsoft, Yahoo and Ask Jeeves, which can afford to crawl the entire Internet space (static pages only). And if the task is to provide a user with a keywords index to any page on the Internet, that is the price to pay. But for many tasks that is neither necessary nor sufficient.
If one looks at the problem of using Internet as a source of answers to a particular set of questions and/or to use the Internet to provide information to a particular application, the desire is to look only at “relevant” pages and never even visit all others. The problem is how to find these pages without crawling the entire Internet. One of the solutions is to use search portals like Google to narrow the list of potentially relevant pages using keyword search. That approach assumes advance knowledge of keywords that is used in the relevant pages. Also it assumes that third party (Google et al.) database can be used to do massive keyword requests. Also the number of pages to be extracted and to be analyzed can significantly supersede the number of relevant pages.
Static Internet constitutes just a small fraction of all documents available on the Web. Deep or dynamic web constitutes a significant challenge for web crawlers. The connections between web pages are presented in a dynamically generated manner. To define the question, the DHTML forms are used. The page that is rendered does not exist and is generated after the request for it is made. The content is typically contained in the server database and the page is usually a mix of predefined templates (text, graphic, voice, video etc.) and the results of dynamically generated database queries. Airlines web sites provide a very good example of ratio between static pages on web site and the information available about flights. Online dictionaries show even more dramatic ratio between the size of surface and deep web, where the deep web part constitutes 99.99% while the static web part is mere 0.01%.
Since the main issue in dealing with the dynamic web is that the answer is rendered only to the rightfully presented question, a mechanism that deals with the Deep Web should be able to recognize what type of questions should be asked and how they should be asked, and then be able to generate all possible questions and analyze all the answers. At the moment Deep Web is not tackled by the search vendors and continues to be a strong challenge.
Typical examples are travel web sites and job boards. Furthermore, now practically any company website contains forms, e.g. to present the list of press releases. The major problem is to find out what questions to ask to retrieve the information from the databases, and how to obtain all of it.
NLP parsing is a field that was created in the 1960's by N. Chomsky's pioneer work on formal grammars for natural languages. Since that time a number of researches tried to create efficient mechanisms to parse a sentence written in a natural language. There are two problems associated with this task. Firstly, no formal grammar of a natural language exists, and there are no indications that it will ever be created, due to the fundamentally “non-formal” nature of a natural language. Secondly, the sentences quite often either not allow for full parsing at all or can be parsed in many different ways. The result is that none of the known general parsers are acceptable from the practical stand point. They are extremely slow and produce too many or no results.
Dictionaries play an important role in facts verification. The main problem though is how to build them. Usually some form of bootstrapping is used that starts with the building of initial dictionaries. Then an iterative processes use dictionaries to verify new facts and then this new facts help to grow dictionaries which in their turn allow extracting more facts etc. This general approach though can generate a lot of false results and specific mechanisms should be built to avoid that.
At the same time even if the parser quickly generated grammatical structure of a sentence it does not mean that the sentence contains any useful information for a particular application. Semantic and pragmatic levels of a system are usually responsible for determination of relevancy.
One of the most difficult problems in facts extraction in Information Retrieval is the problem of identification of objects, their attributes and the relationships between objects. A typical information system contains a pre-defined set of objects. The examples are abundant. A dictionary is a classic example with objects being words chosen by the editors of the dictionary. In business information systems like Hoover's the objects include pre-defined list of companies. But if the system is built automatically the decision whether a particular sequence of words represent a new object is much more difficult. It is especially tricky in the systems that analyze large number of new documents on a daily basis creating significant restrictions on the time spent on the analysis.
Thus, when a knowledge agent extracts a potential object, relationship or attribute, the more strict its grammar is the less the number of false positives it produces. On the other hand, strictness of grammar limits its applicability. The success of the recursive verification depends on the level of heterogeneity of knowledge agents and the presence of documents describing the same objects using different grammatical constructs. The latter is quite typical for the Internet while heterogeneity depends on the system design.
An information system built from unstructured sources has to deal with the problem that objects and facts about them come from disparate documents. That makes identification of objects and establishing the equivalency between them a formidable task. Thus, if a web page containing an article describes a company as IBM while another one mentions International Business Machine, somehow the facts from both articles should be attributed to the blue chip company that is traded on New York Stock Exchange under the ticker IBM, has IRS number 130871985 and is headquartered in Armonk, N.Y. To be able to establish such determination special mechanisms should be developed.
A major challenge with facts extraction from a written document comes from the descriptive nature of any document. While describing a fact the document uses names of objects, not objects themselves. Thus, facts extraction faces a classic problem of instances vs. denotatum. There is no universal solution for that problem available. On the other hand since the purpose of the business-related documents is to communicate a message, there are rules that writers of these documents follow. For example, inside one document two different companies are not called by the same name (e.g. Aspect Communications and Aspect Lab will not be referred simply as Aspect if both are described in the same document, while the word Aspect can be used extensively in the document describing just Aspect Communications). Another important rule based on the fact that the object should be well defined; otherwise the message is confusing. In the case of a company there is usually a paragraph describing the details about the company, such as the “About” section in press release, or information about company's location or its URL. Similar narrowing mechanisms are used for people. For example, mentioning of a person is done in a following way: “ . . . ”, said John Smith, vice president of operations at XYZ.com. Again, if the mechanisms are applied to a narrower domain the object identification procedures are easier to deal with than in a more general case.
Another challenge with such a system is that it should have mechanisms to go back on its decision on some equivalence without destroying others. To provide object identification and equivalence the inference mechanisms should be incorporated into the system.
One of the most common ways to introduce a person in an article is through the mentioning of the person's name, work affiliation and his/her quotes. This is how news articles and press releases are usually written. This “communication standard” constitutes one of the main sources of Business Information Network-related facts.
Quantitative information plays very significant role in Information Retrieval. In majority of the unstructured documents the quantitative information in the form of numbers associated with a particular countable object. These numbers represent important pieces of information that are used to describe the detailed information related to the facts described in the document. We call these numbers VINs, Very Important Numbers. The examples of VINs in the case of business facts are: number of employees in a company, number of customer representatives, percent of the budget spent on a particular business activity, number of call centers, number of different locations, age of a person, his/her salary etc. If an information system has VINs in it, its usability is significantly higher. VINs always represent the most valuable part of any market analysis, lead verification, and sales calls. The countable objects VINS constitute a significant pool of information that helps to make right business decisions.
Extraction of entities and their relationships from a text, news article or product description, is done by using local grammars and island parsing approach. The problem with local grammars is that they are domain dependent and should be built practically from scratch for a new domain. The challenge is to build mechanisms that can automatically enhance the grammar rules without introducing false positive results.
For a long time information systems vendors built the systems that had one kind of objects. The examples are people telephone directories, yellow pages etc, where the objects are individuals and businesses respectively. Practically the same principle is used by business information systems offered by D&B, Hoovers and others. Social networking systems existing on the market today typically apply the concept of relationship to one type of objects—people. Since business is done with people and companies together Business Information Network's knowledge about the relationships between people, people and companies and between companies brings the level of adequacy and sophistication to completely different level. The questions like “which company from my prospect list recently employed a CIO that worked for one of my customers over last 3 years” are completely beyond the capabilities of existing systems. Two examples of new level of information that can be used if Business Information Network database is built include Implicit Social Network and Customer Alumni Network as introduced in this invention.
In any market economy the livelihood of the company depends on its relationships with the outside world, its internal infrastructure, its employees and vital activity parameters, such as cash flow and profit. Short of reading people's minds and perusing through proprietary documents the Internet provides the best shot at all these factors that describe companies and its place in economy. Knowing these facts is useful in many areas, e.g. it empowers sales and business development people. The mentioned facts can significantly improve their business and increase effectiveness of the economy at large. As previously discussed, because the companies are interested in promoting themselves, they willingly publish a lot of information, and Internet made it easier for the publishers and for the receivers of this information. The problem is how to extract the relevant facts from billions of web pages that exist today, and from tens of billions pages that will populate the Internet in not so distant future.
Thus there is a clear need for methods and systems, for particular domains, that extract facts from billions of unstructured documents. There is a further need for methods and systems that address the problem of efficient finding and extraction of facts about a particular subject domain from semi-structured and unstructured documents. Yet there is another need for methods and systems that provide efficient finding and extraction of facts about a particular subject domain and make inferences of new facts from the extracted facts and the ways of verification of the facts. There is yet another need for methods and systems that provide efficient find and extraction of facts about a particular subject domain that create an oracle that uses structured fact representation and can become a source of knowledge about the domain to be effectively queried.