Currently almost all the real world information that is stored on the internet is stored within documents: web pages or other files containing natural language. These documents are held on millions of computers and if linked with hypertext links are done so according to the whims of the individual authors. The documents are in a large variety of different formats and written in thousands of different natural languages. This information is unstructured.
This information is also designed for human eyes. Although natural language understanding has always been a major research area in Artificial Intelligence, computers are not capable of understanding natural language to any great extent. As a consequence, a human user wanting to find something out using the internet has to first locate a document that might have the answer and then read it. To locate the document, the only practical current technique is keyword searching.
In order to find information using keyword searching the human user first hopes that a page/document exists which answers the question, hopes again that it has been indexed by a search engine and then tries to imagine what distinctive words will appear in it. If any of the words guessed are wrong or the page has not been indexed by the search engine they will not find the page. If the combination of words requested is contained on too many other pages the page may be listed but the human user will then have to manually read through hundreds or thousands of similar documents before finding the knowledge required.
In addition there is a certain arbitrariness about the words being used. Searching for general information on a person or product with a unique, distinctive name has a high probability of success, but if the search is for someone with a common name, or for information on something where the name also means something else (searching in English for the Japanese board-game “Go” is a very good example) the search will fail, or an extraordinary amount of extra human effort will be needed to locate the information. Furthermore, different ways of describing the same thing mean that several different queries often need to be made or the search may fail. For example, a search for information on “Abraham Lincoln” is likely to produce a differing list of documents to a search based on “President Lincoln” or “Abe Lincoln”.
Certain other types of queries are also extremely hard to answer with keyword searching. Examples are searching for any type of information which is dynamic. An extreme example would be the local time in a specific international city. This changes every second, so no web page indexing technique is going to be able to tell you this information at the moment of the query. Another example of a dynamic query would be to ask what the market capitalization of a company is at the current time. The answer to this depends on the precise share price of the company involved. A further example would be trying to discover the current age or marital status of a celebrity. Pages containing this information, if they were ever true, are only true at the time they were written. Search engines collect all the documents on the web and have little understanding of which contain out-of-date information. Some of these issues can be addressed with custom programming for the specific type of query at issue (e.g. adding stock quote programming to the search engine and checking for ticker symbols) but keyword indexing documents can provide no general solution.
Another problem may be that the knowledge is conceptualised in a way that is different from the way that it is described on the web page. For example, if one is trying to locate bi-monthly magazines with a search engine, one is unlikely to turn up any examples where they are described as being published “every two months”. Another example would be trying to find all hotels within two kilometers of a specific geographical location. It is extremely unlikely that any description of the hotel will be expressed in exactly that form so any keyword searching for this will fail. i.e. Because search engines don't generally understand the knowledge within a document, they cannot infer new knowledge from what is said.
Another problem with natural language is that keyword searching is language specific. Automatic translation between languages is essentially an unsolved problem in Artificial Intelligence and the state of the art produces very poor results. As a consequence the web is largely partitioned by the languages used to write the pages. Someone searching in (say) Hungarian only truly has access to the knowledge stored in that part of the web which is written in the same language.
Even if a document is found that appears to answer the question, the user may not know how much faith to place in the veracity of what is asserted. The facts asserted within this document may be incorrect or out of date. No general scheme exists on the web for assessing how much confidence can be placed in the veracity of any information contained in a web page. The page could contain errors and even the authorship of the document may not be clear.
An example of a prior art search-engine interaction illustrating some of these problems is shown in FIG. 1. The user has typed a very simple question about a popular musician in the search box (102) and the search engine has responded with a list of documents (104). The web contains a very strong bias towards contemporary people, especially celebrities, and there is no shortage of information on the web which would allow a perfect system to answer this question. In fact there are many thousands of web pages with information in them suitable for answering it. However, the list of documents bears very little similarity to what is being asked and the user would have to experiment further and read through a number of documents to get an answer.
The disadvantages of keyword searching are even more extreme when the user is not human but rather an automated system such as another computer. The software within a website or other automated system needs the knowledge it requires for its processing in a form it can process. In almost all cases, documents found with keyword searching are not sufficiently processable to provide what is needed. As a consequence almost all the world's computer systems have all the knowledge they need stored in a local database in a local format. For example, automated scheduling systems wanting to know whether a particular date is a national holiday access a custom written routine to provide this information, they do not simply consult the interne to find out the answer.
Knowledge in structured form is knowledge stored in a form designed to be directly processable to a computer. It is designed to be read and processed automatically. Structured form means that it is not stored as natural language. It is knowledge stored in a pre-determined format readable and processable by the computer. Knowledge in structured form will include identifiers which denote objects in the real world and examples will include assertions of information about these identified objects. An example of such an assertion would be the assertion that an identified relationship exists between two or more identified objects or that a named attribute applies to an identified object. (Individual instances of structured knowledge are referred to herein as “facts”.)
To fully understand the potential advantages of embodiments of the present invention it is also important to understand some issues relating to the broadness or narrowness of the domain of knowledge being represented. Knowledge stored in (say) a company's employee relational database may be in structured form but is in an extremely narrow domain. The representation is entirely local and only meets the needs of the narrow computer application which accesses it. Typically data stored in a computer system is designed to be used by, and can only be fully exploited by, the software within that system. In contrast, general knowledge is knowledge falling within an extremely wide domain. General knowledge stored in structured form represents general knowledge in such a way that it combines at least some of the universal meaningfulness advantages of natural language with the machine-processing advantages of other computer data. However, there are very significant difficulties to overcome to achieve this.
General knowledge in structured form has a variety of uses by a computer, including direct answering of natural language questions, and assistance with other forms of natural language processing (such as mining data from documents). It can even assist with keyword searching. For example, with the example above, if the structural knowledge exists that the strings “Abe Lincoln” and “President Abraham Lincoln” both denote the same unique entity a search engine using such a knowledge base could return documents containing either term when only one was entered by the user.
Building a large database of general structured knowledge presents serious difficulties. There are considerable difficulties in designing a knowledge representation method that is sufficiently expressive to represent a wide range of knowledge yet also sufficiently elementary in form to allow effective automated processing (such as inference and query responses). Building a knowledge base by hand (i.e. using direct human interaction as the source of the knowledge) is slow, so to build the largest possible knowledge base in a reasonable time requires a large number of people contributing.
One way to enable people to contribute is to select, hire and train salaried staff and then pay them to add this knowledge. Training them would typically require educating them about the underlying knowledge representation syntax and teaching them about what is already in the knowledge base.
However, to open up the process to the largest number of people (such as general users of the internet) requires enabling access to at least some of the knowledge addition process to untrained users.
Enabling untrained users to add general knowledge in structured form to a knowledge base presents a number of very significant problems.
First, these users are unlikely to know anything of the underlying knowledge representation technology so if untrained users are genuinely to be used, they will ideally need to be able to assert facts in a way that is natural to them and distinct from the knowledge representation format.
Secondly, these users are untrusted and potentially malicious. For this reason it isn't desirable to simply permanently add all knowledge asserted by such users to the published knowledge base. Desirably methods are needed to distinguish between true and untrue facts and to retain true facts while removing (or never publishing) untrue facts.
Thirdly, adding knowledge should desirably not require any previous knowledge of what is already in the knowledge base. If prior familiarity with the ontology or other facts that are already in the knowledge base is required, untrained users will find it more difficult to add knowledge.
All of the above issues both with knowledge representation generally and with the knowledge addition process are directly addressed in various embodiments of the present invention.