Improvements in communications and storage technology over the last few decades have resulted in huge volumes of data being available in various electronic formats. For example, in 2012 one internet search engine estimated that there were billions of indexed web pages available on the internet totalling around 5 million terabytes of data. The true amount may be even greater than this, since some web pages are not indexed by any search engine.
The size of resources such as the internet makes it impossible for any meaningful searching to be carried out without the assistance of a computer. However, many resources like the internet have been set up to be easily intelligible by a human but not by a computer. The situation is thus that, given an initial question, a human can search and extract highly relevant information from a resource like the internet, but at a far too slow a rate to be of use, and a computer can rapidly extract large volumes of information from a resource like the internet but cannot easily determine the relevance of the information it has extracted in relation to the question.
As a result, humans have been required to pose a question, mentally deconstruct this question into pertinent keywords, use a computer to perform a search of a resource such as the internet based on those keywords, and then review the search results manually to extract the answer to their question.
However, this solution is not ideal. This is because people are used to posing questions in so-called ‘natural language’. For example the simple question ‘where did you go today?’ is a natural language question that can easily be answered by a human, yet a computer may struggle to extract meaning from this question in order to provide an appropriate answer. Conversely, a question phrased in terms readily intelligible by a computer (often termed a ‘query’) is difficult for a human to understand, making it difficult for a human to pose such a query in the first place. Thus, search results using a basic ‘keyword’ method may be suboptimal.
In addition, many resources such as web pages, journal articles, textbooks, newspapers, magazines, patent specifications and blogs are written in natural language for human consumption. The amount of data in these resources is enormous, but they remain difficult for computers to make use of due to their being written in natural language.
The field of natural language processing (NLP) attempts to bridge the gap between human and machine by providing methods and algorithms that enable computers to derive meaning from a natural language question or statement, in order to translate a natural language ‘question’ into a ‘query’ that is suited for interrogating a fact database, or a natural language statement into a fact that is suited for storing in a fact database.
One application of NLP algorithms is in the field of Question Answering (QA). In a process that may be referred to as ‘query mapping’, a natural language question posed by a user can be translated by a NLP algorithm into a query that is understandable by a computer. The computer can then rapidly interrogate a fact database to gather information relevant to answering the user's question and then present this information to the user, typically sorted according to relevance, in order to answer their question.
Often, the translation of the natural language question into a query will result in a far more refined search of the fact database, such that the set of results returned may be more pertinent to the user's question. In some optimal cases, the query mapping process will result in a query that returns only a single, definitive answer from the fact database. The process of query mapping thus reduces the burden of work on the user, at least because they will not have to wade through large volumes of potentially irrelevant information in order to find an answer to their question.
Another application of NLP algorithms is in the field of fact extraction. Fact extraction is the process of transforming natural language statements into structured facts. A computer may parse a body of text, sometimes referred to as a ‘corpus’ or ‘text corpus’, and use NLP algorithms to extract facts from this corpus. The extracted facts may be stored in a fact database, which may then be interrogated to answer questions. NLP algorithms thus find application in both the extraction of facts from a corpus into a fact database and also the mapping of natural language questions into queries suitable for interrogating a fact database.
Current NLP algorithms suffer from the problem that, as natural language sentences or questions increase in complexity, there is a combinatorial increase in the number of mappings required to extract a fact from the statement, or extract a query from the question. This translates to an increase in the time taken for fact extraction or query mapping, such that the NLP algorithm may not be able to complete its task in a reasonable time frame, or in some cases may not be able to complete it at all. The NLP algorithm may be allocated additional computing resources to reduce the time taken, but this solution is clearly a stop-gap solution that fails with an arbitrarily complex sentence. In addition, in some circumstances available computing resources may be limited, such that it is not possible to increase the computing resources available to the NLP algorithm.
Thus, it is clear that a need exists for improved natural language processing systems and methods that can reliably extract facts and/or map queries from an arbitrarily complex natural language sentence or query in a time frame that is acceptable to a user without commandeering prohibitively large amounts of processing resources.