This invention relates to software that interfaces to information access platforms.
A search engine is a software program used for search and retrieval in database systems. The search engine often determines the searching capabilities available to a user. A Web search engine is often an interactive tool to help people locate information available over the world wide web (WWW). web search engines are actually databases that contain references to thousands of resources. There are many search engines available on the web, from companies such as Alta Vista, Yahoo, Northern Light and Lycos.
In an aspect, the invention features a method of parsing ambiguous grammar including pre-compiling the grammar into a binary format, parsing a query, outputting a graph by combining the parsed query and the binary format of the grammar and outputting a frame representation of potential parses in the graph. The parsing includes permitting multiple successors within any given grammar state, representing multiple grammar states as a non-deterministic state machine with a priority queue and using epsilon transitions to represent right-hand side grammar rules that recourse into a non-terminal. The parsing further includes following a finite state automation corresponding to a regular expression as part of the non-deterministic state machine and matching stack symbols to the terminals in the query being parsed. The parsing may also include placing anchors in the grammar to indicate that all other putative parses that have not reached anchor should be eliminated.
Embodiments of the invention may have one or more of the following advantages.
The parser has the capability of processing large and, ambiguous grammar efficiently by using a graph rather than xe2x80x9cpurexe2x80x9d words. The parser allows an information access process to take grammar files and an incoming query and determine the query""s structure. Generally, the parser pre-compiles the grammar into a binary format. The parser then accepts a query as input text, processes the query, and outputs a graph.
The parser improves upon an LR parser by adding the ability to handle ambiguous grammars efficiently and by permitting the system developer to include regular expressions on the right hand side of grammar rules. In the xe2x80x9cstandardxe2x80x9d LR parser, an ambiguous grammar would produce a conflict during the generation of LR tables. An ambiguous grammar is one that can interpret the same sequence of words as two or more different parse trees. Regular expressions are commonly used to represent patterns of alternative and/or optional words.
In traditional LR parsing, a state machine, typically represented as a set of states along with transitions between a the states, is used together with a last-in first-out (LIFO) stack. The state machine is deterministic, that is, the top symbol on the stack combined with the current state specifies conclusively what the next state should be. Ambiguity is not supported in traditional LR parsing because of the deterministic nature of the state machine.
To support ambiguity the parser permits non-determinism in the state machine, that is, in any given state with any given top stack symbol, more than one successor state is permitted. This non-determinism is supported with the use of a priority queue structure representing multiple states under consideration by the parser. A priority queue is a data structure that maintains a list of items sorted by a numeric score and permits efficient additions to and deletions from the queue. Because the parser is permitted to be simultaneously in multiple states, the parser tracks multiple stacks, one associated with each current state.
In order to improve the efficiency of the state diagrams, the parser makes use of empty transitions that are known as xe2x80x9cepsilonxe2x80x9d transitions. The exponential increase in size occurs because multiple parses may lead to a common rule in the grammar, but in a deterministic state diagram, because the state representing the common rule needs to track which of numerous possible ancestors was used, there needs to be one state of each possible ancestor. However, because parsers support ambiguity via support for a non-deterministic state diagram, the multiple ancestors can be tracked via the priority queue/stack tree mechanism. Thus, a common rule can be collapsed into a single state in the non-deterministic state diagram rather than replicated multiple times. In general, performing this compression in an optimal fashion is difficult. However, a large amount of compression can be achieved by inserting an epsilon whenever the right-hand side of a grammar rule recourses into a non-terminal. This has the effect of causing all occurrences of the same non-terminal in different right-hand-sides to be collapsed in the non-deterministic state diagram.
The parser also includes support for regular expressions on the right-hand-side of context-free grammar rules. Regular expressions can always be expressed as context-free rules, but it is tedious for grammar developers to perform this manual expansion, increasing the effort required to author a grammar and the chance for human error. Implementation of this extension would be to compile the regular expressions into context-free rules mechanically and integrate these rules into the larger set of grammar rules. Converting regular expressions into finite state automata through generally known techniques, and then letting a new non-terminal represent each state in the automata can accomplish this. However, this approach results in great inefficiency during parsing because of the large number of newly created states. Also, this expansion results in parse trees which no longer correspond to the original, unexpanded, grammar, hence, increasing the amount of effort required by the grammar developer to identify and correct errors during development.