Parsing can be thought of as a process of finding structure in text or other ordered sequences of items. A grammar or other rules which may have been used to generate the text may be available and used to find the structure; or the grammar itself may need to be found. However, parsing unstructured text is a very challenging task because of the variety of ways in which people can express themselves using text and the natural ambiguity of language.
Because large amounts of information are available only in the form of unstructured text (such as most information available on the Internet), accurate, fast and cost effective ways of parsing that unstructured text are needed in order to enable automated systems such as information retrieval systems, document classification systems, machine translation systems and other systems to make use of that information.
Some previous parsing approaches have involved manually writing large amounts of machine learning code, which is time consuming, difficult to understand and maintain and bug-prone.
Classifiers trained using large amounts of labeled training data may be used to extract information from unstructured text. However, obtaining the labeled training data is typically expensive and time consuming and, once trained, the classifier does not adapt to changing use of language, such as new words or phrases.
Some previous parsing approaches have used regular expressions to analyze text. However, regular expressions are difficult to use by novice users and are defined using a limited language, so do not allow rich and complex parsing processes to be defined, for example, the parse cannot depend on non-textual cues.
The embodiments described below are not limited to implementations which solve any or all of the disadvantages of known systems for parsing text and other ordered sequences of items.