Documents and unstructured data often contain various entities that a user would like to readily identify, such as formulae, words, phrases, or other terms. For example, a chemist might want to know all the chemical compounds referred to in a particular reference, such as an issued patent. One way of obtaining this information would be to manually read through the reference while marking or making a note of all the compounds appearing in that reference. Another way would be to have a computer analyze the text and compare that text against a library of chemical formulae and names. While such an automated approach might take less time, it is not necessarily more accurate. Furthermore, depending on how the entities of interest were tagged, the automated process might not be scalable.
What is needed is a scalable solution that allows for the rapid analysis of text in order to extract entities that are meaningful to a user, especially a solution that is retargetable to new copora. Such a solution would ideally be applicable to different kinds of entities, such as formulae and text-based words and phrases, thereby greatly improving the process of extracting structure from documents or unstructured data.