The internet provides access to a wealth of information. Documents created by authors all over the world are freely available for reading, indexing, and extraction of information. This incredible diversity of fact and opinion that make the internet the ultimate information source.
However, this same diversity of information creates a considerable challenge when extracting information. Information may be presented in a variety of formats, languages, and layouts. A human user may (or may not) be able to decipher individual documents to gather the information contained therein, but these differences may confuse or mislead an automated extraction system, resulting in information of little or no value. Extracting information from documents of various formats poses a formidable challenge to efforts to create an automated extraction system.