With the increasing commercial importance of the Internet, it is becoming more and more important to be able to reliably extract information from any web page. Because the Internet is extremely broad, it contains web pages in a wide variety of domains, and having a wide variety of structures. Information extraction tools should be able to extract information from any web page in which a user may be interested.
Current extraction systems do not allow users to direct the information extraction process. Existing information extraction systems tend to be focused on extracting information from web pages within a given domain. The information is collected using web crawlers or similar technology, for the purpose of allowing a user to later search this extracted information. This is often referred to as vertical search. These systems require extensive training using machine learning techniques to become useful for a given domain, and are unable to process web pages outside of this domain. Thus, these systems do not allow users to extract information from any web page that a user may find interesting.
Other approaches to information extraction focus on techniques that allow information to be extracted from a source with a uniform and known structure (often classed as wrapper generation techniques). These approaches require the structure to be known in advance and fixed. Therefore, although these techniques can be applied to a specific web site with a uniform and known structure, they fail when the structure is either non-uniform or unknown. Thus, such systems are also not capable of extracting information from any web page of interest to a user.
What is needed are methods, systems and computer readable media that allow users to collect, organize and share interesting information they find anywhere on the web, by automatically extracting desired information from any web page. The information should be extracted in a general way, such that it can, e.g., form a summary of that page. The extraction technique should not be restricted to web pages, but should be applicable to any formatted input, regardless of the format.