While systems exist today for enabling users to store information from a variety of sources into a standard repository, these systems are limited in that they can only extract and summarize explicitly available information, e.g., file creation date, web page uniform resource locator (URL), source website, type of document, etc. Any additional information generally must be manually entered by the user, creating a massive data entry burden.
With the increasing commercial importance of the Internet, it is becoming more and more important to be able to reliably extract information from any web page. Because the Internet is extremely broad, it contains web pages in a wide variety of domains, and having a wide variety of structures. Information extraction tools should be able to extract information from any web page in which a user may be interested.
For example, a new area of personal web creation (social bookmarking) has recently emerged. The systems in this area typically store a pointer to a web page with a local cached copy of the web page. The limited information stored in this format is significantly less useful than would be relevant features automatically extracted from web pages of interest to the user, which could be, e.g., automatically summarized and compared.
What is needed are methods, systems and computer readable media that allow automatic extraction of attributes from any formatted data source. The information should be extracted in a general way, such that it can be stored, and used to, e.g., form a summary of the data source. The extraction technique should not be restricted to web pages, but should be applicable to any formatted input, regardless of the format.