In 2008, Americans consumed information for approximately 1.3 trillion hours, or an average of almost 12 hours per day per person (Global Information Industry Center, University of California at San Diego, January 2010). Consumption totaled 3.6 zettabytes (3.6×1021 bytes) and 10,845 trillion (10,845×1012) words, corresponding to 100,500 words and 34 gigabytes for an average person on an average day. This information coming from over twenty different sources of information, from newspapers and books through to online media, social media, satellite radio, and Internet video although the traditional media of radio and TV still dominated consumption per day.
Computers and the Internet have had major effects on some aspects of information consumption. In the past, information consumption was overwhelmingly passive, with telephone being the only interactive medium. However, with computers, a full third of words and more than half of digital data are now received interactively. Reading, which was in decline due to the growth of television, tripled from 1980 to 2008, because it is the overwhelmingly preferred way to receive words on the Internet. At the same time portable electronic devices and the Internet have resulted in a large portion of the population in the United States for example becoming active generators of information throughout their daily lives as well as active consumers augmenting their passive consumption. Social media such as Facebook™ and Twitter™, blogs, website comment sections, Bing™ Yahoo™ have all contributed in different ways to the active generation of information by individuals which augments that generated by enterprises, news organizations, Government, and marketing organizations.
Globally the roughly 27 million computer servers active in 2008 processed 9.57 zettabytes of information (Global Information Industry Center, University of California at San Diego, April 2011). This study also estimated that enterprise server workloads are doubling about every two years and whilst a substantial portion of this information is incredibly transient overall the amount of information created, used, and retained is growing steadily.
The exploding growth in stored collections of numbers, images and other data represents one facet of information management for organizations, enterprises, Governments and individuals. However, even what was once considered “mere data” becomes more important when it is actively processed by servers as representing meaningful information delivered for an ever-increasing number of uses. Overall the 27 million computer servers were estimated as providing an average of 3 terabytes of information per year to each of the estimated 3.18 billion workers in the world's labor force.
Increasingly, a corporation's competitiveness hinges on its ability to employ innovative search techniques that help users discover data and obtain useful results. In some instances automatically offering recommendations for subsequent searches or extracting related information are beneficial. To gain some insight into the magnitude of the problem consider the following:                in 2009 around 3.7 million new domains were registered each month and as of June 2011 this had increased to approximately 4.5 million per month;        approximately 45% of Internet users are under 25;        there are approximately 600 million wired and 1,200 million wireless broadband subscriptions globally;        approximately 85% of wireless handsets shipped globally in 2011 included a web browser;        there are approximately 2.1 billion Internet users globally with approximately 2.4 billion social networking accounts;        there are approximately 800 million users on Facebook™ and approximately 225 million Twitter™ accounts;        there are approximately 250 million tweets per day and approximately 250 million Facebook activities;        there are approximately 3 billion Google™ searches and 300 million Yahoo™ searches per day.        
Accordingly it would be evident that users face an overwhelming barrage of information (content) that must be filtered, processed, analysed, reviewed, consolidated and distributed or acted upon. For example a market researcher seeking to determine the perception of a particular product may wish to rapidly collate sentiments from reviews sourced from websites, press articles, and social media. However, existing sentiment filtering approaches simply determine occurrences of a keyword with positive and negative terms. Accordingly content containing the phrase “Last night I drove to see Terminator 3 in my new Fiat 500, after eating at Stonewall's, the truffle bison burger was great” would be interpreted as positive feedback even though the positive term is associated with the food rather than either the film “Terminator 3” or the vehicle “Fiat 500.” Accordingly, it would be beneficial for sentiment analysis of content to be contextually aware.
Similarly, a search by a user using the terms “Barack Obama Afghanistan” with Google™ run on May 2, 2012 returns approximately 324 million “hits” in a fraction of a second. These are displayed, by default in the absence of other filters by the user, in an order determined by rules executed by Google™ servers relating to factors including, but not limited to, match to user entered keywords and the number of times a particular webpage or item of content has been opened. However, within this search the same content may be reproduced multiple times in different sources legitimately as well as having been plagiarized partially into other sources as well as the same event being presented through different content on other websites. Accordingly, different occurrences of Barack Obama visiting Afghanistan or different aspects of his visit to Afghanistan may become buried in an overwhelming reporting of his last visit or the repeated occurrence of strategic photo opportunities during the visit during a campaign.
Accordingly, it would be beneficial for the user to be able to retrieve a collection of multiple items of content, commonly referred to as documents, which mention one or more concepts or interests, and automatically cluster them into cohesive groups that relate to the same concepts or interests. Each cohesive group (or cluster) formed thereby consists of one or more documents from the original collection which describe the same concept or interest even where the documents have perhaps a different vocabulary. Even when a user identifies an item of content of interest, for example a review of a product, then the salient text may be buried within a large amount of other content or alternatively the item of content may be formatted for display upon laptops, tablet PCs, etc. whereas the user is accessing the content on a portable electronic device such as a smartphone or portable gaming console for example.
Accordingly it would be beneficial for the user to be able to access the salient text contained in one or more items of content, based on learned semantic and content structure cues so that extraneous elements of the item of content are removed. Accordingly it would be beneficial to provide a tool for inducing content scraping automatically to filter content to that necessary or automatically extracting core text for viewing on constrained screen devices or vocalizing through a screen reader. Automated summarization or text simplification may also form extensions of the scraper.
Other aspects and features of the present invention will become apparent to those ordinarily skilled in the art upon review of the following description of specific embodiments of the invention in conjunction with the accompanying figures.