The internet is one of the primary sources of information of modern life. However, on the web, there coexists a lot of valuable, useful and accurate information together with misleading or inaccurate information. There also exists sources of information that are more trusted and those that are less trusted, and other sources which cannot readily be identified as trusted or not trusted. General web-based searching can return information that is harmful or misleading. The use of non-credible sources of information as basis for decisions can have a severe impact in fields like politics, health, finance and many others. For instance, in the 2008 U.S. presidential campaign of Barack Obama, misleading information connecting the future president to a Muslim faith organization resulted in substantial confusion among voters. Various other instances of false or misleading reports emanating from the internet have been document, and have had consequences affecting lives and decisions. In more daily and personal applications, information obtained from the internet serves as a basis for decision making in insurance underwriting processes, credit and lending decisions, merger and acquisitions, fraud detection, hiring decisions and many others. In this sense, credibility assessments are becoming of increasing importance in order to build judgment skills to properly discern between different sources of information, and to address contradictions in information from various sources.
Prior art approaches to this problem have attempted to reduce web spam by developing credibility based link analysis algorithms like the ones used in common search engines. Common examples include the PageRank algorithm developed and used by Google™, the TrustRank algorithm developed by Stanford University and Yahoo!™, and the HITS algorithm which was a precursor to the PageRank algorithm. Each of these prior art approaches rely on the assumption that the quality of a web page is correlated to the quality of its links, and return, in response to a search query, a ranked list of web pages as a result of a search. Spammers have created several ways to take advantage of how search engines operate like “hijacking” trusted web pages and building “honeypots” or groups of legitimate-appearing web pages to induce trusted pages to link them. Recent studies (such as (i) D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics. WebDB, 2004 and (ii) Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating Web spam with TrustRank. VLDB, 2004.) suggest 26% of web content is spam. On top of this, there is some amount of inaccurate or mistrusted information that cannot be properly described as spam.
As is evident, prior art approaches have been suitable for ranking web pages and providing a list of hits in response to a search request, but are inadequate for assessing the reliability of the information, the reliability of the links to other sources on web pages, or the reliability of events being described with sufficient confidence to permit decision-makers to rely on this information without a significant due diligence burden.