Generally, “credibility” is defined as the quality, capability or power to elicit belief or trust. To the extent that credibility is thus necessarily dependent upon the subjective determinations of others, the process of determining the credibility of someone or something (referred to hereinafter as an entity) is likewise often an highly subjective process. Additionally, the effort required to accurately assess credibility is typically significant to the extent that it requires gathering data from a relatively large number of people knowledgeable about the entity in question.
The relatively recent development of the Internet and World Wide Web has lead to a commensurate explosion in the availability of textual documents authored by entities of every conceivable type. Given the ubiquity and relative ease of accessing such text, interest in techniques (which techniques typically fall within the general categories of natural language processing and/or machine learning) for automatically processing documents in order to “understand” what information they may expressly or inherently convey has increased. Only recently have developers of such techniques turned to the task of assessing credibility of a document. As used herein, a document may comprise a distinct, uniquely identified collection of text, such as a word processing document, advertising copy, a web page, a web log entry, etc. or portions thereof.
For example, techniques have been developed for assessing the credibility of a document in which the importance or credibility of a document is determined based at least in part upon the credibility of its source, e.g., its author or publisher. Obviously, for such techniques to work, data concerning the reliability of the source must be available or, at the very least, readily obtainable, which may not always be the case. Additionally, given the myriad influences that go into the development of a source's reputation for credibility, it is not unreasonable to assume that a source's credibility won't always correlate precisely with the credibility of the document.
In another technique, the credibility of a topic or concept over time is determined by comparing the frequency with which an expression of that topic or concept is detected in a corpus of documents against the frequency with which a related expression of that topic or concept (e.g., a negative or inverse expression of the topic or concept in question) is detected in the documents. The intuition in this technique is that the frequency with which a concept is repeated may serve as a form of proxy for its credibility. For example, over time, the expression “global warming is real” may occur with increasing frequency as compared to the related expression “global warming is a hoax,” with the resulting inference that the concept of “global warming is real” is becoming increasingly credible. However, this technique may likewise suffer from accuracy problems to the extent that text, particularly in the context of the Internet and/or World Wide Web, is often reproduced for reasons other than a subjective belief or trust in its semantic content. As a result, the frequency numbers could be easily skewed, thus resulting in an equally skewed credibility determination.