There are currently hundreds of millions of people contributing content to the Web. They do so by rating items, sharing links, photos, music and video, creating their own webpage or writing them for friends, family, or employer, socializing in social networking sites, and blogging their daily life and thoughts. Of those who author Web content there is a group of people who contribute to more than a single Web entity, be it on a different host, on a different application or under a different username. This group is referred to as Serial Sharers. For example, good examples of people who produce several types of content are university professors and students who maintain their own personal Web page on a different host and also a page on their faculty site.
Some authors contribute more than others and their opinion is heard multiple times in multiple contexts. They not only contribute content to the Web but do so on several different hosts and in various different forms, be it by tagging public material, through their homepage, by blogging, by contributing portions to open content websites, and the likes. These authors are not spammers in the trivial sense. Most have no intention of manipulating search results, or influencing worldwide information. They simply enjoy utilizing everything the virtual world offers.
Knowing that the same person authored a collection of not trivially-related pages may be used to enhance and create new applications where knowledge about users is essential. Analyzing and using information about a single author which is extracted from different sources may add new dimensions to user information, such that is not easily available today.
The problems of Duplicate Page Detection and Mirror Site Detection use multi-dimensional aspects of the page to describe duplication in features such as size, structure, content, similar naming of the URL, etc. Duplication and mirroring are artifacts of hosting similar information on different machines or hosts in order to facilitate access to those pages in a desired context (e.g. hosting a mirror of a software library on a public university server).
Author Detection is somewhat similar in the sense that information written by the same author, such as a user profile or a homepage, is sometimes partially duplicated by mentioning similar topics, expressing similar opinions, repeating the same links or usernames, etc. However, sometimes each page written by the same author comprises exclusively unique segments and there are authors who make a clear distinction between pages about different aspects of their life, for example, their hobbies and their professional pages.
Studies have explored the field of author detection or author attribution in restricted domains. For instance, machine learning and shallow parsing methods have been used to detect authors in various collections of newsgroups. Using similar methods, short messages on online message boards have been clustered for detection of users who mask their identity.
These studies all look at very controlled and contained domains. However, to solve the problem of author detection on the Web it is very costly to employ methods of shallow parsing and machine learning for several reasons. First, feature extraction is a costly process which requires analyzing many aspects of the page and then producing large data structures for storing such information. Secondly, feature extraction in such an uncontrolled environment cannot scale up.
Rao J. R., Rohatgi P. (2000), “Can pseudonymity really guarantee privacy?” In Proceedings of the 9th USENIX Security Symposium, pages 85-96, tries to align authors from both mailing lists and newsgroups. They report that the stylistic conventions practiced by users of the different media resulted in very poor detection rates with learning and shallow parsing methods.
US 2007/003,3168 discloses a method of receiving multiple content items from a corpus of content items and ranking the content items by using an agent rank. The method includes receiving digital signatures each made by one of multiple agents, each digital signature associating one of the agents with one or more of the content items. A score is assigned to a first agent of the multiple agents, wherein the score is based upon the content items associated with the first agent by the digital signatures. This disclosure provides a method of providing an author signature, but this does not address the problem of Web content without such author signature.