This application relates generally to electronic documents such as e-mail messages and HTML (Hyper Text Mark-up Language) based documents. More particularly, this application relates to the auditing of electronic documents in large-scale Web email services. In addition, the application is related to DOM (Document Object Model) applications and DOM processing.
In the last decade, Web email has evolved, very much like regular snail mail, into an electronic messaging paradigm dominated by machine-generated messages. Some recent studies have verified that automated scripts generate more than 90% of non-spam Web email. Such messages vary in importance. Some represent highly valuable information (and often sensitive personal details), such as shipment notifications, flight itineraries, bank statements, etc., while others are almost completely composed of junk mail (while not being malicious), such as promotions or newsletters. One common characteristic of these machine-generated messages is that most are highly structured documents that utilize rich HTML formatting, which repeats at a large scale, rendering them more “predictable.” This enables the application of various automated data extraction or learning methods. Features and applications driven by such methods are ubiquitous in Web email. Analyzing mail content is critical in order to support many product features, such as, for example, customized search results, tailored advertising, and spam and malware detection.
As much as these methods are, by design, automated, they still require human intervention for debugging, evaluation, and research and development. Major Web email services restrict access to such personal information to a small group of employees referred to as auditors. Auditors operate under strict contractual confidentiality obligations, which require, among other factors that user privacy be maintained during auditing. Automated mechanisms are needed which support privacy and preserve auditing.