1. Field of the Invention
The invention relates to pre-processing of data for a production database. More particularly the invention relates to a method and apparatus for identifying duplicate and near-duplicate content or documents for a production database.
2. Description of Related Technology
In the early days of the Internet, information retrieval tools were rudimentary, consisting of text-based search tools such as ARCHIE, GOPHER, and WAIS. In the 1990's the World-wide Web emerged and the first graphical web browsers, MOSAIC and NETSCAPE became available. Internet use started to increase dramatically among individual citizens, who could connect to the network from their own homes via modem over a telephone line. With the growth of the Internet and the corresponding increase in the user population, there arose the need for more sophisticated information retrieval tools. To satisfy this need, powerful search engines, such as WEBCRAWLER (INFOSPACE, INC., BELLEVUE Wash.), ALTAVISTA (YAHOO, INC. SUNNYVALE Calif.) and GOOGLE (GOOGLE, INC., MOUNTAIN VIEW Calif.) were developed. These search engines had to be able to sift through enormous numbers of duplicate documents and avoid returning them in search results in order to provide users the most useful information. Unfortunately, as the web has continued to expand, the volume of available information has mushroomed. While search engines, such as GOOGLE, remain highly effective, the sheer volume of information they return in response to a query can overwhelm the user. Thus, the user experience has, in spite of the power of these search engines, begun to deteriorate.
In response to the proliferation of online information, vertical search tools have arisen to serve highly specific information needs. A vertical search tool may be thought of as a specialized, domain-specific search engine that mines data for one narrow niche of the market place. Post-retrieval, a vertical search tool may classify and process the information and present it in a way that renders the information easier and simpler to use and consume.
The Internet has been recognized as an excellent medium for disseminating job and employment information and has rapidly become an important tool for employers and jobseekers alike. Professional associations often provide job listings for their members and large commercial jobs databases such as MONSTER (MONSTER WORLDWIDE, INC., NEW YORK N.Y.) enjoy great popularity. Employment experts generally counsel job-seekers to use as many modalities as possible to identify and make contact with potential employers. It is also a very common practice for employers seeking employees to use different recruiting modalities: recruiters, Internet-based job bulletin boards, newspaper ads and so on. A result of this practice is that there may exist a large number of announcements, ads and descriptions for a given job on the Internet that are duplicates or near-duplicates of each other. Furthermore, the jobseeker, in order to manage a job search effectively, must find a way to manage jobs information from a multiplicity of sources. For this reason, producers of employment information, in order to serve their market most effectively, must find a way to limit or eliminate the frustratingly large number of duplicate and near-duplicate job listings that are bound to turn up in a job search.
The prior art provides various methods of assessing similarity between documents in order to identify duplicates and near duplicates in a group of documents. Approaches are often based on “signatures” wherein a document signature—a digest of the document—is created, and then pair-wise comparison of signatures is made to identify documents that are similar to each other.
For example, one approach uses “shingling” to represent a document as a series of numeric encodings for an n-term text span—a “shingle.” A document sketch is created by keeping every mth shingle or the shingle with the smallest hash value. There is also a super-shingling technique that creates meta-sketches to reduce computational complexity. Pairs of documents that share a large number of shingles are considered to be near-duplicates of each other. Such approaches suffer the disadvantage of performing poorly on small documents, such as web-published job listings. Additionally, reduction of the volume of data in this manner can result in relatively non-similar documents being identified as duplicates.
Another approach also determines near-duplicate documents based on fingerprints. Fingerprints are generated for each of a fixed number of lists by extracting elements from the documents, hashing each of the extracted elements, and determining which of the number of lists is to be populated with a given part. If the fingerprints of two documents in any list are the same, the documents are duplicates or near-duplicates.
There is a disadvantage to approaches that rely exclusively on a comparison of fingerprints to identify duplicates and near duplicates. Documents having identical fingerprints may, in fact not be duplicates or near duplicates. Thus, a unique document may be identified as a duplicate or near-duplicate of another document based on a non-unique fingerprint. In a case where duplicates and near-duplicates are being identified to remove them from a repository, the content contained in the mistakenly-identified and removed document is then lost. Additionally, such approaches are computationally intensive.
There also exist feature-based approaches to duplicate detection. For example, one approach uses collection statistics to identify terms occurring in the entire collection of documents that are useful for duplicate document detection. Such an approach employs a premise that removal of very infrequent and very common terms results in good document representations for identifying duplicates.
A still further approach uses document length as a binning method to reduce the number of duplicate candidates. Keywords from different parts of a seed document are used to query documents within the bin to reduce the number of duplicate candidates even further. Then, a similarity measure such as Kullback-Leibler divergence is used to do pair-wise comparison to determine near-duplicates.
Job listings typically contain a fair amount of noise—content unrelated to the job listing itself. For example, job listings often contain advertising or promotional information for the search engine itself. Additionally, the formatting and layout of job listings varies greatly from one job site to another. For example, on one site, the actual description of the job may be placed in the middle portion of the listing; a description of the very same job from another site may include promotional material in the middle portion of the document. Furthermore, online job listings are often fairly short in length. The wide variation in the information content, layout and formatting of job listings from one site to another, coupled with their typically short length, pose special challenges for identifying and removing duplicates and near-duplicates that conventional duplicate removal stratagems have difficulty meeting.