1. Technical Field
The present invention generally relates to copyright material and in particular to discovering copyright infringement on a network, including the Internet. Still more particularly, the present invention relates to discovering copyright infringement without infringing on copyright material.
2. Description of the Related Art
Copyright infringement is a major problem on the Internet (Web). Digital documents like Web pages, MP3 audio, etc. are very easy to copy and put on a Web site. Since the Web has documents on line that are reaching billions of pages it is extremely hard for a publisher to track down sites which have infringed on an author's copyright by posting copies of the author's original work. An article entitled “Extent of copyright infringement on the Web” in the Sep. 14, 1999 issue of Fortune Investor News, details the extent of copyright violations on the Web. “There are more than 2 million web sites offering, linking or referencing “warez,” the Internet code word for illegal copies of software. This problem has increased significantly over the past three years, from roughly 100,000 warez sites two years ago, to 900,000 last year.”
Generally, in the past, utilizing a search engine service to detect copyright infringement would suffice. Keywords would be entered into a search engine, which indexes a large portion of the Web, to determine candidate pages to search for copyright infringement. Typically hundreds, if not thousands of hits would be returned to the search engine based on the search criteria using keywords.
The candidate pages were then downloaded to the author or publisher's computer. The searcher would then perform more computer aided processing on the candidate pages to determine potential infringers. If there were just a few pages, reading the downloaded files would be the next step to determine if there were any infringement. However, there would be many files to inspect and this would require a further search that involved more complex pattern matching. This step would narrow the choices further so that visual inspection of the files could be made to see if a copyright was being violated.
Unfortunately, the passage of The Digital Millennium Copyright Act, signed into law on Oct. 28, 1998, has made the approach as described above untenable. The digital age has prompted the passage of strict laws on copyright protection by the United States Congress. A strict interpretation of the law would prevent anyone but exempted entities, from storing copies of copyrighted Web documents on their computer except for downloading incidental to viewing (caching and immediate viewing). While the law is complex the only clear exemptions are: Internet Service Providers (ISPs); search engines—as long as they do not profit directly; non-profit educational institutions and system caching.
Generally a publisher is not concerned about the copying of a line or two of text or a few bars of music because that is not a violation of the “fair-use” act for copying. What he is most concerned about is the copying of entire paragraphs or sections of music verbatim. Even if data could be downloaded “legally” to disk, typical pattern matching algorithms take an inordinate amount of time when the matching strings that are very long (e.g., a text paragraph).
Due to the billions of Web pages on the Internet, and The Digital Millennium Copyright Act, the process of detecting unauthorized posting or copyright infringement on the Web becomes nearly impossible. Therefore it would be desirable to provide a process that would enable an author or publisher to do a reasonably thorough search of the Internet for copyright infringers without violating. The Digital Millennium Copyright Act. Further, it would be desirable to detect Web pages that have copied or modified copyright digital data on the Internet, without extracting and storing pages to store and further process.