Embodiments described herein relate generally to webpage content analysis including, for example, methods and apparatus for programmatically determining the validity of one or more webpages.
Owners of viable brands derive benefit from Internet traffic directed to their web content. To maximize this benefit, such owners often seek to avoid potential consumer confusion occasioned by the inappropriate use of Internet domain names similar to their own. For example, a third party may create a web page under a domain name that contains unauthorized content relating to the brand owner's product—thereby confusing visitors as to the page's true source. Additionally, brand owners often seek to avoid negative impressions that can result when a website with an Internet domain name similar to the name of the brand owner's product or service includes questionable content (such as pornographic material, content related to criminal activity, defamatory content, etc.), or links to such content. More particularly, brand owners may seek to know which webpages associated with such domain names contain at lease some substantive (non-advertising) content, and which contain a mere plurality of hyperlinks to other webpages as a revenue-generating device. Those in the latter category are often referred to as pay-per-click sites.
To police third-party activity of the type described above, a brand owner can first be aware of which particular Internet domain names with lexicographical similarity to their own company, product, or service names contain content and/or hyperlinks to content likely to induce consumer confusion. Due to the sheer volume of potentially-problematic domain names similar to a given brand, however, this task can be both time- and cost-intensive.
Thus, a need exists for methods and apparatus that programmatically determine the validity of a webpage in a sufficiently robust and accurate way.