The Transmission Control Protocol (TCP) and the Internet Protocol (IP) are used to transmit data packets in a network between a sender and a receiver. Data packets include TCP/IP headers, which identify the respective source and destination addresses. It is desirable for TCP/IP headers to be anonymized for protecting a client's privacy.
Currently, anonymized TCP/IP headers may be used to identify the application layer protocol for determining whether a web page is Hypertext Transfer Protocol (HTTP) or non-HTTP. However, this information alone is not very informative.
Anonymized TCP/IP headers may also be used for identifying specific web pages. However, these types of identification methods are limited by resource constraints and do not scale well.
Deep packet inspection approaches are used for various purposes such as examining the payload part of a packet as it passes an inspection point, searching for protocol, non-compliance, viruses, spam, intrusions, or defined criteria to decide whether the packet may pass or if it needs to be routed to a different node. However, such methods are not robust enough to accommodate obfuscated traffic (i.e., encrypted or compressed), rendering this methodology infeasible.
Accordingly, a need exists for methods, systems, and computer readable media for generating and using a web page classification model. Such a web page classification model has a wide range of applicability, not limited to use in predicting network traffic for network planning, identifying security breaches (e.g., web crawlers, malicious bots, etc.), profiling web page content for advertisement targeting, profiling the usage of mobile devices, profiling application type, profiling navigation styles, etc.