Owners of web sites need to understand the capabilities of client communication devices accessing their web sites in order to optimise the content provided to different device types. For example, a news organisation's web page containing an article will be surrounded by areas highlighting other articles to which the reader can progress. On a mobile phone, a single area listing further articles might be displayed at the top of the page using plain text. On a desktop web browser with its larger screen, multiple areas listing additional articles including thumbnail images could be displayed above and to the right of the article. In both cases, the article's content will be identical. FIG. 1 shows an example layout of a web page on a mobile phone screen, in which content area 1 floats at the top of the page and always remains in view. FIG. 2 shows an example layout of the same web page for a desktop or laptop computer screen, in which two content areas are shown. The web page shown in FIG. 2 is the same as that shown in FIG. 1, but more content has been added to the right of the page in area 2 and area 1 is larger and does not float at the top of the page.
Web site owners also need to include characteristics of client communication devices in analysis of web usage in order to understand if user behaviour varies by device type. For example, analysis of the percentage of people failing to read a second news article by screen size may provide the information needed to improve the user interface on devices that correlate with a higher than average failure to read further news articles.
The Hyper Text Transfer Protocol (HTTP) specification advises client devices to include headers to control how a request to a server should be managed by the server. Example headers include preferred language, cookies containing information about previous requests, the types of media the device can support and information about the device. The most widely used header for the identification of device capabilities is known as a User-Agent. A User-Agent is a string of characters that a communication device can transmit to a remote service, such as a web server. The User-Agent contains information about the properties of a communication device, such as the device's hardware, operating system and web browser. Upon receiving a User-Agent from a particular communication device, the remote service can analyse the User-Agent in order to determine the properties of that device.
Whilst the HTTP specification advises devices to transmit a User-Agent header, it provides no guidance concerning the structure of the character string that the header contains. As a result, a wide variety of User-Agent conventions exist, and the structure of User-Agents continues to evolve.
Table 1 shows some examples of User-Agents.
TABLE 1RowExample User-AgentExplanation1Mozilla/5.0 (compatible; MSIE 9.0;Used by Microsoft to Windows NT 6.1; WOW64; identify different versions Trident/5.0)of Internet Explorer on desktop or laptop devices.2Mozilla/5.0 (compatible;Used by the Baidu searchBaiduspider/2.0;engine to identify its web +http://www.baidu.com/search/spider.site crawler.html)3Mozilla/5.0 (iPhone; CPU iPhone OSUsed by Apple to identify6_1_3 like Mac OS X)iPhone type devices.AppleWebKit/536.26 (KHTML, likeGecko) Version/6.0 Mobile/10B329Safari/8536.254Mozilla/5.0 (Linux; U; AndroidUsed by manufacturers of4.0.4; en-us; SPH-D710Android based devices toBuild/IMM76I) AppleWebKit/534.30identify their devices.(KHTML, like Gecko) Version/4.0Mobile Safari/534.305HUAWEI Y320-T00_TD/1.0 AndroidUsed by Huawei to 4.0.3 Release/10.01.2012identify its Y320 Browser/WAP2.0 appleWebkit/534.30smartphone
User-Agents do not follow any defined rules and usually only the inclusion of the prefix “Mozilla/5.0” and some information between succeeding brackets can be expected. However, the User-Agent in Row 5 of Table 1 does not even contain the prefix “Mozilla/5.0” or any brackets.
Different hardware and software vendors use different formats for their User-Agents. In the Apple example at Row 3 of Table 1, the type of device can be found by looking at the string immediately following the first bracket. In the case of Row 3 of Table 1, the string is “iPhone” indicating the device is an Apple iPhone. However, the Android example at Row 4 of Table 1 contains a string indicating the device's model number before the string “Build”. In the case of Row 4 of Table 1, the string is “SPH-D710” indicating the device is a Samsung Galaxy S II. The Baidu search engine example at Row 2 of Table 1 contains no information about the type of device, but instead includes the Uniform Resource Locator (URL) “http://www.baidu.com/search/spider.html”.
Some hardware and software vendors also include serial number information within the User-Agent to uniquely identify a specific communication device. As a result, there is a vast number of User-Agent headers in use today. The number is increasing non-linearly and for practical purposes is infinite.
To identify the properties of a communication device accessing a web site, two things are required:                1. information about devices, including details of the hardware, operating system and browser information; and        2. a method of relating User-Agents, and other relevant HTTP headers, to entities contained within the information about devices.        
Regular Expressions and tries are two methods currently used to achieve the latter requirement.
Regular Expressions (RegExs) are a method of matching patterns within a string of characters. Open source projects such as DetectMobileBrowsers.com (http://detectmobilebrowsers.com/) use a long list of RegExs to determine if a device is a mobile browser, or a traditional desktop or laptop based browser. RegEx based algorithms require relatively little storage space to store the list of RegExs. However, as the number of User-Agents increases, more RegExs need to be evaluated when a request is received by a web site to achieve an accurate and useful result. The number of User-Agents is now so great that the time taken to execute these RegExs is longer than web site owners wish to wait for the resulting device characteristics to be provided. For a web site where response time is extremely important, it is unacceptable to wait even 5 milliseconds whilst all the available central processing unit (CPU) capacity is used to determine the characteristics of the requesting device. A faster solution is required. Furthermore, the accuracy of the results provided by RegEx-based algorithms is often so poor as to be unusable.
Trie data structures can be used to provide considerably faster results, as they reduce the number of complex calculations which need to be performed. A trie is a type of tree data structure that is particularly suited for storing character strings, such as User-Agents. A typical trie has one node for every common prefix, with additional strings contained in child nodes culminating in a leaf node. The trie is evaluated from the root node down. Trie data structures are commonly used for dictionary applications to determine if a word is valid and to suggest alternative words. They work very well in such applications where there are hundreds of thousands of possible results. When used for device identification, however, tries need to be populated with tens of millions of possible User-Agents in order to maintain the required level of accuracy. Tries for accurate device identification are very large, typically more than several gigabytes. As such they are only suitable for web sites that have a large amount of available storage. They are unsuited to small and medium sized web sites that operate on relatively constrained CPU and memory resources.
The applicant's earlier patent, European Patent No. 2 871 816, discloses a method of identifying a property of a communication device. A plurality of data structures (such as trie data structures) are provided, each of which is designated for storing substrings that occur at a particular character position in a character string (such as a User-Agent). Each data structure comprises one or more entries, each of which comprises a substring. Data representing an association between each entry and a respective profile is stored, wherein each profile includes a value of at least one property of a communication device. The property of a communication device can be identified by searching the plurality of data structures for substrings of a character string that identifies the device (such as its User-Agent).
The method described in European Patent No. 2 871 816 is very accurate, requires less storage than prior trie based algorithms, and is capable of accurately identifying a device from a large corpus of known devices faster than RegEx based algorithms. It is nevertheless desirable to reduce storage requirements even further and/or to identify a device even faster, whilst maintaining a high level of accuracy.