Individual users commonly have multiple electronic devices. For example, an individual user may have a desktop computer, a laptop, a tablet, a cell phone, and a work computer. It is desirable to determine a set of devices that are associated with a particular user so that, when actions on those devices are tracked, the actions can be associated with a particular user profile and collectively used, for example, to identify and provide targeted marketing and content to the user. However, identifying a set of devices associated with a particular user is often difficult because users commonly have multiple devices, share devices with other users, borrow devices from one another, and use public-access devices. For example, a particular user may view an advertisement for a product on the user's mobile phone while at home. Once the user arrives at work, the user may perform online research for the product using the user's work computer. At the end of the day, the user purchases the product from the user's home computer. By using three different devices in this example (the user's phone, work computer, and home computer), the marketer that provided the original advertisement as displayed on the mobile phone sees the advertisement as wasted ad placement dollars because no purchase was made using the mobile phone. Further, the advertiser is not able to gain an understanding as to the sequence of events and the user's research done to arrive at the successful purchase because there is currently no ability to link the various devices together accurately to identify the user as a single person using multiple devices to receive the advertisement, research the product, and purchase the product.
Current techniques for identifying which devices belong to a particular user are limited in that the current techniques do not scale accurately for large data sets providing analytics information on millions of devices. In deterministic methods for identifying groups of devices associated with a particular user, an analytics system identifies multiple devices that share deterministic user identifiers, such as a login pattern for logging into one or more online services. However, while deterministic methods provide accuracy in identifying multiple devices for a user, the deterministic methods lack the scale required for large scale data analytics for data collected on millions of users operating millions of devices and interacting with thousands of different Internet brands. For example, deterministic data may not be available for many user devices or online services. There is thus a need for clustering multiple devices to identify particular users in a way that provides both accuracy and scale for large scale data analytics.