A user commonly operates multiple computing devices to access online services. For example, early in the day, the user logs in to a social media platform by using a smartphone. Upon arrival to work, the user turns a desktop computer on and browses various work-related online forums. Later in the day, the user enters a meeting room and operates a laptop for a virtual meeting with attendees at different geographic locations. When home in the evening, the user streams content via a tablet.
Each of the devices is typically associated with a number of device identifiers including, for instance, a cookie and an Internet protocol (IP) address. The device identifiers are used to customize the online services for the user. However, the customization would not be optimized across the different online services unless the different device identifiers are associated with the same user. For example, if one online service uses a cookie and another online service uses an IP address, and these two device identifiers are not detected as belonging to the same user, the customization can vary significantly between the two online services. For instance, the targeted content presented to the user across the two online services can be unrelated or even inconsistent. Hence, the overall quality of the online services is degraded because the user experience is not seamless and can vary significantly between the online services.
Accordingly, many existing systems associate device identifiers with users. Clustering is one usable technique to perform the association. Generally, the device identifiers are clustered such that the ones likely belonging to the same user are grouped together in the same cluster. The customization is then performed at a cluster level. In this way, as long as the device identifiers of a user belong to the same cluster, the different online services can be commonly customized for the user across the different computing devices. For example, whether accessing a social media platform, a work-related web site, a virtual meeting service, or online content via a smartphone, a desktop computer, a laptop, or a tablet, the same message or related messages can be inserted and presented to the user if the respective device identifiers are in the same cluster.
The computational burden on the existing system to cluster device identifiers is significantly large. Typically, millions if not billions of device identifiers should be processed because of the large number of users, each being associated with multiple device identifiers. Further, some of the users are web robots (e.g., Internet bots), each having thousands or more of device identifiers. To handle the computational burden, many existing systems rely on distributed computing.
In distributed computing, many computers are used. Each computer receives and processes data about a subset of the device identifiers. Processed data is exchanged between the computers to complete the clustering. Although this architecture is generally successful, it can fail under different scenarios. For example, when the subset of data allocated to a computer is too large, the computer may not have the memory space or the processing capability to handle the large amount of data and the clustering fails altogether. This scenario occurs, for instance, when some of the users, such as web robots, are associated with a very large number of device identifiers, resulting in skewed data and a distribution of a too large data amount to the same computer.