1. Field of the Invention
Embodiments of the present invention generally relate to data analysis and, more specifically, to generating data clusters of related data entities with customizable analysis strategies.
2. Description of the Related Art
In financial and security investigations an analyst may have to make decisions regarding data entities within a collection of data. For instance, the analyst could have to decide whether an account data entity represents a fraudulent bank account. However, an individual data entity oftentimes includes insufficient information for the analyst to make such decisions. The analyst makes better decisions based upon a collection of related data entities. For instance, two financial transactions may be related by an identical account identifier or two accounts belonging to one customer may be related by an identical customer identifier or other attribute (e.g., a shared phone number or address). Some currently available systems assist the analyst by identifying data entities that are directly related to an initial data entity. For example, the analyst could initiate an investigation with a single suspicious data entity or “seed,” such as a fraudulent credit card account. If the analyst examined this data entity by itself, then the analyst would not observe any suspicious characteristics. However, the analyst could request a list of data entities related to the seed by a shared attribute, such as a customer identifier. In doing so, the analyst could discover an additional data entity, such as an additional credit card account, which relates to the original fraudulent account because of a shared customer identifier. The analyst could then mark the additional credit card account as potentially fraudulent, based upon the relationship of the shared customer identifier.
Although these systems can be very helpful in discovering related data entities, they typically require the analyst to manually repeat the same series of searches for many investigations. Repeating the same investigation process consumes time and resources, such that there are oftentimes more investigations than can be performed. Thus, analysts typically prioritize investigations based upon the characteristics of the seeds. However, there may be insignificant differences between the seeds, so the analyst may not be able to determine the correct priority for investigations. For instance, the analyst could have to choose between two potential investigations based upon separate fraudulent credit card accounts. One investigation could reveal more potentially fraudulent credit card accounts than the other, and therefore could be more important to perform. Yet, the characteristics of the two original credit card accounts could be similar, so the analyst would not be able to choose the more important investigation. Without more information, prioritizing investigations is difficult and error prone.