Certain analysis techniques may be used for analyzing and deriving insights from user interaction data gathered from online services, such as digital marketing platforms. For example, interaction data can be used for predicting future user behavior. User interaction data may be represented as a sequence of event records, including, for example, categorical values (such as state, ZIP code, browser-type, etc.), numerical values (price, age, duration of use, etc.), or some combination thereof. Moreover, user interactions may be encoded as events (e.g., by encoding each individual user event as a separate vector) or by session (e.g., by encoding user interactions in an entire session into a common vector).
However, given the growth in the amount and complexity of data to be analyzed, existing analysis techniques are ineffective for deriving insights from interaction data. Hence, advanced analysis techniques may be used. One such technique is topological data analysis (“TDA”). TDA uses topology, the sub-field of mathematics concerned with the study of shape, to describe the shape or pattern of a set of data. But many advanced analysis techniques cannot operate directly on interaction data. More specifically, advanced techniques require data sets with fixed-dimension records and numerical fields, such that the data can be encoded in vectors forming a point cloud in a real Euclidian space. Differences in two sets of interaction data should be reflected by distances between the two corresponding vectors (which represent the interaction data sets).
In contrast, representation vectors are a suitable input for such advanced techniques. A representation vector is a set of data points in a coordinate system that includes various dimensions representing the user interactions. Interaction data must be transformed into representation vectors. However, existing solutions for transforming interaction data into representation vectors present disadvantages. For instance, existing solutions are unable to encode categorical data in a manner such that the data is adequately represented in a Euclidian space. Existing solutions are also unable to compute or otherwise provide a distance reflecting two categorical variable values (e.g., a designation of “California” versus “Florida”). Additionally, the distance between values in different categories must be taken into consideration. For example, the distance between values within one category, e.g., male versus female, may differ from the distance between different possible values within another category, e.g., age group. The mixing of real and categorical data, and numerical data with differing scales poses a similar problem.
Accordingly, there exists a need to effectively transform user interaction data into a suitable form for advanced analysis techniques, specifically representation vectors.