Field
The disclosed embodiments relate to collaborative machine learning. More specifically, the disclosed embodiments relate to techniques for providing a common feature protocol for collaborative machine learning.
Related Art
Analytics may be used to discover trends, patterns, relationships, and/or other attributes related to large sets of complex, interconnected, and/or multidimensional data. In turn, the discovered information may be used to gain insights and/or guide decisions and/or actions related to the data. For example, business analytics may be used to assess past performance, guide business planning, and/or identify actions that may improve future performance.
To glean such insights, large data sets of features may be analyzed using regression models, artificial neural networks, support vector machines, decision trees, naïve Bayes classifiers, and/or other types of statistical models. The discovered information may then be used to guide decisions and/or perform actions related to the data. For example, the output of a statistical model may be used to guide marketing decisions, assess risk, detect fraud, predict behavior, and/or customize or optimize use of an application or website.
However, significant time, effort, and overhead may be spent on feature selection during creation and training of statistical models for analytics. For example, a data set for a statistical model may have thousands to millions of features, including features that are created from combinations of other features, while only a fraction of the features and/or combinations may be relevant and/or important to the statistical model. At the same time, training and/or execution of statistical models with large numbers of features typically require more memory, computational resources, and time than those of statistical models with smaller numbers of features. Excessively complex statistical models that utilize too many features may additionally be at risk for overfitting.
Additional overhead and complexity may be incurred during sharing and organizing of feature sets. For example, a set of features may be shared across projects, teams, or usage contexts by denormalizing and duplicating the features in separate feature repositories for offline and online execution environments. As a result, the duplicated features may occupy significant storage resources and require synchronization across the repositories. Each team that uses the features may further incur the overhead of manually identifying features that are relevant to the team's operation from a much larger list of features for all of the teams.
Consequently, creation and use of statistical models in analytics may be facilitated by mechanisms for improving the sharing and reuse of features among the statistical models.
In the figures, like reference numerals refer to the same figure elements.