Investment Insights
Understanding the true value of assets of any kind is a critical problem in purchase decisions. Often, the available markets are inefficient at finding the true value (appropriate risk adjusted return) of the assets. As such, it is difficult for an investor or potential investor to make, monitor and communicate decisions around comparison of the value of two assets, comparing the value of two different asset types, and comparing the value of an asset to the market of all such assets.
Most types of assets have a price (as determined by what you pay for it) and a value (what it is actually worth). If you paid the right price, you would realize the expected value. Value is reflective of the price in relation to the risk-adjusted return. In poorly valued items, the price is subject to arbitrage. As the market becomes more efficient, the price and value should converge. Duration is the time horizon between when an interest in an asset is acquired as a primary and when sale, transfer, or the like realizes the complete value, e.g., in a completely measurable asset such as treasury cash, e.g., by sale, transfer, or the like. As the duration is increased, a secondary market or exchange may be available to trade the asset, and speculative investors may take part seeking to profit off changes in risk or reward over time, and often not focusing on the true value.
While there is much emphasis on data correctness, using incomplete data can also lead to faulty analysis. Overall completeness consists of the various completeness factors being monitored, as well as the timeliness of the data. Coverage refers to the percent of the total universe covered by a dataset and the expanse (or breadth) of the coverage. Proper coverage metrics must address the issue of whether a particular data set is indeed a representative sample of the space, such that analysis run on the data set gives results that are consistent with the same analysis on the full universe. Correctness, completeness and coverage are a function of the statistical confidence interval that is required by the user on the results. User may trade-off confidence to increase other parameters.
When investing in assets, each with its own risk and reward profile, it is important for an investor to understand what risks (including pricing risk) are being incurred, what return on investment is to be expected, what term/duration of investment is considered, and how a hypothetical efficient market would value the asset. In terms of market valuation (especially in an inefficient market that does not correctly price assets), it is important to compare an asset to three potential different valuation anchors, its “peers”, or assets which have similar risks vs. returns to evaluate which provides superior value (or explained and accounted for differences in risk and reward profiles), the overall market benchmark (i.e. the largest universe of peers that represent the appropriate market) and potentially a broader category of peers from other markets (or clusters). In some cases, the identification of the peer group may be difficult, and not readily amenable to accurate fully automated determination. Humans, however, may have insights that can resolve this issue.
Some users may have superior insights into understanding the price, risks and rewards, and thus a better assessment of the value or risk adjusted return than others. Investors seek benefit of those with superior insights as advisors. Those with superior insights can recognize patterns that others might not see, and classify data and draw abstractions and conclusions differently than others. However, prospectively determining which advisor(s) to rely on in terms of superior return on investment while incurring acceptable risk remains an unresolved problem.
Data Clustering
Data clustering is a process of grouping together data points having common characteristics. In automated processes, a cost function or distance function is defined, and data is classified is belonging to various clusters by making decisions about its relationship to the various defined clusters (or automatically defined clusters) in accordance with the cost function or distance function. Therefore, the clustering problem is an automated decision-making problem. The science of clustering is well established, and various different paradigms are available. After the cost or distance function is defined and formulated as clustering criteria, the clustering process becomes one of optimization according to an optimization process, which itself may be imperfect or provide different optimized results in dependence on the particular optimization employed. For large data sets, a complete evaluation of a single optimum state may be infeasible, and therefore the optimization process subject to error, bias, ambiguity, or other known artifacts.
In some cases, the distribution of data is continuous, and the cluster boundaries sensitive to subjective considerations or have particular sensitivity to the aspects and characteristics of the clustering technology employed. In contrast, in other cases, the inclusion of data within a particular cluster is relatively insensitive to the clustering methodology. Likewise, in some cases, the use of the clustering results focuses on the marginal data, that is, the quality of the clustering is a critical factor in the use of the system.
The ultimate goal of clustering is to provide users with meaningful insights from the original data, so that they can effectively solve the problems encountered. Clustering acts to effectively reduce the dimensionality of a data set by treating each cluster as a degree of freedom, with a distance from a centroid or other characteristic exemplar of the set. In a non-hybrid system, the distance is a scalar, while in systems that retain some flexibility at the cost of complexity, the distance itself may be a vector. Thus, a data set with 10,000 data points, potentially has 10,000 degrees of freedom, that is, each data point represents the centroid of its own cluster. However, if it is clustered into 100 groups of 100 data points, the degrees of freedom is reduced to 100, with the remaining differences expressed as a distance from the cluster definition. Cluster analysis groups data objects based on information in or about the data that describes the objects and their relationships. The goal is that the objects within a group be similar (or related) to one another and different from (or unrelated to) the objects in other groups. The greater the similarity (or homogeneity) within a group and the greater the difference between groups, the “better” or more distinct is the clustering.
In some cases, the dimensionality may be reduced to one, in which case all of the dimensional variety of the data set is reduced to a distance according to a distance function. This distance function may be useful, since it permits dimensionless comparison of the entire data set, and allows a user to modify the distance function to meet various constraints. Likewise, in certain types of clustering, the distance functions for each cluster may be defined independently, and then applied to the entire data set. In other types of clustering, the distance function is defined for the entire data set, and is not (or cannot readily be) tweaked for each cluster. Similarly, feasible clustering algorithms for large data sets preferably do not have interactive distance functions in which the distance function itself changes depending on the data. Many clustering processes are iterative, and as such produce a putative clustering of the data, and then seek to produce a better clustering, and when a better clustering is found, making that the putative clustering. However, in complex data sets, there are relationships between data points such that a cost or penalty (or reward) is incurred if data points are clustered in a certain way. Thus, while the clustering algorithm may split data points which have an affinity (or group together data points, which have a negative affinity, the optimization becomes more difficult.
Thus, for example, a semantic database may be represented as a set of documents with words or phrases. Words may be ambiguous, such as “apple”, representing a fruit, a computer company, a record company, and a musical artist. In order to effectively use the database, the multiple meanings or contexts need to be resolved. In order to resolve the context, an automated process might be used to exploit available information for separating the meanings, i.e., clustering documents according to their context. This automated process can be difficult as the data set grows, and in some cases the available information is insufficient for accurate automated clustering. On the other hand, a human can often determine a context by making an inference, which, though subject to error or bias, may represent a most useful result regardless.
In supervised classification, the mapping from a set of input data vectors to a finite set of discrete class labels is modeled in terms of some mathematical function including a vector of adjustable parameters. The values of these adjustable parameters are determined (optimized) by an inductive learning algorithm (also termed inducer), whose aim is to minimize an empirical risk function on a finite data set of input. When the inducer reaches convergence or terminates, an induced classifier is generated. In unsupervised classification, called clustering or exploratory data analysis, no labeled data are available. The goal of clustering is to separate a finite unlabeled data set into a finite and discrete set of “natural,” hidden data structures, rather than provide an accurate characterization of unobserved samples generated from the same probability distribution. In semi-supervised classification, a portion of the data are labeled, or sparse label feedback is used during the process.
Non-predictive clustering is a subjective process in nature, seeking to ensure that the similarity between objects within a cluster is larger than the similarity between objects belonging to different clusters. Cluster analysis divides data into groups (clusters) that are meaningful, useful, or both. If meaningful groups are the goal, then the clusters should capture the “natural” structure of the data. In some cases, however, cluster analysis is only a useful starting point for other purposes, such as data summarization. However, this often begs the question, especially in marginal cases; what is the natural structure of the data, and how do we know when the clustering deviates from “truth”?
Many data analysis techniques, such as regression or principal component analysis (PCA), have a time or space complexity of O(m2) or higher (where m is the number of objects), and thus, are not practical for large data sets. However, instead of applying the algorithm to the entire data set, it can be applied to a reduced data set consisting only of cluster prototypes. Depending on the type of analysis, the number of prototypes, and the accuracy with which the prototypes represent the data, the results can be comparable to those that would have been obtained if all the data could have been used. The entire data set may then be assigned to the clusters based on a distance function.
Clustering algorithms partition data into a certain number of clusters (groups, subsets, or categories). Important considerations include feature selection or extraction (choosing distinguishing or important features, and only such features); Clustering algorithm design or selection (accuracy and precision with respect to the intended use of the classification result; feasibility and computational cost; etc.); and to the extent different from the clustering criterion, optimization algorithm design or selection.
Finding nearest neighbors can require computing the pairwise distance between all points. However, clusters and their cluster prototypes might be found more efficiently. Assuming that the clustering distance metric reasonably includes close points, and excludes far points, then the neighbor analysis may be limited to members of nearby clusters, thus reducing the complexity of the computation.
There are generally three types of clustering structures, known as partitional clustering, hierarchical clustering, and individual clusters. The most commonly discussed distinction among different types of clusterings is whether the set of clusters is nested or unnested, or in more traditional terminology, hierarchical or partitional. A partitional clustering is simply a division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset. If the clusters have sub-clusters, then we obtain a hierarchical clustering, which is a set of nested clusters that are organized as a tree. Each node (cluster) in the tree (except for the leaf nodes) is the union of its children (sub-clusters), and the root of the tree is the cluster containing all the objects. Often, but not always, the leaves of the tree are singleton clusters of individual data objects. A hierarchical clustering can be viewed as a sequence of partitional clusterings and a partitional clustering can be obtained by taking any member of that sequence; i.e., by cutting the hierarchical tree at a particular level.
There are many situations in which a point could reasonably be placed in more than one cluster, and these situations are better addressed by non-exclusive clustering. In the most general sense, an overlapping or non-exclusive clustering is used to reflect the fact that an object can simultaneously belong to more than one group (class). A non-exclusive clustering is also often used when, for example, an object is “between” two or more clusters and could reasonably be assigned to any of these clusters. In a fuzzy clustering, every object belongs to every cluster with a membership weight. In other words, clusters are treated as fuzzy sets. Similarly, probabilistic clustering techniques compute the probability with which each point belongs to each cluster.
In many cases, a fuzzy or probabilistic clustering is converted to an exclusive clustering by assigning each object to the cluster in which its membership weight or probability is highest. Thus, the inter-cluster and intra-cluster distance function is symmetric. However, it is also possible to apply a different function to uniquely assign objects to a particular cluster.
A well-separated cluster is a set of objects in which each object is closer (or more similar) to every other object in the cluster than to any object not in the cluster. Sometimes a threshold is used to specify that all the objects in a cluster must be sufficiently close (or similar) to one another. The distance between any two points in different groups is larger than the distance between any two points within a group. Well-separated clusters do not need to be spherical, but can have any shape.
If the data is represented as a graph, where the nodes are objects and the links represent connections among objects, then a cluster can be defined as a connected component; i.e., a group of objects that are significantly connected to one another, but that have less connected to objects outside the group. This implies that each object in a contiguity-based cluster is closer to some other object in the cluster than to any point in a different cluster.
A density-based cluster is a dense region of objects that is surrounded by a region of low density. A density-based definition of a cluster is often employed when the clusters are irregular or intertwined, and when noise and outliers are present. DBSCAN is a density-based clustering algorithm that produces a partitional clustering, in which the number of clusters is automatically determined by the algorithm. Points in low-density regions are classified as noise and omitted; thus, DBSCAN does not produce a complete clustering.
A prototype-based cluster is a set of objects in which each object is closer (more similar) to the prototype that defines the cluster than to the prototype of any other cluster. For data with continuous attributes, the prototype of a cluster is often a centroid, i.e., the average (mean) of all the points in the cluster. When a centroid is not meaningful, such as when the data has categorical attributes, the prototype is often a medoid, i.e., the most representative point of a cluster. For many types of data, the prototype can be regarded as the most central point. These clusters tend to be globular. K-means is a prototype-based, partitional clustering technique that attempts to find a user-specified number of clusters (K), which are represented by their centroids. Prototype-based clustering techniques create a one-level partitioning of the data objects. There are a number of such techniques, but two of the most prominent are K-means and K-medoid. K-means defines a prototype in terms of a centroid, which is usually the mean of a group of points, and is typically applied to objects in a continuous n-dimensional space. K-medoid defines a prototype in terms of a medoid, which is the most representative point for a group of points, and can be applied to a wide range of data since it requires only a proximity measure for a pair of objects. While a centroid almost never corresponds to an actual data point, a medoid, by its definition, must be an actual data point.
In the K-means clustering technique, we first choose K initial centroids, the number of clusters desired. Each point in the data set is then assigned to the closest centroid, and each collection of points assigned to a centroid is a cluster. The centroid of each cluster is then updated based on the points assigned to the cluster. We iteratively assign points and update until convergence (no point changes clusters), or equivalently, until the centroids remain the same. For some combinations of proximity functions and types of centroids, K-means always converges to a solution; i.e., K-means reaches a state in which no points are shifting from one cluster to another, and hence, the centroids don't change. Because convergence tends to b asymptotic, the end condition may be set as a maximum change between iterations. Because of the possibility that the optimization results in a local minimum instead of a global minimum, errors may be maintained unless and until corrected. Therefore, a human assignment or reassignment of data points into classes, either as a constraint on the optimization, or as an initial condition, is possible.
To assign a point to the closest centroid, a proximity measure is required. Euclidean (L2) distance is often used for data points in Euclidean space, while cosine similarity may be more appropriate for documents. However, there may be several types of proximity measures that are appropriate for a given type of data. For example, Manhattan (L1) distance can be used for Euclidean data, while the Jaccard measure is often employed for documents. Usually, the similarity measures used for K-means are relatively simple since the algorithm repeatedly calculates the similarity of each point to each centroid, and thus complex distance functions incur computational complexity. The clustering may be computed as a statistical function, e.g., mean square error of the distance of each data point according to the distance function from the centroid. Note that the K-means may only find a local minimum, since the algorithm does not test each point for each possible centroid, and the starting presumptions may influence the outcome. The typical distance functions for documents include the Manhattan (L1) distance, Bregman divergence, Mahalanobis distance, squared Euclidean distance and cosine similarity.
An optimal clustering will be obtained as long as two initial centroids fall anywhere in a pair of clusters, since the centroids will redistribute themselves, one to each cluster. As the number of clusters increases, it is increasingly likely that at least one pair of clusters will have only one initial centroid, and because the pairs of clusters are further apart than clusters within a pair, the K-means algorithm will not redistribute the centroids between pairs of clusters, leading to a suboptimal local minimum. One effective approach is to take a sample of points and cluster them using a hierarchical clustering technique. K clusters are extracted from the hierarchical clustering, and the centroids of those clusters are used as the initial centroids. This approach often works well, but is practical only if the sample is relatively small, e.g., a few hundred to a few thousand (hierarchical clustering is expensive), and K is relatively small compared to the sample size. Other selection schemes are also available.
The space requirements for K-means are modest because only the data points and centroids are stored. Specifically, the storage required is O((m+K)n), where m is the number of points and n is the number of attributes. The time requirements for K-means are also modest—basically linear in the number of data points. In particular, the time required is O(I×K×m×n), where I is the number of iterations required for convergence. As mentioned, I is often small and can usually be safely bounded, as most changes typically occur in the first few iterations. Therefore, K-means is linear in m, the number of points, and is efficient as well as simple provided that K, the number of clusters, is significantly less than m.
Outliers can unduly influence the clusters, especially when a squared error criterion is used. However, in some clustering applications, the outliers should not be eliminated or discounted, as their appropriate inclusion may lead to important insights. In some cases, such as financial analysis, apparent outliers, e.g., unusually profitable investments, can be the most interesting points.
Hierarchical clustering techniques are a second important category of clustering methods. There are two basic approaches for generating a hierarchical clustering: Agglomerative and divisive. Agglomerative clustering merges close clusters in an initially high dimensionality space, while divisive splits large clusters. Agglomerative clustering relies upon a cluster distance, as opposed to an object distance. For example the distance between centroids or medioids of the clusters, the closest points in two clusters, the further points in two clusters, or some average distance metric. Ward's method measures the proximity between two clusters in terms of the increase in the sum of the squares of the errors that results from merging the two clusters.
Agglomerative Hierarchical Clustering refers to clustering techniques that produce a hierarchical clustering by starting with each point as a singleton cluster and then repeatedly merging the two closest clusters until a single, all-encompassing cluster remains. Agglomerative hierarchical clustering cannot be viewed as globally optimizing an objective function. Instead, agglomerative hierarchical clustering techniques use various criteria to decide locally, at each step, which clusters should be merged (or split for divisive approaches). This approach yields clustering algorithms that avoid the difficulty of attempting to solve a hard combinatorial optimization problem. Furthermore, such approaches do not have problems with local minima or difficulties in choosing initial points. Of course, the time complexity of O(m2 log m) and the space complexity of O(m2) are prohibitive in many cases. Agglomerative hierarchical clustering algorithms tend to make good local decisions about combining two clusters since they can use information about the pair-wise similarity of all points. However, once a decision is made to merge two clusters, it cannot be undone at a later time. This approach prevents a local optimization criterion from becoming a global optimization criterion.
In supervised classification, the evaluation of the resulting classification model is an integral part of the process of developing a classification model. Being able to distinguish whether there is non-random structure in the data is an important aspect of cluster validation.