There is currently a large number of techniques (for example flow and scanning cytometry, mass cytometry, confocal microscopy, thermal cyclers, plate microarrays and microarrays on spheres, ultrasequencing, etc.) applied to several fields (for example proteomics, genomics, cytomics, cellular, metabolomics, etc.) for analyzing biological samples (for example samples of blood, bone marrow, tissues, biological liquids, yeasts, bacteria, foods, body fluids, cell cultures, etc.). These techniques provide measurements of a series of heterogeneous parameters, in digital format, defining each event separately. In the context of the invention, “event” will be understood as each element detected by means of hardware and/or software and defined by a set of parameters obtained by means of said hardware and/or software. The events can be biological or artificial. A cell is an example of a biological event; a microsphere is an example of an artificial event.
This enormous amount of information associated with events, preferably biological events, can be represented in a multidimensional space, where the values of the parameters define the position coordinates of the events in said multidimensional space. Each analysis or experiment performed on a sample can include from thousands to several millions of events with their corresponding associated parameters. The analysis of this data, which involves classifying the events in populations, can be done manually, but this process slows down considerably as the number of parameters to be analyzed increases. The trend in recent years in the fields of chemistry, medicine and biology is for the acquisition software of biomedical devices (e.g. cytometers, thermal cyclers, ultrasequencers, etc.) to take increasingly more complex measurements, with a larger number of heterogeneous parameters. This makes it very complicated to manually analyze large amounts of information that are obtained and requiring using a great deal of time, resources and specialized experts to perform said analysis.
The methods conventionally used to solve this problem involve carrying out a number of manual steps for each population to be identified, such as data selection, cleaning, classification and reclassification, which drastically increases the number of steps according to the number of parameters analyzed. In an analysis of n parameters, it would be necessary to analyze (n*(n−1)/2) individual two-dimensional graphs in which the representation of all the combinations of two parameters can be seen. Most of the time the user does not manually analyze all the populations present in the sample, but rather only takes into account the population considered to be of interest, ignoring a large amount of information that could be relevant, particularly in disease diagnosis, prognosis and monitoring. Furthermore, the user performing manual analysis must be a person skilled in the art of analysis to obtain reliable results that can be reproduced by another user. Nonetheless, analyses are not always completely objective, with the risks and inaccuracies this entails.
There are methods of automatic population clustering in the literature, many of which are based on finite mixture distribution models, such as the one described in patent document U.S. Pat. No. 9,164,022, or agglomerative hierarchical methods, such as the one referred to in patent document US20130060775. However, these methods require the user possessing prior knowledge, because the user must previously define the number of groups to be detected or a threshold defining the iterations until the number of groups identified is equal to the number of target groups defined by the user.
The prior art relating to methods of automatic data classification in a multidimensional space is very scarce. Patent document EP1785899A2 is known, describing a method that uses a finite mixture model characterized by expected Gaussian distributions and expert databases for clustering the data by means of applying expectation and maximization algorithms. This method is envisaged for repetitive analyses where the same type of samples are always analyzed, in which the populations present in the sample always have to be known beforehand, but it is rather ineffective in cases in which the populations are unknown, when the populations follow a type of non-Gaussian distribution or when it is complicated to infer data about the distribution of the populations.
A method of automatic event-associated information classification that is more efficient and more reliable is therefore necessary.