Methods for determining the probability of the size of overlap of two sets of data picked independently and randomly from the same population are known in the art. A method of choice for determining the statistical significance of such an overlap is to employ the hypergeometric distribution. Methods for determining the probability of overlap of two sets of data picked independently and randomly from two different but overlapping populations are also known. However, these methods either oversimplify the problem in order to employ the hypergeometric distribution, in which case the accuracy is compromised (except in the limiting case where the two populations have a complete overlap); or the methods employ a permutation method to determine the probability, in which case the solution is also approximate and very time consuming.
For the specific case of microarrays, the use of the hypergeometric distribution for determining the overlapping probability of two gene signatures derived from two experiments using the same chip type is known. Further, when the two experiments being compared are from different chip types, the current practice to so reduce the problem by considering only those genes that are common between both chips so that the hypergeometric distribution can be utilized. See, for example, GeneSpring™ (Agilent Technologies, Inc.), and Resolver™ (Rosetta Inpharmatics, LLC). Alternatively, a random permutation technique is available in Oncomine™ (Compendia Bioscience Inc.).
Thus, there is a need for a simple and accurate method that can determine the probability of overlap when the underlying populations are different and overlapping. In particular, there is a need for an easy to use and accurate method of determining the probability of overlap between two sets of genes selected from two different but overlapping microarray chips.