The present invention is concerned with improving the interpretation of results from DNA analysis. In particular, the invention improves the manner in which a test result from a test sample is considered against a plurality of stored test results. The number of stored test results used in the consideration can be vast. The consideration is often intended to give an outcome, for instance, the presence of one or more matches and/or a likelihood of that match. Basically, the DNA analysis involves taking a sample of DNA and analysing the variations present at a number of loci. The identities of the variations give rise to a data set which is then interpreted to give a profile or genotype. This may form the test result. Once the process has been completed for a test result, the test result is often then one of the stored results in the context of a subsequent consideration. The extent of interpretation required can be extensive and/or introduce uncertainties. This is particularly so where the DNA sample contains DNA from more than one person, a mixture.
There is often a need to consider various hypotheses for the identities of the persons responsible for the DNA and evaluate the likelihood of those hypotheses; evidential uses.
There is often a need to consider the analysis profile or genotype, test result, against a database of profiles or genotypes, stored results, so as to establish a list of stored profiles or genotypes that are likely matches with the analysis profile or genotype; intelligence uses.
In support of this analysis, the applicant has developed and disclosed a mathematical specification of a model for computing likelihood ratios (LRs) that uses peak heights taken from such DNA analysis. The approach draws on an estimation of a two-dimensional, 2D, probability density function, pdf, which is estimated from the heights or areas of peaks observed after the analysis of control samples. Such pdf's may be generated from heterozygous donors and separately from homozygous donors. The approach goes on to calculate the probability of dropout and achieve other benefits. Full details of these developments are to be found in International Patent Publication number WOWO2009/066067 and/or US Patent Application Publication number US2009/0132173, the contents of both of which are fully incorporated herein by reference, particularly with respect to the analysis of the samples, their mathematical expression and their comparison with others, including the determination of the likelihood ratio for a match between them.
Subsequently, the applicant has developed that technology further. The statistical model now provides for computing likelihood ratios for single profiles and mixed profiles while considering peak heights or areas, but also takes into consideration allelic dropout and stutters. In this way, the technique makes far greater use of a far greater proportion of the information in the results and hence give a more informative and useful overall result.
To achieve this, the present invention includes the use of a number of components. The main components of the approach are:                1. An estimated PDF for homozygote peaks conditional on DNA quantity;        2. An estimated PDF for stutter heights conditional on the height of the parent allele;        3. An estimated joint probability density function (PDF) of peak height pairs conditional on DNA quantity;        4. A latent variable X representing DNA quantity that models the variability of peak heights across the profile.        5. The calculation of the LR is done separately for the numerator and the denominator. The overall joint PDF for the numerator and the denominator can be represented with Bayesian networks (BNs).        
Full details of these further developments to the technology are included in International. Patent Publication number WO2010/116158, the contents of which are fully incorporated herein by reference, particularly with respect to the analysis of the samples, generation of the test results and/or stored results, their mathematical expression and their comparison with others, including the determination of the likelihood ratio for a match between them.
The use of such technology, and potentially other approaches, for the consideration of the DNA sample gives a test result, and hence stored results, which include a data set. This data set includes a far larger volume of data in the data set than was produced under previous approaches. This is beneficial in terms of the information which may be obtained and the ability to consider a wider range of possible matches. The volume of data in the data set may be larger because instead of reaching a single or relatively limited number of possibilities (expressed as possible alleles/identities at one or more loci, through to expression as a profiles or genotypes through interpretation of the results), the results include a far larger number of possibilities (expressed as possible alleles/identities at one or more loci, through to expression as a profiles or genotypes). In general, a test result provides a data set which is fainted of a series of sub-sets. Each sub-set is formed of data elements, with a data element for the genotype and/or profile of each person deemed to have contributed to the sample and an expression of the probability of that combination of genotypes and/or profiles. Thus a sample which was a mixture of two people's DNA could have a sub-set formed of a first genotype, a second genotype and an expression of the probability of that combination of two genotypes. This format for the data set will also be present, therefore, when the test result becomes one of the many stored results. In general, the data set is in the form of a vector made up, potentially by a large number, of the sub-sets.
However, the number of combinations in a data set represented by the sub-sets and/or the format of the data sets also creates problems with respect to the computation resources and/or time needed to process the subsequent data processing stages. A much larger number of possibilities needs to be considered against others to see if there is a match.
Overview of the Invention—Hardware Configuration
In prior art approaches to considering the test result against stored results to consider whether there is a match and/or give a likelihood of a match, the entirety of the data set forming the test result is considered against the entirety of the data set forming the stored results with respect to all of the stored results. This means the test result is compared with a vast number of stored results in large databases, such as The National DNA Database® operated in the UK.
The type of developments identified about greatly increase the amount of data which forms the data set for the test result and the stored results, hence greatly increasing the computational needs for making such a comparison. The present invention seeks to avoid this problem by materially reducing the computational need. This is achieved through a different hardware structure and through a different organisation of the comparison of the data set for the test sample with the data set for the stored samples.
FIG. 1 shows a schematic of a hardware configuration suitable for use in the present invention. A master node 1 is provided which is connected to a switching unit 3 to allow communications between the master node 1 and one or more of a set of worker nodes 5. In this case, sixteen separate worker nodes, 5a, 5b, . . . 5p, are provided. Each of the worker nodes 5 are connected to each other. Each of the worker nodes 5 are connected to a data storage device 7. The data storage device 7 is also connected to the master node 1. In this specific example, the master node 1 is in the form of a server 9 and accompanying user display 11 and interface 13. The switching unit 3 is an Ethernet switch, such as a 10/100 Mbs Ethernet switch. The worker nodes 5 are each provided by a unit of the same type and specification (speed, RAM, ROM etc) and can be personal computer type units.
The system, potentially via the master node 1 is provide with an optional connection to the Internet 15. This can be used to provide communications between the system and other locations. The other location may be those at which further results are generated by the collection, analysis and reporting of results. Connection to other communications networks, internal to the operating organisation and/or external thereto, can be provided.
A computer cluster of this type is capable of achieving high rates of computation by linking the master node 1 and worker nodes 5 so that they work closely together. Such a cluster is capable of performing parallel computing, where multiple calculations are performed concurrently. Such clusters may use the Linix operating systems, open source software and a TCP/IP LAN as the network.
In operation, the master node 1 is responsible for allocating the work to the worker nodes 5.
The use of a cluster of this type offers improved computing performance which is beneficial in the context of the computations the present invention is concerned with.
Further Benefits and Details of the Hardware Approach
As mentioned above, the use of a cluster of this type offers improved computing performance which is beneficial in the context of the computations the present invention is concerned with. This comes in a number of ways.
Firstly, a parallel processing cluster is capable of high computational rates.
Secondly, such a configuration of the hardware is highly scalable. In the example described above, sixteen worker nodes are used so that each of the loci considered by the multiplex of primers, which is used to amplify parts of the DNA sample, are handled on a different worker node 5 from the other loci. If the system needed to switch to a larger multiplex, for instance a thirty two plex, to give greater discrimination power, it is a simple task to increase the number of worker nodes 5 in the system. A worker node 5 for each loci can be provided still and a similar level of performance can be obtained. In other instances, the computational load may prove too great with respect to one or more of the sixteen loci being considered. In such a case, it is possible to split one or more of the loci so that the single loci is handled on two different worker nodes 5. Hence, scaling up of the number of worker nodes provided can be used to maintain computational performance.
Overview of the Invention—Processes Applied
Generally, the process involves a number of different stages/sub-stages:                a) A stored result selection and plurality of stored result database creation stage;        b) A test result against stored result comparison stage, including:                    1) A test result selection and plurality of test result database creation sub-stage;            2) A single test result database against single stored result database search sub-stage, performed for the various pairs of test result databases and stored result databases, to establish matches;            3) An established match review sub-stage, to filter out established matches which do not feature as matches across the other test result against stored result databases;            4) A process outcome sub-stage which provides details of the matches which extend across all the database pairs.                        
In the present invention, further benefits are obtained through the manner in which these stages and/or sub-stages are assigned to and/or performed by the master node 1 and worker nodes 5 used in the process to provide the consideration/comparison between the test sample and the stored samples.
An explanation and further details for each of these stages and sub-stages are provided in the sections set out below.
A variety of possibilities exist for deploying such an approach in terms of the code used. However, the following pseudo code provides a useful indication of the general requirements involved:
[Pseudo Code]
Master Node receives collection of samplesWhile (there are still samples to be searched)Send locus information to appropriate worker-nodesWorker-node creates search-termSearch-term is run against databaseUnique identifiers list returned to master-nodeMaster-node coordinates synchronisation of lists from eachworker-nodeSynchronised list sent to each worker-nodeWorker-node returns results for each unique identifier in synchronisedlistCombine results from each worker-node into a collectionReturn collectionEndwhileFurther Benefits and Details of the Stored Result Selection and Plurality of Stored Result Database Creation Stage
As mentioned above, the invention differs in the manner in which the test result and the stored results are considered. In particular, there is a comparison of less than the entirety of the data set for a test result with less than the entirety of the data set for each stored result. In particular, certain elements from within the sub-sets of the data are considered separately from others.
As a first stage, the system must be prepared with respect to the stored results against which a comparison with a test result is to occur.
As an initial step, a selection is made of those stored results to be considered. This may be a selection from a larger number of stored results which are available or may be all of the stored results.
For those selected stored results, they are provided to the master node 1. This may be from a data storage device 7 within the system or form outside, for instance using the connection to the Internet. The stored results include a data set in each case. As mentioned above, the data set includes a series of sub-sets. Each sub-set is formed of data elements, with a data element for the genotype and/or profile of each person deemed to have contributed to the sample and an expression of the probability of that combination of genotypes and/or profiles. The data element for the genotype and/or profile will reflect, in terms of the data sub-elements present, the allele identities for each of the different loci in respect of which results were collected in the physical analysis stage.
Having received the stored results, the master node 1 processes those to divide out the data to be provided to each of the worker nodes 5 for the subsequent processing. The intention is to provide each worker node only with the information it needs. In this preferred embodiment, that involves sending a worker node only the data sub-elements which relate to the locus it is concerned with processing. Thus the data for locus vWA may be provided to worker node 5a, the data for locus D21S11 to worker node 5b and so on. The worker node data set includes for a stored result, sub-elements relating to the identities observed in the analysis for that locus for each of the genotypes and/or profiles in a combination represented by a sub-set, together with the probability information for that combination. This is repeated for each of the combinations in each of the stored results.
Having sent the data from the master node to the worker nodes, the focus of the processing moves to the worker nodes. Each worker node acts in an equivalent manner on the locus specific data it has received.
The worker node is required to establish a database which represents all of the identity combinations observed in at least one of the genotypes and/or profiles in at least one of the combinations in at least one of the stored results. This can be thought of as the creation of the locus-estate for the stored results.
In doing so, the worker node applies the same process to each of the sub-sets. First, the worker node stores the probability for that combination for later use. The worker node then looks to see whether the identity information for one of the genotypes and/or profiles in that combination corresponds to an entry in the database being created. If not, then an entry in the database is generated for that identity information. The next genotype and/or profile in that combination is then considered. If there is no corresponding entry, then one is created. If there is a corresponding entry already, then no new entry is needed. Once all of the genotypes and/or profiles in a combination are considered in this way, the worker node advances to the next combination and works through the genotypes and/or profiles therein. Once all of the stored results have been processed in this way, the stored result database is completed. There is an entry or slot, but only one, for each identity information form observed in all of the combinations in all of the stored results.
For each entry or slot, the database has further associated information. This is best understood in the context of the example of FIG. 3 and the text below.
In the example, five of the slots established for that locus are shown (left column). These are designated by the allele designations attributed to the identities observed for that slot. Thus, the top slot is homozygous with respect to alleles 9, 9; the next slot is heterozygous with respect to alleles 9, 10; and so on. Each slot has linked to it, a collection of profiles and/or genotypes (eight in the example) which had the identities of that slot. For each of these profiles, a unique coding is present (middle column). In this case, a five digit number is used, but there are many possibilities. This unique code forms a link between the slot and the origins of the profile. Also present (right column) is information for each of the profiles and/or genotypes, as to which of the contributors within that result gave rise to the profile and/or genotype, together with the probability information (expressed here as a number between 0 and 1).
This process can be thought of in terms of the following Pseudo code for its implementation by the master node:
[Pseudo Code]
master-node receives sampleFor each (locus in sample)Send locus information to appropriate worker-nodeEndForWhile (there are any worker-nodes still to finish)For each (worker-node that has not finished)Check if worker-node has finished creating locus-estateIf (worker-node has finished)Mark worker-node as finishedEndIfEndForEndwhileand the process can be thought of in terms of the following Pseudo code for its implementation by the worker-nodes:[Pseudo Code]
Worker-node receives locus information from master-nodeE2-vector information is extracted from the locus informationFor each (combination in the E2 vector)If (combination is included for searching)For each (potential contributor)Extract genotype from combinationIf (genotype not already present in the locus-estate)Create a new genotype-slot to store that genotype andplace in the locus-estateEndIfGet genotype-slot and store E2 vector information in itEndForEndIfEndForwhere the E2 vector information is the probability information discussed elsewhere.
The above processing can be performed by each worker node 5 in parallel and can start as soon as data is transferred to the worker node for the first of the stored results. This speeds up the implementation. Furthermore, the compilation of the database is made through a relatively easy and low computational demand process by virtue of the checking of the identity information against, in effect, a list of those already seen in previous stored results which have been processed.
Having completed this stage, the process can advance to the test result against stored result comparison stage, and in particular the test result selection and plurality of test result database creation sub-stage.
Further Benefits and Details of the Test Result Selection and Plurality of Test Result Database Creation Sub-Stage
As an initial step, a selection is made of the test result to be considered. This may be a selection from a larger number of test results and could be more than one test result for processing in parallel.
The selected test results is provided to the master node 1. This may be from a data storage device 7 within the system or form outside, for instance using the connection to the Internet. Just as with the stored results, the test result includes a data set and the data set has the same format.
Having received the test result, the master node 1 processes it to divide out the data to be provided to each of the worker nodes 5 for the subsequent processing; just as with the stored results. The worker node data set includes for a test result, sub-elements relating to the identities observed in the analysis for that locus for each of the genotypes and/or profiles in a combination represented by a sub-set, together with the probability information for that combination.
Each worker node acts in an equivalent manner on the locus specific data it has received.
The worker node is required to establish a test result database which represents all of the identity combinations observed in at least one of the genotypes and/or profiles in at least one of the combinations in the test result. This can be thought of as the creation of the locus-estate for the test result.
In doing so, the worker node applies the same process to each of the sub-sets. First, the worker node stores the probability for that combination for later use. The worker node then looks to see whether the identity information for one of the genotypes and/or profiles in that combination corresponds to an entry in the database being created. If not, then an entry in the database is generated for that identity information. The next genotype and/or profile in that combination is then considered. If there is no corresponding entry, then one is created. If there is a corresponding entry already, then no new entry is needed. Once all of the genotypes and/or profiles in the combination which represents the test result are considered in this way, the sub-stage is complete. There is an entry or slot, but only one, for each identity information form observed in all of the combinations in the test result.
The same information as to the unique code, contributor and probability as was described above for the stored results, is obtained for the test results.
The next sub-stage can then be performed.
Further Benefits and Details of the Single Test Result Database Against Single Stored Result Database Search Sub-Stage
With all the stored samples loaded and the stored result database created for each locus and with the test result loaded and the test result database created for each locus, it is possible to start the comparison.
The comparison is only carried out on worker nodes and is performed in an equivalent manner on each, in parallel.
As described above, the test result database for a locus has an entry or slot for each of the identity information form observed in it. The comparison takes a slot from the test result database, and looks to see whether there is a match for this test result slot in the slots of the stored result database.
When a match is observed, then a note is made in a match list. The note means that slot is included in those for which a match is established at that locus. The note provides a link to not just the slot, but also to the unique codes behind that slot (as described above in the example) and the information behind that, as to contributor and probability.
When a match is not observed, then no note is added to the match list.
This process is repeated until all of the test result slots have been considered against the slots in the stored sample database for that locus. The process is taken to completion on each of the locus specific worker nodes 5.
This process can be thought of in terms of the following Pseudo code for its implementation by the worker nodes:
[Pseudo Code]
Receive sample from master-nodeCreate search term from sampleFor each (genotype slot in search-term)For each (genotype-slot in database)If (genotypes match)Store matchAdd all codes to collectionEndIfEndForEndforMaintain stored matches in memory for next stageReturn collection.length to master-node
As a result of these operations, the worker nodes each generate a match list of their own, a locus specific match list. The worker nodes keep a record of their own locus specific match list and send a copy of it to the master node. In the next sub-stage, the master node works upon the set of locus specific match lists it has received.
Further Benefits and Details of the Established Match Review Sub-Stage
Having obtained the set of locus specific match lists, the method proceeds to establish which of those matches are true across the different loci.
The comparison of the locus specific match lists can be parallelised to an extent, as it is possible to start the comparison once two locus specific match lists have been received; without having to wait for all the locus specific match lists to be received.
The master node coordinates which of the locus specific match lists are to be considered by which worker nodes. The master node is aware of the length of the locus specific match list each worker node has. Hence, it can instruct the worker node with the shortest list to send a copy to the worker node with the longest list for the process to start.
The worker node which has sent the match list, the transmitting worker node, then becomes inactive.
Once the worker node, the receiving worker node, has both its own generated match list and the locus specific match list sent to it, that worker node can work through its processing.
The worker node compares the two match lists.
If the unique code is present in both, then there is a match across both loci. That unique code is then added to a combination list; further match list.
If the unique code is only present in one of the match lists, then it is not a match across both loci and it can be discounted from further processing.
The outcome is a combination list (first further match list) of all the matches across those two loci. A note of the length of the combination list can then be sent back to the master node.
Other worker nodes can be working through other pairs of match lists to generate other combination lists (second further match lists and so). They too provide length information on their combination lists to the master node.
Once the length information on two lists is received, be they combination lists (further match lists) or match lists (which have not yet been processed), then the master node can tell the worker node with the shortest list to send a copy of that list to the worker node with the longest list.
The process is continued until all of the match lists and combination lists (further match lists) have been combined to generate a single combination list; a final match list.
This process can be thought of in terms of the following Pseudo code for its implementation by the master node:
[Pseudo Code]
While (not all lists have been combined)Master-node receives a list length from a worker nodeIf (master-node already holds another list-length)Compare list lengthsSend message to worker-node with the shortest list to send itslist to the worker node with the longer list-length for comparisonElseWait for a second list length to be returnedEndIfEndwhileand by the following Pseudo code for its implementation by the worker nodes:[Pseudo Code]
Worker-node searches locus-estate and creates match-listWorker-node sends match-list.length to master-node and wait for responseWhile worker-node is active)If (response is from master-node)Send match-list to the worker-node specified in the master-noderesponseWorker-node becomes inactiveElseCompare local match-list with the match-list received in theresponse creating a combined listWorker-node sends combined-list.length to master and wait forresponseEndIfEndwhileFurther Benefits and Details of the Process Outcome Sub-Stage
The outcome list represents those unique codes which link to stored samples, in terms of their genotypes and/or profiles, which are a match with the test result across all loci present.
For each of those unique codes, it is then possible to use the associated probability information to assign a probability for that genotype and/or profile being the one which matches the test sample. The matches can then be ranked according to the probability to give a ranked list of matches. Some matches may be more likely than others, on the basis that a genotype is a match, but the occurrences/circumstances which give rise to that genotype are more or less unlikely.
Where the test result itself is a mixture, then the matches will reflect both the genotype and/or profile of the test result and that of the stored results, with the probability being a combination of both.