The present invention relates generally to data processing systems and, more particularly, to the generation of Bayesian networks.
The advent of artificial intelligence within computer science has brought an abundance of decision-support systems. Decision-support systems are computer systems in which decisions, typically rendered by humans, are recommended and sometimes made. In creating decision-support systems, computer scientists seek to provide decisions with the greatest possible accuracy. Thus, computer scientists strive to create decision-support systems that are equivalent to or more accurate than a human expert. Applications of decision-support systems include medical diagnosis, troubleshooting computer networks, or other systems wherein a decision is based upon identifiable criteria.
One of the most promising new areas for research in decision-support systems is Bayesian networks. A Bayesian network is a representation of the probabilistic relationships among distinctions about the world. Each distinction, sometimes called a variable, can take on one of a mutually exclusive and exhaustive set of possible states. A Bayesian network is expressed as an acyclic-directed graph where the variables correspond to nodes and the relationships between the nodes correspond to arcs. FIG. 1 depicts an examplary Bayesian network 101. In FIG. 1 there are three variables, X1, X2, and X3, which are represented by nodes 102, 106 and 110, respectively. This Bayesian network contains two arcs 104 and 108. Associated with each variable in a Bayesian network is a set of probability distributions. Using conditional probability notation, the set of probability distributions for a variable can be denoted by p(xi|Πi, "xgr"), where xe2x80x9cpxe2x80x9d refers to the probability distribution, where xe2x80x9cΠixe2x80x9d denotes the parents of variable Xi and where xe2x80x9c"xgr"xe2x80x9d denotes the knowledge of the expert. The Greek letter xe2x80x9c"xgr"xe2x80x9d indicates that the Bayesian network reflects the knowledge of an expert in a given field. Thus, this expression reads as follows: the probability distribution for variable Xi given the parents of Xi and the knowledge of the expert. For example, X1 is the parent of X2. The probability distributions specify the strength of the relationships between variables. For instance, if X1 has two states (true and false), then associated with X1 is a single probability distribution p(x1|"xgr") and associated with X2 are two probability distributions p(x2|x1=t, "xgr") and p(x2|x1=f, "xgr"). In the remainder of this specification, "xgr" is not specifically mentioned.
The arcs in a Bayesian network convey dependence between nodes. When there is an arc between two nodes, the probability distribution of the first node depends upon the value of the second node when the direction of the arc points from the second node to the first node. For example, node 106 depends upon node 102. Therefore, nodes 102 and 106 are said to be conditionally dependent. Missing arcs in a Bayesian network convey conditional independencies. For example, node 102 and node 110 are conditionally independent given node 106. However, two variables indirectly connected through intermediate variables are conditionally dependent given lack of knowledge of the values (xe2x80x9cstatesxe2x80x9d ) of the intermediate variables. Therefore, if the value for node 106 is known, node 102 and node 110 are conditionally dependent.
In other words, sets of variables X and Y are said to be conditionally independent, given a set of variables Z, if the probability distribution for X given Z does not depend on Y. If Z is empty, however, X and Y are said to be xe2x80x9cindependentxe2x80x9d as opposed to conditionally independent. If X and Y are not conditionally independent, given Z, then X and Y are said to be conditionally dependent given Z.
The variables used for each node may be of different types. Specifically, variables may be of two types: discrete or continuous. A discrete variable is a variable that has a finite or countable number of states, whereas a continuous variable is a variable that has an uncountably infinite number of states. All discrete variables considered in this specification have a finite number of states. An example of a discrete variable is a Boolean variable. Such a variable can assume only one of two states: xe2x80x9ctruexe2x80x9d or xe2x80x9cfalse.xe2x80x9d An example of a continuous variable is a variable that may assume any real value between xe2x88x921 and 1. Discrete variables have an associated probability distribution. Continuous variables, however, have an associated probability density function (xe2x80x9cdensityxe2x80x9d). Where an event is a set of possible outcomes, the density p(x) for a variable xe2x80x9cxxe2x80x9d and events xe2x80x9caxe2x80x9d and xe2x80x9cbxe2x80x9d is defined as:       p    ⁢          (      x      )        =            Lim              a        →        b              ⁡          [                        p          ⁢                      (                          a              ≤              x              ≤              b                        )                                    "LeftBracketingBar"                      (                          a              -              b                        )                    "RightBracketingBar"                    ]      
where p(axe2x89xa6xxe2x89xa6b) is the probability that x lies between a and b. Conventional systems for generating Bayesian networks cannot use continuous variables in their nodes.
FIG. 2 depicts an example Bayesian network for troubleshooting automobile problems. The Bayesian network of FIG. 2 contains many variables 202, 204, 206, 208, 210, 212, 214, 216, 218, 220, 222, 224, 226, 228, 230, 232, and 234, relating to whether an automobile will work properly, and arcs 236, 238, 240, 242, 244, 246, 248, 250, 252, 254, 256, 258, 260, 262, 264, 268. A few examples of the relationships between the variables follow. For the radio 214 to work properly, there must be battery power 212 (arc 246). Battery power 212, in turn, depends upon the battery working properly 208 and a charge 210 (arcs 242 and 244). The battery working properly 208 depends upon the battery age 202 (arc 236). The charge 210 of the battery depends upon the alternator 204 working properly (arc 238) and the fan belt 206 being intact (arc 240). The battery age variable 202, whose values lie from zero to infinity, is an example of a continuous variable that can contain an infinite number of values. However, the battery variable 208 reflecting the correct operations of the battery is a discrete variable being either true or false.
The automobile troubleshooting Bayesian network also provides a number of examples of conditional independence and conditional dependence. The nodes operation of the lights 216 and battery power 212 are dependent, and the nodes operation of the lights 216 and operation of the radio 214 are conditionally independent given battery power 212. However, the operation of the radio 214 and the operation of the lights 216 are conditionally dependent. The concept of conditional dependence and conditional independence can be expressed using conditional probability notation. For example, the operation of the lights 216 is conditionally dependent on battery power 212 and conditionally independent of the radio 214 given the battery power 212. Therefore, the probability of the lights working properly 216 given both the battery power 212 and the radio 214 is equivalent to the probability of the lights working properly given the battery power alone, P(Lights|Battery Power, Radio)=P(Lights|Battery Power). An example of a conditional dependence relationship is the probability of the lights working properly 216 given the battery power 212 which is not equivalent to the probability of the lights working properly given no information. That is, p(Lights|Battery Power)xe2x89xa0p(Lights).
There are two conventional approaches for constructing Bayesian networks. Using the first approach (xe2x80x9cthe knowledge-based approachxe2x80x9d), a person known as a knowledge engineer interviews an expert in a given field to obtain the knowledge of the expert about the field of expertise of the expert. The knowledge engineer and expert first determine the distinctions of the world that are important for decision making in the field of the expert. These distinctions correspond to the variables of the domain of the Bayesian network. The xe2x80x9cdomainxe2x80x9d of a Bayesian network is the set of all variables in the Bayesian network. The knowledge engineer and the expert next determine the dependencies among the variables (the arcs) and the probability distributions that quantify the strengths of the dependencies.
In the second approach (xe2x80x9ccalled the data-based approachxe2x80x9d), the knowledge engineer and the expert first determine the variables of the domain. Next, data is accumulated for those variables, and an algorithm is applied that creates a Bayesian network from this data. The accumulated data comes from real world instances of the domain. That is, real world instances of decision making in a given field. Conventionally, this second approach exists for domains containing only discrete variables.
After the Bayesian network has been created, the Bayesian network becomes the engine for a decision-support system. The Bayesian network is converted into a computer-readable form, such as a file and input into a computer system. Then, the computer system uses the Bayesian network to determine the probabilities of variable states given observations, determine the benefits of performing tests, and ultimately recommend or render a decision. Consider an example where a decision-support system uses the Bayesian network of FIG. 2 to troubleshoot automobile problems. If the engine for an automobile did not start, the decision-based system could request an observation of whether there was gas 224, whether the fuel pump 226 was in working order by possibly performing a test, whether the fuel line 228 was obstructed, whether the distributor 230 was working, and whether the spark plugs 232 were working. While the observations and tests are being performed, the Bayesian network assists in determining which variable should be observed next.
U.S. application Ser. No. 08/240,019 filed May 9, 1994 entitled xe2x80x9cGenerating Improved Belief Networksxe2x80x9d describes an improved system and method for generating Bayesian networks (also known as xe2x80x9cbelief networksxe2x80x9d) that utilize both expert data received from an expert (xe2x80x9cexpert knowledgexe2x80x9d) and data received from real world instances of decisions made (xe2x80x9cempirical dataxe2x80x9d). By utilizing both expert knowledge and empirical data, the network generator provides an improved Bayesian network that is more accurate than conventional Bayesian networks. In addition, the exemplary embodiment facilitates the use of continuous variables in Bayesian networks and handles missing data in the empirical data that is used to construct Bayesian networks.
Expert knowledge consists of two components: an equivalent sample size or sizes (xe2x80x9csample sizexe2x80x9d), and the prior probabilities of all possible Bayesian-network structures (xe2x80x9cpriors on structuresxe2x80x9d). The effective sample size is the effective number of times that the expert has rendered a specific decision. For example, a doctor with 20 years of experience diagnosing a specific illness may have an effective sample size in the hundreds. The priors on structures refers to the confidence of the expert that there is a relationship between variables (e.g., the expert is 70 percent sure that two variables are related). The priors on structures can be decomposed for each variable-parent pair known as the xe2x80x9cprior probabilityxe2x80x9d of the variable-parent pair. Empirical data is typically stored in a database. An example of acquiring empirical data can be given relative to the Bayesian network of FIG. 2. If, at a service station, a log is maintained for all automobiles brought in for repair, the log constitutes empirical data. The log entry for each automobile may contain a list of the observed state of some or all of the variables in the Bayesian network. Each log entry constitutes a case. When one or more variables are unobserved in a case, the case containing the unobserved variable is said to have xe2x80x9cmissing data.xe2x80x9d Thus, missing data refers to when there are cases in the empirical data database that contain no observed value for one or more of the variables in the domain. An assignment of one state to each variable in a set of variables is called an xe2x80x9cinstancexe2x80x9d of that set of variables. Thus, a xe2x80x9ccasexe2x80x9d is an instance of the domain. The xe2x80x9cdatabasexe2x80x9d is the collection of all cases.
An example of a case can more clearly be described relative to the Bayesian network of FIG. 2. A case may consist of the battery age 202 being 2.132 years old, the battery working properly 208 being true, the alternator working properly 204 being true, the fan belt being intact 206 being true, the charge 210 being sufficient, the battery power 212 being sufficient, the starter working properly 220 being true, the engine turning over 218 being true, the amount of gas 224 being equal to 5.3 gallons, the fuel pump working properly 226 being true, the fuel line working properly 228 being true, the distributor working properly 230 being false, the spark plugs working properly 232 being true and the engine starting 234 being false. In addition, the variables for the gas gauge 222, the radio working properly 214 and the lights working properly 216 may be unobserved. Thus, the above-described case contains missing data.
Although Bayesian networks are quite useful in decision-support systems, Bayesian networks require a significant amount of storage. For example, in the Bayesian network 300 of FIG. 3A, the value of nodes X and Y causally influences the value of node Z. In this example, nodes X, Y, and Z have binary values of either 0 or 1. As such, node Z maintains a set of four probabilities, one probability for each combination of the values of X and Y, and stores these probabilities into a table 320 as shown in FIG. 3B. When performing probabilistic inference, it is the probabilities in table 320 that are accessed. As can be seen from table 320, only the probabilities for Z equaling 0 are stored; the probabilities for Z equaling 1 need not be stored as they are easily derived by subtracting the probability of when Z equals 0 from 1. As the number of parents of a node increases, the table in the node that stores the probabilities becomes multiplicatively large and requires a significant amount of storage. For example, a node having binary values with 10 parents that also have binary values requires a table consisting of 1,024 entries. And, if either the node or one of its parents has more values than a binary variable, the number of probabilities in the table increases multiplicatively. To improve the storage of probabilities in a Bayesian network node, some conventional systems use a tree data structure. A tree data structure is an acyclic, undirected graph where each vertex is connected to each other vertex via a single path. The graph is acyclic in that there is no path that both emanates from a vertex and returns to the same vertex, where each edge in the path is traversed only once. FIG. 3C depicts an example tree data structure 330 that stores into its leaf vertices 336-342 the probabilities shown in table 320 of FIG. 3B. Assuming that a decision-support system performs probabilistic inference with X""s value being 0 and Y""s value being 1, the following steps occur to access the appropriate probability in the tree data structure 330: First, the root vertex 332, vertex X, is accessed, and its value determines the edge or branch to be traversed. In this example, X""s value is 0, so edge 344 is traversed to vertex 334, which is vertex Y. Second, after reaching vertex Y, the value for this vertex determines which edge is traversed to the next vertex. In this example, the value for vertex Y is 1, so edge 346 is traversed to vertex 338, which is a leaf vertex. Finally, after reaching the leaf vertex 338, which stores the probability for Z equaling 0 when X=0 and Y =1, the appropriate probability can be accessed. As compared to a table, a tree is a more efficient way of storing probabilities in a node of a Bayesian network, because it requires less space. However, tree data structures are inflexible in the sense that they can not adequately represent relationships between probabilities. For example, because of the acyclic nature of tree data structures, a tree cannot be used to indicate some types of equality relationships where multiple combinations of the values of the parent vertices have the same probability (i.e., refer to the same leaf vertex). This inflexibility requires that multiple vertices must sometimes store the same probabilities, which is wasteful. It is thus desirable to improve Bayesian networks with tree distributions.
Collaborative filtering systems have been developed that predict the preferences of a user. The term xe2x80x9ccollaborative filteringxe2x80x9d refers to predicting the preferences of a user based on known attributes of the user, as well as known attributes of other users. For example, a preference of a user may be whether they would like to watch the television show xe2x80x9cI Love Lucyxe2x80x9d and the attributes of the user may include their age, gender, and income. In addition, the attributes may contain one or more of the user""s known preferences, such as their dislike of another television show. A user""s preference can be predicted based on the similarity of that user""s attributes to other users. For example, if all users over the age of 50 with a known preference happen to like xe2x80x9cI Love Lucyxe2x80x9d and if that user is also over 50, then that user may be predicted to also like xe2x80x9cI Love Lucyxe2x80x9d with a high degree of confidence. One conventional collaborative filtering system has been developed that receives a database as input. The database contains attribute-value pairs for a number of users. An attribute is a variable or distinction, such as a user""s age, gender or income, for predicting user preferences. A value is an instance of the variable. For example, the attribute age may have a value of 23. Each preference contains a numeric value indicating whether the user likes or dislikes the preference (e.g., 0 for dislike and 1 for like). The data in the database is obtained by collecting attributes of the users and their preferences. It should be noted that conventional collaborative filtering systems can typically only utilize numerical attributes. As such, the values for non-numerical attributes, such as gender, are transposed into a numerical value, which sometimes reduces the accuracy of the system. For example, when a variable has three non-numerical states, such as vanilla, chocolate and strawberry, transposing these states into a numerical value will unintentionally indicate dissimilarity between the states. That is, if vanilla were assigned a value of 1, chocolate 2 and strawberry 3, the difference between each value indicates to the system how similar each state is to each other. Therefore, the system may make predictions based on chocolate being more similar to both vanilla and strawberry than vanilla is similar to strawberry. Such predictions may be based on a misinterpretation of the data and lead to a reduction in the accuracy of the system.
In performing collaborative filtering, the conventional system first computes the correlation of attributes between a given user xe2x80x9cvxe2x80x9d and each other user xe2x80x9cuxe2x80x9d (except v) in the database. The computation of the xe2x80x9ccorrelationxe2x80x9d is a well-known computation in the field of statistics. After computing the correlation, the conventional system computes, for example, the preference of a user xe2x80x9cvxe2x80x9d for a title of a television show xe2x80x9ctxe2x80x9d as follows:       pref    ⁢          (              t        ,        v            )        =            ⟨              pref        ⁢                  (          t          )                    ⟩        +                            ∑          u                ⁢                              (                                          pref                ⁢                                  (                                      t                    ,                    u                                    )                                            -                              ⟨                                  pref                  ⁢                                      (                    t                    )                                                  ⟩                                      )                    ⁢                      corr            ⁢                          (                              u                ,                v                            )                                                            ∑          u                ⁢                  corr          ⁢                      (                          u              ,              v                        )                              
where xe2x80x9cpref(t, v)xe2x80x9d is the preference of user xe2x80x9cvxe2x80x9d for title xe2x80x9ct,xe2x80x9d where xe2x80x9c less than pref(t) greater than xe2x80x9d is the average preference of title xe2x80x9ctxe2x80x9d by all users, where xe2x80x9cpref(t, u)xe2x80x9d is the preference of user xe2x80x9cuxe2x80x9d for title xe2x80x9ct,xe2x80x9d where xe2x80x9ccorr(u, v)xe2x80x9d is the correlation of users xe2x80x9cuxe2x80x9d and xe2x80x9cv,xe2x80x9d and the sums run over the users xe2x80x9cuxe2x80x9d that have expressed a preference for title xe2x80x9ct.xe2x80x9d One drawback to this conventional system is that the entire database must be examined when predicting preferences, which requires a significant amount of processing time.
One way to improve upon this conventional system is to utilize a clustering algorithm. Using this approach, a collaborative filtering system uses any of a number of well-known clustering algorithms to divide the database into a number of clusters. For example, the algorithms described in KoJain, Algorithms for Clustering Data (1988) can be used. Each cluster contains the data of users whose preferences tend to be similar. As such, when predicting the preferences of one user in a cluster, only the preferences of the other users in the cluster need to be examined and not the preferences of all other users in the database. A collaborative filtering system that utilizes a clustering algorithm receives as input a database, as described above, a guess of the number of clusters and a distance metric. The guess of the number of clusters is provided by an administrator of the collaborative filtering system based on their own knowledge of how many clusters the database can probably be divided into. The distance metric is a metric provided by the administrator for each user in the database that estimates how similar one user is to each other in the database based on user""s preferences and attributes. The distance metric is a range between 0 and 1 with 0 indicating that two users are least similar and 1 indicating that two users are most similar. This similarity is expressed as a numerical value. Each user will have a distance metric for every other user. Thus, the distance metrics are conveniently represented by an N-by-N matrix, where xe2x80x9cNxe2x80x9d is the number of users. After receiving the number of clusters and the distance metric, the clustering algorithm identifies the clusters. The clustering algorithm outputs a list of the users in the database and a cluster number assigned to each user. To determine the preferences of a user, the other users within that user""s cluster are examined. For example, if the system is attempting to determine whether a user would like the television show xe2x80x9cI Love Lucy,xe2x80x9d the other users within that cluster are examined. If there are six other users within the cluster and five out of the six like xe2x80x9cI Love Lucy,xe2x80x9d then it is likely that so will the user.
Although utilizing a clustering algorithm may be an improvement over the previously-described conventional system, it has limitations. One such limitation is that the exact number of clusters is determined manually, which renders the algorithm prone to human error. Another limitation is that all attributes are numerical and as such, the values of non-numerical attributes must be transposed into numerical values. Based upon the above-described limitations of conventional collaborative filtering systems, it is desirable to improve collaborative filtering systems.
The invention performs speech recognition using an array of mixtures of Bayesian networks. A mixture of Bayesian networks (MBN) consists of plural hypothesis-specific Bayesian networks (HSBNs) having possibly hidden and observed variables. A common external hidden variable is associated with the MBN, but is not included in any of the HSBNs. The number of HSBNs in the MBN corresponds to the number of states of the common external hidden variable, and each HSBN models the world under the hypothesis that the common external hidden variable is in a corresponding one of those states. In accordance with the invention, the MBNs encode the probabilities of observing the sets of acoustic observations given the utterance of a respective one of said acoustic parts of speech. Each of the HSBNs encodes the probabilities of observing the sets of acoustic observations given the utterance of a respective one of the acoustic parts of speech and given a hidden common variable being in a particular state. Each HSBN has nodes corresponding to the attributes of the acoustic observations. These nodes store probability parameters corresponding to the probabilities with causal links representing dependencies between ones of said nodes. The array of MBNs is trained by conducting a parameter search for a set of changes in the probability parameters which improves the goodness of each HSBN in predicting the observed data, and then modifying the probability parameters of each HSBN accordingly. Then, for each HSBN, a structure score is computed reflecting the goodness of the HSBN in predicting the observed data. A structure search is conducted for a change in the causal links which improves the structure search score, and the causal links of each HSBN are adjusted accordingly.
The MBN structure is initialized as a collection of identical HSBNs whose discrete hidden variables are connected to all observed variables and whose continuous hidden variables are connected only to each of the continuous observed variables, the directionality being from hidden variable to observed variable.
In constructing the MBN, the parameters of the current HSBNs are improved using an expectation-maximization process applied for training data. The expectation-maximization process is iterated to improve the network performance in predicting the training data, until some criteria has been met. Early in the process, this criteria may be a fix number of iterations which may itself be a function of the number of times the overall learning process has iterated. Later in the process, this criteria may be convergence of the parameters to a near optimum network performance level.
Then, expected complete-model sufficient statistics are generated from the training data. The expected complete-model sufficient statistics are generated as follows: first, a vector is formed for each observed case in the training data. Each entry in the vector corresponds to a configuration of the discrete variables. Each entry is itself a vector with subentries. The subentries for a given case are (1) the probability that, given the data of the particular case, the discrete variables are in the configuration corresponding to the entry""s position within the vector, and (2) information defining the state of the continuous variables in that case multiplied by the probability in (1). These probabilities are computed by conventional techniques using the MBN in its current form. In this computation, conditional probabilities derived from the individual HSBNs are weighted and then summed together. The individual weights correspond to the current probabilities of the common external hidden variable being in a corresponding one of its states. These weights are computed from the MBN in its current form using conventional techniques. As each one of the of the set of vectors for all the cases represented by the training data is formed, a running summation of the vectors is updated, so that the expected complete-model sufficient statistics are generated as the sum of the vectors over all cases. (Waiting until all the vectors have been produced and then performing one grand summation would probably use too much memory space.)
After computation of the expected complete-model sufficient statistics for the MBN, the structures of the HSBNs are searched for changes which improve the HSBN""s score or performance in predicting the training data given the current parameters. The MBN score preferably is determined by the HSBN scores, the score for the common hidden external variable, and a correction factor. If the structure of any HSBN changes as a result of this fast search, the prior steps beginning with the expectation-maximization process are repeated. The foregoing is iteratively repeated until the network structure stabilizes. At this point the current forms of the HSBNs are saved as the MBN. An MBN is thus generated for each possible combination of number of states of the hidden discrete variables including the common external hidden variable, so that a number of MBNs is produced in accordance with the number of combinations of numbers of states of the hidden discrete variables.