The prior art includes two basic technologies for implementing knowledge based systems on machines: Expert systems and neural networks. In constructing an expert system, a human expert (or several experts) is generally consulted about how he would solve a certain problem. Through these consultations, general rules about how the data associated with a particular problem should be manipulated are developed. These rules are eventually programmed into the machine so that, given a set of input data, the formulated rules can be applied to the data to yield a solution.
As the above discussion indicates, expert systems are generally associated with top-down knowledge engineering and model-based or deductive reasoning. In other words, to implement an expert system one must have some previous information indicating how a problem should be solved or a model describing the problem through a group of rules.
In contrast to expert systems, neural networks are generally associated with bottom-up learning and inductive reasoning. To construct a neural network, one first constructs a network of neurons that receive input and produce an output in response to the input. In most neural networks, the neurons assign differing weights to each input, and combine the weighted inputs to produce the output. Once the basic neural network is constructed, it is then trained by feeding it data representative of known problems and their known solutions. The network then adjusts the weight factors in accordance with predetermined feedback rules so that it can correctly produce an acceptable output for each set of known inputs. In this sense, the neural network "learns" from the sets of known problems and solutions.
Additional details concerning expert systems and neural networks are set out below.
1.1 Expert Systems
The above provided a brief explanation regarding how an expert system works. This section provides a more detailed discussion of expert systems. As used in this specification, an "expert system" is generally defined as comprising three interacting components: a rule base, an inference engine, and a cache. Each element of such an expert system is figuratively illustrated in regard to a typical expert system 10 in FIG. 1A and further discussed below.
1.1(a) Rule Base
The rule base 12 of expert system 10 typically consists of statements called rules that are essentially implications with certainty factors. As discussed above, these rules are generally established by interviewing human experts. An entry in the rule base is a rule of the form EQU a.perp-right.b (cf)
where a is the antecedent, b the consequent, and cf the certainty factor of the rule. For example, a human expert in the area of auto repair may posit a rule that 75% of the time a car's wheels squeak the brake pads need replacement. Thus, a rule may be established indicating that if the brakes squeak then the brake pads need replacing with a certainty factor of 0.75. In this example the squeaking brakes would be the antecedent a, the brake pad replacement would be the consequence b, and the 0.75 would be the confidence factor of. For the purposes of this discussion, it is assumed that certainty factors are restricted to the range [-1,1] although the present invention may be practiced with other certainty factors.
Generally, the rules contained in an expert system's rule base may be expressed in mathematical terms. For example, let A denote the set of antecedents, B denote the set of consequents, and define C=A.orgate.B. Under such a description the elements of C are called assertions. Each assertion c.epsilon.C has three attributes: a label c.l denoting a fixed logical statement, a real variable c.i called its internal state, and a real variable c.o called its output value. The significance of these labels, internal states, and output values is further discussed below. Again, for the purposes of this specification it is assumed that states are restricted to the range [-1,1] and output values are restricted to [0,1] although other ranges may be used without departing from the inventive scope of the present invention.
In nearly all expert systems, the rules are constructed in a feed-forward form and no cycles occur among the expert system rules, as circular logic is typically disallowed in an expert system database. An example of this feed forward rule construction is illustrated in FIG. 1B. As illustrated, while the output of a rule in one layer may serve as an antecedent for a rule in a subsequent layer, the outputs for a rule in a subsequent layer do not serve as antecedents for rules in preceding layers. This is because cyclic rules are generally not allowed in expert systems.
1.1(b) Cache.
The second element in most expert systems is known as the cache or working memory. Such a cache is figuratively illustrated as element 14 in FIG. 1. Basically, the cache 14 is the dynamic working memory of the expert system. The current state of any active rule (or assertion) is stored in the cache along with facts consisting of information about the validity of rule antecedents. The cache may be viewed as the place where the label of an assertion is associated with its current state and output value. For example, using the simple rule discussed above, the cache 14 would contain information indicating the fact of whether the brakes are squeaking. For example if the brakes were not squeaking the current state of that antecedent would probably be 0; while if the brakes were squeaking the value would be 1.
1.1(c) Inference Engine.
The inference engine is the part of the expert system that draws conclusions by manipulating rules from the rule base and facts from the cache, and updates the current values in the cache during processing. Even though it is usually superimposed on a clocked computer, this cache updating is naturally an event-driven computation.
Using the above example, assume that the driver of the car is not sure whether the brakes or shocks are squeaking but believes that there is a 80% chance that the brakes are squeaking. In this case the cache may include an indication that the current state of the antecedent "squeaking brakes" is 0.80. The inference engine would read this current state, apply the rule, and generate an output signal indicating that there is a 60% (0.8.times.0.75) chance that the brake pads need replacing. Of course the above example is an extremely simple one. In most expert systems the number of antecedents will be much greater than one and the output state of one rule may serve as the input antecedent for another. In such systems, the modification of one antecedent almost always involves a recalculation of several other antecedents, an updating of these antecedent states in the cache, and a reapplication of the rules on the rule base to the updated antecedents.
The inference processing of an expert system may be referred to as inferential dynamics. The inferential dynamics are determined by three components of the inference engine: the evidentiary combining function, the firing function, and the control system.
The "evidentiary combining function" is used to evaluate the effect of multiple rules having the same consequent. For example, assume that two rules are known, the first being the rule that when the brakes squeak the brake pads need to be changed 75% of the time, and the second being the rule that when the car takes 200 feet to stop at 25 miles/per/hour the brake pads need replacing 75% of the time. If the expert system is given the information that a car takes over 200 feet to stop at 25 mph and that the brakes squeak, it can use the evidentiary function to combine the two rules and indicate that there is probably greater than a 75% chance that the brake pads need replacing.
The "firing function" determines whether or not a rule will fire based on the value of its internal state, and then determines the output value. For an assertion a, a.o:=f(a.i) where f is the firing function, the "control system" performs selection and scheduling. Selection consists of determining which rules and facts if any are to be considered for activation. Scheduling resolves conflicts that may arise, for example, when more than one rule is ready to fire or when rules are encountered in some sequential order that may not reflect real knowledge. For example, again given the above example, a firing function for the rule "change brake pads" may be set to fire only when there are adequate indicators that the task needs to be performed. For example, the firing function may be set to only change the brake pads when the internal value of the rule is greater than 0.75. Thus, if the brakes merely squeak or the car takes 200 ft to stop at 25 mph, the function will not fire since the internal state will be 0.75. However, if both facts are found in the cache, the internal state of the rule will be 1.5 and the rule will fire indicating that the brake pads should be changed. In keeping with the above example, the value for changing brake pads would be set to 1 since the output values are restricted to be in the range of [0, 1].
Closely related to the concept of an inference engine is the concept of a "shell." Basically, a shell is an un-instantiated expert system, consisting of an inference engine and empty memory structures (i.e., no set rules in the rule-base). A shell can be instantiated (made into an expert system) by insertion of knowledge into the knowledge base or by placing rules into the rule base.
1.2 Neural Networks
In contrast to the non-cyclic, rule-based expert systems described above, most neural networks consist of networks of artificial neural objects, sometimes referred to as processing units, nodes, or "neurons," which receive input data, process that data, and generate an output signal. In such systems, the key to solving lies not in a rule proclaimed by an expert but in the processing functions of the many neurons which comprise the network. One example of a neural network is illustrated in FIG. 2A.
As discussed above, most neural networks must be trained to establish the processing functions of the neurons and enable the network to solve problems. The role and nature of time is important for both computational expedition and model realism in the training of such neural networks. This role is influenced and often constrained by architectural assumptions. For example, the most widely used learning algorithm, back-prop, typically depends on a layered feed-forward architecture that implicitly defines the role of time in both activation and learning phases. The layers impose what amounts to a global clock in the entire network. Additional discussion of the artificial neural objects that make up a neural networks is contained below.
In object-oriented programming, an object is characterized by a set of attributes describing its current state and a set of operations which can be applied to that object. An "artificial neural object" (ANO) may be defined as an artificial neuron with it attendant states, I/O, and processes: activation state, processing state, incoming connections and connection strengths, outgoing connections, combining function, output function, and possibly a learning function, together with a communications facility to signal changes in processing state (waiting, active, ready) to adjacent nodes in the network. Precise specification of communications facilities for an ANO is dependent on the learning method imposed on the network and possibly other application-specific considerations. The exact nature of the combining and output functions is also variable. There is one specific type of ANO that is widely used in prior art neural networks. The combining function for this ANO is the taking of a weighted sum of the inputs; the output function for this ANO is a sigmoidal squashing function applied to the value of the weighted sums of the inputs. For the purposes of this specification only, ANO meeting this definition, i.e., having a weighted sum combining function and a sigmoidal squashing output function is referred to as an analog perceptron.
1.3 Learning
As discussed above, one important step in training a neural network to solve problems is to teach the neural network how to solve the desired problem by supplying it with known problems with known solutions. This teaching process is often referred to as "learning" since the neural network learns from the known correct cases.
After enumerating the nodes in a neural network, each node has a "weight vector" whose components are the weights of incoming edges. The weight vectors are the rows of the "weight matrix," also called the "knowledge state" of the network. The "weight space" of a node consists of all possible weight vectors. "Learning" is defined to be a change in the knowledge state. The process of updating the knowledge state over time according to some algorithm is called "knowledge dynamics". The time scale of knowledge dynamics is generally assumed to be slower than that of activation dynamics.
Learning implies a change in knowledge. Generally speaking, neural networks are said to represent knowledge in their connections. There are two levels on which to interpret such a statement.
First, given a set of connections (a network topology), knowledge is stored in the synaptic functions. This is the more usual interpretation and is usually referred to as "fine" knowledge. In other words, fine knowledge is represented in a neural network by the weights of the connections between the ANOs.
On the other hand, the specification of which connections exist could also fit this concept of knowledge in neural networks. This is referred to as "coarse" knowledge.
Thus, coarse knowledge is captured in a network topology; fine knowledge is captured in the synaptic functionality of the connections. Learning coarse knowledge means changing the network topology while learning fine knowledge (or knowledge refinement) involves changing the synaptic functionalities. In either case learning is change, or knowledge dynamics.
The above brake-pad example may be used to illustrate the difference between coarse learning and fine learning. Assume that a neural network is established and that numerous known-correct cases are applied where it has been proper to change the brake pads. After learning, the neural network should establish a link between (a) the neurons responsible for indicating that the brakes squeak and that the car takes 200 ft to stop at 25 mph and (b) the neuron responsible for indicating that the brake pads need to be changed. This establishment of the link may be referred to as "coarse learning."
Once the coarse learning has been accomplished, the neural network must next determine what weight factors to apply to the respective outputs indicating that the brakes squeak and that the car takes 200 ft to stop at 25 mph. After repeated learning, the neural network should determine that the weight factors for both of these outputs should be 0.75. This determination of the exact weight factor to be applied to given inputs is referred to as "fine learning."
Learning coarse knowledge could be loosely interpreted as rule extraction; a considerable body of research on this topic exists independent of neural networks. Some connectionist methods have also been introduced in recent years that build or modify network topology. While these methods are mostly not directed at high-level networks, where a single connection may be assigned meaning, some of them have potential in the realm of expert networks.
1.4 Backpropagation Learning
As discussed above, a neural network must go through a learning process before it can accurately be used to solve basic problems. Although several procedures are available for training neural networks, one of the most widely used is "backpropagation of error" learning.
Backpropagation learning, more precisely described as steepest-descent supervised learning using backpropagation of error, has had a significant impact in the field of neural networks. Basically, backpropagation of error involves comparing the actual output of a neuron with a known correct value and determining the error for that neuron. That error for that neuron is then sent back (or back-propagated) to the neurons that provided input into the neuron for which the error was calculated; the errors for those neurons are then calculated and backpropagated through the network.
Once each neuron receives its backpropagation error, it has an indication of both what its output value actually is and what that output value should be. Because the error for a given neuron is essentially a vector representing the erroneous weights given to the various input values, each node can be designed (a) to determine the gradient of the error vector and (b) to determine in which direction is must change its weight vector to minimize the magnitude of the error vector. In other words, each neuron can be designed to determine the change in its weight vector that would tend to minimize the magnitude of the error vector the fastest, and then to change its weight vector in that direction. By periodically receiving error vectors, calculating the fastest way to minimize the magnitude of the error vector (i.e., calculating the steepest descent), and altering its weight vector, the neurons of a neural network can learn how to solve various problems.
Because of the need to backpropagate errors, many prior art backpropagation learning methods typically depend on a layered feed-forward architecture that implicitly defines the role of time in both activation and learning phases. In other words, most neural networks using back propagation divide the ANOs into separate layers and backpropagate the error from each layer to its predecessor through the use of a global clock. Such layered, backpropagation is illustrated in FIG. 2B.
As illustrated, the neural network is divided into four layers: A, B, C and D. Known inputs are applied to layer A and the network is activated to yield outputs in layer D. The error for the ANOs in layer D is then calculated using the known outputs and backpropagated to layer C. This process is repeated from layer C to B and B to A. Notably, the error for all the neurons is clocked to the neurons of its preceding layer through the use of a global clock. By implementing backpropagation in this manner these layers impose what amounts to a global clock on the entire network.
Another feature typical of most prior art neural networks using backpropagation learning is that the ANOs are almost always simple analog perceptrons. As discussed above, an analog perceptron is an ANO where the combining function takes the weighted sum of the inputs and the output function is a sigmoidal squashing function.
Although analog perceptrons are useful for solving many problems, they are often inadequate when more complicated types of neural networks are attempted to be implemented. Apparently, the complexities of implementing backpropagation with ANOs that are not perceptrons has been one of the factors preventing the prior art from using non-perceptron ANOs in backpropagation neural networks.
In summary, although backpropagation has been widely used in the prior art as a supervised learning paradigm, it has been applied almost exclusively to layered, feed-forward networks of analog perceptrons.