Various methods are known from the prior art in which the dynamic timing behavior of a technical system is optimized with computer assistance while taking account of prespecified criteria. Examples of such methods are learning methods which, on the basis of known data of the technical system, learn an optimized adjustment of the system. In such cases the technical system is described by states which represent measurable state variables of the technical system, for example physical variables such as pressure, temperature, power and the like. Furthermore actions are defined in the technical system which describe the modification of specific adjustment variables at the technical system, such as the modification of valve settings, increasing pressure and the like. The state of the technical system is put into a new follow-up state by an action. Known learning methods in such cases learn an optimum action selection rule which for each state of the technical system defines the optimum action for putting the system into a new state. Each action is typically awarded in such cases either a reward or a punishment, especially one including a cost function, with an optimum dynamic behavior of the technical system able to be achieved with the aid of the rewards.
The basis for the appropriate determination of an optimum behavior of a technical system in this case is a good database in the form of a well-explored state space of the technical system. This means that a plurality of appropriate data about the technical system must be present in the form of states, actions and follow-up states. In this case the problem arises that during the exploration of the state space, for determining a suitable database, states can be assumed which are to be classified as unsafe, i.e. which under some circumstances can lead directly or indirectly to the technical system being damaged or to its malfunctioning.