In reinforcement learning, an area of machine learning, a software model is trained to take actions in an environment so as to maximize a certain notion of cumulative reward. In a typical training process, the model generates an action, sends the action to the environment, and waits for the effect on or outcome from the environment resulting from the action through a feedback loop. The feedback can then be used to calculate a score, indicating a reward by taking the action, which can be incorporated into the model to improve the accuracy of taking further actions to maximize the score/reward.
In this typical training process, resources for training the model normally have to remain idle while waiting for the effect or outcome to be fed back from the environment before launching the next training task. Such idle time prolongs the total time required for training the model and results in slower training. In some cases, the effect of an action on the environment may take a long time to materialize (e.g., weeks or even months). Allowing the training resources to remain idle during such a long time may cause significant waste of computational power.
Thus, there is a need for systems and methods capable of utilizing the waiting period to accelerate the model training process.