Technical Field
The present disclosure relates to the field of reinforcement learning. Particularly, the present disclosure relates to a processor specifically programmed for implementing reinforcement learning operations, and to an application-domain specific instruction set (ASI) comprising instructions specifically designed for implementing reinforcement learning operations.
Decryption of the Related Art
Artificial Intelligence (AI) aims to make a computer/computer-controlled robot/computer implemented software program mimic the thought process of a human brain. Artificial Intelligence is utilized in various computer implemented applications including gaming, natural language processing, creation and implementation of expert systems, creation and implementation of vision systems, speech recognition, handwriting recognition, and robotics. A computer/computer controlled robot/computer implemented software program achieves or implements Artificial Intelligence through iterative learning, reasoning, perception, problem-solving and linguistic intelligence.
Machine learning is a branch of artificial intelligence that provides computers the ability to learn without necessitating explicit functional programming. Machine learning emphasizes on the development of (artificially intelligent) learning agents that could tweak their actions and states dynamically and appropriately when exposed to a new set of data. Reinforcement learning is a type of machine learning where a reinforcement learning agent learns by utilizing the feedback received from a surrounding environment in each entered state. The reinforcement learning agent traverses from one state to another by the way of performing an appropriate action at every state, thereby receiving an observation/feedback and a reward from the environment. The objective of a Reinforcement Learning (RL) system is to maximize the reinforcement learning agent's total rewards in an unknown environment through a learning process that warrants the reinforcement learning agent to traverse between multiple states while receiving feedback and reward at every state, in response to an action performed at every state.
Further, essential elements of a reinforcement learning system include a ‘policy’, ‘reward functions’, action-value functions’ and ‘state-value functions’. Typically, a ‘policy’ is defined as a framework for interaction between the reinforcement learning agent and a corresponding reinforcement learning environment. Typically, the actions undertaken by the reinforcement learning agent and the states traversed by the reinforcement learning agent during an interaction with a reinforcement learning environment are governed by the policy. When an action is undertaken, the reinforcement learning agent moves within the environment from one state to another and the quality of a state-action combination defines an action-value function. The action-value function (Qπ) determines expected utility of a (selected) action. The reward function is representative of the rewards received by the reinforcement learning agent at every state in response to performing a predetermined action. Even though rewards are provided directly by the environment after the reinforcement learning agent performs specific actions, the ‘rewards’ are estimated and re-estimated (approximated/forecasted) from the sequences of observations a reinforcement learning agent makes over its entire lifetime. Thus, a reinforcement learning algorithm aims to estimate state-value function and an action-value function that helps approximate/forecast the maximum possible reward to the reinforcement learning agent.
Q-learning is one of the techniques employed to perform reinforcement learning. In Q-learning, the reinforcement teaming agent attempts to learn an optimal policy based on the historic information corresponding to the interaction between the reinforcement learning agent and reinforcement learning environment. The reinforcement learning agent learns to carry out actions in the reinforcement learning environment to maximize the rewards achieved or to minimize the costs incurred. Q-learning estimates the action-value function that further provides the expected utility of performing a given action in a given state and following the optimal policy thereafter. Thus, by finding the optimal policy, the agents can perform actions to achieve maximum rewards.
Existing methods disclose the use of neural networks (by the reinforcement learning agents) to determine the action to be performed in response to the observation/feedback. Neural networks are machine learning models that employ one or more layers of nonlinear units to predict an output for a received input. Each layer of the network generates an output from a received input in accordance with current values of a respective set of parameters. However, existing methods do not disclose processor architectures specifically configured to perform reinforcement learning operations. Furthermore, existing methods that promote the use of neural networks do not support reward function approximation.
To implement the function of deep reinforcement learning, and AI application, existing systems typically use GPUs. GPUs typically incorporate Single Instruction Multiple Data (SIMD) architecture to execute reinforcement learning operations. In SIMD, all the GPUs share the same instruction but perform operations on different data elements. However, the GPUs require a large amount of processing time to extract actionable data. Further, GPUs are unsuitable for sequential decision-making tasks and are hindered by a lack of efficiency as far as processing the memory access of reinforcement learning tasks is concerned.
Therefore, in order to overcome the drawbacks discussed hitherto, there is felt a need for a processor architecture specifically designed for implementing reinforcement learning operations/tasks. Further, there was also felt a need for a processor architecture that renders rich actionable data for effective and efficient implementation of reinforcement learning operations. Further, there is also felt a need for a processor architecture that incorporates an application-domain specific instruction set, a memory architecture and a multi-core processor specifically designed for performing reinforcement learning tasks/operations.