This disclosure relates to methods and systems for dispatching vehicles in a public transportation network.
The Reinforcement Learning (RL) framework has promised to bring solutions to several applications such as slow server problems where arriving customers wait in a queue before obtaining service (e.g. call center operations, web server load balancing etc.), machine replacement problems in inventory management, and river swim problems where an agent needs to swim left or right in a stream. A recent goal in the RL framework is to choose a sequence of actions or a policy to maximize the reward collected or minimize the regret incurred in a finite time horizon. For several RL problems in operation research and optimal control, the optimal policy of an underlying Markov Decision Process (MDP) is characterized by a known structure. The current state of the art does not utilize this known structure of the optimal policy while minimizing the regret. Other systems attempt to optimize long range average reward, which has been previously shown to be disadvantageous in some scenarios to algorithms that minimize regret. In other RL systems, the transition probabilities and reward values are not known a priori, making it harder to compute a decision rule.
This document describes devices and methods that are intended to address at least some issues discussed above and/or other issues.