Schematic framework of the reinforcement learning algorithm using policy iteration for continuous-time dynamical systems (IMAGE)
Caption
At each sampling time instant, one observes system output and action to form discrete-time rewards. The sampled input-output data are collected along the trajectory of the dynamical system in real-time, and are stacked over the time interval of interest as the discrete-time input-output data. The input-output data, associated with the prescribed optimization criterion, are used for updating the value estimate given in the critic module, based on which the control policy in the actor module is updated. The ultimate goal of this framework is to use the input-output data for learning the optimal decision law that minimizes the user-defined optimization criterion.
Credit
©Science China Press
Usage Restrictions
Use with credit.
License
Original content
 
                