Nonrandomized Markov and semi-Markov policies in dynamic programming
Teoriâ veroâtnostej i ee primeneniâ, Tome 27 (1982) no. 1, pp. 109-119
Citer cet article
Voir la notice de l'article provenant de la source Math-Net.Ru
The discrete time infinite horizon Borel state and action spaces non-stationary Markov decision model with the expected total reward criterion is considered. For an arbitrary fixed policy $\pi$ the following two statements are proved: a) for an arbitrary initial measure $\mu$ and for a constant $K<\infty$ there exists a nonrandomized Markov policy $\varphi$ such that \begin{gather*} w(\mu,\varphi)\ge w(\mu,\pi)\ \text{if}\ w(\mu,\pi)<\infty, \\ w(\mu,\varphi)\ge K\ \text{if}\ w(\mu,\pi)=\infty, \end{gather*} b) for an arbitrary measurable function $K(x)<\infty$ on the initial state space $X_0$ there exists a nonrandomized semi-Markov policy $\varphi'$ such that \begin{gather*} w(x,\varphi')\ge w(x,\pi)\ \text{if}\ w(x,\pi)<\infty, \\ w(x,\varphi')\ge K(x)\ \text{if}\ w(x,\pi)=\infty\ \text{for every}\ x\in X_0. \end{gather*} For every policy $\sigma$ the numbers $w(\mu,\sigma)$ and $w(x,\sigma)$ are the values of the criterion for the initial measure $\mu$ and the initial state $x$ respectively.