$Q$-learning in a stochastic Stackelberg game between an uninformed leader and a naive follower
Teoriâ veroâtnostej i ee primeneniâ, Tome 64 (2019) no. 1, pp. 53-74
Voir la notice de l'article provenant de la source Math-Net.Ru
We consider a game between a leader and a follower, where the players' actions affect the stochastic dynamics of the state process $x_t$, $t\in\mathbb Z_+$. The players observe their rewards and the system state $x_t$. The transition kernel of the process $x_t$ and the opponent rewards are unobservable. At each stage of the game, the leader selects action $a_t$ first. When selecting the action $b_t$, the follower knows the action $a_t$. The follower's actions are unknown to the leader (an uniformed leader). Each player tries to maximize the discounted criterion by applying the $Q$-learning algorithm. The players' randomized strategies are uniquely determined by Boltzmann distributions depending on the $Q$-functions, which are updated in the course of learning. The specific feature of the algorithm is that when updating the $Q$-function, the follower believes that the action of the leader in the next state is the same as in the current one (a naive follower). It is shown that the convergence of the algorithm is secured by the existence of deterministic stationary strategies that generate an irreducible Markov chain. The limiting large time behavior of the players' $Q$-functions is described in terms of controlled Markov processes. The distributions of the players' actions converge to Boltzmann distributions depending on the limiting $Q$-functions.
Keywords:
$Q$-learning, leader, follower, stochastic Stackelberg game, discounted criterion, Boltzmann distribution.
@article{TVP_2019_64_1_a3,
author = {D. B. Rokhlin},
title = {$Q$-learning in a stochastic {Stackelberg} game between an uninformed leader and a naive follower},
journal = {Teori\^a vero\^atnostej i ee primeneni\^a},
pages = {53--74},
publisher = {mathdoc},
volume = {64},
number = {1},
year = {2019},
language = {ru},
url = {http://geodesic.mathdoc.fr/item/TVP_2019_64_1_a3/}
}
TY - JOUR AU - D. B. Rokhlin TI - $Q$-learning in a stochastic Stackelberg game between an uninformed leader and a naive follower JO - Teoriâ veroâtnostej i ee primeneniâ PY - 2019 SP - 53 EP - 74 VL - 64 IS - 1 PB - mathdoc UR - http://geodesic.mathdoc.fr/item/TVP_2019_64_1_a3/ LA - ru ID - TVP_2019_64_1_a3 ER -
D. B. Rokhlin. $Q$-learning in a stochastic Stackelberg game between an uninformed leader and a naive follower. Teoriâ veroâtnostej i ee primeneniâ, Tome 64 (2019) no. 1, pp. 53-74. http://geodesic.mathdoc.fr/item/TVP_2019_64_1_a3/