https://www.youtube.com/watch?v=6KSf-j4LL-c&t=92s&list=PLlMkM4tgfjnKsCWav-Z2F-MMFRx-2gMGG&index=8 ================================================================================ Deterministic - If agent wants to go right, agent can go right - Agents obtain fixed pattern of reward Non-deterministic (stochastic) - If agent wants to go right, agent can go other direction - Agents obtain various pattern of reward ================================================================================ Guide (as Q values) from Q-mentor (as Q function) doesn't work on non-deterministic environment ================================================================================ Why not work? Q mentor says his really experience. But since it's non-deterministic environment, experience of Q mentor can't work ================================================================================ How to solve this problem? - Agent will get "a bit" of guide from Q mentor - Agent will use "more" Q vallues that agent previously had ================================================================================ $$$Q(s,a) \leftarrow (1-\alpha) Q(s,a) + \alpha [ r + \gamma \max_{a^{'}} Q(s^{'},a^{'}) ]$$$ $$$\alpha$$$: learning rate, say $$$\alpha=0.1$$$, 10% $$$(1-\alpha) Q(s,a)$$$: agent's stubborn is 90% $$$\alpha [ r + \gamma \max_{a^{'}} Q(s^{'},a^{'}) ]$$$: Q-mentor's guide, 10% ================================================================================ You can simplify equation by hyperparameter $$$\alpha$$$ $$$Q(s,a) \leftarrow Q(s,a) + \alpha [ r + \gamma \max_{a^{'}} Q(s^{'},a^{'}) -Q(s,a) ]$$$ ================================================================================ $$$\hat{Q}(s,a) \leftarrow \hat{Q}(s,a) + \alpha [ r + \gamma \max_{a^{'}} Q(s^{'},a^{'}) - \hat{Q}(s,a) ]$$$ $$$\hat{Q}$$$: Approximated Q $$$Q$$$: actual $$$Q$$$ If you iterate algorithm enough, convergence of $$$\hat{Q}$$$ to Q is proved mathematically