https://www.youtube.com/watch?v=wYgyiCEkwC8 ================================================================================ Reinforcement learning * There is no supervisor who teaches how to maximize reward * There is reward signals * Feedback can be delayed For example, agent does action, agent gets reward in 10 minutes * Time matters (left to right and right to left are different) Non i.i.d data * Agent's actions affect the subsequent data which agent receives ================================================================================ Reward * $$$R_t$$$ is a scalar feedback signal * Reward $$$R_t$$$ indicates how well agent is doing at time t * Agent's job is to maximize "cumulative reward" * Reinforcement learning is based on the "reward hypothesis" * "Reward hypothesis": all goals can be described by the maximization of expected culuative reward * Agree with this hypothesis? ================================================================================ Reinforcement learning is "sequential decision making" RL is not about one time good action but sequential good action. ================================================================================ Agent does "action" Environmen gives "reward" and "observation (state)" to the agent Agent does "action" based on reward and observation (state) ... ================================================================================ History: is records that agent does $$$H_t = O_1,R_1,A_1,\cdots,A_{t-1},O_t,R_t$$$ $$$H_t$$$: all records until time t ================================================================================ State: is information which is used for agent and environment to do next step * $$$S_t=f(H_t)$$$ * Input history $$$H_t$$$ * Output state $$$S_t$$$ ================================================================================ State for environment $$$S_t^e$$$ * $$$S_t^e$$$ is all information to make observation and reward to agent ================================================================================ State for agent $$$S_t^a$$$ * All informations which agent uses when doing action ================================================================================ Markov state (aka information state) * When agent decision action, agent depends on only very previous state * When controling hellicopter, state is Markov state * $$$P[S_{t+1}|S_{t}] = P[S_{t+1}|S_{1},\cdots,S_{t}]$$$ ================================================================================ Fully observability: anvironment where agent can see states of environment $$$O_t=S_t^a=S_t^e$$$ This is also called Markov decision process (MDP) ================================================================================ Partial observability: anvironment where agent can see states of environment Agent state $$$\ne$$$ environment state This is also called partially observable Markov decision process (POMDP) * Agent must construct agent's state representation $$$S_t^a$$$ (1) Agent can use history as state, $$$S_t^a = H_t$$$ (2) Agent can use Beliefs of environment state: $$$S_t^a = (P[S_t^e=s^1],\cdots,P[S_t^e=s^n)$$$ (3) Agent can use RNN, $$$S_t^a=\sigma(S_{t-1}^aW_s+O_tW_o)$$$ ================================================================================ How agent is composed of? (1) Policy: is agent's behaviour function (2) Value function: represents how good is each state and/or action (3) Model: agent's representation of the environment Agent can have one or one two or all ================================================================================ Policy * is the thing that represents agent's behavior * policy is function and mapping action=policy(state) * deterministic policy: $$$a=\pi(s)$$$ Input state s, and one action is deterministically determined * stochastic policy: $$$\pi(a|s)=\mathbb{P}[A_t=a|S_t=s]$$$ Input state s, multiple actions are possible to happen. [prob_to_action1,prob_to_action2,prob_to_action3,...]=stochastic_policy(state) idx=argmax(prob_to_action1,prob_to_action2,prob_to_action3,...) chosen_action=actions[idx] ================================================================================ Value function * represents how good state and action of agent are * is prediction of future reward * is used to evaluate the goodness/badness of states * is used to select between actions * $$$v_{\pi}(s)=\mathbb{E}_{\pi}[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots | S_t=s]$$$ v: value function s: state as input $$$\pi$$$: from state, when agent follows policy $$$\pi$$$ expectation value of all rewards until the end of game $$$\gamma$$$: discount factor If there is no policy, value function can't be defined because agent should play the game until the end of game and it means there is guide (policy) whichh is used for agent to use information about how to play game ================================================================================ Model: is the thing which predicts how environment varies Model P predicts next state $$$\mathbb{P}_{ss{'}}=\mathbb{E}[S_{t+1}=s^{'}|S_t=s,A_t=a]$$$ Model R predicts next (immediate) reward $$$\mathbb{R}_s^a=\mathbb{E}[R_{t+1}|S_t=s,A_t=a]$$$ Model free agent: agent doesn't use model Model based agent: agent uses model ================================================================================ ================================================================================ Each cell represents optimal policy action=policy_function(state) ================================================================================ Values when agent follows optimal policy ================================================================================ ================================================================================ Categorizing agents (1) Value based agent agent has value function (no policy, implicit) good locations, agent can follow good locations (2) Policy based agent agent has policy (no value function) (3) Actor critic agent policy value function ================================================================================ (4) model free agent agent doesn't make model. agent does using policy and/or value function (5) model based agent agent creates model agent moves based on model ================================================================================ Categorizinig problems 2 fundamental problems in sequential decision making problem (1) learning RL. Environment is initially unknown. Agent interacts with the environment Agent improves agent's policy (2) planning Search model of the environment is known (reward and transition are known) * Agent performs computations (simulation without directly doing trial) with agent's model (without any external interaction) because agent knows environment --> MCTS of AlphaGo * Agent improves agent's policy aka deliberation, reasoning, introspection, pondering, thought, search ================================================================================ Exploration and exploit tradeoff relationship ================================================================================ Prediction problem when given policy, predict future goal: train value function well Control problem optimizes future. finds best policy