Reinforcement learning is based on the reward hypothesis
Rewards
A reward Rt is a scalar feedback signal
Definition (reward hypothesis)
All goals can be described by the maximization of expected cumulative reward
Sequential Decision Making
โข
Actions may have long term consequences
โข
Reward may be delayed
โข
It may be better to sacrifice immediate reward to gain more long-term reward
History and State
The history is the sequence of observations, actions, rewards
State is the information used to determine what happens next
Imformation State
An information state (a.k.a. Markov state) contains all useful information from the history
โThe future is independent of the past given the presentโ
โ ํ์ฌ ์ํ๋ง ์๋ฉด ๊ณผ๊ฑฐ ์ํ๋ ํ์ ์๋ค.
Fully Observable Environments
Full observability : agent directly observes environment state
Agent state = environment state = information state
์ด ์ํ๋ฅผ Markov Decision Process (MDP)๋ผ๊ณ ํ๋ค.
Partial observability : agent indirectly observes environment:
๋ถ๋ถ์ ์ผ๋ก env๋ฅผ ๊ด์ฐฐ์ด ๊ฐ๋ฅํ ํํ
โ Agent State โ Environment state
Formally this is a partially observable Markov decision process (POMDP)
Agent must construct its own state representation
โ ์์ด์ ํธ๋ ์์ ์์ ์ ์ํ๊ฐ ์ ์๋์ด์ผ ํ๋ค.
RL Agent
1.
Policy : agentโs behaviour function
2.
Value function : how good is each state and/or action
3.
Model : agentโs representation of the environment
Policy
A policy is the agentโs behaviour
It is a map from state to action, e.g.
โ state๋ฅผ ๋ฃ์ผ๋ฉด action์ ๋ฐํ, ์ฃผ๋ก pi๋ก ํํ
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states
โ Policy ๊ฐ ์์ผ๋ฉด value๋ฅผ ์ธก์ ํ๊ธฐ ์ด๋ ต๋ค.
Model
A model predicts what the environment will do next
Categorizing RL agent
โข
Model Free
โฆ
Policy and/or Value Function
โฆ
No Model
โข
Model Based
โฆ
Policy and/or Value Function
โฆ
Model
Learning and Planning
Reinforcement Learning:
โข
The environment is initially unknown
โข
The agent interacts with the environment
โข
The agent improves its policy
โ ์ ๋ณด๋ฅผ ํ๋๋ ๋ชจ๋ฅด๊ธฐ ๋๋ฌธ์ interection์ ํ๋ฉด์ ๋ฐฐ์๋๊ฐ๋ค.
Planning:
โข
A model of the environment is known
โข
The agent performs computations with its model (without any external interaction)
โข
The agent improves its policy
โข
a.k.a. deliberation, reasoning, introspection, pondering, thought, search
โ ์ ๋ณด๋ฅผ ์๊ธฐ ๋๋ฌธ์ query ๋ฅผ ๋ ๋ฆด์ ์๋ค.
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy
Exploration : finds more information about the environment
Exploitation : exploits known information to maximise reward
Prediction and Control
Prediction: evaluate the future
Control: optimise the future