Search
๐Ÿ

1์žฅ

Reinforcement learning is based on the reward hypothesis

ฯ•1\phi_{1}

Rewards

A reward Rt is a scalar feedback signal

Definition (reward hypothesis)

All goals can be described by the maximization of expected cumulative reward

Sequential Decision Making

โ€ข
Actions may have long term consequences
โ€ข
Reward may be delayed
โ€ข
It may be better to sacrifice immediate reward to gain more long-term reward

History and State

The history is the sequence of observations, actions, rewards
State is the information used to determine what happens next

Imformation State

An information state (a.k.a. Markov state) contains all useful information from the history
โ€œThe future is independent of the past given the presentโ€
โ†’ ํ˜„์žฌ ์ƒํƒœ๋งŒ ์•Œ๋ฉด ๊ณผ๊ฑฐ ์ƒํƒœ๋Š” ํ•„์š” ์—†๋‹ค.

Fully Observable Environments

Full observability : agent directly observes environment state
Agent state = environment state = information state
์ด ์ƒํƒœ๋ฅผ Markov Decision Process (MDP)๋ผ๊ณ  ํ•œ๋‹ค.
Partial observability : agent indirectly observes environment:
๋ถ€๋ถ„์ ์œผ๋กœ env๋ฅผ ๊ด€์ฐฐ์ด ๊ฐ€๋Šฅํ•œ ํ˜•ํƒœ
โ†’ Agent State โ‰  Environment state
Formally this is a partially observable Markov decision process (POMDP)
Agent must construct its own state representation
โ†’ ์—์ด์ „ํŠธ๋Š” ์ž์‹ ์ž์‹ ์˜ ์ƒํƒœ๊ฐ€ ์ •์˜๋˜์–ด์•ผ ํ•œ๋‹ค.

RL Agent

1.
Policy : agentโ€™s behaviour function
2.
Value function : how good is each state and/or action
3.
Model : agentโ€™s representation of the environment

Policy

A policy is the agentโ€™s behaviour
It is a map from state to action, e.g.
โ†’ state๋ฅผ ๋„ฃ์œผ๋ฉด action์„ ๋ฐ˜ํ™˜, ์ฃผ๋กœ pi๋กœ ํ‘œํ˜„

Value Function

Value function is a prediction of future reward Used to evaluate the goodness/badness of states
โ†’ Policy ๊ฐ€ ์—†์œผ๋ฉด value๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์–ด๋ ต๋‹ค.

Model

A model predicts what the environment will do next

Categorizing RL agent

โ€ข
Model Free
โ—ฆ
Policy and/or Value Function
โ—ฆ
No Model
โ€ข
Model Based
โ—ฆ
Policy and/or Value Function
โ—ฆ
Model

Learning and Planning

Reinforcement Learning:

โ€ข
The environment is initially unknown
โ€ข
The agent interacts with the environment
โ€ข
The agent improves its policy
โ†’ ์ •๋ณด๋ฅผ ํ•˜๋‚˜๋„ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— interection์„ ํ•˜๋ฉด์„œ ๋ฐฐ์›Œ๋‚˜๊ฐ„๋‹ค.

Planning:

โ€ข
A model of the environment is known
โ€ข
The agent performs computations with its model (without any external interaction)
โ€ข
The agent improves its policy
โ€ข
a.k.a. deliberation, reasoning, introspection, pondering, thought, search
โ†’ ์ •๋ณด๋ฅผ ์•Œ๊ธฐ ๋•Œ๋ฌธ์— query ๋ฅผ ๋‚ ๋ฆด์ˆ˜ ์žˆ๋‹ค.

Exploration and Exploitation

Reinforcement learning is like trial-and-error learning
The agent should discover a good policy
Exploration : finds more information about the environment Exploitation : exploits known information to maximise reward

Prediction and Control

Prediction: evaluate the future
Control: optimise the future