🍏

1장

Reinforcement learning is based on the reward hypothesis

\phi_{1}

Rewards

A reward Rt is a scalar feedback signal

Definition (reward hypothesis)

All goals can be described by the maximization of expected cumulative reward

Sequential Decision Making

•

Actions may have long term consequences

•

Reward may be delayed

•

It may be better to sacrifice immediate reward to gain more long-term reward

History and State

The history is the sequence of observations, actions, rewards

State is the information used to determine what happens next

Imformation State

An information state (a.k.a. Markov state) contains all useful information from the history

“The future is independent of the past given the present”

→ 현재 상태만 알면 과거 상태는 필요 없다.

Fully Observable Environments

Full observability : agent directly observes environment state

Agent state = environment state = information state

이 상태를 Markov Decision Process (MDP)라고 한다.

Partial observability : agent indirectly observes environment:

부분적으로 env를 관찰이 가능한 형태

→ Agent State ≠ Environment state

Formally this is a partially observable Markov decision process (POMDP)

Agent must construct its own state representation

→ 에이전트는 자신자신의 상태가 정의되어야 한다.

RL Agent

Policy : agent’s behaviour function

Value function : how good is each state and/or action

Model : agent’s representation of the environment

Policy

A policy is the agent’s behaviour

It is a map from state to action, e.g.

→ state를 넣으면 action을 반환, 주로 pi로 표현

Value Function

Value function is a prediction of future reward Used to evaluate the goodness/badness of states

→ Policy 가 없으면 value를 측정하기 어렵다.

Model

A model predicts what the environment will do next

Categorizing RL agent

•

Model Free

◦

Policy and/or Value Function

◦

No Model

•

Model Based

◦

Policy and/or Value Function

◦

Model

Learning and Planning

Reinforcement Learning:

•

The environment is initially unknown

•

The agent interacts with the environment

•

The agent improves its policy

→ 정보를 하나도 모르기 때문에 interection을 하면서 배워나간다.

Planning:

•

A model of the environment is known

•

The agent performs computations with its model (without any external interaction)

•

The agent improves its policy

•

a.k.a. deliberation, reasoning, introspection, pondering, thought, search

→ 정보를 알기 때문에 query 를 날릴수 있다.

Exploration and Exploitation

Reinforcement learning is like trial-and-error learning

The agent should discover a good policy

Exploration : finds more information about the environment Exploitation : exploits known information to maximise reward

Prediction and Control

Prediction: evaluate the future

Control: optimise the future