Reinforcement Learning Part 1

Introduction

Reinforcement Learning can be thought of as an application of the psychological concept of conditioning. The central position here is taken by an agent who takes actions in a specific environment. Every action can effect in reward or punishment (negative reward). The core of the learning process is the change of the behavior based on these interactions with the ultimate goal of maximizing sum of the rewards in the long run. Crucial here is an abstract concept of a state, that is, all the information about environment that an agent is aware of at a given moment.

state → action → reward → behavioral change

Law of effect stipulates that "responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation, and responses that produce a discomforting effect become less likely to occur again in that situation". Responses can be understood as actions, situations as states and effects as rewards. Let's look at an example from animal training. A dog's owner who plays a role of the part of the environment tries to teach his dog (an agent) to sit every time the command "Sit!" is given. In the simplest form an owner can give a command once in a while his dog is standing (even this can be treated as part of the environment). Dog can take probably an infinite number of actions but it's quite likely by random chance that after a big number of tries he will sit just after command "Sit!" is given. If nothing happens, there is no response from the environment, dog's behavior won't change. But if an owner gives him some treat as a reward, something can switch in dog's brain to make action of sitting in the state of (standing + being given "Sit!" command by an owner) more probable. Honestly we cannot be sure what happens, we may have some intuitive idea, but dog's brain is a black box. However that's exactly what unknown algorithms are!

Behavior of an agent in the given circumstances is dictated by his policy, states -> actions mapping. Learning is just the evolution of a policy. Action-taking process doesn't have to be deterministic, policy could assign only some probability to choosing some action in a given state. After all in many situations we cannot be sure what rewards await us. We can do the same thing all the time but that doesn't mean that we will always get the same result. We have information about environment, but in the real world it is neither accurate nor exhaustive all the time, otherwise dodging bullets would be as easy as shown in the Matrix. Unfortunately it is not... Adequately performing the same action all the time can put rule of the diminishing returns to work. If we make the same chess moves at the beginning of every game our opponents (parts of our environment) might take notice and better prepare for that. So it seems taking more stochastic instead of a deterministic approach could get us some benefits. Every reward gives us more information about the value of taking a given action in a given state. We are basing our decision of what action to take based on the perceived values of every available action. More information makes us more perceptive and makes it more viable to take deterministic way i.e. to choose the action perceived by us as the most valuable. Taking such an action is described as exploitation. In contrast, taking actions perceived as of not the best value is known as exploration. Exploitation versus Exploration conflict is one of the most crucial in reinforcement learning and most of the time it is not too straightforward...

We learn things over time so it is logical that we must acquire some new abstractions... Most of the time we declare discrete time steps that make up episodes. An episode can be a one game of chess and in it, every time step would correspond to every situation in which we have to make a move. Episodes can be of different length, not every game is the same. At every time step we find ourselves in some state, when we take an action an environment moves us to a different state. The last state in an episode is called the terminal state, based on a problem it can be more or less abstractive, the point is that it is the last one, in it we can take no more actions, it is the end... The goal of many problems is to maximize cumulative reward know as return in the terminal state. When we are hedonistic we can decide to choose an action giving the highest immediate reward like eating pizza non-stop every day, but we know better, that wouldn't end well for us. Delayed gratification is the resistance to the temptations of immediate reward. Running can be quite non-pleasant at the beginning, but in the long run <hehe> it gives us more opportunities in the better future! Our agents should work the same!

Let us learn the symbols commonly used to denote described concepts!

Capital letters are used for random variables, whereas lower case letters are used for the values of random variables and for scalar functions. Quantities that are required to be real-valued vectors are written in bold and in lower case (even if random variables). Matrices are bold capitals.

$a$ - action
$s$ - state
$r$ - reward
$\pi$ - policy
$\pi(s)$ - action taken in state $s$ under deterministic policy $\pi$
$\pi(a|s)$ - probability of taking action $a$ in state $s$ under stochastic policy $\pi$
$\mathcal{A}(s)$ - set of all actions available in state $s$
$\mathcal{R}$ - set of all possible rewards, a finite subset of $\mathbb{R}$ (real numbers)
$\mathcal{S}$ - set of all non-terminal states
$\mathcal{S}^{+}$ - set of all states, including the terminal state
$|\mathcal{S}|$ - number of elements in set $\mathcal{S}$
$v_{\pi}(s)$ - value of state $s$ under policy $\pi$ (expected return)
$v_{*}(s)$ - value of state $s$ under the optimal policy
$q_{\pi}(s,a)$ - value of taking action $a$ in state $s$ under policy $\pi$
$q_{*}(s,a)$ - value of taking action $a$ in state $s$ under the optimal policy
$\varepsilon$ - probability of taking random action in $\varepsilon$ -greedy policy
$V$ , $V_{t}$ - array estimates of state-value function $v_{\pi}$ or $v_{*}$
$Q$ , $Q_{t}$ - array estimates of action-value function $q_{\pi}$ or $q_{*}$
$p(s',r|s,a)$ - probability of transition to state $s'$ with reward $r$ , from state $s$ and action $a$
$p(s'|s,a)$ - probability of transition to state $s'$ , from state $s$ taking action $a$
$r(s,a)$ - expected immediate reward from state $s$ after action $a$
$r(s,a,s')$ - expected immediate reward on transition from $s$ to $s'$ under action $a$
$t$ - discrete time step
$T$ , $T(t)$ - final time step of an episode, or of the episode including time step $t$
$A_{t}$ - action at time $t$
$S_{t}$ - state at time $t$ , typically due, stochastically, to $S_{t-1}$ and $A_{t-1}$
$R_{t}$ - reward at time $t$ , typically due, stochastically, to $S_{t-1}$ and $A_{t-1}$
$G_{t}$ - return following time $t$
$\alpha$ - step-size parameter
$V(S_{t}) \leftarrow V(S_{t}) + \alpha [V(S_{t+1}) - V(S_{t})]$ - temporal-difference learning

Let us now use them...