Rating: 9.0/10.

Overall, great textbook about reinforcement learning using deep neural networks, I liked it because it places roughly equal emphasis on theory and code, there are some equations, but the author explains everything more through intuition rather than formal mathematics, making it easy to understand quickly compared to other textbooks. Many of the algorithms have Python code, and the author points you to references as well as practical libraries where you can get some hands-on experience. The book is up to date (written in 2022), so it is a good resource for getting up to the state-of-the-art in deep RL.

## Ch1: Introduction

What is RL; difference between supervised, unsupervised, and reinforcement learning, applications to sequential decision problems / robotics / games.

## Ch2: Tabular Value-based RL

An **agent** sends an action to the **environment**, which returns a new state and a reward. The goal of RL is to find the best action in each state. A **state** contains all the information that represents the configuration of the environment.

The **transition function** determines how the state changes after an action is selected. Agents do not have access to this (but may try to learn it). Transitions can be stochastic (multiple possible next states given action) or deterministic. The **reward** (r) is assigned at each step, a **discount factor** (\gamma) makes immediate rewards more preferable than future rewards.

A **policy** (\pi(a|s)) assigns a probability distribution over the action space for each state; a deterministic policy maps a state to an action. A **trace** (or trajectory or episode) is a sequence of states, actions, and rewards until termination. The **return** of a trace (R) is the sum of all rewards multiplied by discount factors.

A **state value** function V^\pi(s) assigns a real value to each state; a **state-action value** function Q^\pi(s, a) assigns a real value to a state-action pair. Benefit of Q(s, a) is you can obtain a policy from it by taking the max value action, but you can’t do this with V(s). The **Bellman equation** recursively assigns V(s) to each state by traversing the full tree, summing up the children, and applying discount factor.

**Value iteration** iteratively updates V and Q functions until convergence, but requires knowledge of the world dynamics (instead of learning by acting on the environment). **Temporal difference (TD) learning** updates V and Q based on difference from expected values from sampling the environment. There algorithms must make a tradeoff between exploration and exploitation; a popular strategy is **epsilon-greedy** which takes a random action some of the time and the best action otherwise.

**SARSA** is an on-policy algorithm that updates Q(s, a) given a samples from its own policy. **Q-learning** is the off-policy variant, learns Q(s, a) assuming the max valued action is taken from each state, although this max operation makes it less stable.

Example code is given in the Taxi environment (discrete, grid world), two other popular Gym environments are Mountain Car and Cartpole.

## Ch3: Deep Value-based RL

**DQN** was the first paper in deep RL, uses neural network to approximate Q function which is too large to tabulate. The training loop is actually quite similar to supervised learning, but instead of looping over the dataset in each epoch, you collect episodes and train on them.

The **deadly triad** of three elements that make deep RL difficult are: coverage (hard to ensure all states are covered), correlation (similar states are explored together), and convergence (updating Q function while using it to train makes it unstable). DQN uses **experience replay** to store previous experience and sample it back randomly, and a separate Q network as targets that is less frequently updated, this is to improve stability.

Lots of improvements to DQN after it was released: DDQN uses separate networks to choose and evaluate the action; PEX prioritizes important experiences to replay more often; Dueling DDQN separates value and advantage function; Rainbow combines several improvements together.

## Ch4: Policy-based RL

Policy-based methods find a policy directly, without finding an intermediate value action; they are also able to handle continuous action spaces whereas value-based methods cannot. This is useful in many robotics situations and some games.

**REINFORCE** (or Monte-Carlo policy gradient) is the simplest policy gradient algorithm, in each iteration, collect an episode and update the model using discounted rewards. The policy gradient formula divides the gradient \nabla \pi(a|s) by the probability of itself, \pi(a|s), for normalization, so the whole thing simplifies to \nabla \ln \pi(a|s), then multiply that by the reward.

**Actor-critic** methods have a critic network in addition to the policy (or actor) network, critic learns the value of state/action pair. This can be used to replace the full discounted rewards with an estimated value after n steps. **Advantage function** is to learn the baseline value of a state (independent of action) and separately learn the difference of an action from the baseline value.

**A3C** does parallel workers which each runs the environment and sends gradient updates back to a central worker. **TRPO** uses KL divergence from old to new policy as a loss term to prevent large drastic updates; **PPO** uses a clipping of the objective function to accomplish the same thing in an easier way.

**SAC** adds an entropy term to the loss function to encourage exploration. **DDPG** learns a Q(s, a) function, but instead of taking an argmax, trains the policy to maximize the learned Q function. All these algorithms can be applied to locomotion tasks like MuJoCo from DeepMind Control suite, where the agent is rewarded for going forward.

## Ch5: Model-based RL

Model-based algorithms learn the world dynamics from the environment, which is good for sample efficiency. Concretely, model M takes state and action as input and predicts next state and reward, it is used to train the policy (in hybrid setups, the real environment is used to train the policy too). Often about 1000 model sampling steps are done for each environment sampling step. Another advantage is the agent can try and undo actions, whereas undo is not possible in the environment.

Multiple models can be trained in an ensemble to learn the world dynamics, and the policy can use them all for training. **Latent models** predict a lower dimensional latent state instead of the actual state, this can allow it to ignore unimportant details of the state. **Planning** can be done using traditional search algorithms, or end-to-end models (like VIN) where the NN learns to search, these end-to-end methods are better for generalizability.

## Ch6: Two-agent self-play

AlphaGo Zero uses self-play in three levels simultaneously:

- Move-level self play meaning the Monte Carlo playouts has an opponent that’s a copy of ourselves.
- Example-level self play, meaning the value and policy networks are trained using the data from playing ourselves.
- Tournament-level self play, meaning the outer loop contains copies of ourselves playing against each other, creating a learning curriculum.

The classical **minimax** algorithm does a tree search to compute the value of a game state, using a heuristic evaluation function at the leaves. **Monte Carlo Tree Search** (MCTS) does not need a heuristic evaluation function, instead it plays out games randomly until the end, storing the win rate and visit count. It uses **UCT** to select the most promising node to expand using win rate and visitation statistics, balancing exploration vs exploitation. We then perform **playout** on that node and add it to the tree. AlphaGo uses **P-UCT**, where the policy is used to weigh the UCT formula.

The first AlphaGo used supervised learning from human grandmasters to aid training. AlphaGo Zero was trained from scratch; AlphaZero played chess, shogi, and Go using a single architecture. Polygames framework is good for playing around with MCTS.

## Ch7: Multi-agent RL

Multi-agent problems can be competitive or cooperative settings. Challenge is partial observability (including imperfect information from simultaneous moves), and nonstationary environments (what you do can affect the other agents), which can blow up the state space. **Nash equilibrium** is a strategy that cannot be exploited, so opponents have no reason to deviate from it, although it does not try to exploit opponents’ weaknesses.

**Counterfactual regret minimization** (CFR) is a tree search similar to minimax, computes the value for each node and eventually converges to the Nash equilibrium. Deep CFR uses neural networks to approximate within CFR. **Population-based training** (PBT) has a population of many agents that try to explore and exploit each other, this was used in AlphaStar to play Starcraft.

## Ch8: Hierarchical RL

Hierarchical RL is setting **subgoals** (which are solved by sub-policies) to achieve intermediate objectives, like navigating to a hallway in a room maze task. In the options framework, an **option** (or macro-action) is a group of actions to achieve an objective, a **meta-policy** chooses among option to take, and a sub-policy tries to solve an option.

Subgoals can be defined manually, or methods like **Option Critic** or Hierarchical Actor Critic can find subgoals automatically. Another idea is **intrinsic motivation** where the critic provides rewards back to the policy, this is useful for solving Montezuma’s revenge.

## Ch9: Meta-learning

Many tasks in NLP and CV use the **pretrain-finetune** paradigm where a model is fine-tuned on task specific data. **Meta-learning** is learning a model that can learn new tasks in few-shot settings. In RL, one way is using LSTMs to remember previous episodes in the hidden state. **MAML** is a model-agnostic optimization approach, finds model parameters such that optimal parameters for new tasks can be reached in few iterations, this requires a second-derivative optimization.