multiAgentRL.md

Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

TL;DR

As software and hardware agents begin to perform tasks of genuine interest, they will be faced with environments too complex for humans to predetermine the correct actions to take. Three characteristics shared by many complex domains are:

high-dimensional state and action spaces
partial observability
multiple learning agents

To tackle such problems algorithms combine deep neural network function approximation with reinforcement learning. The paper first described using Recurrent Neural Networks(RNN) to handle partial observability in Atari games. Next, a multiagent soccer domain Half-Field-Offense is described and approaches for learning effective policies in this parameterized-continuous action space are enumerated.

What's next in store?

Hierarchial RL : possibility to observe different goal states.

Key terms:

n-step Q-learning
On-Policy Monte Carlo
Taking on-policy approach in the beginning before switching to off-policy(Matthew's future work)

Key Slides from the seminar:

Presentation Outline

Markov Decision Process

Simply observe the current state - learn the action to execute.

Partially Observable MDP

Instead of receiving the full state of the world, the agent only receives observations(which may be noisy and incomplete) -- the agent still performs actions a_t and rewards r_t.

Introductory slide on RL

Q-Value Function : Expected sum of $gamma$ discounted rewards from taking action a in state s.

An optimal Q function yields an optimal policy -- it is important correctly estimate the every action from every state -- so that it is easy to learn to act optimally by simply choosing the action that maximizes the Q-function for each state.

Deep Neural Networks

Recurrent Q-Learning for POMDPs -- The Atari Environment

Observation : Current game screen of the Atari Game(160x210 image with 3 channels)

Are Atari Games MDPs or POMDPs? Depends on the number of game screens used in the state representation.

Many games PO with a single frame - can tell the position of the ball but not the velocity!!

Deep Q-Network

Most successful approach to play Atari Games -- it estimates the Q-values for each of the 18 possible actions in an Atari Game. The DQN accepts the last 4 game screens as input. Learning via TD Reinforcement Learning -- maintain a Replay Memory D -- sample transitions from the memory and make the target of the neural net y the reward plus the gamma discounted Q-value of the next state encountered.

Flickering Atari

DeepMind have established that the DQNs perform very well on MDPs but the motivation is to test the performance in POMDPs.

DQN Flickering Pong

Here the Game state must be inferred with high probability from previous history of observations! DQN learns a flickering version of Pong not that well -- it seems to have establishing the position of the ball or inferring its velocity. Half of the 4 game screens are noisy -- the DQN however treats them as normal game screens. Reason : Not intended to handicap the algorithm.

Deep Recurrent Q-Network

Two major changes - Fully connected layer of DQN replaced with LSTM (Same number of nodes present in both layers) - Recurrence in the LSTM layer hopefully extracts the relevant information from the screen at the current timestep.

BPTT : Back Propogation Through Time for the last 4 10 timesteps.

The LSTM gives the DQN some redundancy in order to combat the noisy observations coming in. Hope is to infer what the current state of the world is, although the observations are noisy. It is important to note that the LSTM has inferred velocity of the Pong ball despite observing just a single frame at every timestep -- a sequence of inputs that maximize the activation of a given LSTM unit.

Performance Analysis

Note : Pong maxes out at 21

State Action Spaces

Observe the view cones. The exact positions and the velocities of the agents are not observable.

Lillicrap's DDPG

Training

Approaches to bound the DDPG action space

Squashing Gradients

 ![squashinggradients](https://cloud.githubusercontent.com/assets/7057078/16124275/2986f05a-33a3-11e6-9027-fc185637ce8d.PNG)

Zero Gradients
Invert Gradients

Q-Learning Spectrum

Low Bias, High Variance Q-learning v/s High Bias, Low Variance Q-learning

Monte-Carlo

Experiments

Inverting Gradients -- can we use eligibility traces for update targets?

Need to acquire the Monte-Carlo HV,LB approach as well stay computationally efficient.

Snapshot of the Game

Off-Policy Monte Carlo

Deep Multiagent RL

More challenging as the action space is twice as large ~ learning to write : we have two hands but you use only 1 to write.

Related Work

Future Directions

Reducing sample complexity but learn better policies, Non-differenitable components in RNN

Discussion(Notes/Questions):

Instead of all joint space taking up exponential complexity
Idea is to have a sequence of tasks to learn, curriculum to learn. promising direction for shaping rewards
Implementation Question : mini-batch of size 32, number of steps 10 for backprop.
Training time : LSTM receives a single frame with 10 time steps.
RL is a little slower to learn policies, flexible to learn non-differentiabilties.
Determistic Policy Gradient vs Stochastic Gradient
- to address continuous action spaces
- trade-off between TD methods to estimate Q-values v/s policy methods
The critic must know [ToDo]
Loss may not be able to tell you that you have converged.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiAgentRL.md

multiAgentRL.md

Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

TL;DR

What's next in store?

Key terms:

Key Slides from the seminar:

Presentation Outline

Markov Decision Process

Partially Observable MDP

Introductory slide on RL

Q-Value Function : Expected sum of $gamma$ discounted rewards from taking action a in state s.

Deep Neural Networks

Recurrent Q-Learning for POMDPs -- The Atari Environment

Deep Q-Network

Flickering Atari

DQN Flickering Pong

Deep Recurrent Q-Network

Performance Analysis

State Action Spaces

Lillicrap's DDPG

Training

Approaches to bound the DDPG action space

Q-Learning Spectrum

Low Bias, High Variance Q-learning v/s High Bias, Low Variance Q-learning

Monte-Carlo

Experiments

Inverting Gradients -- can we use eligibility traces for update targets?

Snapshot of the Game

Off-Policy Monte Carlo

Deep Multiagent RL

Related Work

Future Directions

Discussion(Notes/Questions):

Files

multiAgentRL.md

Latest commit

History

multiAgentRL.md

File metadata and controls

Deep Multiagent Reinforcement Learning for Partially Observable Parameterized Environments

TL;DR

What's next in store?

Key terms:

Key Slides from the seminar:

Presentation Outline

Markov Decision Process

Partially Observable MDP

Introductory slide on RL

Q-Value Function : Expected sum of $gamma$ discounted rewards from taking action a in state s.

Deep Neural Networks

Recurrent Q-Learning for POMDPs -- The Atari Environment

Deep Q-Network

Flickering Atari

DQN Flickering Pong

Deep Recurrent Q-Network

Performance Analysis

State Action Spaces

Lillicrap's DDPG

Training

Approaches to bound the DDPG action space

Q-Learning Spectrum

Low Bias, High Variance Q-learning v/s High Bias, Low Variance Q-learning

Monte-Carlo

Experiments

Inverting Gradients -- can we use eligibility traces for update targets?

Snapshot of the Game

Off-Policy Monte Carlo

Deep Multiagent RL

Related Work

Future Directions

Discussion(Notes/Questions):