As software and hardware agents begin to perform tasks of genuine interest, they will be faced with environments too complex for humans to predetermine the correct actions to take. Three characteristics shared by many complex domains are:
- high-dimensional state and action spaces
- partial observability
- multiple learning agents
To tackle such problems algorithms combine deep neural network function approximation with reinforcement learning. The paper first described using Recurrent Neural Networks(RNN) to handle partial observability in Atari games. Next, a multiagent soccer domain Half-Field-Offense is described and approaches for learning effective policies in this parameterized-continuous action space are enumerated.
Hierarchial RL : possibility to observe different goal states.
- n-step Q-learning
- On-Policy Monte Carlo
- Taking on-policy approach in the beginning before switching to off-policy(Matthew's future work)
Simply observe the current state - learn the action to execute.
Instead of receiving the full state of the world, the agent only receives observations(which may be noisy and incomplete) -- the agent still performs actions at and rewards rt.
An optimal Q function yields an optimal policy -- it is important correctly estimate the every action from every state -- so that it is easy to learn to act optimally by simply choosing the action that maximizes the Q-function for each state.
Observation : Current game screen of the Atari Game(160x210 image with 3 channels)
Are Atari Games MDPs or POMDPs? Depends on the number of game screens used in the state representation.
Many games PO with a single frame - can tell the position of the ball but not the velocity!!
Most successful approach to play Atari Games -- it estimates the Q-values for each of the 18 possible actions in an Atari Game. The DQN accepts the last 4 game screens as input. Learning via TD Reinforcement Learning -- maintain a Replay Memory D -- sample transitions from the memory and make the target of the neural net y the reward plus the gamma discounted Q-value of the next state encountered.
DeepMind have established that the DQNs perform very well on MDPs but the motivation is to test the performance in POMDPs.
Here the Game state must be inferred with high probability from previous history of observations! DQN learns a flickering version of Pong not that well -- it seems to have establishing the position of the ball or inferring its velocity. Half of the 4 game screens are noisy -- the DQN however treats them as normal game screens. Reason : Not intended to handicap the algorithm.
Two major changes - Fully connected layer of DQN replaced with LSTM (Same number of nodes present in both layers) - Recurrence in the LSTM layer hopefully extracts the relevant information from the screen at the current timestep.
BPTT : Back Propogation Through Time for the last 4 10 timesteps.
The LSTM gives the DQN some redundancy in order to combat the noisy observations coming in. Hope is to infer what the current state of the world is, although the observations are noisy. It is important to note that the LSTM has inferred velocity of the Pong ball despite observing just a single frame at every timestep -- a sequence of inputs that maximize the activation of a given LSTM unit.
Note : Pong maxes out at 21
Observe the view cones. The exact positions and the velocities of the agents are not observable.
-
Squashing Gradients
![squashinggradients](https://cloud.githubusercontent.com/assets/7057078/16124275/2986f05a-33a3-11e6-9027-fc185637ce8d.PNG)
Need to acquire the Monte-Carlo HV,LB approach as well stay computationally efficient.
More challenging as the action space is twice as large ~ learning to write : we have two hands but you use only 1 to write.
Reducing sample complexity but learn better policies, Non-differenitable components in RNN
- Instead of all joint space taking up exponential complexity
- Idea is to have a sequence of tasks to learn, curriculum to learn. promising direction for shaping rewards
- Implementation Question : mini-batch of size 32, number of steps 10 for backprop.
- Training time : LSTM receives a single frame with 10 time steps.
- RL is a little slower to learn policies, flexible to learn non-differentiabilties.
- Determistic Policy Gradient vs Stochastic Gradient
- to address continuous action spaces
- trade-off between TD methods to estimate Q-values v/s policy methods
- The critic must know [ToDo]
- Loss may not be able to tell you that you have converged.