Policy Iteration#

A general framework for determining a good policy for an MDP is to start by finding the value function, or the value associated with each state, or state-action pair, for that policy.

This indicates our estimate of the discounted return that we would obtain if we started in a given state and then followed the policy forever after. Policy iteration involves iterating between improving the policy and estimating the value function.