Coding With Fun
Home Docker Django Node.js Articles Python pip guide FAQ Policy

How does policy iteration and value iteration work?


Asked by Bryan Phillips on Dec 05, 2021 FAQ



Value-iteration and policy iteration rely on these equations to compute the optimal value-function. Value iteration computes the optimal state value function by iteratively improving the estimate of V (s). The algorithm initialize V (s) to arbitrary random values. It repeatedly updates the Q (s, a) and V (s) values until they converges.
Moreover,
Policy iteration works on principle of “Policy evaluation —-> Policy improvement”. Value Iteration works on principle of “ Optimal value function —-> optimal policy”. As far as I am concerned, in contrary to @zyxue 's idea, VI is generally much faster than PI.
In this manner, One important special case is when policy evaluation is stopped after just one sweep (one backup of each state). This algorithm is called value iteration. It can be written as a particularly simple backup operation that combines the policy improvement and truncated policy evaluation steps: for all .
And,
Value iteration is a method of computing an optimal policy for an MDP and its value. Value iteration starts at the “end” and then works backward, refining an estimate of either Q∗ Q * or V ∗ V *.
Furthermore,
Value iteration is a method of computing an optimal MDP policy and its value. Value iteration starts at the "end" and then works backward, refining an estimate of either Q * or V *. There is really no end, so it uses an arbitrary end point.