Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TP7 - RL - ex1cd - iteration 4 in algorithm #6

Open
mtonglet opened this issue May 24, 2022 · 3 comments
Open

TP7 - RL - ex1cd - iteration 4 in algorithm #6

mtonglet opened this issue May 24, 2022 · 3 comments

Comments

@mtonglet
Copy link

I think there's an error at the fourth iteration when updating the policy (and thus values accordingly). Indeed, I've found a higher value of Q when the action is right (Q=16,446) instead of up (Q=16,32) for the position (line2,column4). The same error also occurs in next iterations, and all other value are thus slightly different.
In fact, the algorithm find it more interesting to go in the wall (0% chance to fall in the cliff) than go up (5% chance falling in the cliff).

@KenN7
Copy link
Member

KenN7 commented May 24, 2022

Hi,
How do you compute it ?

Note that, in order to take into account the walls, any policy that would take you to a wall has the effect of making you stay at the same spot.

for me, starting for iteration 2, (where M(2,4) = 16.27),

If you compute for policy "up" and M(2,4) you get : V = 0.9*(20+1*0)+0.05*(-50+1*0)+0.05*(0+1*16.27) = 16.3
that is 5% chance of staying in place with max Q = 16.27, 5% chance of -50 reward and 90% chance of getting +20 reward.
and from then on it converges, you just aply the same calculation replacing 16.27 by 16.3.

If you compute for policy "right" and M(2,4), you get V = 0.9*(0+1*15.5) + 0.05*(20+1*0) + 0.05*(0+1*13.95) = 15.65
that is 5% chance of going up with reward 20, 5% chance of going down with max Q = 13.95, and 90% of staying in place with max Q = 15.5.
Policy up is thus better.

@mtonglet
Copy link
Author

Hello, thanks for your answer so far. I think I missused the Bellman equation, but I have still some grey zones about the reasoning and the algorithm...

I computed it as follow, when considering 3rd iteration where M(2,4)=16.3 :

For policy "up", I got v = -50*0.05 + 20*0.9 + 16.3*0.05 = 16.315 rounded up to 16.32

For policy "right" v = 20*0.05 + 16.3*0.9 + 15.34*0.05 = 16.437 which is bigger than for the policy "up".

Btw, for the 2nd iteration you've writed, I don't understand why you're computing the term [...]+0.05*(0+1*16.27) for the "up" policy, but [...]+0.9*(0+1*15.5) for the "right" policy.
Why are the values different ? Shouldn't it be the same value, i.e. 16.27 ? Personnaly, I would have computed that way v = 20*0.05 + 16.27*0.9 + 13.95*0.05= 16.34, which would have already been greater than the value found for the "up" policy where ´v=16.3´...

@KenN7
Copy link
Member

KenN7 commented May 30, 2022

Hi,

Yes indeed, my bad, you are right, there is a mistake in the correction of the practical session.
i just double checked the computation and indeed, you calculations are right, sorry for that.

As for fixing the problem:
In the current form, there are 2 things that make the algorithm converge to "maximum reward" (20):

  • there is no negative reward from making a move (we could for example put a -1 reward when an action is taken)
  • the discount factor (gamma) is 1

If the current form the optimal policy is (U=up, D=down, L=left, R=right):

image

and the values converge to:

image

But if you introduce a negative reward of -1 for each action then it becomes:

image

image

Or if you define the discount factor (gamma) as 0.9:

image

image

I hope this clarifies the matter.

(this SE answer should also give you more info: https://ai.stackexchange.com/questions/35596/what-should-the-discount-factor-for-the-non-slippery-version-of-the-frozenlake-e )

I don't have time to fix it now in the correction (you are welcome to make a pull request btw), so i'll leave the issue open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants