Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Incorrectly interpreting a Python error as a success message #5637

Open
1 task done
avi12 opened this issue Dec 16, 2024 · 17 comments · May be fixed by #5644
Open
1 task done

[Bug]: Incorrectly interpreting a Python error as a success message #5637

avi12 opened this issue Dec 16, 2024 · 17 comments · May be fixed by #5644
Labels
bug Something isn't working

Comments

@avi12
Copy link

avi12 commented Dec 16, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Describe the bug and reproduction steps

2024-12-16.23-33-04.msedge.mp4

OpenHands Installation

Docker command in README

OpenHands Version

0.15.2, from main

Operating System

WSL on Windows

Logs, Errors, Screenshots, and Additional Context

The error message is

Parameter `old_str` is required for command: str_replace.
@avi12 avi12 added the bug Something isn't working label Dec 16, 2024
@enyst
Copy link
Collaborator

enyst commented Dec 16, 2024

Does it continue from there? This error only means that the LLM asked for a replace without giving the text to replace. Then the LLM will receive this error message, next step, and hopefully correct itself next time.

What LLM are you using?

@avi12
Copy link
Author

avi12 commented Dec 16, 2024

2024-12-17.00-02-47.msedge.mp4

@enyst
Copy link
Collaborator

enyst commented Dec 16, 2024

Sorry, this looks like normal, intended behavior. The model will make mistakes. When it makes a mistake, like trying to execute a command without a required parameter (this 'old_str'), then the next step it will receive that information (the error message), and expected to figure out how to fix the problem.

That's what it says it will do "let's try again, by specifying the correct old_str parameter". This kind of error is caused by the model, and the model will be helped - and expected - to solve it in the next steps.

@avi12
Copy link
Author

avi12 commented Dec 17, 2024

I see, then why is the icon next to the command result a ✅ instead of ❌?

@enyst
Copy link
Collaborator

enyst commented Dec 17, 2024

Aha! You are correct, it appears as a pretty green checkbox, and the rest of the application interprets it as any other message of the model doing its thing.

Maybe we could fix this. @openhands-agent the ipython observation with an "ERROR" string is displayed in the frontend as a success message, with a green checkbox. Can we detect it was an error and display it with a red x?

@openhands-agent
Copy link
Contributor

OpenHands started fixing the issue! You can monitor the progress here.

@openhands-agent
Copy link
Contributor

A potential fix has been generated and a draft PR #5644 has been created. Please review the changes.

@CitroenGames
Copy link

CitroenGames commented Dec 17, 2024

Same issue here I'm using:
Qwen 2.5 Coder 14B using LM studio as backend

@BradKML
Copy link

BradKML commented Dec 18, 2024

I have this issue as well regarding edit loops, it is either that they forgot some of the parameters, or worse the strings do not match the file, which makes substitutions harder. #5310
And prompting them to just delete and rewrite the file is just as difficult. Imagine if they can substitute by line numbers or function/class/method. But then the issue is to make sure the substitution can handle varying line counts between the old and new, or that intended function changes can be handled well. Also refactoring would be a pain if there are name changes.
@CitroenGames have you tried the other models like Phi-3/Phi-4 and DeepSeek Coder v2.5?

@avi12
Copy link
Author

avi12 commented Dec 18, 2024

I don't think we need the ability to instruct the agent to edit lines manually; we might as well click the "Open VS Code" and edit the file manually
The goal is rather to have a proper fix for the agent to parse the result from the Python code correctly

@BradKML
Copy link

BradKML commented Dec 18, 2024

@avi12 it's like telling the agent to learn how to parse the file correctly, no matter Python or other languages. Yet this is very tricky when models like Qwen uses ... to skip code like a human writing tutorials. It's partially a prompting issue and partially a tool use issue.

@avi12
Copy link
Author

avi12 commented Dec 18, 2024

I get what you're saying but in my POV I just want the agent to figure out stuff on its own, using the LLM, whether it's keeping a record of the file structure, reading many files to get a better context, etc
If I have to manually instruct the model what to do then it kinda loses the point IMHO

@BradKML
Copy link

BradKML commented Dec 18, 2024

@avi12 yeah you are right, would be much better if it has self-instruction built in so it can correct itself and "learn" how to edit (e.g. section checks, line checks, function checks). Hate how low compatibility is for some models that are good on theory and algo but bad in writing and comprehension.

@avi12
Copy link
Author

avi12 commented Dec 18, 2024

I mean I can't really complain, it's a FOSS project
Cheaper than paying $500 for Devin

@CitroenGames
Copy link

I have this issue as well regarding edit loops, it is either that they forgot some of the parameters, or worse the strings do not match the file, which makes substitutions harder. #5310
And prompting them to just delete and rewrite the file is just as difficult. Imagine if they can substitute by line numbers or function/class/method. But then the issue is to make sure the substitution can handle varying line counts between the old and new, or that intended function changes can be handled well. Also refactoring would be a pain if there are name changes.
@CitroenGames have you tried the other models like Phi-3/Phi-4 and DeepSeek Coder v2.5?

I used phi "4" not official from Microsoft and it had less issues getting stuck in a loop but it still happened so I think it's also something related to openhands

@enyst
Copy link
Collaborator

enyst commented Dec 18, 2024

Getting stuck in a loop is a problem for many LLMs. In fact, to my knowledge only Anthropic appears to have solved it somehow, specially for Opus but Sonnet too, and I think even Haiku. Just about every other LLM gets weighted down by its own history at some point and gets stuck in a loop every once in a while. The next tokens end up the same every time: only a particular answer, the model responds with it over and over.

This happens with coding agents, but it also happens with multi-user dialogue, discord experiments, attempts at analysis of text etc. I've seen it everywhere there's a larger or more complex history than just a simple back and forth basically.

What openhands currently does about it, is to stop it (instead of going on and on until the maximum steps allowed!).

And we try to hand over to the user the ability to try to prompt it again, in the hope that it will be able to continue. Many LLMs might not able to anyway, or not without some lot of prompting.

Sorry for elaborating, I'm sure we can do more, but this isn't really on topic here, we should probably make a new issue if you wish to discuss improvements on how we can handle the situation when the LLM gets stuck in a loop. It's really getting lost if it's in unrelated issues.

Back to the topic here, we can fix this icon. I haven't verified the openhands-agent's fix yet, will get to do that. Please note that it's just about web frontend display, it makes no difference to the agent.

@BradKML
Copy link

BradKML commented Dec 18, 2024

@enyst have you seen DRY sampler as a replacement for repetition penalty yet? Not in OpenRouter yet but got a lot of good press, made by @p-e-w oobabooga/text-generation-webui#5677
Besides that I would highly recommend checking out Dynamic Temperature and min_p by @kalomaze (see here #5310 (comment) ), but the issue is that sampler hacks requires upstream support, which is a tall order for proprietary models.
A more proper approach, other than just reliance on Open Weights hacking, is to create mechanism that forces the LLM to "get creative" and break out of established loops. Temperature in OpenHands are defaulted to 0 and I tried other params, need to run a larger set of test to see if alternative param pairs matter. Raising temperature based on attempt counter would be helpful.
But other than doing it on parameter side, we also should just get the LLM to observe previous attempts and make it do some introspection on how to avoid redoing bad solutions, and to learn from them a little. Exposing errors from tool usage, and throw in some prompt engineering regarding debug procedures, would help with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants