Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Replace pexpect with libtmux in BashSession #4881

Open
wants to merge 226 commits into
base: main
Choose a base branch
from

Conversation

xingyaoww
Copy link
Collaborator

@xingyaoww xingyaoww commented Nov 10, 2024

  • Simplified implementation using libtmux instead of pexpect
  • Added proper handling of command errors, interactive commands, and timeouts
  • Added test suite to verify behavior
  • Improved output handling and error detection

End-user friendly description of the problem this fixes or functionality that this introduces

  • Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

Collaborated with OpenHands:


Link of any specific issues this addresses


To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:cf98287-nikolaik   --name openhands-app-cf98287   docker.all-hands.dev/all-hands-ai/openhands:cf98287

openhands-agent and others added 30 commits November 10, 2024 18:12
- Simplified implementation using libtmux instead of pexpect
- Added proper handling of command errors, interactive commands, and timeouts
- Added test suite to verify behavior
- Improved output handling and error detection
…ould do it in the CmdOutputObservation end for keep_prompt
- Add tests for missing fields in PS1 metadata
- Add tests for malformed values in numeric fields
- Add tests for boolean values in numeric fields
- Fix JSON parsing in test_ps1_metadata_json_structure
- Fix handling of malformed values in from_ps1_match
- Move error handling from from_ps1_match to from_ps1
- Let from_ps1_match raise exceptions for invalid data
- Update tests to match new error handling behavior
- Add support for float values in numeric fields
- Fix regex pattern to handle different line endings
- Add more test cases for edge cases
- Keep valid string fields when numeric fields fail to parse
- Use re.escape() to properly escape special characters in markers
- Use constants to avoid duplication
- Update tests to use constants and handle newlines consistently
Copy link
Collaborator

@tofarr tofarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🍰

@enyst
Copy link
Collaborator

enyst commented Dec 20, 2024

We got around 41% on SWE-Bench Lite (well, in my first run), then I fix one minor issue and retry, it got us around 40%.

Running SWE-Bench verify to validate 🏃

I wonder what were the empty generations? I notice you used the proxy with the general model, and it's possible it will fall into a weird "not found" error we've seen before. I think I've seen that results in empty patches.🤔

@xingyaoww
Copy link
Collaborator Author

@enyst i think that should be exposed as an .error attribute, right? I didn't really see smth like that 🤔

@ryanhoangt
Copy link
Contributor

ryanhoangt commented Dec 21, 2024

I wonder what were the empty generations? I notice you used the proxy with the general model, and it's possible it will fall into a weird "not found" error we've seen before. I think I've seen that results in empty patches.🤔

I think another possibility is when the agent faced issues reproducing the error and poked around with the reproduce script, and that script was in the /workspace instead of /workspace/xxx, which is where we collect diff.

@@ -82,7 +90,7 @@ def event_to_dict(event: 'Event') -> dict:
d['timeout'] = event.timeout
elif 'observation' in d:
d['content'] = props.pop('content', '')
d['extras'] = props
d['extras'] = {k: _convert_pydantic_to_dict(v) for k, v in props.items()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
d['extras'] = {k: _convert_pydantic_to_dict(v) for k, v in props.items()}
# props is a dict whose values can include a complex object like an instance of a BaseModel subclass
# such as CmdOutputMetadata
# we serialize it along with the rest
d['extras'] = {k: _convert_pydantic_to_dict(v) for k, v in props.items()}

self.confirmation_state = confirmation_state
self.security_risk = security_risk
logger.debug(
f'CmdRunAction ignored kwargs (likely due to legacy serialization): {kwargs}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is compatibility the reason for adding this init? In other cases we've done it outside the constructor, so we could let the dataclass continue to create its init undercover. It's only keep_prompt, if I see this right?

'--END AGENT OBSERVATION--'
)

def to_agent_observation(self) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def to_agent_observation(self) -> str:
def _to_agent_observation(self) -> str:

This method seems used only here? Maybe we can make it private.

It confused me a bit, I thought it was used in the agent... 😅

metadata = CmdOutputMetadata.from_ps1_match(ps1_matches[-1])

# Special case where the previous command output is truncated due to history limit
# We should content BEFORE the last PS1 prompt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# We should content BEFORE the last PS1 prompt
# We should get the content BEFORE the last PS1 prompt

command: str
exit_code: int = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the stuck equivalence check _eq_no_pid may need a bit of update, to account for the move of exit_code.

With these changes, what should it mean for a CmdOutputObservation to be "the same" with another?

# Handle legacy attribute
if 'exit_code' in kwargs:
self.metadata.exit_code = kwargs['exit_code']
if 'command_id' in kwargs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, if we need to handle legacy attrs, we can do it as before, maybe we don't have to write an init (yet)?

@enyst
Copy link
Collaborator

enyst commented Dec 25, 2024

@enyst i think that should be exposed as an .error attribute, right? I didn't really see smth like that 🤔

I've seen it not reported as error in a recent eval... I'll have to verify what's up, it might be just the summary script that missed it as error.
The completions of the eval, any of those with empty patch, would show that error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants