Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Replace pexpect with libtmux in BashSession #4881

Open
wants to merge 226 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 223 commits
Commits
Show all changes
226 commits
Select commit Hold shift + click to select a range
7b86e33
refactor: Replace pexpect with libtmux in BashSession
openhands-agent Nov 10, 2024
9e5653c
update poetry and implement pwd
xingyaoww Nov 10, 2024
522eb53
add CmdOutputMetadata to get a lot of info from ps1
xingyaoww Nov 10, 2024
d60065b
handle and test on multiple PS1 block
xingyaoww Nov 10, 2024
e304973
greatly simplify command to not accepting blocking/keep_prompt, we sh…
xingyaoww Nov 10, 2024
c743833
fix PS1 so PS1JSON works
xingyaoww Nov 10, 2024
49eae72
support pid
xingyaoww Nov 10, 2024
363b379
preliminary impl of bash
xingyaoww Nov 10, 2024
ffa0676
slight refactor of cmd
xingyaoww Nov 10, 2024
554e03a
add blocking back
xingyaoww Nov 10, 2024
2b252e0
add blocking back
xingyaoww Nov 10, 2024
e994620
Improve test coverage for CmdOutputMetadata
openhands-agent Nov 10, 2024
1935483
Refactor error handling in CmdOutputMetadata
openhands-agent Nov 10, 2024
23ddbe4
Improve CmdOutputMetadata handling of malformed values and line endings
openhands-agent Nov 11, 2024
744938c
Refactor PS1 metadata regex pattern
openhands-agent Nov 11, 2024
f5518bf
update test bash session to be compatible with latest interface
xingyaoww Nov 11, 2024
aaed596
make sure ps1 end begin with newline
xingyaoww Nov 11, 2024
71e4ec5
add newline suffix for test
xingyaoww Nov 11, 2024
3a2443f
add test
xingyaoww Nov 11, 2024
ceb0d32
add testcase & tweak for ps1
xingyaoww Nov 11, 2024
145f141
remove re escape
xingyaoww Nov 11, 2024
56e4df5
tweak ps1
xingyaoww Nov 11, 2024
9e43ee5
fix bash session arg
xingyaoww Nov 11, 2024
ed382c6
make action execution server compatible
xingyaoww Nov 11, 2024
8e4180f
remove command id from agent obs & make other places compatible with …
xingyaoww Nov 11, 2024
3c68c7d
fix typo
xingyaoww Nov 11, 2024
d82c420
fix typo
xingyaoww Nov 11, 2024
010e453
do not wrap lines in tmux captured output
xingyaoww Nov 11, 2024
4aeb681
use pwd to get working_dir
xingyaoww Nov 11, 2024
28615cd
use PROMPT_COMMAND to make sure PS1 changes
xingyaoww Nov 11, 2024
9aa3e47
remove warnings for pid
xingyaoww Nov 11, 2024
959733c
add 1 to re.match.end() to remove extra newline
xingyaoww Nov 11, 2024
210303c
lstrip on cmd output
xingyaoww Nov 11, 2024
e050ceb
update tests to be more strict on newline
xingyaoww Nov 11, 2024
8db543d
add tests
xingyaoww Nov 11, 2024
3dd21fa
Improve test coverage for BashSession
openhands-agent Nov 11, 2024
efc481f
Merge commit '910b283ac2f6b3896e174cb77377c5ab6900da22' into feature/…
xingyaoww Nov 12, 2024
03ba929
only clean screen if prev status is not timeout
xingyaoww Nov 12, 2024
92b2b0c
also don't clear screen on CONTINUE
xingyaoww Nov 12, 2024
fec3083
fix command prefix
xingyaoww Nov 12, 2024
a28212b
tweak debug viz
xingyaoww Nov 12, 2024
f03de49
print agent obs
xingyaoww Nov 12, 2024
42b69a3
tweak
xingyaoww Nov 12, 2024
4ee07fe
Merge commit 'a93f1402debd325dac68360650bd12ae6abad643' into feature/…
xingyaoww Nov 14, 2024
99ef1ef
make timeout configurable
xingyaoww Nov 14, 2024
34a14fd
add info when command completed
xingyaoww Nov 14, 2024
c3ae9cf
refactor _get_command_output
xingyaoww Nov 14, 2024
3affa77
rename custom prefix to continue_prefix
xingyaoww Nov 14, 2024
d90a338
add missing newlines
xingyaoww Nov 14, 2024
4d47241
fix ctrl+c
xingyaoww Nov 14, 2024
3e1f12b
allow continue if prev command is also continue
xingyaoww Nov 14, 2024
0df9ba2
tweak bash doc
xingyaoww Nov 14, 2024
ebeccc6
improve "continue" mode for bash
xingyaoww Nov 14, 2024
fa351fe
fix all bash session tests
xingyaoww Nov 14, 2024
0c14a80
Merge commit '00ffc33d1bdc4d3287d26a4b63cedd2244e96570' into feature/…
xingyaoww Nov 15, 2024
77b4c7c
fix linter
xingyaoww Nov 15, 2024
bc3428a
add tmux to Dockerfile
xingyaoww Nov 15, 2024
7a8ff37
make eventstream runtime atexit register happen only at init
xingyaoww Nov 15, 2024
f9f37ad
improve ps1 for env commands
xingyaoww Nov 15, 2024
fae1185
fix CmdOutputObservation constructor; fix env test
xingyaoww Nov 15, 2024
184794a
fix cmdoutput constructor
xingyaoww Nov 15, 2024
e1c2ac2
handle multi line session
xingyaoww Nov 18, 2024
fce1b07
fix multiline runtime tests
xingyaoww Nov 18, 2024
fab1438
(hopefully) fix all tests
xingyaoww Nov 18, 2024
c7aee63
Merge commit 'de821718fd579448150a8a614be9da550fd743bf' into feature/…
xingyaoww Nov 18, 2024
f5d23b3
update poetry lock
xingyaoww Nov 18, 2024
88658dd
fix security test
xingyaoww Nov 18, 2024
25ae18c
fix deserialization test
xingyaoww Nov 18, 2024
488a1a7
refactor imports
xingyaoww Nov 18, 2024
c173a03
fix codeact test
xingyaoww Nov 18, 2024
1bb9f82
Add tmux installation to GitHub workflows
openhands-agent Nov 18, 2024
7752a94
Add tmux installation to additional GitHub workflows
openhands-agent Nov 18, 2024
cd94759
feat: add keep_prompt parameter to CmdRunAction
openhands-agent Nov 18, 2024
d491e47
feat: implement keep_prompt handling in BashSession and add tests
openhands-agent Nov 18, 2024
d76bbfa
Revert "feat: implement keep_prompt handling in BashSession and add t…
xingyaoww Nov 18, 2024
4d1c742
Revert "feat: add keep_prompt parameter to CmdRunAction"
xingyaoww Nov 18, 2024
51d0bcb
refactor: move prefix/suffix to CmdOutputMetadata
openhands-agent Nov 18, 2024
fa714a4
test: update test_bash_session.py to verify prefix/suffix fields
openhands-agent Nov 18, 2024
53d2de2
fix testcase
xingyaoww Nov 18, 2024
fefabd1
fix tests
xingyaoww Nov 18, 2024
e5001a3
improve 500 error message
xingyaoww Nov 18, 2024
9c00bd6
improve error message & fix ps1 parsing
xingyaoww Nov 18, 2024
7ef9e37
fix test
xingyaoww Nov 18, 2024
20721e3
remove keep_prompt from everywhere
xingyaoww Nov 18, 2024
725eeb1
fix resolver tests
xingyaoww Nov 18, 2024
1dfee78
improve error message
xingyaoww Nov 18, 2024
9fe792f
fix resolver test
xingyaoww Nov 18, 2024
c335b1e
fix test bash ps1
xingyaoww Nov 18, 2024
a15708a
remove the complex local tmux test
xingyaoww Nov 18, 2024
ff3d971
try fix conflict of tmux session
xingyaoww Nov 19, 2024
22a2572
Merge commit 'a531413d8649640842d2e639e15b4e7ecadf35c5' into feature/…
xingyaoww Nov 19, 2024
904bc29
remove command id
xingyaoww Nov 19, 2024
f016fbc
fix test
xingyaoww Nov 19, 2024
0f40b4c
fix PS1 parsing
xingyaoww Nov 19, 2024
313a901
fix ipython pwd
xingyaoww Nov 19, 2024
1f9168a
remove specified sid
xingyaoww Nov 19, 2024
1a40358
only raise RuntimeError when error code >= 500
xingyaoww Nov 19, 2024
95add43
relax tests
xingyaoww Nov 19, 2024
6da2636
resize tmux window
xingyaoww Nov 19, 2024
bc995ef
temporarily bump ver for runtime
xingyaoww Nov 19, 2024
153a501
fix resize arg
xingyaoww Nov 19, 2024
796a100
simplify test bash session in favor of runtime test
xingyaoww Nov 19, 2024
483f4b1
tweak test
xingyaoww Nov 19, 2024
3d7b44c
tweak test
xingyaoww Nov 19, 2024
bd12b99
fix serialization for CmdOutputMetadata
xingyaoww Nov 19, 2024
f2d57f9
Merge commit '302e41d7bb3d5b2b319f1ce2d15e5925dda069a2' into feature/…
xingyaoww Nov 19, 2024
cf7897b
hopefully fixes the bash
xingyaoww Nov 19, 2024
b430cb4
fix empty cmd handling
xingyaoww Nov 19, 2024
bae44a7
fix test
xingyaoww Nov 19, 2024
60daaa3
fix request
xingyaoww Nov 19, 2024
04397fe
fix VERY long cmd output
xingyaoww Nov 19, 2024
206eb19
update runtime test for looooong output
xingyaoww Nov 19, 2024
a0b5c9f
fix history limit
xingyaoww Nov 20, 2024
868e5a3
fix window start dir
xingyaoww Nov 20, 2024
48a866f
tweak
xingyaoww Nov 20, 2024
e914055
Merge commit '68e52a9c62f4cc6d48d33c5f1179aa4c1008b5a8' into feature/…
xingyaoww Nov 21, 2024
902a484
Merge commit '36d85b65c809f0c522c590dce5c6f96d48169dae' into feature/…
xingyaoww Nov 25, 2024
5e4e238
merge
xingyaoww Nov 26, 2024
e8d734d
get preliminary ver of pipe-pane working
xingyaoww Nov 27, 2024
96fa5be
get bash session tests working with pipe-pane
xingyaoww Nov 27, 2024
a657690
ok we may need to live with color when doing pipe-pane
xingyaoww Nov 27, 2024
8132820
Merge commit '082a55195ffa669ff71669156f9d8aa887217075' into feature/…
xingyaoww Nov 27, 2024
3ad3a39
add tests back
xingyaoww Nov 27, 2024
f360b87
add destructor
xingyaoww Dec 2, 2024
49be926
cleanup bracketed-paste
xingyaoww Dec 2, 2024
6fd958c
only .close() if not closed
xingyaoww Dec 2, 2024
61ebe54
Merge commit '5069a8700a8fc1219b10e2b57b1922eab995ec9f' into feature/…
xingyaoww Dec 2, 2024
6db1672
remove ansi test
xingyaoww Dec 2, 2024
b1652be
improve debug log
xingyaoww Dec 2, 2024
c50a45e
feat: display exact error for runtime requests exception handling
xingyaoww Dec 3, 2024
fb19118
Merge commit 'c50a45e9f058182def36c4a07650323ebaee020a' into feature/…
xingyaoww Dec 3, 2024
40e6767
fix action execution detail
xingyaoww Dec 3, 2024
a5815e6
fix action execution detail
xingyaoww Dec 3, 2024
dce4a38
replace all occurences of requests.HTTPError
xingyaoww Dec 3, 2024
f05af62
replace all occurences of requests.HTTPError
xingyaoww Dec 3, 2024
ab4f0e4
simplify error
xingyaoww Dec 3, 2024
3d03509
Merge commit 'ab4f0e497046c97e47e3d2cf369bdd16f6593a09' into feature/…
xingyaoww Dec 3, 2024
84c75e4
only print stacktrace
xingyaoww Dec 3, 2024
79410c5
get pipe to work for bash session (kinda)
xingyaoww Dec 3, 2024
a878109
do not reset pane every time
xingyaoww Dec 3, 2024
115cde3
remove extra debug; fix session test
xingyaoww Dec 3, 2024
990fb03
fix bug for very long outputs
xingyaoww Dec 3, 2024
cc44952
reduce freq of getting pane output & parse ps1
xingyaoww Dec 3, 2024
ba52ac5
Merge commit '1b8104ba14234599ce3a19e266582be7b87cf23c' into feature/…
xingyaoww Dec 3, 2024
be8c9d5
get read -p test back
xingyaoww Dec 3, 2024
aee78f3
disable enter name check
xingyaoww Dec 3, 2024
dc2c23b
fix cleanup
xingyaoww Dec 3, 2024
8db6055
always combine outputs between matches on all cases
xingyaoww Dec 3, 2024
043cc16
fix combine output bugs
xingyaoww Dec 3, 2024
4641566
strip commands before execute; fix bash loop
xingyaoww Dec 4, 2024
2b554bd
log openhands version in eval runs, instead of agent ver
xingyaoww Dec 4, 2024
2052829
fix ver
xingyaoww Dec 4, 2024
4d6d069
use get_version
xingyaoww Dec 4, 2024
a3fff39
support log debug remotely1
xingyaoww Dec 4, 2024
12ecd35
support directly stream docker/devbox logs to stdout in debug mode
xingyaoww Dec 4, 2024
4fa842e
add sse-starlette
xingyaoww Dec 4, 2024
0952c38
tweak test
xingyaoww Dec 4, 2024
f085364
hit enter for cases when matches <1
xingyaoww Dec 4, 2024
923f88d
Merge commit 'ceb60b9a37d669a51945710ae036e7fc428dc7e9' into feature/…
xingyaoww Dec 5, 2024
3ea1fd8
handle multiple ps1 before start
xingyaoww Dec 5, 2024
4615908
fix poetry lock
xingyaoww Dec 5, 2024
c90a95a
print pod log when failed remote runtime
xingyaoww Dec 5, 2024
f093c69
use non-login shell to start a new shell for the given user
xingyaoww Dec 5, 2024
3ae045f
condense test_bash to single line
xingyaoww Dec 5, 2024
8569e7a
update pyproject ver
xingyaoww Dec 5, 2024
b1fde67
revert window command
xingyaoww Dec 5, 2024
1679810
do login
xingyaoww Dec 5, 2024
27c2455
increase timeout
xingyaoww Dec 5, 2024
69d8f34
add tests
xingyaoww Dec 5, 2024
ace691e
revert to polling capture-pane since pipe-pane can't capture prompts …
xingyaoww Dec 5, 2024
2a41ee5
log decoder error for match ps1
xingyaoww Dec 5, 2024
2e8452e
remove commented code
xingyaoww Dec 5, 2024
0514bed
add test_python_interactive_input to test_bash
xingyaoww Dec 5, 2024
4a2c880
increase history limit
xingyaoww Dec 5, 2024
85c5431
increase timeout
xingyaoww Dec 5, 2024
47d0ba4
reduce num lines for testing
xingyaoww Dec 5, 2024
7cbebdf
reduce max lines
xingyaoww Dec 5, 2024
f34dbd3
update implementation to handle overly long cmd output
xingyaoww Dec 6, 2024
ef04cdb
increase timeout for CI
xingyaoww Dec 6, 2024
eb27320
remove extra stuff from tests
xingyaoww Dec 6, 2024
db8114e
handle requests.exceptions.JSONDecodeError
xingyaoww Dec 9, 2024
d5c5db6
fix request error handling
xingyaoww Dec 10, 2024
256b352
add a bunch of debug log
xingyaoww Dec 13, 2024
29bf36b
Merge commit '8ae2fb636eb9ded9039ea8c3a7227b3fce5cc68b' into feature/…
xingyaoww Dec 13, 2024
6ec1683
get git op tests
xingyaoww Dec 13, 2024
14b1085
add mechanism to avoid double newline
xingyaoww Dec 13, 2024
f529bc8
try fix serialization
xingyaoww Dec 13, 2024
00253b9
try fix serialization
xingyaoww Dec 13, 2024
47ae5bf
fix serialization
xingyaoww Dec 13, 2024
9ded783
fix command success test
xingyaoww Dec 13, 2024
06f7694
fix tests
xingyaoww Dec 13, 2024
21e497b
Merge commit 'd733bc6bdd8e743d2e5a7f5fe592f7462548c5d9' into feature/…
xingyaoww Dec 13, 2024
9bd5143
fix test case
xingyaoww Dec 16, 2024
6cf0a08
return alive only when client is initialized
xingyaoww Dec 17, 2024
5953ee8
update log
xingyaoww Dec 17, 2024
a5404b8
add check for python interpreter
xingyaoww Dec 17, 2024
2dba843
add cwd to agent observation
xingyaoww Dec 17, 2024
dfb33ca
remove request body
xingyaoww Dec 17, 2024
06a68eb
use cp -r instead of mv
xingyaoww Dec 17, 2024
e6f095c
Merge commit '3297e4d5a8c8578bbe220bed6489d74a659a832a' into feature/…
xingyaoww Dec 17, 2024
faaf63d
increase timeout
xingyaoww Dec 18, 2024
22cb1e6
set max retries back to 5
xingyaoww Dec 18, 2024
e5f798b
make cannot restore state a debug message
xingyaoww Dec 18, 2024
fcc7fdf
cleanup runtime exception handling
xingyaoww Dec 19, 2024
65742fa
increase resource factor for runtime when previous run failed likely …
xingyaoww Dec 20, 2024
901a2c8
remove stuck in look from fatal exception; add AgentRuntimeUnavailabl…
xingyaoww Dec 20, 2024
2bf3202
Merge commit '73c38f1163cc37048c3e31e1941fe4cd798c296e' into feature/…
xingyaoww Dec 20, 2024
b4ed2dc
replace while true with while should_continue
xingyaoww Dec 20, 2024
be8914b
rename pwd to cwd
xingyaoww Dec 20, 2024
8f2e9a9
move bash init logic to a separate init function
xingyaoww Dec 20, 2024
178e029
update resource factor
xingyaoww Dec 20, 2024
7498fe4
Merge commit 'd62cf7e7319850ce8c0dc47a3ddab0f4151d2af6' into feature/…
xingyaoww Dec 23, 2024
fa78313
add initialized for bash session
xingyaoww Dec 23, 2024
8040497
make sure legacy CmdOutputObservation is still serializable
xingyaoww Dec 23, 2024
5ff8998
fix missing init
xingyaoww Dec 23, 2024
c5ca25f
re-order thought
xingyaoww Dec 23, 2024
b34beaa
fix serialization of action
xingyaoww Dec 23, 2024
73f379e
fix obs serialization
xingyaoww Dec 23, 2024
c593295
fix serialization
xingyaoww Dec 23, 2024
bf34c7e
try fix test
xingyaoww Dec 23, 2024
68ffd0c
fix test again
xingyaoww Dec 24, 2024
cf98287
Merge commit 'ecff5c67fb7f1995556f0f36f5050f33dc0953d2' into feature/…
xingyaoww Dec 24, 2024
bb9c19b
pretty print file write action
xingyaoww Dec 24, 2024
c89677d
improve util script for swebench
xingyaoww Dec 26, 2024
165ee7a
print actual visualization file path of the diff
xingyaoww Dec 26, 2024
9bc721b
fix grab test_output logic
xingyaoww Dec 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/dummy-agent-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ jobs:
- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-action@v3
- name: Install tmux
run: sudo apt-get update && sudo apt-get install -y tmux
- name: Install poetry via pipx
run: pipx install poetry
- name: Set up Python
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/eval-runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v4

- name: Install tmux
run: sudo apt-get update && sudo apt-get install -y tmux
- name: Install poetry via pipx
run: pipx install poetry

Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/py-unit-tests-mac.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ jobs:
key: ${{ runner.os }}-poetry-${{ hashFiles('**/poetry.lock') }}
restore-keys: |
${{ runner.os }}-poetry-
- name: Install tmux
run: brew install tmux
- name: Install poetry via pipx
run: pipx install poetry
- name: Install Python dependencies using Poetry
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/py-unit-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ jobs:
- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-action@v3
- name: Install tmux
run: sudo apt-get update && sudo apt-get install -y tmux
- name: Install poetry via pipx
run: pipx install poetry
- name: Set up Python
Expand Down
1 change: 0 additions & 1 deletion docs/static/img/backend_architecture.puml
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,6 @@ class openhands.state.State {
updated_info: List[Tuple[Action, Observation]]
}
class openhands.observation.CmdOutputObservation {
command_id: int
command: str
exit_code: int
observation: str
Expand Down
4 changes: 1 addition & 3 deletions evaluation/benchmarks/agent_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,6 @@ def complete_runtime(

action = CmdRunAction(
command=f'chmod +x ./{script_name} && ./{script_name}',
keep_prompt=False,
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
Expand All @@ -162,8 +161,7 @@ def complete_runtime(
logger.info(f'Running get ground truth cmd: {script_name}')

action = CmdRunAction(
command=f'chmod +x ./{script_name} && ./{script_name}',
keep_prompt=False,
command=f'chmod +x ./{script_name} && ./{script_name}'
)
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
Expand Down
5 changes: 1 addition & 4 deletions evaluation/benchmarks/aider_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,10 +143,7 @@ def complete_runtime(
)
logger.info(f'Running test file: {script_name}')

action = CmdRunAction(
command=f'python3 -m unittest {script_name}',
keep_prompt=False,
)
action = CmdRunAction(command=f'python3 -m unittest {script_name}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
Expand Down
6 changes: 2 additions & 4 deletions evaluation/benchmarks/biocoder/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -197,7 +197,7 @@ def complete_runtime(
if obs.exit_code == 0:
test_result['metadata']['1_copy_change_success'] = True

action = CmdRunAction(command=f'cat {generated_path}', keep_prompt=False)
action = CmdRunAction(command=f'cat {generated_path}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
assert obs.exit_code == 0
Expand All @@ -221,9 +221,7 @@ def complete_runtime(
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0

action = CmdRunAction(
command='cat /testing_files/results_biocoder.json', keep_prompt=False
)
action = CmdRunAction(command='cat /testing_files/results_biocoder.json')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
if obs.exit_code == 0:
Expand Down
1 change: 0 additions & 1 deletion evaluation/benchmarks/bird/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,6 @@ For each problem, OpenHands is given a set number of iterations to fix the faili
"observation": "run",
"content": "california_schools/california_schools.sqlite\r\n[(1.0,)]",
"extras": {
"command_id": -1,
"command": "python3 0.py",
"exit_code": 0
}
Expand Down
10 changes: 2 additions & 8 deletions evaluation/benchmarks/bird/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,10 +266,7 @@ def initialize_runtime(
runtime.copy_to(db_file, '/workspace')

# Check the database is copied
action = CmdRunAction(
command='cd /workspace && ls -l',
keep_prompt=False,
)
action = CmdRunAction(command='cd /workspace && ls -l')
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
Expand Down Expand Up @@ -298,10 +295,7 @@ def complete_runtime(
instance_id = instance.instance_id.replace('/', '__')
path = os.path.join('/workspace', f'{instance_id}.py')

action = CmdRunAction(
command=f'cat {path}',
keep_prompt=False,
)
action = CmdRunAction(command=f'cat {path}')
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})

Expand Down
3 changes: 0 additions & 3 deletions evaluation/benchmarks/humanevalfix/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,6 @@ For each problem, OpenHands is given a set number of iterations to fix the faili
"observation": "run",
"content": "[File: /workspace/Python__2.py (14 lines total)]\r\n1:def truncate_number(number: float) -> float:\r\n2: return number % 1.0 + 1.0\r\n3:\r\n4:\r\n5:\r\n6:\r\n7:\r\n8:\r\n9:def check(truncate_number):\r\n10: assert truncate_number(3.5) == 0.5\r\n11: assert abs(truncate_number(1.33) - 0.33) < 1e-6\r\n12: assert abs(truncate_number(123.456) - 0.456) < 1e-6\r\n13:\r\n14:check(truncate_number)",
"extras": {
"command_id": -1,
"command": "open Python__2.py",
"exit_code": 0
}
Expand All @@ -98,7 +97,6 @@ For each problem, OpenHands is given a set number of iterations to fix the faili
"observation": "run",
"content": "> > [File: /workspace/Python__2.py (14 lines total)]\r\n1:def truncate_number(number: float) -> float:\r\n2: return number % 1.0\r\n3:\r\n4:\r\n5:\r\n6:\r\n7:\r\n8:\r\n9:def check(truncate_number):\r\n10: assert truncate_number(3.5) == 0.5\r\n11: assert abs(truncate_number(1.33) - 0.33) < 1e-6\r\n12: assert abs(truncate_number(123.456) - 0.456) < 1e-6\r\n13:\r\n14:check(truncate_number)\r\nFile updated. Please review the changes and make sure they are correct (correct indentation, no duplicate lines, etc). Edit the file again if necessary.",
"extras": {
"command_id": -1,
"command": "edit 2:2 <<EOF\n return number % 1.0\nEOF",
"exit_code": 0
}
Expand All @@ -125,7 +123,6 @@ For each problem, OpenHands is given a set number of iterations to fix the faili
"observation": "run",
"content": "",
"extras": {
"command_id": -1,
"command": "python3 Python__2.py",
"exit_code": 0
}
Expand Down
4 changes: 1 addition & 3 deletions evaluation/benchmarks/humanevalfix/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,9 +169,7 @@ def complete_runtime(
num_workers = LANGUAGE_TO_NUM_WORKERS[language]
python_imports = '\n'.join(IMPORT_HELPER[language])

action = CmdRunAction(
command=f'cat /workspace/{_get_instance_id(instance)}.py', keep_prompt=False
)
action = CmdRunAction(command=f'cat /workspace/{_get_instance_id(instance)}.py')
obs = runtime.run_action(action)
assert obs.exit_code == 0

Expand Down
2 changes: 1 addition & 1 deletion evaluation/benchmarks/ml_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ def complete_runtime(
eval_script = os.path.join(task_path, 'run.sh')
logger.info(f'Running evaluation script: {eval_script}')

action = CmdRunAction(command=f'cat {eval_script}', keep_prompt=False)
action = CmdRunAction(command=f'cat {eval_script}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
if obs.exit_code == 0:
Expand Down
10 changes: 2 additions & 8 deletions evaluation/benchmarks/scienceagentbench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,10 +121,7 @@ def initialize_runtime(
runtime.copy_to(dataset_dir, '/workspace/benchmark/datasets', recursive=True)

# Check the dataset exists
action = CmdRunAction(
command='cd /workspace/benchmark/datasets && ls',
keep_prompt=False,
)
action = CmdRunAction(command='cd /workspace/benchmark/datasets && ls')
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert obs.exit_code == 0
Expand Down Expand Up @@ -154,10 +151,7 @@ def complete_runtime(

assert obs.exit_code == 0

action = CmdRunAction(
command=f'cat pred_programs/{instance.pred_program_name}',
keep_prompt=False,
)
action = CmdRunAction(command=f'cat pred_programs/{instance.pred_program_name}')
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)

Expand Down
10 changes: 4 additions & 6 deletions evaluation/benchmarks/swe_bench/eval_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ def process_instance(
"(patch --batch --fuzz=5 -p1 -i /tmp/patch.diff && echo 'APPLY_PATCH_PASS' || "
"echo 'APPLY_PATCH_FAIL')))"
)
action = CmdRunAction(command=exec_command, keep_prompt=False)
action = CmdRunAction(command=exec_command)
action.timeout = 600
obs = runtime.run_action(action)
assert isinstance(obs, CmdOutputObservation)
Expand All @@ -200,9 +200,7 @@ def process_instance(

# Run eval script in background and save output to log file
log_file = '/tmp/eval_output.log'
action = CmdRunAction(
command=f'/tmp/eval.sh > {log_file} 2>&1 & echo $!', keep_prompt=False
)
action = CmdRunAction(command=f'/tmp/eval.sh > {log_file} 2>&1 & echo $!')
action.timeout = 60 # Short timeout just to get the process ID
obs = runtime.run_action(action)

Expand All @@ -224,7 +222,7 @@ def process_instance(
instance['test_result']['report']['test_timeout'] = True
break
check_action = CmdRunAction(
command=f'ps -p {pid} > /dev/null; echo $?', keep_prompt=False
command=f'ps -p {pid} > /dev/null; echo $?'
)
check_action.timeout = 60
check_obs = runtime.run_action(check_action)
Expand All @@ -242,7 +240,7 @@ def process_instance(
time.sleep(30) # Wait for 30 seconds before checking again

# Read the log file
cat_action = CmdRunAction(command=f'cat {log_file}', keep_prompt=False)
cat_action = CmdRunAction(command=f'cat {log_file}')
cat_action.timeout = 300
cat_obs = runtime.run_action(cat_action)

Expand Down
16 changes: 13 additions & 3 deletions evaluation/benchmarks/swe_bench/run_infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,6 +282,16 @@ def initialize_runtime(
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(obs.exit_code == 0, f'Failed to remove git remotes: {str(obs)}')

action = CmdRunAction(command='which python')
action.timeout = 600
logger.info(action, extra={'msg_type': 'ACTION'})
obs = runtime.run_action(action)
logger.info(obs, extra={'msg_type': 'OBSERVATION'})
assert_and_raise(
obs.exit_code == 0 and 'testbed' in obs.content,
f'Expected to find python interpreter from testbed, but got: {str(obs)}',
)

logger.info('-' * 30)
logger.info('END Runtime Initialization Fn')
logger.info('-' * 30)
Expand Down Expand Up @@ -337,8 +347,7 @@ def complete_runtime(
git_patch = None
while n_retries < 5:
action = CmdRunAction(
command=f'git diff --no-color --cached {instance["base_commit"]}',
keep_prompt=False,
command=f'git diff --no-color --cached {instance["base_commit"]}'
)
action.timeout = 600 + 100 * n_retries
logger.info(action, extra={'msg_type': 'ACTION'})
Expand Down Expand Up @@ -385,7 +394,7 @@ def process_instance(
if runtime_failure_count > 0:
config.sandbox.remote_runtime_resource_factor = min(
config.sandbox.remote_runtime_resource_factor * (2**runtime_failure_count),
2, # hardcode maximum resource factor to 2
4, # hardcode maximum resource factor to 4
)
logger.warning(
f'This is the second attempt for instance {instance.instance_id}, setting resource factor to {config.sandbox.remote_runtime_resource_factor}'
Expand Down Expand Up @@ -535,4 +544,5 @@ def filter_dataset(dataset: pd.DataFrame, filter_column: str) -> pd.DataFrame:
args.eval_num_workers,
process_instance,
timeout_seconds=120 * 60, # 2 hour PER instance should be more than enough
max_retries=5,
)
2 changes: 1 addition & 1 deletion evaluation/integration_tests/tests/t01_fix_simple_typo.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ def initialize_runtime(cls, runtime: Runtime) -> None:
@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
# check if the file /workspace/bad.txt has been fixed
action = CmdRunAction(command='cat /workspace/bad.txt', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/bad.txt')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand Down
6 changes: 3 additions & 3 deletions evaluation/integration_tests/tests/t02_add_bash_hello.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ class Test(BaseIntegrationTest):

@classmethod
def initialize_runtime(cls, runtime: Runtime) -> None:
action = CmdRunAction(command='mkdir -p /workspace', keep_prompt=False)
action = CmdRunAction(command='mkdir -p /workspace')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
# check if the file /workspace/hello.sh exists
action = CmdRunAction(command='cat /workspace/hello.sh', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/hello.sh')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand All @@ -26,7 +26,7 @@ def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
)

# execute the script
action = CmdRunAction(command='bash /workspace/hello.sh', keep_prompt=False)
action = CmdRunAction(command='bash /workspace/hello.sh')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand Down
6 changes: 3 additions & 3 deletions evaluation/integration_tests/tests/t03_jupyter_write_file.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,14 @@ class Test(BaseIntegrationTest):

@classmethod
def initialize_runtime(cls, runtime: Runtime) -> None:
action = CmdRunAction(command='mkdir -p /workspace', keep_prompt=False)
action = CmdRunAction(command='mkdir -p /workspace')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
# check if the file /workspace/hello.sh exists
action = CmdRunAction(command='cat /workspace/test.txt', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/test.txt')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand All @@ -26,7 +26,7 @@ def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
)

# execute the script
action = CmdRunAction(command='cat /workspace/test.txt', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/test.txt')
obs = runtime.run_action(action)

if obs.exit_code != 0:
Expand Down
14 changes: 6 additions & 8 deletions evaluation/integration_tests/tests/t04_git_staging.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,31 +10,29 @@ class Test(BaseIntegrationTest):

@classmethod
def initialize_runtime(cls, runtime: Runtime) -> None:
action = CmdRunAction(command='mkdir -p /workspace', keep_prompt=False)
action = CmdRunAction(command='mkdir -p /workspace')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

# git init
action = CmdRunAction(command='git init', keep_prompt=False)
action = CmdRunAction(command='git init')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

# create README.md
action = CmdRunAction(
command='echo \'print("hello world")\' > hello.py', keep_prompt=False
)
action = CmdRunAction(command='echo \'print("hello world")\' > hello.py')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

# git add README.md
action = CmdRunAction(command='git add hello.py', keep_prompt=False)
action = CmdRunAction(command='git add hello.py')
obs = runtime.run_action(action)
assert_and_raise(obs.exit_code == 0, f'Failed to run command: {obs.content}')

@classmethod
def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
# check if the file /workspace/hello.py exists
action = CmdRunAction(command='cat /workspace/hello.py', keep_prompt=False)
action = CmdRunAction(command='cat /workspace/hello.py')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand All @@ -43,7 +41,7 @@ def verify_result(cls, runtime: Runtime, histories: list[Event]) -> TestResult:
)

# check if the staging area is empty
action = CmdRunAction(command='git status', keep_prompt=False)
action = CmdRunAction(command='git status')
obs = runtime.run_action(action)
if obs.exit_code != 0:
return TestResult(
Expand Down
Loading