Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] Stale values in scene robot data #240

Closed
2 of 3 tasks
amacati opened this issue Feb 12, 2024 · 5 comments
Closed
2 of 3 tasks

[Bug Report] Stale values in scene robot data #240

amacati opened this issue Feb 12, 2024 · 5 comments
Assignees
Labels
question Further information is requested wontfix This will not be worked on

Comments

@amacati
Copy link

amacati commented Feb 12, 2024

Describe the bug

The data in env.scene["robot"].data is sometimes (or potentially always) stale. This can be seen when using observations that extract cartesian positions of robot frames.

Steps to reproduce

You can see the impact in the following example. It creates the FrankaCubeLiftEnv, but instead of the usual observation space it uses only the end effector world position as observation.

When resetting the environment, the end effector world position is exactly zero. This cannot be, as the robot's default joint configuration does not place the end effector at [0, 0, 0] in world coordinates.

from omni.isaac.orbit.app import AppLauncher

app_launcher = AppLauncher({"headless": False})

import torch
import gymnasium

from omni.isaac.orbit.managers import ObservationGroupCfg, ObservationTermCfg
from omni.isaac.orbit_tasks.manipulation.lift.config.franka.ik_rel_env_cfg import FrankaCubeLiftEnvCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObjectTableSceneCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObservationsCfg
from omni.isaac.orbit.utils import configclass


def ee_pos_w(env) -> torch.Tensor:
    return env.scene["robot"].data.body_pos_w[:, 8, :3]


@configclass
class ObservationsCfg():

    @configclass
    class ObsCfg(ObservationGroupCfg):
        ee_pos_w = ObservationTermCfg(func=ee_pos_w)

        def __post_init__(self):
            self.concatenate_terms = True

    obs = ObsCfg()


def main():
    scene = ObjectTableSceneCfg(num_envs=1, env_spacing=2.5, replicate_physics=False)
    observations = ObservationsCfg()
    env_cfg = FrankaCubeLiftEnvCfg(scene=scene, observations=observations)
    env = gymnasium.make("Isaac-Lift-Cube-Franka-v0", cfg=env_cfg)

    obs, _ = env.reset()
    assert not torch.all(obs["obs"] == 0), "Robot end effector position is all zeros"


if __name__ == "__main__":
    main()
[INFO]: Completed setting up the environment...
Traceback (most recent call last):
  File "/home/amacati/repos/lsy_rl/debug.py", line 43, in <module>
    main()
  File "/home/amacati/repos/lsy_rl/debug.py", line 39, in main
    assert not torch.all(obs["obs"] == 0), "Robot end effector position is all zeros"
AssertionError: Robot end effector position is all zeros

Furthermore, when the simulation runs for two episodes, the first observation of the second episode from calling env.reset() has the end effector at the exact same coordinates as the last observation of the first episode. This cannot be, as the robot has made random movements throughout the episode. The buffer seems to have never been cleared. This leads me to suspect that either the buffer is not properly written to during env.reset(), or that it always lags behind by one environment step. Either of those two options leads to false observations.

from omni.isaac.orbit.app import AppLauncher

app_launcher = AppLauncher({"headless": False})

import torch
import gymnasium

from omni.isaac.orbit.managers import ObservationGroupCfg, ObservationTermCfg
from omni.isaac.orbit_tasks.manipulation.lift.config.franka.ik_rel_env_cfg import FrankaCubeLiftEnvCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObjectTableSceneCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObservationsCfg
from omni.isaac.orbit.utils import configclass


def ee_pos_w(env) -> torch.Tensor:
    return env.scene["robot"].data.body_pos_w[:, 8, :3]


@configclass
class ObservationsCfg():

    @configclass
    class ObsCfg(ObservationGroupCfg):
        ee_pos_w = ObservationTermCfg(func=ee_pos_w)

        def __post_init__(self):
            self.concatenate_terms = True

    obs = ObsCfg()


def main():
    scene = ObjectTableSceneCfg(num_envs=1, env_spacing=2.5, replicate_physics=False)
    observations = ObservationsCfg()
    env_cfg = FrankaCubeLiftEnvCfg(scene=scene, observations=observations)
    env = gymnasium.make("Isaac-Lift-Cube-Franka-v0", cfg=env_cfg)

    final_obs = None
    for _ in range(2):
        done = False
        obs, _ = env.reset()
        if final_obs is not None:
            if torch.all(obs["obs"] == final_obs["obs"]):
                raise RuntimeError("Stale observation detected. Wrist coordinates are identical")
        while not done:
            action = torch.as_tensor(env.action_space.sample()).cuda()
            next_obs, _, terminated, truncated, _ = env.step(action)
            done = torch.any(terminated | truncated)
            if done:
                final_obs = next_obs


if __name__ == "__main__":
    main()
[INFO]: Completed setting up the environment...
Traceback (most recent call last):
  File "/home/amacati/repos/lsy_rl/debug.py", line 54, in <module>
    main()
  File "/home/amacati/repos/lsy_rl/debug.py", line 44, in main
    raise RuntimeError("Stale observation detected. Wrist coordinates are identical")
RuntimeError: Stale observation detected. Wrist coordinates are identical

Interestingly enough, this problem does not manifest itself in the object data as can be checked by looking at the object world position instead of the end effector position. It does however affect more members of env.scene["robot"].data, e.g. root_state_w.

from omni.isaac.orbit.app import AppLauncher

app_launcher = AppLauncher({"headless": False})

import torch
import gymnasium

from omni.isaac.orbit.managers import ObservationGroupCfg, ObservationTermCfg
from omni.isaac.orbit_tasks.manipulation.lift.config.franka.ik_rel_env_cfg import FrankaCubeLiftEnvCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObjectTableSceneCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObservationsCfg
from omni.isaac.orbit.utils import configclass


def object_pos_w(env) -> torch.Tensor:
    return env.scene["object"].data.root_pos_w[:, :3]


@configclass
class ObservationsCfg():

    @configclass
    class ObsCfg(ObservationGroupCfg):
        object_pos_w = ObservationTermCfg(func=object_pos_w)

        def __post_init__(self):
            self.concatenate_terms = True

    obs = ObsCfg()


def main():
    scene = ObjectTableSceneCfg(num_envs=1, env_spacing=2.5, replicate_physics=False)
    observations = ObservationsCfg()
    env_cfg = FrankaCubeLiftEnvCfg(scene=scene, observations=observations)
    env = gymnasium.make("Isaac-Lift-Cube-Franka-v0", cfg=env_cfg)

    final_obs = None
    for _ in range(2):
        done = False
        obs, _ = env.reset()
        if final_obs is not None:
            if torch.all(obs["obs"] == final_obs["obs"]):
                raise RuntimeError("Stale observation detected. Object coordinates are identical")
        while not done:
            action = torch.as_tensor(env.action_space.sample()).cuda()
            next_obs, _, terminated, truncated, _ = env.step(action)
            done = torch.any(terminated | truncated)
            if done:
                final_obs = next_obs


if __name__ == "__main__":
    main()

(Finishes without any issue)

System Info

Describe the characteristic of your environment:

  • Commit: 62aeea1
  • Isaac Sim Version: 2023.1.1
  • OS: Ubuntu 20.04
  • GPU: RTX 4090
  • CUDA: 12.0
  • GPU Driver: 525.147.05

Checklist

  • I have checked that there is no similar issue in the repo (required)
  • I have checked that the issue is not in running Isaac Sim itself and is related to the repo

Acceptance Criteria

Add the criteria for which this task is considered done. If not known at issue creation time, you can add this once the issue is assigned.

  • The data in env.scene["robot"] is updated correctly on reset
@Mayankm96
Copy link
Contributor

Mayankm96 commented Feb 12, 2024

Hi @amacati ,

Many physics engines do a simulation step as a two-level call: forward() and simulate(), where the kinematic and dynamic states are updated, respectively. Unfortunately, PhysX has only a single step() call where these two operations are combined. Due to computations through GPU kernels, it is not so straightforward for them to split these operations. Thus, at the moment, it is not possible to set the joint states and do a forward call to update the kinematic states of links.

As you have noticed, the issue mainly appears in kinematic chains, where you'd expect the link states to be updated correctly when you update the root or joint states. This affects both initialization as well as episodic resets. For root and joint states, the setters directly write into the data field as well as the PhysX buffers. Thus, those values appear "updated."

While this is erroneous, there is currently no direct workaround for it. From our experience in using IsaacGym, the reset values are crucial. However, it does not affect the agent learning critically enough.

If you always need the link states to be correct, a possible solution is to make a warp simulation model that is detached from PhysX, and you can call it to update the link states whenever you need. It's not ideal, but that's a possible workaround for now. We have made a feature request to the PhysX team to have separated forward and simulate calls like other physics engines. However, at this point, there is no set timeline for this feature request.

@Mayankm96 Mayankm96 added question Further information is requested wontfix This will not be worked on and removed wontfix This will not be worked on labels Feb 12, 2024
@Mayankm96 Mayankm96 self-assigned this Feb 12, 2024
@amacati
Copy link
Author

amacati commented Feb 12, 2024

Hi @Mayankm96 ,

Thanks a lot for the swift reply! Your explanation makes a lot of sense, but it's a bummer nevertheless. I think it would be great to place a warning in the docs somewhere so users know about this limitation.

Also, if I understood you correctly, the reset() function just "queues" a reset by overwriting the joint states, and then gets applied during the next step() call. That means the buffers from t=1,... are, in fact, back in sync again, right? I was worried they might always lag behind by one time step, which would be pretty bad.

I guess I will go with a wrapper for now that fuses each reset() with a single step(). Thanks again for your detailed answer, feel free to close the issue at any time. Just letting it open for know so people with the same question can find it more easily.

@Mayankm96
Copy link
Contributor

Mayankm96 commented Feb 12, 2024

Good point. I have made an MR #241 based on the reply above.

I would be careful about putting a step() following a reset() call. This only works well when you have a single environment. Since right now you may have multiple environment instances running simultaneously (i.e. sharing the same stage), calling step() will cause all of them to do a forward stepping with cached commands from the last time. This may not be what you want. Depending on how often "reset" happens, this may adversely affect learning.

Mayankm96 added a commit that referenced this issue Feb 12, 2024
# Description

As pointed out in #234 and #240, the environment returns stale values on
reset. Since there is no immediate fix for this, this MR adds the issue
to "Known issues" page in the documentation.

## Type of change

- This change requires a documentation update

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./orbit.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [x] I have run all the tests with `./orbit.sh --test` and they pass
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there
@amacati
Copy link
Author

amacati commented Feb 12, 2024

True. I'm working with a constant time horizon task that always truncates though, so I think it should be fine.

@Mayankm96 Mayankm96 added the wontfix This will not be worked on label Feb 12, 2024
@Mayankm96
Copy link
Contributor

I see. Then it is fine.

Closing this issue now as it isn't fixable on Orbit side directly.

Junfeng-Long pushed a commit to Junfeng-Long/IsaacLab that referenced this issue Feb 27, 2024
This MR fixes a typo in `RigidObject` where for a rigid object, the private member `_body_view` is not defined (since it should be using the `root_view`). The fix changes the reference to be the public `body_view` property.

- Bug fix (non-breaking change which fixes an issue)

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./orbit.sh --format`
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
fatimaanes pushed a commit to fatimaanes/omniperf that referenced this issue Aug 8, 2024
This MR fixes a typo in `RigidObject` where for a rigid object, the private member `_body_view` is not defined (since it should be using the `root_view`). The fix changes the reference to be the public `body_view` property.

- Bug fix (non-breaking change which fixes an issue)

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./orbit.sh --format`
- [ ] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my feature works
- [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there
fatimaanes pushed a commit to fatimaanes/omniperf that referenced this issue Aug 8, 2024
# Description

As pointed out in isaac-sim#234 and isaac-sim#240, the environment returns stale values on
reset. Since there is no immediate fix for this, this MR adds the issue
to "Known issues" page in the documentation.

## Type of change

- This change requires a documentation update

## Checklist

- [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with
`./orbit.sh --format`
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] I have added tests that prove my fix is effective or that my
feature works
- [x] I have run all the tests with `./orbit.sh --test` and they pass
- [ ] I have updated the changelog and the corresponding version in the
extension's `config/extension.toml` file
- [x] I have added my name to the `CONTRIBUTORS.md` or my name already
exists there
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants