[Bug Report] Stale values in scene robot data #240

amacati · 2024-02-12T13:08:09Z

Describe the bug

The data in env.scene["robot"].data is sometimes (or potentially always) stale. This can be seen when using observations that extract cartesian positions of robot frames.

Steps to reproduce

You can see the impact in the following example. It creates the FrankaCubeLiftEnv, but instead of the usual observation space it uses only the end effector world position as observation.

When resetting the environment, the end effector world position is exactly zero. This cannot be, as the robot's default joint configuration does not place the end effector at [0, 0, 0] in world coordinates.

from omni.isaac.orbit.app import AppLauncher

app_launcher = AppLauncher({"headless": False})

import torch
import gymnasium

from omni.isaac.orbit.managers import ObservationGroupCfg, ObservationTermCfg
from omni.isaac.orbit_tasks.manipulation.lift.config.franka.ik_rel_env_cfg import FrankaCubeLiftEnvCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObjectTableSceneCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObservationsCfg
from omni.isaac.orbit.utils import configclass


def ee_pos_w(env) -> torch.Tensor:
    return env.scene["robot"].data.body_pos_w[:, 8, :3]


@configclass
class ObservationsCfg():

    @configclass
    class ObsCfg(ObservationGroupCfg):
        ee_pos_w = ObservationTermCfg(func=ee_pos_w)

        def __post_init__(self):
            self.concatenate_terms = True

    obs = ObsCfg()


def main():
    scene = ObjectTableSceneCfg(num_envs=1, env_spacing=2.5, replicate_physics=False)
    observations = ObservationsCfg()
    env_cfg = FrankaCubeLiftEnvCfg(scene=scene, observations=observations)
    env = gymnasium.make("Isaac-Lift-Cube-Franka-v0", cfg=env_cfg)

    obs, _ = env.reset()
    assert not torch.all(obs["obs"] == 0), "Robot end effector position is all zeros"


if __name__ == "__main__":
    main()

[INFO]: Completed setting up the environment...
Traceback (most recent call last):
  File "/home/amacati/repos/lsy_rl/debug.py", line 43, in <module>
    main()
  File "/home/amacati/repos/lsy_rl/debug.py", line 39, in main
    assert not torch.all(obs["obs"] == 0), "Robot end effector position is all zeros"
AssertionError: Robot end effector position is all zeros

Furthermore, when the simulation runs for two episodes, the first observation of the second episode from calling env.reset() has the end effector at the exact same coordinates as the last observation of the first episode. This cannot be, as the robot has made random movements throughout the episode. The buffer seems to have never been cleared. This leads me to suspect that either the buffer is not properly written to during env.reset(), or that it always lags behind by one environment step. Either of those two options leads to false observations.

from omni.isaac.orbit.app import AppLauncher

app_launcher = AppLauncher({"headless": False})

import torch
import gymnasium

from omni.isaac.orbit.managers import ObservationGroupCfg, ObservationTermCfg
from omni.isaac.orbit_tasks.manipulation.lift.config.franka.ik_rel_env_cfg import FrankaCubeLiftEnvCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObjectTableSceneCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObservationsCfg
from omni.isaac.orbit.utils import configclass


def ee_pos_w(env) -> torch.Tensor:
    return env.scene["robot"].data.body_pos_w[:, 8, :3]


@configclass
class ObservationsCfg():

    @configclass
    class ObsCfg(ObservationGroupCfg):
        ee_pos_w = ObservationTermCfg(func=ee_pos_w)

        def __post_init__(self):
            self.concatenate_terms = True

    obs = ObsCfg()


def main():
    scene = ObjectTableSceneCfg(num_envs=1, env_spacing=2.5, replicate_physics=False)
    observations = ObservationsCfg()
    env_cfg = FrankaCubeLiftEnvCfg(scene=scene, observations=observations)
    env = gymnasium.make("Isaac-Lift-Cube-Franka-v0", cfg=env_cfg)

    final_obs = None
    for _ in range(2):
        done = False
        obs, _ = env.reset()
        if final_obs is not None:
            if torch.all(obs["obs"] == final_obs["obs"]):
                raise RuntimeError("Stale observation detected. Wrist coordinates are identical")
        while not done:
            action = torch.as_tensor(env.action_space.sample()).cuda()
            next_obs, _, terminated, truncated, _ = env.step(action)
            done = torch.any(terminated | truncated)
            if done:
                final_obs = next_obs


if __name__ == "__main__":
    main()

[INFO]: Completed setting up the environment...
Traceback (most recent call last):
  File "/home/amacati/repos/lsy_rl/debug.py", line 54, in <module>
    main()
  File "/home/amacati/repos/lsy_rl/debug.py", line 44, in main
    raise RuntimeError("Stale observation detected. Wrist coordinates are identical")
RuntimeError: Stale observation detected. Wrist coordinates are identical

Interestingly enough, this problem does not manifest itself in the object data as can be checked by looking at the object world position instead of the end effector position. It does however affect more members of env.scene["robot"].data, e.g. root_state_w.

from omni.isaac.orbit.app import AppLauncher

app_launcher = AppLauncher({"headless": False})

import torch
import gymnasium

from omni.isaac.orbit.managers import ObservationGroupCfg, ObservationTermCfg
from omni.isaac.orbit_tasks.manipulation.lift.config.franka.ik_rel_env_cfg import FrankaCubeLiftEnvCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObjectTableSceneCfg
from omni.isaac.orbit_tasks.manipulation.lift.lift_env_cfg import ObservationsCfg
from omni.isaac.orbit.utils import configclass


def object_pos_w(env) -> torch.Tensor:
    return env.scene["object"].data.root_pos_w[:, :3]


@configclass
class ObservationsCfg():

    @configclass
    class ObsCfg(ObservationGroupCfg):
        object_pos_w = ObservationTermCfg(func=object_pos_w)

        def __post_init__(self):
            self.concatenate_terms = True

    obs = ObsCfg()


def main():
    scene = ObjectTableSceneCfg(num_envs=1, env_spacing=2.5, replicate_physics=False)
    observations = ObservationsCfg()
    env_cfg = FrankaCubeLiftEnvCfg(scene=scene, observations=observations)
    env = gymnasium.make("Isaac-Lift-Cube-Franka-v0", cfg=env_cfg)

    final_obs = None
    for _ in range(2):
        done = False
        obs, _ = env.reset()
        if final_obs is not None:
            if torch.all(obs["obs"] == final_obs["obs"]):
                raise RuntimeError("Stale observation detected. Object coordinates are identical")
        while not done:
            action = torch.as_tensor(env.action_space.sample()).cuda()
            next_obs, _, terminated, truncated, _ = env.step(action)
            done = torch.any(terminated | truncated)
            if done:
                final_obs = next_obs


if __name__ == "__main__":
    main()

(Finishes without any issue)

System Info

Describe the characteristic of your environment:

Commit: 62aeea1
Isaac Sim Version: 2023.1.1
OS: Ubuntu 20.04
GPU: RTX 4090
CUDA: 12.0
GPU Driver: 525.147.05

Checklist

I have checked that there is no similar issue in the repo (required)
I have checked that the issue is not in running Isaac Sim itself and is related to the repo

Acceptance Criteria

Add the criteria for which this task is considered done. If not known at issue creation time, you can add this once the issue is assigned.

The data in env.scene["robot"] is updated correctly on reset

The text was updated successfully, but these errors were encountered:

Mayankm96 · 2024-02-12T13:31:34Z

Hi @amacati ,

Many physics engines do a simulation step as a two-level call: forward() and simulate(), where the kinematic and dynamic states are updated, respectively. Unfortunately, PhysX has only a single step() call where these two operations are combined. Due to computations through GPU kernels, it is not so straightforward for them to split these operations. Thus, at the moment, it is not possible to set the joint states and do a forward call to update the kinematic states of links.

As you have noticed, the issue mainly appears in kinematic chains, where you'd expect the link states to be updated correctly when you update the root or joint states. This affects both initialization as well as episodic resets. For root and joint states, the setters directly write into the data field as well as the PhysX buffers. Thus, those values appear "updated."

While this is erroneous, there is currently no direct workaround for it. From our experience in using IsaacGym, the reset values are crucial. However, it does not affect the agent learning critically enough.

If you always need the link states to be correct, a possible solution is to make a warp simulation model that is detached from PhysX, and you can call it to update the link states whenever you need. It's not ideal, but that's a possible workaround for now. We have made a feature request to the PhysX team to have separated forward and simulate calls like other physics engines. However, at this point, there is no set timeline for this feature request.

amacati · 2024-02-12T14:07:07Z

Hi @Mayankm96 ,

Thanks a lot for the swift reply! Your explanation makes a lot of sense, but it's a bummer nevertheless. I think it would be great to place a warning in the docs somewhere so users know about this limitation.

Also, if I understood you correctly, the reset() function just "queues" a reset by overwriting the joint states, and then gets applied during the next step() call. That means the buffers from t=1,... are, in fact, back in sync again, right? I was worried they might always lag behind by one time step, which would be pretty bad.

I guess I will go with a wrapper for now that fuses each reset() with a single step(). Thanks again for your detailed answer, feel free to close the issue at any time. Just letting it open for know so people with the same question can find it more easily.

Mayankm96 · 2024-02-12T14:39:30Z

Good point. I have made an MR #241 based on the reply above.

I would be careful about putting a step() following a reset() call. This only works well when you have a single environment. Since right now you may have multiple environment instances running simultaneously (i.e. sharing the same stage), calling step() will cause all of them to do a forward stepping with cached commands from the last time. This may not be what you want. Depending on how often "reset" happens, this may adversely affect learning.

# Description As pointed out in #234 and #240, the environment returns stale values on reset. Since there is no immediate fix for this, this MR adds the issue to "Known issues" page in the documentation. ## Type of change - This change requires a documentation update ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./orbit.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have run all the tests with `./orbit.sh --test` and they pass - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there

amacati · 2024-02-12T17:07:44Z

True. I'm working with a constant time horizon task that always truncates though, so I think it should be fine.

Mayankm96 · 2024-02-12T21:58:31Z

I see. Then it is fine.

Closing this issue now as it isn't fixable on Orbit side directly.

This MR fixes a typo in `RigidObject` where for a rigid object, the private member `_body_view` is not defined (since it should be using the `root_view`). The fix changes the reference to be the public `body_view` property. - Bug fix (non-breaking change which fixes an issue) - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./orbit.sh --format` - [ ] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there

# Description As pointed out in isaac-sim#234 and isaac-sim#240, the environment returns stale values on reset. Since there is no immediate fix for this, this MR adds the issue to "Known issues" page in the documentation. ## Type of change - This change requires a documentation update ## Checklist - [x] I have run the [`pre-commit` checks](https://pre-commit.com/) with `./orbit.sh --format` - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have run all the tests with `./orbit.sh --test` and they pass - [ ] I have updated the changelog and the corresponding version in the extension's `config/extension.toml` file - [x] I have added my name to the `CONTRIBUTORS.md` or my name already exists there

Mayankm96 mentioned this issue Feb 12, 2024

[Question] Sensor reset #234

Closed

Mayankm96 added question Further information is requested wontfix This will not be worked on and removed wontfix This will not be worked on labels Feb 12, 2024

Mayankm96 self-assigned this Feb 12, 2024

Mayankm96 mentioned this issue Feb 12, 2024

Adds known issue docs for simulation stepping #241

Merged

7 tasks

Mayankm96 added the wontfix This will not be worked on label Feb 12, 2024

Mayankm96 closed this as completed Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] Stale values in scene robot data #240

[Bug Report] Stale values in scene robot data #240

amacati commented Feb 12, 2024 •

edited

Loading

Mayankm96 commented Feb 12, 2024 •

edited

Loading

amacati commented Feb 12, 2024

Mayankm96 commented Feb 12, 2024 •

edited

Loading

amacati commented Feb 12, 2024

Mayankm96 commented Feb 12, 2024

[Bug Report] Stale values in scene robot data #240

[Bug Report] Stale values in scene robot data #240

Comments

amacati commented Feb 12, 2024 • edited Loading

Describe the bug

Steps to reproduce

System Info

Checklist

Acceptance Criteria

Mayankm96 commented Feb 12, 2024 • edited Loading

amacati commented Feb 12, 2024

Mayankm96 commented Feb 12, 2024 • edited Loading

amacati commented Feb 12, 2024

Mayankm96 commented Feb 12, 2024

amacati commented Feb 12, 2024 •

edited

Loading

Mayankm96 commented Feb 12, 2024 •

edited

Loading

Mayankm96 commented Feb 12, 2024 •

edited

Loading