Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagster SQLite EventLog failures #3868

Open
zaneselvans opened this issue Sep 26, 2024 · 2 comments
Open

Dagster SQLite EventLog failures #3868

zaneselvans opened this issue Sep 26, 2024 · 2 comments
Labels
dagster Issues related to our use of the Dagster orchestrator sqlite Issues related to interacting with sqlite databases

Comments

@zaneselvans
Copy link
Member

As previously addressed in e.g. #2417, #2996, #3003, #3208, and #3211, SQLite can't handle multiple concurrent writes. Our SQLite IO Manager has worked around this for the PUDL DB, but we seem to be hitting a new limit of some kind with Dagster's event logging DB, which still uses SQLite locally by default.

@cmgosnell recently encountered the issue in attempting to debug some integration test failures through the Dagster UI, in which it failed basically 100% of the time. I have had the problem on and off with maybe a 50-75% failure rate on the full ETL.

For me, the failure always seems to happen right after the execution of the raw_eia860m__all_dfs asset.

2024-09-25 21:43:12 -0600 - dagster - DEBUG - etl_full - 3162fc18-e15b-4284-81dd-b46c063d8d35 - 8873 - raw_eia860m__all_dfs - STEP_SUCCESS - Finished execution of step "raw_eia860m__all_dfs" in 2m35s.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: unable to open database file

Full stack trace here:

2024-09-25 21:43:12 -0600 - dagster - DEBUG - etl_full - 3162fc18-e15b-4284-81dd-b46c063d8d35 - 8873 - raw_eia860m__all_dfs - STEP_SUCCESS - Finished execution of step "raw_eia860m__all_dfs" in 2m35s.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlite3.OperationalError: unable to open database file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_grpc/impl.py", line 266, in start_run_in_subprocess
    _run_in_subprocess(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_grpc/impl.py", line 250, in _run_in_subprocess
    instance.report_engine_event(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2497, in report_engine_event
    self.report_dagster_event(dagster_event, run_id=run_id, log_level=log_level)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2519, in report_dagster_event
    self.handle_new_event(event_record)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/instance/__init__.py", line 2417, in handle_new_event
    self._event_storage.store_event(events[0])
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/dagster/_core/storage/event_log/sqlite/sqlite_event_log.py", line 242, in store_event
    conn.execute(insert_event_statement)
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
           ^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2355, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/Users/zane/miniforge3/envs/pudl-dev/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 941, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) unable to open database file
[SQL: INSERT INTO event_logs (run_id, event, dagster_event_type, timestamp, step_key, asset_key, partition) VALUES (?, ?, ?, ?, ?, ?, ?)]
[parameters: ('3162fc18-e15b-4284-81dd-b46c063d8d35', '{"__class__": "EventLogEntry", "dagster_event": {"__class__": "DagsterEvent", "event_specific_data": {"__class__": "EngineEventData", "error": null,  ... (350 characters truncated) ... "pipeline_name": "etl_full", "run_id": "3162fc18-e15b-4284-81dd-b46c063d8d35", "step_key": null, "timestamp": 1727322130.5879638, "user_message": ""}', 'ENGINE_EVENT', '2024-09-26 03:42:10.587964', None, None, None)]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

This "unable to open database file" error seems somewhat different than the locked DB error that we were getting before due to attempted concurrent writes.

So, is there a new workaround to squeeze some more life out of SQLite? Or what is the easiest way to use Postgres locally for development?

  • @jdangerx suggested Orbstack as a containerized Postgres solution for local use that doesn't run like 10x slower on MacOS/arm64.
  • Is there some way to make Dagster log fewer events? Can we turn off DEBUG level logging? I don't think we really look at it!
  • How does this intersect with our development environment setup, which is already challenging for new contributors? Is there an "easy" way to use dev containers and integrate this Postgres setup into that?
  • Is there a possibility of making our tests not use the in-process (single threaded) executor if we're messing around with containers anyway?
@zaneselvans zaneselvans added sqlite Issues related to interacting with sqlite databases dagster Issues related to our use of the Dagster orchestrator labels Sep 26, 2024
@zaneselvans
Copy link
Member Author

I may just be a superstitious pigeon on this but it seems like sometimes I get this failure 100% reliably, and other times it's maybe just a 50% chance, so if it fails, and I try to run the ETL again, it might work. But I don't know what influences the probability. Last night I tried to run it 10 times in a row, and they all failed within 2.5 minutes.

Some things I've explored in the past without success:

  • Does the probability vary with the python environment? I.e. is there some combination of packages that aren't playing well together that causes the problem to be more likely? Sometimes updating the pudl-dev environment seems to fix things. But it's not clear if that's real or me imagining things.
  • Are there too many files & sockets open at once? Using lsof I can see that Dagster often has thousands of "files" open at once, and my system as a whole might have 20,000 open when the ETL is running.

@zaneselvans
Copy link
Member Author

Looking at the logging databases themselves:

  • they exist, and Dagster has been able to write data to them before it starts failing
  • Recent DBs from failures are 4.0 MB in size
  • They appear to contain only Dagster logging messages in the event_logs table, and that is the only table that contains any records.
  • In the failed logging DBs, there are about 2000 event logging records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator sqlite Issues related to interacting with sqlite databases
Projects
Status: New
Development

No branches or pull requests

1 participant