Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] dbt seed fails for Delta Lake + S3DynamoDBLogStore configuration #1090

Open
2 tasks done
Paulskit opened this issue Aug 20, 2024 · 0 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@Paulskit
Copy link

Paulskit commented Aug 20, 2024

Is this a new bug in dbt-spark?

  • I believe this is a new bug in dbt-spark
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

dbt seed command produces java.nio.file.FileSystemException: Old entries for table s3://<bucket_name>/<table_path> still exist in the external log store error

Expected Behavior

dbt seed successfully creates table from csv seed file

Steps To Reproduce

dbt 1.8.5 + dbt-spark 1.8.0

seeds:
   +file_format: 'delta'

Apache Spark configured with Delta Lake session and s3 dynamodb lock store:

spark.delta.logStore.s3.impl=io.delta.storage.S3DynamoDBLogStore
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.extensions=o.delta.sql.DeltaSparkSessionExtension
spark.io.delta.storage.S3DynamoDBLogStore.ddb.tableName=<your dynamo db table name>

Relevant log output

[2024-08-20T15:50:46.627+0000] {logging_mixin.py:188} INFO - 15:50:46  Running with dbt=1.8.5
[2024-08-20T15:50:46.804+0000] {logging_mixin.py:188} INFO - 15:50:46  Registered adapter: spark=1.8.0
[2024-08-20T15:50:46.845+0000] {logging_mixin.py:188} INFO - 15:50:46  Unable to do partial parsing because saved manifest not found. Starting full parse.
[2024-08-20T15:50:49.738+0000] {logging_mixin.py:188} INFO - 15:50:49  [WARNING]: Deprecated functionality
The `tests` config has been renamed to `data_tests`. Please see
https://docs.getdbt.com/docs/build/data-tests#new-data_tests-syntax for more
information.
[2024-08-20T15:50:50.267+0000] {logging_mixin.py:188} INFO - 15:50:50  Found 6 models, 42 data tests, 1 seed, 36 sources, 733 macros, 25 unit tests
[2024-08-20T15:50:50.275+0000] {logging_mixin.py:188} INFO - 15:50:50
[2024-08-20T15:52:03.105+0000] {logging_mixin.py:188} INFO - 15:52:03  Concurrency: 4 threads (target='dev')
[2024-08-20T15:52:03.106+0000] {logging_mixin.py:188} INFO - 15:52:03
[2024-08-20T15:52:03.110+0000] {logging_mixin.py:188} INFO - 15:52:03  1 of 1 START seed file <database>.<table> ........... [RUN]
[2024-08-20T15:52:11.569+0000] {logging_mixin.py:188} INFO - 15:52:11  1 of 1 ERROR loading seed file <database>.<table> ... [ERROR in 8.45s]
[2024-08-20T15:52:11.947+0000] {logging_mixin.py:188} INFO - 15:52:11
[2024-08-20T15:52:11.948+0000] {logging_mixin.py:188} INFO - 15:52:11  Finished running 1 seed in 0 hours 1 minutes and 21.67 seconds (81.67s).
[2024-08-20T15:52:12.023+0000] {logging_mixin.py:188} INFO - 15:52:12
[2024-08-20T15:52:12.024+0000] {logging_mixin.py:188} INFO - 15:52:12  Completed with 1 error and 0 warnings:
[2024-08-20T15:52:12.025+0000] {logging_mixin.py:188} INFO - 15:52:12
[2024-08-20T15:52:12.027+0000] {logging_mixin.py:188} INFO - 15:52:12    Runtime Error in seed <table> (seeds/<table>.csv)
  Database Error
    org.apache.hive.service.cli.HiveSQLException: Error running query: java.nio.file.FileSystemException: Old entries for table <s3_path> still exist in the external log store
    	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:46)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166)
    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
    	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161)
    	at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
    	at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175)
    	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    	at java.base/java.lang.Thread.run(Thread.java:840)
    Caused by: java.nio.file.FileSystemException: Old entries for table <s3_path> still exist in the external log store
    	at io.delta.storage.BaseExternalLogStore.write(BaseExternalLogStore.java:222)
    	at org.apache.spark.sql.delta.storage.LogStoreAdaptor.write(LogStore.scala:444)
    	at org.apache.spark.sql.delta.storage.DelegatingLogStore.write(DelegatingLogStore.scala:119)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.writeCommitFile(OptimisticTransaction.scala:1806)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.writeCommitFile$(OptimisticTransaction.scala:1798)
    	at org.apache.spark.sql.delta.OptimisticTransaction.writeCommitFile(OptimisticTransaction.scala:142)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.doCommit(OptimisticTransaction.scala:1711)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.doCommit$(OptimisticTransaction.scala:1682)
    	at org.apache.spark.sql.delta.OptimisticTransaction.doCommit(OptimisticTransaction.scala:142)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$doCommitRetryIteratively$3(OptimisticTransaction.scala:1651)
    	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$doCommitRetryIteratively$2(OptimisticTransaction.scala:1648)
    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    	at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:140)
    	at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:138)
    	at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:142)
    	at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:133)
    	at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:128)
    	at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:117)
    	at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:142)
    	at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:132)

Environment

- OS: Amazon Linux 2 (AWS EMR)
- Python: 3.11
- dbt-core: 1.8.5
- dbt-spark:1.8.0

Additional Context

Possible root cause.
dbt-spark uses drop table + create table for seeds.
image

On the first seed run, everything works well, data is inserted and table is created. However, since we're using ddb lock store, the following records for delta log files are created in the dynamodb
image

On the next run, table is dropped, s3 delta lake files are deleted, but records in the dynamodb aren't. When dbt-spark executes 'create table' command, Delta tries to write 00000000000000000000.json, but fails, since the corresponding lock record already exists.

It feels like the better way to work with delta lake tables are not drop + create, but create if not exist, truncate + insert. Or merge if that makes more sense.

@Paulskit Paulskit added bug Something isn't working triage labels Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants