[Bug] dbt seed fails for Delta Lake + S3DynamoDBLogStore configuration #1090

Paulskit · 2024-08-20T16:17:48Z

Is this a new bug in dbt-spark?

I believe this is a new bug in dbt-spark
I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

dbt seed command produces java.nio.file.FileSystemException: Old entries for table s3://<bucket_name>/<table_path> still exist in the external log store error

Expected Behavior

dbt seed successfully creates table from csv seed file

Steps To Reproduce

dbt 1.8.5 + dbt-spark 1.8.0

seeds:
   +file_format: 'delta'

Apache Spark configured with Delta Lake session and s3 dynamodb lock store:

spark.delta.logStore.s3.impl=io.delta.storage.S3DynamoDBLogStore
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.extensions=o.delta.sql.DeltaSparkSessionExtension
spark.io.delta.storage.S3DynamoDBLogStore.ddb.tableName=<your dynamo db table name>

Relevant log output

[2024-08-20T15:50:46.627+0000] {logging_mixin.py:188} INFO - 15:50:46  Running with dbt=1.8.5
[2024-08-20T15:50:46.804+0000] {logging_mixin.py:188} INFO - 15:50:46  Registered adapter: spark=1.8.0
[2024-08-20T15:50:46.845+0000] {logging_mixin.py:188} INFO - 15:50:46  Unable to do partial parsing because saved manifest not found. Starting full parse.
[2024-08-20T15:50:49.738+0000] {logging_mixin.py:188} INFO - 15:50:49  [WARNING]: Deprecated functionality
The `tests` config has been renamed to `data_tests`. Please see
https://docs.getdbt.com/docs/build/data-tests#new-data_tests-syntax for more
information.
[2024-08-20T15:50:50.267+0000] {logging_mixin.py:188} INFO - 15:50:50  Found 6 models, 42 data tests, 1 seed, 36 sources, 733 macros, 25 unit tests
[2024-08-20T15:50:50.275+0000] {logging_mixin.py:188} INFO - 15:50:50
[2024-08-20T15:52:03.105+0000] {logging_mixin.py:188} INFO - 15:52:03  Concurrency: 4 threads (target='dev')
[2024-08-20T15:52:03.106+0000] {logging_mixin.py:188} INFO - 15:52:03
[2024-08-20T15:52:03.110+0000] {logging_mixin.py:188} INFO - 15:52:03  1 of 1 START seed file <database>.<table> ........... [RUN]
[2024-08-20T15:52:11.569+0000] {logging_mixin.py:188} INFO - 15:52:11  1 of 1 ERROR loading seed file <database>.<table> ... [ERROR in 8.45s]
[2024-08-20T15:52:11.947+0000] {logging_mixin.py:188} INFO - 15:52:11
[2024-08-20T15:52:11.948+0000] {logging_mixin.py:188} INFO - 15:52:11  Finished running 1 seed in 0 hours 1 minutes and 21.67 seconds (81.67s).
[2024-08-20T15:52:12.023+0000] {logging_mixin.py:188} INFO - 15:52:12
[2024-08-20T15:52:12.024+0000] {logging_mixin.py:188} INFO - 15:52:12  Completed with 1 error and 0 warnings:
[2024-08-20T15:52:12.025+0000] {logging_mixin.py:188} INFO - 15:52:12
[2024-08-20T15:52:12.027+0000] {logging_mixin.py:188} INFO - 15:52:12    Runtime Error in seed <table> (seeds/<table>.csv)
  Database Error
    org.apache.hive.service.cli.HiveSQLException: Error running query: java.nio.file.FileSystemException: Old entries for table <s3_path> still exist in the external log store
    	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:46)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166)
    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79)
    	at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161)
    	at java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
    	at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
    	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175)
    	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    	at java.base/java.lang.Thread.run(Thread.java:840)
    Caused by: java.nio.file.FileSystemException: Old entries for table <s3_path> still exist in the external log store
    	at io.delta.storage.BaseExternalLogStore.write(BaseExternalLogStore.java:222)
    	at org.apache.spark.sql.delta.storage.LogStoreAdaptor.write(LogStore.scala:444)
    	at org.apache.spark.sql.delta.storage.DelegatingLogStore.write(DelegatingLogStore.scala:119)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.writeCommitFile(OptimisticTransaction.scala:1806)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.writeCommitFile$(OptimisticTransaction.scala:1798)
    	at org.apache.spark.sql.delta.OptimisticTransaction.writeCommitFile(OptimisticTransaction.scala:142)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.doCommit(OptimisticTransaction.scala:1711)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.doCommit$(OptimisticTransaction.scala:1682)
    	at org.apache.spark.sql.delta.OptimisticTransaction.doCommit(OptimisticTransaction.scala:142)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$doCommitRetryIteratively$3(OptimisticTransaction.scala:1651)
    	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    	at org.apache.spark.sql.delta.OptimisticTransactionImpl.$anonfun$doCommitRetryIteratively$2(OptimisticTransaction.scala:1648)
    	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    	at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile(DeltaLogging.scala:140)
    	at org.apache.spark.sql.delta.metering.DeltaLogging.recordFrameProfile$(DeltaLogging.scala:138)
    	at org.apache.spark.sql.delta.OptimisticTransaction.recordFrameProfile(OptimisticTransaction.scala:142)
    	at org.apache.spark.sql.delta.metering.DeltaLogging.$anonfun$recordDeltaOperationInternal$1(DeltaLogging.scala:133)
    	at com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:128)
    	at com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:117)
    	at org.apache.spark.sql.delta.OptimisticTransaction.recordOperation(OptimisticTransaction.scala:142)
    	at org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperationInternal(DeltaLogging.scala:132)

Environment

- OS: Amazon Linux 2 (AWS EMR)
- Python: 3.11
- dbt-core: 1.8.5
- dbt-spark:1.8.0

Additional Context

Possible root cause.
dbt-spark uses drop table + create table for seeds.

On the first seed run, everything works well, data is inserted and table is created. However, since we're using ddb lock store, the following records for delta log files are created in the dynamodb

On the next run, table is dropped, s3 delta lake files are deleted, but records in the dynamodb aren't. When dbt-spark executes 'create table' command, Delta tries to write 00000000000000000000.json, but fails, since the corresponding lock record already exists.

It feels like the better way to work with delta lake tables are not drop + create, but create if not exist, truncate + insert. Or merge if that makes more sense.

The text was updated successfully, but these errors were encountered:

Paulskit added bug Something isn't working triage labels Aug 20, 2024

amychen1776 removed the triage label Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] dbt seed fails for Delta Lake + S3DynamoDBLogStore configuration #1090

[Bug] dbt seed fails for Delta Lake + S3DynamoDBLogStore configuration #1090

Paulskit commented Aug 20, 2024 •

edited

Loading

[Bug] dbt seed fails for Delta Lake + S3DynamoDBLogStore configuration #1090

[Bug] dbt seed fails for Delta Lake + S3DynamoDBLogStore configuration #1090

Comments

Paulskit commented Aug 20, 2024 • edited Loading

Is this a new bug in dbt-spark?

Current Behavior

Expected Behavior

Steps To Reproduce

Relevant log output

Environment

Additional Context

Paulskit commented Aug 20, 2024 •

edited

Loading