Replies: 6 comments 3 replies
-
gitsync is the way that Airflow recommends. We have used PVCs in the past, but they have their own drawbacks, as not all k8s clusters make e.g. |
Beta Was this translation helpful? Give feedback.
-
You can use the S3Hook in airflow code to download from an S3 bucket, but it is difficult to use that in conjunction with a volumeMount without a PVC in between. You could also provision a Pod from within an Airflow DAG using the PodOperator, collecting the data from S3 via a container command (which would basically do what the S3Hook does, but using python directly). |
Beta Was this translation helpful? Give feedback.
-
Specifically, you can define a PVC and mount it in the Airflow cluster so that it is available for the worker. Then a DAG can be written that downloads from S3 and writes to the PVC folder mounted to all workers. THen a second DAG can mount the PVC that is available to the worker and which now contains the DAGs from S3. This may be more complicated if using the KubernetesExecutor. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the detailed answer @adwk67, Although the gitsync option does not fit our use case very well, I tried it out, without success. First of all I had to tweak some things not described in the documentation:
Airflow is running and the DAG is available to execute, however the driver pod fails because he does not find the I've three files which should be synced with gitsync (basically the files from the Spark waterlevel demo):
I assume that the main application file is not synced from the git repository to the driver pod? Or is there a special folder where it is copied to? |
Beta Was this translation helpful? Give feedback.
-
Thank you for the clarification. I think gitsync is the way to go then for us, but I don't get I to work. When the Airflow job is running it starts a pod (e.g. spark-wl-compaction-job-20240912134621-bvb4b) with the "spark-submit" container, which starts a driver pod (e.g. spark-wl-compaction-job-20240912134621-eccafe91e67c4fc3-driver) with a spark container , which tries to execute the Python job file, because the path I provided is wrong: I've added So the question is, where will the files synced with gitsync be placed, so I can provide the correct path for the mainApplicationFile ? |
Beta Was this translation helpful? Give feedback.
-
It sounds like it would make sense to separate a) the gitsync operation from b) dealing with the spark job dependencies (
For the last of these, you can use the S3Hook to do this. It would look something like this (not tested):
This task would then be called before the one that calls the spark job. |
Beta Was this translation helpful? Give feedback.
-
The documentation provides two ways to load DAGs into Airflow:
configMaps
gitSync
Both methods are not suitable for our usecase. Are there any other ways to get DAGs into Airflow? We were thinking about a PersistentVolume (we were not able to copy files into the PV using
kubectl cp
) and a S3 bucket (we could not figure out how to mount the bucket into Airflow).Do you have any recommendation or ideas how to solve our problem?
Thank you in advance!
Beta Was this translation helpful? Give feedback.
All reactions