Alternatives to get DAGs into Airflow #55

chdudek · 2024-09-06T09:17:18Z

chdudek
Sep 6, 2024

The documentation provides two ways to load DAGs into Airflow:

Using configMaps
Using gitSync

Both methods are not suitable for our usecase. Are there any other ways to get DAGs into Airflow? We were thinking about a PersistentVolume (we were not able to copy files into the PV using kubectl cp) and a S3 bucket (we could not figure out how to mount the bucket into Airflow).

Do you have any recommendation or ideas how to solve our problem?

Thank you in advance!

adwk67 · 2024-09-06T09:42:14Z

adwk67
Sep 6, 2024
Maintainer

gitsync is the way that Airflow recommends. We have used PVCs in the past, but they have their own drawbacks, as not all k8s clusters make e.g. accessModes available in the same way: they are certainly a possibility if they work for your scenario. This article is quite helpful for weighing up the options for a given scenario.

0 replies

adwk67 · 2024-09-06T09:45:18Z

adwk67
Sep 6, 2024
Maintainer

You can use the S3Hook in airflow code to download from an S3 bucket, but it is difficult to use that in conjunction with a volumeMount without a PVC in between. You could also provision a Pod from within an Airflow DAG using the PodOperator, collecting the data from S3 via a container command (which would basically do what the S3Hook does, but using python directly).

0 replies

adwk67 · 2024-09-06T10:14:02Z

adwk67
Sep 6, 2024
Maintainer

You can use the S3Hook in airflow code to download from an S3 bucket, but it is difficult to use that in conjunction with a volumeMount without a PVC in between.

Specifically, you can define a PVC and mount it in the Airflow cluster so that it is available for the worker. Then a DAG can be written that downloads from S3 and writes to the PVC folder mounted to all workers. THen a second DAG can mount the PVC that is available to the worker and which now contains the DAGs from S3. This may be more complicated if using the KubernetesExecutor.

0 replies

chdudek · 2024-09-11T12:44:52Z

chdudek
Sep 11, 2024
Author

Thank you for the detailed answer @adwk67,

Although the gitsync option does not fit our use case very well, I tried it out, without success. First of all I had to tweak some things not described in the documentation:

Remove the envOverrides for AIRFLOW__CORE__DAGS_FOLDER to the airflow configuration
Add spark.kubernetes.file.upload.path to the spark job yaml

Airflow is running and the DAG is available to execute, however the driver pod fails because he does not find the mainApplicationFile

I've three files which should be synced with gitsync (basically the files from the Spark waterlevel demo):

The Python file containing the DAG
The SparkApplication yaml
The Python main application file

mainApplicationFile is defined as local:///stackable/spark/jobs/spark-wl-compaction-job.py and the spark.kubernetes.file.upload.path is defined as local:///stackable/spark/jobs/

I assume that the main application file is not synced from the git repository to the driver pod? Or is there a special folder where it is copied to?

3 replies

adwk67 Sep 11, 2024
Maintainer

I'm not exactly sure which demo you are referring to (maybe it's a combination of the airflow one and the water level one) but it may be clearer to look at the integration test that combines airflow and a spark job. A few things to note:

AIRFLOW__CORE__DAGS_FOLDER is set internally when gitsync is used. It only really makes sense to set this with envOverrides if you are mounting the DAGs yourself i.e. via a configMap
the integration test uses a semi-protected repo to test credentialled access, but the DAGs themselves are also hosted temporarily here in gitlab: https://gitlab.eclipse.org/adwk67/stackable-airflow-dags, so you can see what they do. The DAG is modularized so you will need SDP 24.7 for that to work.

chdudek Sep 12, 2024
Author

I'm not exactly sure which demo you are referring to (maybe it's a combination of the airflow one and the water level one) but it may be clearer to look at the integration test that combines airflow and a spark job. A few things to note:

Sorry for the confusion. We have build upon the Data Lakehouse demo (with Iceberg, Trino and Spark) and Airflow Demo and added the water level and smart city compaction jobs to Airflow using ConfigMaps. Now we want to have separate files for the Airflow tasks.

In the documentation and your integration test the mainApplicationFile is always located somewhere in the /stackable folder (e.g. local:///stackable/spark/examples/src/main/python/pi.py in the integration test) and most likely part of the deployment.

I am looking for a solution to have the mainApplicationFile located together with the job yaml and the DAG file.

* `AIRFLOW__CORE__DAGS_FOLDER` is set internally when gitsync is used. It only really makes sense to set this with `envOverrides` if you are mounting the DAGs yourself i.e. via a `configMap`

Maybe this should be made clearer in the documentation (for me it was not that clear).

* the integration test uses a semi-protected repo to test credentialled access, but the DAGs themselves are also hosted temporarily here in gitlab: https://gitlab.eclipse.org/adwk67/stackable-airflow-dags, so you can see what they do. The DAG is modularized so you will need SDP 24.7 for that to work.

Modularized DAGs are a great addition and I'm looking forward to upgrading to SDP 24.7

adwk67 Sep 12, 2024
Maintainer

There are several ways in which the mainApplicationFile is made available to the Spark job:

having it present in the spark image (this is the case with e.g. local:///stackable/spark/examples/src/main/python/pi.py: it's a standard example file delivered by spark with the source code)
having it mounted via ConfigMap. See e.g. https://github.com/stackabletech/spark-k8s-operator/blob/main/tests/templates/kuttl/delta-lake/40-spark-app.yaml.j2
downloading as part of the gitsync action, so that mainApplicationFile points to a (local) file within the dags folder
downloading it from an S3 bucket. See e.g. https://github.com/stackabletech/spark-k8s-operator/blob/main/tests/templates/kuttl/spark-ny-public-s3/10-deploy-spark-app.yaml.j2

I am looking for a solution to have the mainApplicationFile located together with the job yaml and the DAG file.

So one possibility would be include the file in the DAG-package-to-be-downloaded-by-gitsync (so include it in these files or a structure like it) and set mainApplicationFile to local:///path-to-dags-folder/{folder-tree}/my_job.py.

chdudek · 2024-09-12T14:07:14Z

chdudek
Sep 12, 2024
Author

Thank you for the clarification. I think gitsync is the way to go then for us, but I don't get I to work.

When the Airflow job is running it starts a pod (e.g. spark-wl-compaction-job-20240912134621-bvb4b) with the "spark-submit" container, which starts a driver pod (e.g. spark-wl-compaction-job-20240912134621-eccafe91e67c4fc3-driver) with a spark container , which tries to execute the Python job file, because the path I provided is wrong:
2024-09-12T15:46:44.977594819+02:00 /usr/bin/python: can't open file '/stackable/spark/jobs/dags/spark-wl-compaction-job.py': [Errno 2] No such file or directory

I've added spark.kubernetes.file.upload.path: local:///stackable/spark/jobs/ to spec.sparkConf and also tried several other paths without success. Unfortunately, the pod crashes almost instantly, so there is no simple way to explore the pods file system.

So the question is, where will the files synced with gitsync be placed, so I can provide the correct path for the mainApplicationFile ?

0 replies

adwk67 · 2024-09-12T15:50:12Z

adwk67
Sep 12, 2024
Maintainer

It sounds like it would make sense to separate a) the gitsync operation from b) dealing with the spark job dependencies (mainApplicationFile etc.). Here are some options:

mount the job from the gitsync folder via volumeMounts: this is a bit of a chicken-and-egg situation, though. Airflow knows where the DAGs from a gitsync download are stored as it is an internal detail, but the spark job needs to reference that, too, for volumes/volumeMounts that make the job available to the driver and executor pods (see here, here and here). This will probably involve some hard coding that will get brittle over time.
bake the spark jobs into a custom image built off one of our standard ones. This has obvious drawbacks (the user needs a registry to host such images) but it should be possible to keep the spark jobs in the same repo as the DAGs and curl them in when the docker image is being created. The testing of the job itself is then kept separate from its provisioning, which can be tackled in a later iteration.
insert an Airflow task to the DAG that copies any Spark dependencies (that have been downloaded via gitsync) to an S3 bucket. We have several examples and tests showing how spark reads resources from S3, and with this option the details of gitsync folders is kept out of the spark job definitions.

For the last of these, you can use the S3Hook to do this. It would look something like this (not tested):

    def upload_to_s3() -> None:
      # connection to s3 bucket, defined in airflow
      hook = S3Hook('s3-conn') 
      # local path on the airflow worker where gitsync has written the DAGs
      results_dir = '/path-to-DAGs-folder'
      path = pathlib.Path(results_dir)
      for p in path.rglob("*"):
        if Path(p).is_file():
          hook.load_file(
            filename=p.absolute(),
           # bucket name
            bucket_name='my-bucket',
            # path will form the prefix under the bucket name
            key=p,
            replace=True,
          )

This task would then be called before the one that calls the spark job.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stackable

Alternatives to get DAGs into Airflow #55

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Stackable

Alternatives to get DAGs into Airflow #55

chdudek Sep 6, 2024

Replies: 6 comments · 3 replies

adwk67 Sep 6, 2024 Maintainer

adwk67 Sep 6, 2024 Maintainer

adwk67 Sep 6, 2024 Maintainer

chdudek Sep 11, 2024 Author

adwk67 Sep 11, 2024 Maintainer

chdudek Sep 12, 2024 Author

adwk67 Sep 12, 2024 Maintainer

chdudek Sep 12, 2024 Author

adwk67 Sep 12, 2024 Maintainer

chdudek
Sep 6, 2024

Replies: 6 comments 3 replies

adwk67
Sep 6, 2024
Maintainer

adwk67
Sep 6, 2024
Maintainer

adwk67
Sep 6, 2024
Maintainer

chdudek
Sep 11, 2024
Author

adwk67 Sep 11, 2024
Maintainer

chdudek Sep 12, 2024
Author

adwk67 Sep 12, 2024
Maintainer

chdudek
Sep 12, 2024
Author

adwk67
Sep 12, 2024
Maintainer