Skip to content

Supporting Airflow as executor #671

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pchevali opened this issue Mar 20, 2025 · 2 comments
Closed

Supporting Airflow as executor #671

pchevali opened this issue Mar 20, 2025 · 2 comments

Comments

@pchevali
Copy link

Hi,

I'm creating an issue here, but it's more like an a discussion.

I have developed a similar engine but it is far less powerful than this one, so I am looking how I can use it for my purpose.
My use case is building large scale geotiff from simulation grid. I am currently building COG on a web mercator tiles. I am using Apache Airflow as a task scheduler ( as I use it for other use cases).
I was wondering how can I use mapchete with an airflow excecutor. I guess that I have to extend the BaseExecutor, but as mapchete does the supervision, I don't think that it is the best way.
I think that I should generate a task list and then let Airflow run the dag with the task list.

Do you have an idea if there is a place where I can do that ?

Thanks

Regards

@ungarj
Copy link
Owner

ungarj commented Mar 25, 2025

Hi,

I have not used Airflow so far but I think it could indeed be used as an alternative executor. It is going to be a bit tricky because the task graph generation is for now bound to dask and is implemented separately from the executors. But it is feasible.

We did some evaluation of Airflow a while back and had the impression that it is more suited for larger, longer running tasks whereas dask works better for a large amount of tasks because it has less overhead. This may be outdated already, so please correct me if I'm wrong. Anyways, we came to the conclusion that sticking with dask was the better option for that time and didn't test Airflow.

How many tasks (roughly) does your process spawn?
Do they need to have dependencies (i.e. require a task graph) or can they be executed independently from each other?
If it is in the thousands or ten thousands, how does Airflow cope?
Do you intend to call it similar to the external dask scheduler execution (mapchete execute --dask-scheduler my-scheduler ...)?

As for implementation, a simple, first step would be to add an AirflowExecutor class to mapchete.executors which implements the required methods (like DaskExecutor does) and add some unit test for this. I then can help you out with connecting the task graph creation bits with the other parts of the code.

Cheers

@pchevali
Copy link
Author

Hi,

After thinking again about that, making an airflow executor might not be the right choice as stated above. I should keep airflow for orchestration and rely on dask for scheduling as it is so mush lightweight than airflow.
I was think of making batches of tasks for using with airflow but that would be rewriting things.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants