Answer the question
In order to leave comments, you need to log in
How to retry multiple related tasks in Apache Airflow?
There is a standard ETL task, to take data from the source and put it into PostgreSQL.
There is Apache Airflow (2.2.3) and python scripts for this.
But there are some nuances:
- Heavy requests are often sent to the received data, so material views are used , which must be updated after collection.
- The data is not always collected in full, so there are several retry on the collection tasks, but the material views must be updated after each (well, or almost each) retry of the upstream task.
In the DAG, it looks like this now:
two tasks: A - PythonOperator, which collects data and puts it in the database, B - PostgresOperator, which makes a request for a refresh materialized view.
Connected in series: A >> B
The problem is that A has several retry, and until they all end (successfully or not, this is not so important), task B will not start.
It would be possible to shove the refresh logic into task A, but then it would be impossible to do something like
[A0, A1, A2] >> B (refresh after performing several upstream tasks).
Ideally, it seems to me, it would be to shove it all into a TaskGroup and retry the entire group, but unfortunately, Airflow does not allow retrying the group. There used to be Subdag, which is now deprecated and replaced by TaskGroup.
Answer the question
In order to leave comments, you need to log in
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question