Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat timeout docs #46257

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
18 changes: 9 additions & 9 deletions docs/apache-airflow/core-concepts/tasks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -167,25 +167,25 @@ These can be useful if your code has extra knowledge about its environment and w

.. _concepts:zombies:

Zombie Tasks
------------
Task Heartbeat Timeout (Zombie Tasks)
karenbraganz marked this conversation as resolved.
Show resolved Hide resolved
---------------------------------------

No system runs perfectly, and task instances are expected to die once in a while.

*Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite their associated jobs being inactive
(e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these
periodically, clean them up, and either fail or retry the task depending on its settings. Tasks can become zombies for
``TaskInstances`` may get stuck in a ``running`` state despite their associated jobs being inactive
(e.g. their local task job did not send a recent heartbeat as it got killed, or the machine died). Such tasks are also known as zombie tasks. Airflow will find these
karenbraganz marked this conversation as resolved.
Show resolved Hide resolved
periodically, clean them up, and either fail or retry the task depending on its settings. The heartbeat of a local task job can timeout for
karenbraganz marked this conversation as resolved.
Show resolved Hide resolved
many reasons, including:

* The Airflow worker ran out of memory and was OOMKilled.
* The Airflow worker failed its liveness probe, so the system (for example, Kubernetes) restarted the worker.
* The system (for example, Kubernetes) scaled down and moved an Airflow worker from one node to another.


Reproducing zombie tasks locally
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Reproducing task heartbeat timeouts locally
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you'd like to reproduce zombie tasks for development/testing processes, follow the steps below:
If you'd like to reproduce local task job heartbeat timeouts for development/testing processes, follow the steps below:

1. Set the below environment variables for your local Airflow setup (alternatively you could tweak the corresponding config values in airflow.cfg)

Expand Down Expand Up @@ -216,7 +216,7 @@ If you'd like to reproduce zombie tasks for development/testing processes, follo
sleep_dag()


Run the above DAG and wait for a while. You should see the task instance becoming a zombie task and then being killed by the scheduler.
Run the above DAG and wait for a while. You should see the task experience a heartbeat timeout and then get killed by the scheduler.



Expand Down