From bb8d47fed037c91325bf390c9791418c510d5b1e Mon Sep 17 00:00:00 2001 From: Karen Braganza Date: Wed, 29 Jan 2025 13:39:34 -0500 Subject: [PATCH 1/8] Emphasize task heartbeat timeout terminology in docs to match logs --- docs/apache-airflow/core-concepts/tasks.rst | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/apache-airflow/core-concepts/tasks.rst b/docs/apache-airflow/core-concepts/tasks.rst index 530b58e28456b..dbe10db53283c 100644 --- a/docs/apache-airflow/core-concepts/tasks.rst +++ b/docs/apache-airflow/core-concepts/tasks.rst @@ -167,14 +167,14 @@ These can be useful if your code has extra knowledge about its environment and w .. _concepts:zombies: -Zombie Tasks ------------- +Task Heartbeat Timeout (Zombie Tasks) +--------------------------------------- No system runs perfectly, and task instances are expected to die once in a while. -*Zombie tasks* are ``TaskInstances`` stuck in a ``running`` state despite their associated jobs being inactive -(e.g. their process did not send a recent heartbeat as it got killed, or the machine died). Airflow will find these -periodically, clean them up, and either fail or retry the task depending on its settings. Tasks can become zombies for +``TaskInstances`` may get stuck in a ``running`` state despite their associated jobs being inactive +(e.g. their local task job did not send a recent heartbeat as it got killed, or the machine died). Such tasks are also known as zombie tasks. Airflow will find these +periodically, clean them up, and either fail or retry the task depending on its settings. The heartbeat of a local task job can timeout for many reasons, including: * The Airflow worker ran out of memory and was OOMKilled. @@ -182,10 +182,10 @@ many reasons, including: * The system (for example, Kubernetes) scaled down and moved an Airflow worker from one node to another. -Reproducing zombie tasks locally -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Reproducing task heartbeat timeouts locally +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -If you'd like to reproduce zombie tasks for development/testing processes, follow the steps below: +If you'd like to reproduce local task job heartbeat timeouts for development/testing processes, follow the steps below: 1. Set the below environment variables for your local Airflow setup (alternatively you could tweak the corresponding config values in airflow.cfg) @@ -216,7 +216,7 @@ If you'd like to reproduce zombie tasks for development/testing processes, follo sleep_dag() -Run the above DAG and wait for a while. You should see the task instance becoming a zombie task and then being killed by the scheduler. +Run the above DAG and wait for a while. You should see the task experiencing a heartbeat timeout and then being killed by the scheduler. From 7ab777ad8bf9fbbdbe895355cae1ed9eff98e775 Mon Sep 17 00:00:00 2001 From: Karen Braganza Date: Wed, 29 Jan 2025 13:47:38 -0500 Subject: [PATCH 2/8] Grammatical correction --- docs/apache-airflow/core-concepts/tasks.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/apache-airflow/core-concepts/tasks.rst b/docs/apache-airflow/core-concepts/tasks.rst index dbe10db53283c..4f9254c81079a 100644 --- a/docs/apache-airflow/core-concepts/tasks.rst +++ b/docs/apache-airflow/core-concepts/tasks.rst @@ -216,7 +216,7 @@ If you'd like to reproduce local task job heartbeat timeouts for development/tes sleep_dag() -Run the above DAG and wait for a while. You should see the task experiencing a heartbeat timeout and then being killed by the scheduler. +Run the above DAG and wait for a while. You should see the task experience a heartbeat timeout and then being killed by the scheduler. From 8aa1f72ff145285fd5082120542db6e0017bff7e Mon Sep 17 00:00:00 2001 From: Karen Braganza Date: Wed, 29 Jan 2025 13:48:43 -0500 Subject: [PATCH 3/8] Grammatical correction --- docs/apache-airflow/core-concepts/tasks.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/apache-airflow/core-concepts/tasks.rst b/docs/apache-airflow/core-concepts/tasks.rst index 4f9254c81079a..af6dcb0b12d29 100644 --- a/docs/apache-airflow/core-concepts/tasks.rst +++ b/docs/apache-airflow/core-concepts/tasks.rst @@ -216,7 +216,7 @@ If you'd like to reproduce local task job heartbeat timeouts for development/tes sleep_dag() -Run the above DAG and wait for a while. You should see the task experience a heartbeat timeout and then being killed by the scheduler. +Run the above DAG and wait for a while. You should see the task experience a heartbeat timeout and then get killed by the scheduler. From 6b8e595b23d9308f20e2e38577203006c88c1266 Mon Sep 17 00:00:00 2001 From: Karen Braganza Date: Tue, 4 Feb 2025 22:14:22 -0500 Subject: [PATCH 4/8] edit docs --- docs/apache-airflow/core-concepts/tasks.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/apache-airflow/core-concepts/tasks.rst b/docs/apache-airflow/core-concepts/tasks.rst index af6dcb0b12d29..ac87b3f394538 100644 --- a/docs/apache-airflow/core-concepts/tasks.rst +++ b/docs/apache-airflow/core-concepts/tasks.rst @@ -167,13 +167,13 @@ These can be useful if your code has extra knowledge about its environment and w .. _concepts:zombies: -Task Heartbeat Timeout (Zombie Tasks) ---------------------------------------- +Task Heartbeat Timeout +---------------------- No system runs perfectly, and task instances are expected to die once in a while. ``TaskInstances`` may get stuck in a ``running`` state despite their associated jobs being inactive -(e.g. their local task job did not send a recent heartbeat as it got killed, or the machine died). Such tasks are also known as zombie tasks. Airflow will find these +(for example, their local task job did not send a recent heartbeat as it got killed, or the machine died). Such tasks were formerly known as zombie tasks. Airflow will find these periodically, clean them up, and either fail or retry the task depending on its settings. The heartbeat of a local task job can timeout for many reasons, including: @@ -216,7 +216,7 @@ If you'd like to reproduce local task job heartbeat timeouts for development/tes sleep_dag() -Run the above DAG and wait for a while. You should see the task experience a heartbeat timeout and then get killed by the scheduler. +Run the above DAG and wait for a while. You should see the task experiencing a heartbeat timeout and then being killed by the scheduler. From 5e0aa2d8c716dd70825745c81434a03b555998b4 Mon Sep 17 00:00:00 2001 From: Karen Braganza Date: Tue, 4 Feb 2025 22:34:26 -0500 Subject: [PATCH 5/8] redirect URL --- docs/apache-airflow/static/redirects.js | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/apache-airflow/static/redirects.js b/docs/apache-airflow/static/redirects.js index 761409601faf2..c4b5bc1132c95 100644 --- a/docs/apache-airflow/static/redirects.js +++ b/docs/apache-airflow/static/redirects.js @@ -19,7 +19,7 @@ document.addEventListener("DOMContentLoaded", function () { const redirects = { - "zombie-undead-tasks": "zombie-tasks", + "zombie-undead-tasks": "task-heartbeat-timeout", }; const fragment = window.location.hash.substring(1); if (redirects[fragment]) { From b479fdc872f07e73c77912154f130ea84bceb208 Mon Sep 17 00:00:00 2001 From: Karen Braganza Date: Wed, 5 Feb 2025 15:54:28 -0500 Subject: [PATCH 6/8] Update docs/apache-airflow/core-concepts/tasks.rst Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com> --- docs/apache-airflow/core-concepts/tasks.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/apache-airflow/core-concepts/tasks.rst b/docs/apache-airflow/core-concepts/tasks.rst index ac87b3f394538..1f70dc4063829 100644 --- a/docs/apache-airflow/core-concepts/tasks.rst +++ b/docs/apache-airflow/core-concepts/tasks.rst @@ -173,7 +173,7 @@ Task Heartbeat Timeout No system runs perfectly, and task instances are expected to die once in a while. ``TaskInstances`` may get stuck in a ``running`` state despite their associated jobs being inactive -(for example, their local task job did not send a recent heartbeat as it got killed, or the machine died). Such tasks were formerly known as zombie tasks. Airflow will find these +(for example if the ``TaskInstance``'s worker ran out of memory). Such tasks were formerly known as zombie tasks. Airflow will find these periodically, clean them up, and either fail or retry the task depending on its settings. The heartbeat of a local task job can timeout for many reasons, including: From 1d16f1970f02c3170133380b1220e5f080174794 Mon Sep 17 00:00:00 2001 From: Karen Braganza Date: Wed, 5 Feb 2025 15:54:54 -0500 Subject: [PATCH 7/8] Update docs/apache-airflow/core-concepts/tasks.rst Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com> --- docs/apache-airflow/core-concepts/tasks.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/apache-airflow/core-concepts/tasks.rst b/docs/apache-airflow/core-concepts/tasks.rst index 1f70dc4063829..81e28d0473292 100644 --- a/docs/apache-airflow/core-concepts/tasks.rst +++ b/docs/apache-airflow/core-concepts/tasks.rst @@ -174,7 +174,7 @@ No system runs perfectly, and task instances are expected to die once in a while ``TaskInstances`` may get stuck in a ``running`` state despite their associated jobs being inactive (for example if the ``TaskInstance``'s worker ran out of memory). Such tasks were formerly known as zombie tasks. Airflow will find these -periodically, clean them up, and either fail or retry the task depending on its settings. The heartbeat of a local task job can timeout for +periodically, clean them up, and mark the ``TaskInstance`` as failed or retry it if it has available retries. The ``TaskInstance``'s heartbeat can timeout for many reasons, including: * The Airflow worker ran out of memory and was OOMKilled. From 5eb20baf45612a67eb49928805c9f04838cd9602 Mon Sep 17 00:00:00 2001 From: Karen Braganza Date: Wed, 5 Feb 2025 15:55:04 -0500 Subject: [PATCH 8/8] Update docs/apache-airflow/core-concepts/tasks.rst Co-authored-by: Ryan Hatter <25823361+RNHTTR@users.noreply.github.com> --- docs/apache-airflow/core-concepts/tasks.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/apache-airflow/core-concepts/tasks.rst b/docs/apache-airflow/core-concepts/tasks.rst index 81e28d0473292..df0a0bd382904 100644 --- a/docs/apache-airflow/core-concepts/tasks.rst +++ b/docs/apache-airflow/core-concepts/tasks.rst @@ -216,7 +216,7 @@ If you'd like to reproduce local task job heartbeat timeouts for development/tes sleep_dag() -Run the above DAG and wait for a while. You should see the task experiencing a heartbeat timeout and then being killed by the scheduler. +Run the above DAG and wait for a while. The ``TaskInstance`` will be marked failed after seconds.