Citus worker pods tend to go into recovery when brought back up #10288

xin-hedera · 2025-02-03T17:03:43Z

Description

When running citus database runbook scripts, e.g., volume-snapshot.sh, worker pods tend to go into recovery when brought back up. The theory is the pods created and managed by stackgres have a default terminationGracePeriodSeconds of 60 seconds, in case the graceful shutdown of the database server takes close to or longer than 60s, it'll be killed and ending with WALs not fully checkpointed and recovery is guranteed next time the database starts.

Logs to compare

shard2 - likely killed instead of gracefully shutdown

2025-02-03 16:29:35,253 WARNING: Request to Citus coordinator leader mirror-citus-coord-0 http://XXX/patroni failed: ReadTimeoutError("HTTPConnectionPool(host='XXXX', port=8009): Read timed out. (read timeout=30)")
2025-02-03 16:29:35 UTC [1293]: db=,user=,app=,client= LOG: received fast shutdown request
2025-02-03 16:29:35 UTC [1293]: db=,user=,app=,client= LOG: aborting any active transactions
2025-02-03 16:29:35 UTC [1359]: db=postgres,user=postgres,app=Patroni restapi,client=[local] FATAL: terminating connection due to administrator command
2025-02-03 16:29:35 UTC [1306]: db=postgres,user=postgres,app=Patroni heartbeat,client=[local] FATAL: terminating connection due to administrator command
2025-02-03 16:29:35 UTC [1359]: db=postgres,user=postgres,app=Patroni restapi,client=[local] LOG: disconnection: session time: 1172:50:06.109 user=postgres database=postgres host=[local]
2025-02-03 16:29:35 UTC [1306]: db=postgres,user=postgres,app=Patroni heartbeat,client=[local] LOG: disconnection: session time: 1172:50:13.701 user=postgres database=postgres host=[local]
2025-02-03 16:29:35 UTC [1293]: db=,user=,app=,client= LOG: background worker "logical replication launcher" (PID 1323) exited with exit code 1

shard0 - checkpoint completed

2025-02-03 16:29:21 UTC [906]: db=,user=,app=,client= LOG: checkpoint complete: wrote 40034 buffers (1.5%); 0 WAL file(s) added, 5 removed, 0 recycled; write=1165.619 s, sync=0.367 s, total=1168.819 s; sync files=48331, longest=0.005 s, average=0.001 s; distance=126781 kB, estimate=126781 kB; lsn=178A/9C046700, redo lsn=178A/98329908
2025-02-03 16:29:21 UTC [906]: db=,user=,app=,client= LOG: shutting down
2025-02-03 16:29:21 UTC [906]: db=,user=,app=,client= LOG: checkpoint starting: shutdown immediate
2025-02-03 16:29:43 UTC [906]: db=,user=,app=,client= LOG: checkpoint complete: wrote 16405 buffers (0.6%); 0 WAL file(s) added, 1 removed, 0 recycled; write=19.744 s, sync=0.244 s, total=21.503 s; sync files=430, longest=0.041 s, average=0.001 s; distance=78681 kB, estimate=121971 kB; lsn=178A/9D000028, redo lsn=178A/9D000028
2025-02-03 16:29:46 UTC [886]: db=,user=,app=,client= LOG: database system is shut down

Steps to reproduce

check the description

Additional context

No response

Hedera network

other

Version

v0.121.2

Operating system

None

The text was updated successfully, but these errors were encountered:

xin-hedera added bug Type: Something isn't working database Area: Database labels Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Citus worker pods tend to go into recovery when brought back up #10288

Citus worker pods tend to go into recovery when brought back up #10288

xin-hedera commented Feb 3, 2025 •

edited

Loading

Citus worker pods tend to go into recovery when brought back up #10288

Citus worker pods tend to go into recovery when brought back up #10288

Comments

xin-hedera commented Feb 3, 2025 • edited Loading

Description

Steps to reproduce

Additional context

Hedera network

Version

Operating system

xin-hedera commented Feb 3, 2025 •

edited

Loading