You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running citus database runbook scripts, e.g., volume-snapshot.sh, worker pods tend to go into recovery when brought back up. The theory is the pods created and managed by stackgres have a default terminationGracePeriodSeconds of 60 seconds, in case the graceful shutdown of the database server takes close to or longer than 60s, it'll be killed and ending with WALs not fully checkpointed and recovery is guranteed next time the database starts.
Logs to compare
shard2 - likely killed instead of gracefully shutdown
2025-02-03 16:29:35,253 WARNING: Request to Citus coordinator leader mirror-citus-coord-0 http://XXX/patroni failed: ReadTimeoutError("HTTPConnectionPool(host='XXXX', port=8009): Read timed out. (read timeout=30)")
2025-02-03 16:29:35 UTC [1293]: db=,user=,app=,client= LOG: received fast shutdown request
2025-02-03 16:29:35 UTC [1293]: db=,user=,app=,client= LOG: aborting any active transactions
2025-02-03 16:29:35 UTC [1359]: db=postgres,user=postgres,app=Patroni restapi,client=[local] FATAL: terminating connection due to administrator command
2025-02-03 16:29:35 UTC [1306]: db=postgres,user=postgres,app=Patroni heartbeat,client=[local] FATAL: terminating connection due to administrator command
2025-02-03 16:29:35 UTC [1359]: db=postgres,user=postgres,app=Patroni restapi,client=[local] LOG: disconnection: session time: 1172:50:06.109 user=postgres database=postgres host=[local]
2025-02-03 16:29:35 UTC [1306]: db=postgres,user=postgres,app=Patroni heartbeat,client=[local] LOG: disconnection: session time: 1172:50:13.701 user=postgres database=postgres host=[local]
2025-02-03 16:29:35 UTC [1293]: db=,user=,app=,client= LOG: background worker "logical replication launcher" (PID 1323) exited with exit code 1
shard0 - checkpoint completed
2025-02-03 16:29:21 UTC [906]: db=,user=,app=,client= LOG: checkpoint complete: wrote 40034 buffers (1.5%); 0 WAL file(s) added, 5 removed, 0 recycled; write=1165.619 s, sync=0.367 s, total=1168.819 s; sync files=48331, longest=0.005 s, average=0.001 s; distance=126781 kB, estimate=126781 kB; lsn=178A/9C046700, redo lsn=178A/98329908
2025-02-03 16:29:21 UTC [906]: db=,user=,app=,client= LOG: shutting down
2025-02-03 16:29:21 UTC [906]: db=,user=,app=,client= LOG: checkpoint starting: shutdown immediate
2025-02-03 16:29:43 UTC [906]: db=,user=,app=,client= LOG: checkpoint complete: wrote 16405 buffers (0.6%); 0 WAL file(s) added, 1 removed, 0 recycled; write=19.744 s, sync=0.244 s, total=21.503 s; sync files=430, longest=0.041 s, average=0.001 s; distance=78681 kB, estimate=121971 kB; lsn=178A/9D000028, redo lsn=178A/9D000028
2025-02-03 16:29:46 UTC [886]: db=,user=,app=,client= LOG: database system is shut down
Steps to reproduce
check the description
Additional context
No response
Hedera network
other
Version
v0.121.2
Operating system
None
The text was updated successfully, but these errors were encountered:
Description
When running citus database runbook scripts, e.g.,
volume-snapshot.sh
, worker pods tend to go into recovery when brought back up. The theory is the pods created and managed by stackgres have a defaultterminationGracePeriodSeconds
of 60 seconds, in case the graceful shutdown of the database server takes close to or longer than 60s, it'll be killed and ending with WALs not fully checkpointed and recovery is guranteed next time the database starts.Logs to compare
Steps to reproduce
check the description
Additional context
No response
Hedera network
other
Version
v0.121.2
Operating system
None
The text was updated successfully, but these errors were encountered: