Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Citus worker pods tend to go into recovery when brought back up #10288

Open
xin-hedera opened this issue Feb 3, 2025 · 0 comments
Open

Citus worker pods tend to go into recovery when brought back up #10288

xin-hedera opened this issue Feb 3, 2025 · 0 comments
Labels
bug Type: Something isn't working database Area: Database

Comments

@xin-hedera
Copy link
Collaborator

xin-hedera commented Feb 3, 2025

Description

When running citus database runbook scripts, e.g., volume-snapshot.sh, worker pods tend to go into recovery when brought back up. The theory is the pods created and managed by stackgres have a default terminationGracePeriodSeconds of 60 seconds, in case the graceful shutdown of the database server takes close to or longer than 60s, it'll be killed and ending with WALs not fully checkpointed and recovery is guranteed next time the database starts.

Logs to compare

  • shard2 - likely killed instead of gracefully shutdown
2025-02-03 16:29:35,253 WARNING: Request to Citus coordinator leader mirror-citus-coord-0 http://XXX/patroni failed: ReadTimeoutError("HTTPConnectionPool(host='XXXX', port=8009): Read timed out. (read timeout=30)")
2025-02-03 16:29:35 UTC [1293]: db=,user=,app=,client= LOG: received fast shutdown request
2025-02-03 16:29:35 UTC [1293]: db=,user=,app=,client= LOG: aborting any active transactions
2025-02-03 16:29:35 UTC [1359]: db=postgres,user=postgres,app=Patroni restapi,client=[local] FATAL: terminating connection due to administrator command
2025-02-03 16:29:35 UTC [1306]: db=postgres,user=postgres,app=Patroni heartbeat,client=[local] FATAL: terminating connection due to administrator command
2025-02-03 16:29:35 UTC [1359]: db=postgres,user=postgres,app=Patroni restapi,client=[local] LOG: disconnection: session time: 1172:50:06.109 user=postgres database=postgres host=[local]
2025-02-03 16:29:35 UTC [1306]: db=postgres,user=postgres,app=Patroni heartbeat,client=[local] LOG: disconnection: session time: 1172:50:13.701 user=postgres database=postgres host=[local]
2025-02-03 16:29:35 UTC [1293]: db=,user=,app=,client= LOG: background worker "logical replication launcher" (PID 1323) exited with exit code 1
  • shard0 - checkpoint completed
2025-02-03 16:29:21 UTC [906]: db=,user=,app=,client= LOG: checkpoint complete: wrote 40034 buffers (1.5%); 0 WAL file(s) added, 5 removed, 0 recycled; write=1165.619 s, sync=0.367 s, total=1168.819 s; sync files=48331, longest=0.005 s, average=0.001 s; distance=126781 kB, estimate=126781 kB; lsn=178A/9C046700, redo lsn=178A/98329908
2025-02-03 16:29:21 UTC [906]: db=,user=,app=,client= LOG: shutting down
2025-02-03 16:29:21 UTC [906]: db=,user=,app=,client= LOG: checkpoint starting: shutdown immediate
2025-02-03 16:29:43 UTC [906]: db=,user=,app=,client= LOG: checkpoint complete: wrote 16405 buffers (0.6%); 0 WAL file(s) added, 1 removed, 0 recycled; write=19.744 s, sync=0.244 s, total=21.503 s; sync files=430, longest=0.041 s, average=0.001 s; distance=78681 kB, estimate=121971 kB; lsn=178A/9D000028, redo lsn=178A/9D000028
2025-02-03 16:29:46 UTC [886]: db=,user=,app=,client= LOG: database system is shut down

Steps to reproduce

check the description

Additional context

No response

Hedera network

other

Version

v0.121.2

Operating system

None

@xin-hedera xin-hedera added bug Type: Something isn't working database Area: Database labels Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Type: Something isn't working database Area: Database
Projects
None yet
Development

No branches or pull requests

1 participant