-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Way to change master without downtime? #255
Comments
@albertvaka the declarative config of stolon (like kubernetes, pacemaker etc...) declares the desired cluster state and the leader sentinel will calculate the the best way to achieve it. So it's not really possible to just define the wanted master (for doing this you'll have to rewrite the cluster config in the store but this will become a big work since will need to add a command to stolonctl that will implement a lot of logic of the sentinels to find the best standby). But also doing this it'll be not possible to achieve zero downtime (but a lower downtime) since the keepers and proxies need to react to the new cluster config. Another solution will be to add a stolonctl command to mark a keeper as faulted so the sentinel will elect a new master without waiting the failure time. Today we achieve this restarting the keeper, since postgres is usually really fast to start this happens in few seconds. Have you tried this approach? An improvement on the above approach will be to make the keeper handle a signal (like SIGUSR1) and, when you send this signal, it'll restart the postgres instance. It'll be a bit faster than restarting the keeper since it'll skip the keeper boot registration (that takes some seconds). What do you think? |
The solution we found was to kill the postgres server from within the keeper with a SIGINT (fast shutdown) and then let the keeper restart it. The downtime was small enough doing it this way. Maybe this could even be triggered automatically when updating the cluster spec if a field that needs a restart is changed? |
@albertvaka Can you please open e new issue for that proposed enhancement? We aren't currently doing this for these reasons:
|
@sgotti I am interesting on this feature, and I think it has some relation with #235 . Right now, I am experimenting with stolon and I would like to create a Bosh release to automatically deploy a cluster, the idea is use postgres 9.6, having a master an a couple of slaves (at least). I would like to have a "safe" way to "evacuate the master" (mainly to apply upgrades) to whichever slave (I do not care which one). I think is related to #235 because instead of analyzing the parameters (to see if postgres needs to be restarted), the user could have the responsibility of the workflow:
with this workflow, the "evacuate" option it becomes important, and the cluster operator is responsible of following this workflow and decide if PostgreSQL has to be restarted or not. I am suggesting this as a first step, but having the functionality you wrote is nice. What do you think? |
@jriguera I commented on #235 since I noticed I haven't answered to it 😞 About your proposed workflow it's very similar to what we're doing right now:
Does it match your idea (with the additional synchronous replication temporary enabling/disabling)? I agree that automating this could be useful but at the same time, with the declaritive declaration it'll be difficult to code since we should handle a complex state machine and find a recovery path if any of the steps fails and I'd really avoid taking this path. Plus people would like to use different workflow and can automate them outside of stolon. But the initial question was related on how to do this with the lower possible downtime, the above workflow doesn't achieve the minimal possible downtime. A solution for this is to kill/fast shutdown the instance so the keeper will restart it (#255 (comment) or make the keeper trap an os signal like SIGUSR1 and do the instance restart). What about, as an initial step, adding the @albertvaka solution and your/my proposed workflow as a "tested" recipe to the documentation? |
Looks like automating this is would be quite difficult, so I agree that we could just document the "workaround" and it might be as effective as actually implementing all the logic :) |
@albertvaka would you like to create a pull request that adds these recipes to the documentation? |
@sgotti nice! I like the idea!. The idea in my case is having a 3 node cluster with PG 9.6 with synchronous replication to the 2 slaves (That is want I want to start experimenting with). I was suggesting the evacuate behavior because I though that marking a keeper as died was easy to implement (I have not enough time to look at the implementation, sorry). But since my plan is trying the sync replication, it is enough for now. If nobody else is doing it, I will try to open (this week) a PR to add those recipes to the docs. Thanks!!! |
@jriguera Great for the PR! About marking a keeper as faulted it could be an improvement that could be implemented, also asking a keeper to restart its managed postgres instance could be another step. I'll open some new issues to track these enhancements. |
I don't feel knowledgeable enough to document this... It would be great if you do it @jriguera! |
Hi!
I'm using stolon in kubernetes in a production app, and I'm facing the following problem.
I want to change the max_connections settings of my pgsql. I'm doing the change in the cluster schema using stolonctl, and it gets applied to the postgres config. This setting, however, requires a server restart to take effect (a config reload is not enough).
Since I have a master and a replica, I though that I would be able to do it by first changing it in the replica and restarting it, then promoting it to be a master, and finally changing it for the old master and restarting it.
The problem, though, is that there is no way to manually promote a replica to be a master. The only way seems to be to kill the master and wait for the replica to be promoted, but (at least on my dev environment) this takes about a minute or two while the database is down.
Is there any solution to this problem? Can I tell my sentinels to change the master somehow, without downtime and without losing data?
Thanks!
The text was updated successfully, but these errors were encountered: