-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stolon cluster doesn't update master after upgrade #154
Comments
Any ideas here would be greatly appreciated @sgotti - looks like I'm working the weekend here in the US to get this resolved... If I used the etcdctl utility to wipe out the config when we do an upgrade, would that effectively address this situation? Wouldn't the stolon cluster come up, figure out the master and start working correctly in this case? |
I can confirm that I used etcdctl to kill the stolon config like so:
Then I restarted proxy & the sentinel, and the cluster seems to be working again. Going forward, I'd like to understand whether this approach is appropriate. It might mnake sense for stolon to be doing periodic checks to verify which node is master and update the config if it has changed, especially in a cloud centric environment. |
I don't know how you are upgrading (and what it's being upgraded) but looks like the upgrade steps did some quite strange things:
Looks like a new keeper is taking the ip of an old keeper (172.18.62.12). If this is under kubernets then it looks strange (are you defining single pods, petset or other?) This, in addition with the sentinel not being the leader (the reason why the sentinel is not electing as the leader is not clear without the sentinel debug logs) mean that the clusterview and the keeperstates will not be updated (they are managed only by the leader sentinel). So the proxy will point to 172.18.62.12:5432 that is not 0c9aa03a but e738a36b and is a standby. All of this will be fixed when the sentinel will become the leader and update the clusterview with the correct ips. I'll add that, also with this strange state, the db consistency is kept, which is the primary stolon scope. Some notes related to your next questions/comments:
I hope you did this with only the "real" master keeper active, or, since you are using
Having more than one sentinel will help if one of this dies or isn't able to talk to etcd. In the next weeks I'm going to implement additional features for stolon (like PITR) that will probably require some architectural changes (and could break backward compatibility). I'll take advantage of this breaking point (trying to avoid future ones) to fix or improve the real world experience with stolon. For example on the upgrade side I'll be happy to know your use cases and how you're doing this now and see how stolon can be improved on this side (and document a suggested upgrade procedure). |
My use case is such that when we perform an upgrade, I'm fine to lose everything in etcd, to start over if you will, as long as stolon refreshes the configuration when it comes back up. During our upgrade process everything (stolon cluster, etcd cluster, other components) gets upgraded simultaneously. The stolon containers each have their own shared storage volume that retains the database data, so theoretically, we should lose little to noting when we upgrade. Without going into specifics, we don't store a ton in the database to begin with, so it's not critical data for us. We'd simply like the avoid having to recreate most of it in case of some sort of failure. I'm going to add a second sentinel here to try and avoid this scenario in the future. And I very well may just wipe the etcd data on each upgrade. |
@wchrisjohnson Given that a having a standby keeper taking the ip of the master keeper is a really strange situation and probably should be fixed in the deployment procedure, in future we'll try to detect such cases were a keeper is taking another keeper ips also when a sentinel is not running (#171). As an additional note. Deleting the cluster data in the store to make it elect a random keeper with existing data isn't a supported operation and in future its behavior will change (#160) leading to data removal. For the moment I'm closing it, feel free to reopen with additional information. |
After an upgrade of our cluster, we are seeing messages that the client cannot perform any operations against the database. It appears that our application is pointed at the wrong keeper (standby) instead of the master.
I checked status of the cluster:
Then I jumped on each of the three keepers and did the following:
The text was updated successfully, but these errors were encountered: