Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DHCP lease timeout #4

Open
craig-willis opened this issue Dec 19, 2017 · 5 comments
Open

DHCP lease timeout #4

craig-willis opened this issue Dec 19, 2017 · 5 comments
Assignees

Comments

@craig-willis
Copy link
Collaborator

We're seeing frequent log entries indicating network configuration changes:

Dec 19 14:00:27 host-192-168-149-8 systemd-timesyncd[605]: Network configuration changed, trying to establish connection.
Dec 19 14:00:27 host-192-168-149-8 systemd-timesyncd[605]: Synchronized to time server 129.114.97.2:123 (129.114.97.2).
Dec 19 14:00:47 host-192-168-149-8 systemd-timesyncd[605]: Network configuration changed, trying to establish connection.
Dec 19 14:00:47 host-192-168-149-8 systemd-timesyncd[605]: Synchronized to time server 129.114.97.2:123 (129.114.97.2).
...
Dec 19 14:02:41 host-192-168-149-8 systemd-timesyncd[605]: Network configuration changed, trying to establish connection.
Dec 19 14:02:41 host-192-168-149-8 systemd-timesyncd[605]: Synchronized to time server 129.114.97.2:123 (129.114.97.2).
Dec 19 14:03:00 host-192-168-149-8 systemd-timesyncd[605]: Network configuration changed, trying to establish connection.
Dec 19 14:03:00 host-192-168-149-8 systemd-timesyncd[605]: Synchronized to time server 129.114.97.2:123 (129.114.97.2).

Since we're using Docker swarm, we're also seeing frequent "node join" events as the system responds to the network change.

This may be related to short DHCP lease timeout

$ cat /run/systemd/netif/leases/2
..
MTU=9000
T1=133
T2=245
LIFETIME=300

According to the OS docs, the default value of dhcp_lease_duration is 24 hours.

Confirm with TACC why the lease is so short and consider impacts

@craig-willis craig-willis self-assigned this Dec 19, 2017
@craig-willis
Copy link
Collaborator Author

craig-willis commented Dec 19, 2017

tickets.xsede.org #80694

DHCP licenses are short primarily because of suspend and migration issues.

Essentially, during either Suspend or non-live Migration, the VM's internal clock stops.
Back in the "real" world, the DHCP server's clock didn't stop.

When the VM resumes, its lease may be expired, but it doesn't think to ask for a new lease until its internal timer goes off, then it renegotiates. If we left it at 24hrs, then we'd either have the pool of IPs used up and/or VMs might wait up to 24 hrs to renegotiate.

Worth noting:

SDSC/Cloud

MTU=1458
T1=78090
T2=142890
LIFETIME=172800

NCSA/Nebula

MTU=1454
T1=40307
T2=72707
LIFETIME=86400

@craig-willis
Copy link
Collaborator Author

Comment from SDSC about the high lease time:

dont think we have given it much thought, suspend and non-live migration is not very common at sdsc. Each project has their own subnet and pool of ips so unused ips have not been a concern.
Perhaps if we couldn't live migrate then this would be a concern, but there are very few circumstances where we cant

@craig-willis
Copy link
Collaborator Author

@Xarthisius I don't think TACC will change the DHCP lease timeout based on above -- it seems to have been an intentional decision. Do you have any further questions? I expect we should go forward expecting frequent network config changes and swarm join log entries.

@craig-willis
Copy link
Collaborator Author

Actually, just had another response from Jetstream:

since you’re already playing with fire, I’ll assume you’re willing to play with explosives as well ;)
If you want to adjust the DHCP life time for an instance, on that instance edit the file (on RHEL anyway) /etc/dhcp/dhclient.conf and add the line
supersede dhcp-lease-time XXXXX;
where XXXXX is the number of seconds you want your lease to have.
Obviously, if you’re unable to communicate with an instance after resuming it, you may have to wait for the lease time to run out.
Please, let us know how that works out,

So perhaps we can override somehow, if needed.

@Xarthisius
Copy link
Contributor

No, until we have a concrete issue that this is causing, I don't think we can push them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants