Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support k3s-agent deployment #128

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jvassev
Copy link

@jvassev jvassev commented May 13, 2024

No description provided.

Copy link
Member

@ctalledo ctalledo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jvassev, thanks for the contribution, looks good to me. Just one minor request (see comments below).

Thanks!

@@ -1509,6 +1533,8 @@ function main() {
do_config_kubelet_rke2
elif kubelet_docker_systemd_deployment; then
do_config_kubelet_docker_systemd
elif kubelet_k3s_deploymet; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment after line 1526 above.

@jvassev
Copy link
Author

jvassev commented May 17, 2024

I discovered a few more missing pieces and added them too.
There is a strange issue when pods get rescheduled on crio-o where I occasionally see:

level=error err="listen tcp :9100: bind: address already in use"

Simple pod recreation solves it. That's why I'm adding sleep 20 between k3s-agent restarting.

Is there a smarter way to solve this?

Copy link
Member

@ctalledo ctalledo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ...

@@ -1500,7 +1534,7 @@ function main() {
# * RKE2: Host-based kubelet managed by rke2-agent's systemd service (Rancher's RKE2 approach).
# * Systemd+Docker: Docker-based kubelet managed by a systemd service (Lokomotive's approach).
# * Systemd: Host-based kubelet managed by a systemd service (most common approach).
#
# * Systemd: k3s when run as an agent (k3s-agent.service), if k3s is run as controlplane + node (k3s.service) it will not work
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; to avoid having two "systemd" entries, I would re-word as: "k3s: when run as agent only (if k3s is run as control plane + node (i.e., k3s.service) it won't work)."

@ctalledo
Copy link
Member

I discovered a few more missing pieces and added them too. There is a strange issue when pods get rescheduled on crio-o where I occasionally see:

level=error err="listen tcp :9100: bind: address already in use"

Simple pod recreation solves it. That's why I'm adding sleep 20 between k3s-agent restarting.

Is there a smarter way to solve this?

I don't know, but it's certainly not ideal.

Is it because the k3s agent is not fully stopped after systemctl stop k3s-agent? If that's the case, then we could try looping until it is. Do you know what agent is using that tcp 9100 port?

@jvassev
Copy link
Author

jvassev commented May 20, 2024

In my case, it was node-exporter but it happens on other pods like calico-node. I'm sure systemctl stop k3s-agent blocks until the process is down.
Maybe the containerd-manged pods need to get wiped out too? I see this in do_config_kubelet_docker_systemd:
https://github.com/nestybox/sysbox-pkgr/blob/master/k8s/scripts/kubelet-config-helper.sh#L1396-L1401

@jvassev
Copy link
Author

jvassev commented May 21, 2024

With that latest change I think the pod running the kubelet-config-helper.sh scripts is stopped because of the call to clean_runtime_state "$runtime" and the final systemctl start k3s-agent never has a chance to run.
Starting it manually fixes the node

@jvassev
Copy link
Author

jvassev commented May 23, 2024

After some more debugging i noticed that it just takes too long to kill the old *.slice units.
So the last change is to stop them in parallel.

@ctalledo
Copy link
Member

With that latest change I think the pod running the kubelet-config-helper.sh scripts is stopped because of the call to clean_runtime_state "$runtime" and the final systemctl start k3s-agent never has a chance to run. Starting it manually fixes the node

Mmm ... not sure about this. The kubelet-config-helper.sh does not run within a pod; it runs directly on the host (i.e., k8s node) as a systemd unit. The systemd unit is created and then started by the sysbox-deploy-k8s.sh script running inside the sysbox-deploy-k8s pod.

Thus the call to clean_runtime_state should not affect the execution of the kubelet-config-helper.sh. Maybe something else is going on?

@ctalledo
Copy link
Member

Hi @jvassev, thanks again for the contribution.

Where is this PR at? Is is ready for merging or are you still debugging/testing it?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants