-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
konnectivity fails on kubectl cp #261
Comments
The problem also comes once I try to copy a 20M file to a container.
Then trying some logs:
In the log,
Thank you! |
Here is what I have found so far:
Once I remove this agent identifier, the It is strange to me, but seems somehow the destHost strategy has some bug - later on I will check that code part. Thank you! |
Sorry right now just a passing comment. Will dig in later. Seems clear that default should return a backend irrespective of what destHost does. However if the agent identifier is changing that, it sounds like something is wrong. My best guess is that something about having an agent identifier is preventing the backend (agent) from being registered with the default backend manager. Also curious what the IP is for the test-k8s-cp-ds-mzlbb pod. I'm guessing is does not match agent identifier. |
Hi! Yes, the identifer is the node private IP where the agent is running, e.g. 10.10.42.22 Do you think when a Thank you! |
I'm just guessing. I haven't spent much time playing with kubectl cp. |
@mihivagyok Out of curiosity how many agents do you have? |
Hi! In the this test, I have two workers, so there are 2 agents. Thank you! |
Just tried to repro locally using the following commands.
So UDS + GRPC and two clients. No luck so far. Does seem like something about either data volume or the cp command. Plan on trying on a proper K8s cluster tomorrow. What version of server and client are you using? |
@cheftako
Thank you! |
Few more breadcrumbs. (Still working on narrowing down the repro). I0829 23:38:51.775518 184241 round_trippers.go:405] POST https://X.X.X.X/api/v1/namespaces/kube-system/pods/kube-proxy-kubernetes-minion-group-xxxx/exec?command=tar&command=xf&command=-&command=-C&command=%2Ftmp&container=kube-proxy&container=kube-proxy&stderr=true&stdin=true&stdout=true 101 Switching Protocols in 228 milliseconds KAS logs show the following: I0830 06:39:01.873427 10 client.go:107] "[tracing] recv packet" type="DATA" |
OK tried with agent-identifiers destHost,default,grpc and 3 agents. No problem I could see. Tomorrow will try switching to http-connect. Myabe we're hitting a size limit trying to support SPDY and http-connect. (Not sure where agent-id falls in that) |
Yes, okay, I try to investigate it too. We have bunch of customers who are complianing, so I still believe the problem is real. |
@cheftako If you are open to a webex session, I could show how we configure Konnectivity and the problem itself. Thank you! |
@mihivagyok I would be open to a webex session. Feel free to reach out on Slack to arrange some time. |
Worth checking out kubernetes/kubernetes#60140 (comment) |
Thanks, I read it, although I think our problem is a bit different, because once I change the config, everything works well. Thank you! |
Thanks for meeting with me and showing me more details of how to repro. I have now been able to repro, although its still seems like a less frequent problem on my test system. Having reproduced on my system I was able to look at the relevant agent logs. The last things I see in the agent log for the relevant connection is
the agent log continues after that but I see nothing more for connectionID=4. Going to keep digging. |
Bread crumbs. Managed to get goroutine dumps on the Konnectivity Server and Konnectivity Agent after I repro'd the issue. I then subracted the goroutine dumps from the same processes when they were 'idle'. Here are the diffs. 1 @ 0x43ae45 0x44c057 0x93291a 0x947dd9 0x9a3462 0x13cbee9 0x13ce11f 0x13dad9b 0x6e9523 0x6e4a4d 0x471d01
# 0x932919 google.golang.org/grpc/internal/transport.(*writeQuota).get+0x79 /go/pkg/mod/google.golang.org/[email protected]/internal/transport/flowcontrol.go:59
# 0x947dd8 google.golang.org/grpc/internal/transport.(*http2Server).Write+0x1f8 /go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:927
# 0x9a3461 google.golang.org/grpc.(*serverStream).SendMsg+0x241 /go/pkg/mod/google.golang.org/[email protected]/stream.go:1421
# 0x13cbee8 sigs.k8s.io/apiserver-network-proxy/proto/agent.(*agentServiceConnectServer).Send+0x48 /go/src/sigs.k8s.io/apiserver-network-proxy/proto/agent/agent.pb.go:147
# 0x13ce11e sigs.k8s.io/apiserver-network-proxy/pkg/server.(*backend).Send+0x7e /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/backend_manager.go:88
# 0x13dad9a sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP+0xb5a /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:149
# 0x6e9522 net/http.serverHandler.ServeHTTP+0xa2 /usr/local/go/src/net/http/server.go:2887
# 0x6e4a4c net/http.(*conn).serve+0x8cc /usr/local/go/src/net/http/server.go:1952
1/3 @ 0x43ae45 0x43349b 0x46c275 0x4da265 0x4db355 0x4db337 0x5c1e0f 0x5d4271 0x6268c3 0x50c03e 0x626b33 0x623975 0x629d45 0x629d50 0x6de759 0x5110a2 0x13dac65 0x6e9523 0x6e4a4d 0x471d01
# 0x46c274 internal/poll.runtime_pollWait+0x54 /usr/local/go/src/runtime/netpoll.go:222
# 0x4da264 internal/poll.(*pollDesc).wait+0x44 /usr/local/go/src/internal/poll/fd_poll_runtime.go:87
# 0x4db354 internal/poll.(*pollDesc).waitRead+0x1d4 /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
# 0x4db336 internal/poll.(*FD).Read+0x1b6 /usr/local/go/src/internal/poll/fd_unix.go:166
# 0x5c1e0e net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_posix.go:55
# 0x5d4270 net.(*conn).Read+0x90 /usr/local/go/src/net/net.go:183
# 0x6268c2 crypto/tls.(*atLeastReader).Read+0x62 /usr/local/go/src/crypto/tls/conn.go:776
# 0x50c03d bytes.(*Buffer).ReadFrom+0xbd /usr/local/go/src/bytes/buffer.go:204
# 0x626b32 crypto/tls.(*Conn).readFromUntil+0xf2 /usr/local/go/src/crypto/tls/conn.go:798
# 0x623974 crypto/tls.(*Conn).readRecordOrCCS+0x114 /usr/local/go/src/crypto/tls/conn.go:605
# 0x629d44 crypto/tls.(*Conn).readRecord+0x164 /usr/local/go/src/crypto/tls/conn.go:573
# 0x629d4f crypto/tls.(*Conn).Read+0x16f /usr/local/go/src/crypto/tls/conn.go:1276
# 0x6de758 net/http.(*connReader).Read+0x1b8 /usr/local/go/src/net/http/server.go:800
# 0x5110a1 bufio.(*Reader).Read+0x141 /usr/local/go/src/bufio/bufio.go:213
# 0x13dac64 sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP+0xa24 /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:129
# 0x6e9522 net/http.serverHandler.ServeHTTP+0xa2 /usr/local/go/src/net/http/server.go:2887
# 0x6e4a4c net/http.(*conn).serve+0x8cc /usr/local/go/src/net/http/server.go:1952 4 @ 0x43aec5 0x4068cf 0x40654b 0x934c05 0x4710e1
# 0x934c04 sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).proxyToRemote+0x4c4 /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:503
4 @ 0x43aec5 0x43351b 0x46b9f5 0x4d2bc5 0x4d3cb5 0x4d3c97 0x5a646f 0x5b88d1 0x9341d8 0x4710e1
# 0x46b9f4 internal/poll.runtime_pollWait+0x54 /usr/local/go/src/runtime/netpoll.go:222
# 0x4d2bc4 internal/poll.(*pollDesc).wait+0x44 /usr/local/go/src/internal/poll/fd_poll_runtime.go:87
# 0x4d3cb4 internal/poll.(*pollDesc).waitRead+0x1d4 /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
# 0x4d3c96 internal/poll.(*FD).Read+0x1b6 /usr/local/go/src/internal/poll/fd_unix.go:166
# 0x5a646e net.(*netFD).Read+0x4e /usr/local/go/src/net/fd_posix.go:55
# 0x5b88d0 net.(*conn).Read+0x90 /usr/local/go/src/net/net.go:183
# 0x9341d7 sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).remoteToProxy+0xd7 /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:478 |
grpc quota writer is known to block writes if the other end has stopped (or is to slowly) reading. Need to work out if that's whats happening, which is the other end and see why. |
Hi @cheftako could you be more specific about the grpc quota writer? What's the impact to ANP? I suspect the #268 is similar with this one. |
@cheftako did you happen to find anything on this? We are still seeing problems. |
This pr aims to help start fixing the problem (by randomizing across a pool of multiple redundant instances that can serve the same IPs/hosts) |
Let me share our use-case and how we use the agent-identifier. So idea is the following:
I'm also trying to figure out what is happening here and I have some interesting findings what I cannot comment. But that limit is way to low. If I remove the limit, basically I cannot re-pro the issue anymore. My test is the following:
When the CPU limit is:
I'm also checking the metrics (kubectl top), but it does not show big usage - around ~200m. Agent settings:
Server settings:
Version is 0.0.24. What I still don't understand: when the limit is low (50m) and the destHost proxy strategy is not used, it works even with big files - of course, it is quite slow, but works. I'm still trying to figure out how is that possible. Or do have any idea about that? Thank you! |
I have new test results which shows the following:
I will execute the same test when the cluster has one node only - to see that if destHost has any effect in that case. Thank you! |
Okay, I see the following:
Transfer starts and this is the last log regarding the
Then there is nothing about the connection closing.
That's the end. I don't see closing logs either. Then the next
Here I think it stuck in the grpc SendMsg message: I have added a log message after this line, but it is not printing that. I see also that other service traffic works:
Agent finishes also fine:
What do you think? @caesarxuchao @cheftako @jkh52 Thank you very much! |
I have also checked the log "Receive channel on Proxy is full", but it does not appear (yes, I have enabled in the parameters). Thanks! |
A similar founding that GRPC itself might get stuck, like #268 , but I think they triggered in different places. |
Sorry I've been quiet on this one. I've added a few optimization on the channels but while they seem to let more data through on the
That would suggest grpc/grpc-go#2427. However that essentially says the other end (the agent) is not reading. However in this replication of the issue I have exactly 1 agent and if I look at its goroutines I find the following.
So it would appear it is calling RecvMesg. Seems like its time to engage the GRPC folks. |
Hi All! I have run my tests with the latest v0.0.26 (compiled with golang 1.16.10), and I still see the issue (although it has newer GRPC). Thank you! |
I commented on #324 , but let me repeat it: I think we have some idea why kubectl cp fails eventually: the issue comes when we have multiple proxy strategies. The backend (agent) connections are guarded by mutex. But if there are multiple strategies then the same backend with the same connection will be registered to multiple backend managers. To illustrate this:
We have two agent instances (one for DestHost, one for Default backend manager), but with the same conn. In this case, it is possible that different go routines are trying to write to the same connection as there is no protection between the instances (there is mutex only within the instance), error could happen. I could submit an issue about this and discuss this theory there. What do you think about this theory? Thanks! |
@mihivagyok I was able to reproduce the issue w/ kubectl cp. But testing #324 doesn't seem to fix the issue I'm seeing. It's possible I'm running into a different issue from what you're seeing though. I'm only using the default strategy in this case. Is it possible for you to test kubectl cp on a cluster using only 1 backend strategy to test your theory? |
@andrewsykim Yes, there could be several leaks :) So we have already tested with single and multiple strategies, and we came to the conclusion that the hanging stream/kubectl cp issues are happening only when we have multiple proxyserver strategies. cc @Tamas-Biro1 Thanks! |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We see issues with the
kubectl cp
functionality with version 0.0.23 (and earlier).In my test, I'm copying 3M files to a pod, but the on the third or fourth copy is always failing and then nothing else is working with that konnectivity agent.
Command:
When a successful copy ends, I see the following logs in the agent:
When it fails, I don't see the
connection EOF
message, and the agent still writes some logs (for different connectionIDs), but thekubectl cp
hangs.Then when I try to get some logs, I get the following error:
Once I restart the agent, everything goes back to normal.
I think it is easy to reproduce the problem, and I could share more configuration option and pprof if necessary.
Please give me some suggestion how to tackle this issue.
We have customers who are having this issues and we need to provide some correction.
Thank you!
Adam
The text was updated successfully, but these errors were encountered: