Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

konnectivity fails on kubectl cp #261

Closed
mihivagyok opened this issue Aug 26, 2021 · 43 comments
Closed

konnectivity fails on kubectl cp #261

mihivagyok opened this issue Aug 26, 2021 · 43 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@mihivagyok
Copy link
Contributor

We see issues with the kubectl cp functionality with version 0.0.23 (and earlier).

In my test, I'm copying 3M files to a pod, but the on the third or fourth copy is always failing and then nothing else is working with that konnectivity agent.

Command:

kubectl cp /tmp/testfile-3M.bin test-k8s-cp-ds-mzlbb:/tmp/testfile-3M.bin -n my-namespace

When a successful copy ends, I see the following logs in the agent:

I0826 13:53:51.986610       1 client.go:482] connection EOF
I0826 13:53:51.986665       1 client.go:400] "close connection" connectionID=6
I0826 13:53:52.021504       1 client.go:444] "received CLOSE_REQ" connectionID=6
I0826 13:53:52.021568       1 client.go:450] "Failed to find connection context for close" connectionID=6

When it fails, I don't see the connection EOF message, and the agent still writes some logs (for different connectionIDs), but the kubectl cp hangs.
Then when I try to get some logs, I get the following error:

➜ kubectl logs -n kube-system konnectivity-agent-bvzf6 --tail 10
Error from server: Get "https://10.188.233.5:10250/containerLogs/kube-system/konnectivity-agent-bvzf6/konnectivity-agent?tailLines=10": net/http: TLS handshake timeout

Once I restart the agent, everything goes back to normal.

I think it is easy to reproduce the problem, and I could share more configuration option and pprof if necessary.
Please give me some suggestion how to tackle this issue.

We have customers who are having this issues and we need to provide some correction.

Thank you!
Adam

@mihivagyok
Copy link
Contributor Author

mihivagyok commented Aug 26, 2021

The problem also comes once I try to copy a 20M file to a container.

user@virtualserver01:~/home/tests/pvg# head -c "20M" </dev/urandom > 20M
user@virtualserver01:~/home/tests/pvg# kubectl cp 20M test-k8s-cp-ds-mzlbb:/tmp/20M -npvg-armada-network-k8s-cp
E0826 17:08:02.772597    6087 v2.go:105] write tcp 169.x.x.x:42940->169.y.y.y:30719: write: connection reset by peer

Then trying some logs:

➜ kubectl logs -n kube-system konnectivity-agent-ztbrj
Error from server: Get "https://10.188.233.5:10250/containerLogs/kube-system/konnectivity-agent-ztbrj/konnectivity-agent": net/http: TLS handshake timeout

In the log, 169.y.y.y is the server address, the 169.x.x.x is my virtual server IP address.
More configuration detail:

  • agent runs in container network (CNI is calico)
  • it has one identifier: --agent-identifiers=ipv4=$(HOST_IP)
  • it connects to the master using a DNS name which resolvest to 1 IP, but sometimes to a different one

Thank you!

@mihivagyok
Copy link
Contributor Author

Here is what I have found so far:

  • our konnectivity server has the following strategies: destHost,default
  • agents have the identifier as ipv4=$(HOST_IP) as I posted above

Once I remove this agent identifier, the kubectl cp starts working fine, no issue even with big files. I have a test case which executes several copy, and so far it is successful now.

It is strange to me, but seems somehow the destHost strategy has some bug - later on I will check that code part.

Thank you!

@cheftako
Copy link
Contributor

cheftako commented Aug 27, 2021

Sorry right now just a passing comment. Will dig in later. Seems clear that default should return a backend irrespective of what destHost does. However if the agent identifier is changing that, it sounds like something is wrong. My best guess is that something about having an agent identifier is preventing the backend (agent) from being registered with the default backend manager. Also curious what the IP is for the test-k8s-cp-ds-mzlbb pod. I'm guessing is does not match agent identifier.

@mihivagyok
Copy link
Contributor Author

Hi! Yes, the identifer is the node private IP where the agent is running, e.g. 10.10.42.22
The agent runs in the pod network and the test pod also, so both has an address like 172.32.2.x.

Do you think when a kubectl cp, the destination host is the target pod IP?
In case of logs, the destination host is the node IP where the pod runs (as we go to kubelet).

Thank you!

@cheftako
Copy link
Contributor

Do you think when a kubectl cp, the destination host is the target pod IP?
In case of logs, the destination host is the node IP where the pod runs (as we go to kubelet).

I'm just guessing. I haven't spent much time playing with kubectl cp.
However having it sent to the pod IP could explain not getting a backend from the destHost manager.
You could try looking for the pod IP in the konn server log.
I would expect to see it in the log if thats where it was being sent.
Even if true there is still something clearly wrong as the default manager should still have handled this.
Should have time to reproduce and look later.

@PratikDeoghare
Copy link
Contributor

@mihivagyok Out of curiosity how many agents do you have?

@mihivagyok
Copy link
Contributor Author

Hi! In the this test, I have two workers, so there are 2 agents. Thank you!

@cheftako
Copy link
Contributor

Just tried to repro locally using the following commands.

  • ./bin/http-test-server
  • ./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key --admin-server-port=9094 --health-server-port=9095 --agent-identifiers=ipv4=127.0.0.1
  • ./bin/proxy-agent --ca-cert=certs/agent/issued/ca.crt --agent-cert=certs/agent/issued/proxy-agent.crt --agent-key=certs/agent/private/proxy-agent.key --agent-identifiers=ipv4=172.17.0.1
  • ./bin/proxy-server --server-port=0 --uds-name=/tmp/uds-proxy --cluster-ca-cert=certs/agent/issued/ca.crt --cluster-cert=certs/agent/issued/proxy-frontend.crt --cluster-key=certs/agent/private/proxy-frontend.key --proxy-strategies=destHost,default
  • ./bin/proxy-test-client --proxy-port=0 --proxy-uds=/tmp/uds-proxy --proxy-host="" --request-host=127.0.0.1

So UDS + GRPC and two clients. No luck so far. Does seem like something about either data volume or the cp command. Plan on trying on a proper K8s cluster tomorrow. What version of server and client are you using?

@mihivagyok
Copy link
Contributor Author

@cheftako
Thank you for trying! From your test, I don't see the "copy" part, but let me share some differences:

  • We use HTTP-CONNECT between the Konnectivity server and client (kube API server)
  • Yes, I think you need a kubernetes cluster for this problem, we have integrated Konnectivity to Kubernetes 1.21
  • The problem is that the copy hangs after some tries or with bigger files. This also tells me that the copy starts (backend selection is fine), but it never ends.
  • But once I remove the agent-identifier I could copy as many as big files I want, it works so far without problem. This is very stange, I know - but also tells me that Kubernetes/kubectl is fine.

Thank you!
Adam

@cheftako
Copy link
Contributor

Few more breadcrumbs. (Still working on narrowing down the repro).
Kubectl -v=9 shows the following as the request.

I0829 23:38:51.775518 184241 round_trippers.go:405] POST https://X.X.X.X/api/v1/namespaces/kube-system/pods/kube-proxy-kubernetes-minion-group-xxxx/exec?command=tar&command=xf&command=-&command=-C&command=%2Ftmp&container=kube-proxy&container=kube-proxy&stderr=true&stdin=true&stdout=true 101 Switching Protocols in 228 milliseconds
I0829 23:38:51.775568 184241 round_trippers.go:411] Response Headers:
I0829 23:38:51.775589 184241 round_trippers.go:414] Connection: Upgrade
I0829 23:38:51.775606 184241 round_trippers.go:414] Upgrade: SPDY/3.1
I0829 23:38:51.775618 184241 round_trippers.go:414] X-Stream-Protocol-Version: v4.channel.k8s.io
I0829 23:38:51.775629 184241 round_trippers.go:414] Date: Mon, 30 Aug 2021 06:38:51 GMT

KAS logs show the following:

I0830 06:39:01.873427 10 client.go:107] "[tracing] recv packet" type="DATA"
I0830 06:39:01.873488 10 client.go:107] "[tracing] recv packet" type="CLOSE_RSP"
I0830 06:39:01.873509 10 clientconn.go:435] [core] Channel Connectivity change to SHUTDOWN
I0830 06:39:01.873520 10 clientconn.go:1114] [core] Subchannel Connectivity change to SHUTDOWN
I0830 06:39:01.873757 10 conn.go:57] "[tracing] send req" type="DATA"
I0830 06:39:01.873781 10 conn.go:115] closing connection
I0830 06:39:01.873789 10 conn.go:125] "[tracing] send req" type="CLOSE_REQ"
I0830 06:39:01.873865 10 httplog.go:105] "HTTP" verb="POST" URI="/api/v1/namespaces/kube-system/pods/kube-prox
y-kubernetes-minion-group-xxxx/exec?command=tar&command=xf&command=-&command=-C&command=%2Ftmp&container=kube-proxy
&container=kube-proxy&stderr=true&stdin=true&stdout=true" latency="10.168115135s" userAgent="kubectl/v0.0.0 (linux/
amd64) kubernetes/$Format" audit-ID="****" srcIP="X.X.X.X:57902" hijacked=true
I0830 06:39:01.873931 10 controlbuf.go:521] [transport] transport: loopyWriter.run returning. connection error
: desc = "transport is closing"

@cheftako
Copy link
Contributor

OK tried with agent-identifiers destHost,default,grpc and 3 agents. No problem I could see. Tomorrow will try switching to http-connect. Myabe we're hitting a size limit trying to support SPDY and http-connect. (Not sure where agent-id falls in that)

@mihivagyok
Copy link
Contributor Author

Yes, okay, I try to investigate it too. We have bunch of customers who are complianing, so I still believe the problem is real.
Thank you!

@mihivagyok
Copy link
Contributor Author

@cheftako If you are open to a webex session, I could show how we configure Konnectivity and the problem itself. Thank you!

@cheftako
Copy link
Contributor

cheftako commented Sep 1, 2021

@mihivagyok I would be open to a webex session. Feel free to reach out on Slack to arrange some time.

@cheftako
Copy link
Contributor

cheftako commented Sep 2, 2021

Worth checking out kubernetes/kubernetes#60140 (comment)

@mihivagyok
Copy link
Contributor Author

Thanks, I read it, although I think our problem is a bit different, because once I change the config, everything works well.
I will look for you on slack.

Thank you!
Adam

@cheftako
Copy link
Contributor

cheftako commented Sep 3, 2021

Thanks for meeting with me and showing me more details of how to repro. I have now been able to repro, although its still seems like a less frequent problem on my test system. Having reproduced on my system I was able to look at the relevant agent logs. The last things I see in the agent log for the relevant connection is

2021-09-03T18:18:08.354627866Z stderr F I0903 18:18:08.354328       1 client.go:508] "write to remote" connectionID=4 lastData=4096

the agent log continues after that but I see nothing more for connectionID=4.

Going to keep digging.

@cheftako
Copy link
Contributor

cheftako commented Sep 9, 2021

Bread crumbs. Managed to get goroutine dumps on the Konnectivity Server and Konnectivity Agent after I repro'd the issue. I then subracted the goroutine dumps from the same processes when they were 'idle'. Here are the diffs.

1 @ 0x43ae45 0x44c057 0x93291a 0x947dd9 0x9a3462 0x13cbee9 0x13ce11f 0x13dad9b 0x6e9523 0x6e4a4d 0x471d01
#	0x932919	google.golang.org/grpc/internal/transport.(*writeQuota).get+0x79			/go/pkg/mod/google.golang.org/[email protected]/internal/transport/flowcontrol.go:59
#	0x947dd8	google.golang.org/grpc/internal/transport.(*http2Server).Write+0x1f8			/go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:927
#	0x9a3461	google.golang.org/grpc.(*serverStream).SendMsg+0x241					/go/pkg/mod/google.golang.org/[email protected]/stream.go:1421
#	0x13cbee8	sigs.k8s.io/apiserver-network-proxy/proto/agent.(*agentServiceConnectServer).Send+0x48	/go/src/sigs.k8s.io/apiserver-network-proxy/proto/agent/agent.pb.go:147
#	0x13ce11e	sigs.k8s.io/apiserver-network-proxy/pkg/server.(*backend).Send+0x7e			/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/backend_manager.go:88
#	0x13dad9a	sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP+0xb5a		/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:149
#	0x6e9522	net/http.serverHandler.ServeHTTP+0xa2							/usr/local/go/src/net/http/server.go:2887
#	0x6e4a4c	net/http.(*conn).serve+0x8cc								/usr/local/go/src/net/http/server.go:1952

1/3 @ 0x43ae45 0x43349b 0x46c275 0x4da265 0x4db355 0x4db337 0x5c1e0f 0x5d4271 0x6268c3 0x50c03e 0x626b33 0x623975 0x629d45 0x629d50 0x6de759 0x5110a2 0x13dac65 0x6e9523 0x6e4a4d 0x471d01
#	0x46c274	internal/poll.runtime_pollWait+0x54						/usr/local/go/src/runtime/netpoll.go:222
#	0x4da264	internal/poll.(*pollDesc).wait+0x44						/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4db354	internal/poll.(*pollDesc).waitRead+0x1d4					/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4db336	internal/poll.(*FD).Read+0x1b6							/usr/local/go/src/internal/poll/fd_unix.go:166
#	0x5c1e0e	net.(*netFD).Read+0x4e								/usr/local/go/src/net/fd_posix.go:55
#	0x5d4270	net.(*conn).Read+0x90								/usr/local/go/src/net/net.go:183
#	0x6268c2	crypto/tls.(*atLeastReader).Read+0x62						/usr/local/go/src/crypto/tls/conn.go:776
#	0x50c03d	bytes.(*Buffer).ReadFrom+0xbd							/usr/local/go/src/bytes/buffer.go:204
#	0x626b32	crypto/tls.(*Conn).readFromUntil+0xf2						/usr/local/go/src/crypto/tls/conn.go:798
#	0x623974	crypto/tls.(*Conn).readRecordOrCCS+0x114					/usr/local/go/src/crypto/tls/conn.go:605
#	0x629d44	crypto/tls.(*Conn).readRecord+0x164						/usr/local/go/src/crypto/tls/conn.go:573
#	0x629d4f	crypto/tls.(*Conn).Read+0x16f							/usr/local/go/src/crypto/tls/conn.go:1276
#	0x6de758	net/http.(*connReader).Read+0x1b8						/usr/local/go/src/net/http/server.go:800
#	0x5110a1	bufio.(*Reader).Read+0x141							/usr/local/go/src/bufio/bufio.go:213
#	0x13dac64	sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP+0xa24	/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:129
#	0x6e9522	net/http.serverHandler.ServeHTTP+0xa2						/usr/local/go/src/net/http/server.go:2887
#	0x6e4a4c	net/http.(*conn).serve+0x8cc							/usr/local/go/src/net/http/server.go:1952
4 @ 0x43aec5 0x4068cf 0x40654b 0x934c05 0x4710e1
#	0x934c04	sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).proxyToRemote+0x4c4	/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:503

4 @ 0x43aec5 0x43351b 0x46b9f5 0x4d2bc5 0x4d3cb5 0x4d3c97 0x5a646f 0x5b88d1 0x9341d8 0x4710e1
#	0x46b9f4	internal/poll.runtime_pollWait+0x54						/usr/local/go/src/runtime/netpoll.go:222
#	0x4d2bc4	internal/poll.(*pollDesc).wait+0x44						/usr/local/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4d3cb4	internal/poll.(*pollDesc).waitRead+0x1d4					/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4d3c96	internal/poll.(*FD).Read+0x1b6							/usr/local/go/src/internal/poll/fd_unix.go:166
#	0x5a646e	net.(*netFD).Read+0x4e								/usr/local/go/src/net/fd_posix.go:55
#	0x5b88d0	net.(*conn).Read+0x90								/usr/local/go/src/net/net.go:183
#	0x9341d7	sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).remoteToProxy+0xd7	/go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:478

@cheftako
Copy link
Contributor

cheftako commented Sep 9, 2021

grpc quota writer is known to block writes if the other end has stopped (or is to slowly) reading. Need to work out if that's whats happening, which is the other end and see why.

@ydp
Copy link

ydp commented Sep 27, 2021

grpc quota writer is known to block writes if the other end has stopped (or is to slowly) reading. Need to work out if that's whats happening, which is the other end and see why.

Hi @cheftako could you be more specific about the grpc quota writer? What's the impact to ANP? I suspect the #268 is similar with this one.

@relyt0925
Copy link
Contributor

@cheftako did you happen to find anything on this? We are still seeing problems.

@relyt0925
Copy link
Contributor

relyt0925 commented Oct 15, 2021

This pr aims to help start fixing the problem (by randomizing across a pool of multiple redundant instances that can serve the same IPs/hosts)
https://github.com/kubernetes-sigs/apiserver-network-proxy/pull/272/files

@mihivagyok
Copy link
Contributor Author

mihivagyok commented Oct 15, 2021

Let me share our use-case and how we use the agent-identifier.

So idea is the following:

  • when kubernetes API server asks for logs, it goes to the node's kubelet, e.g. https://10.0.2.12:10250/logs...
  • the IP 10.0.2.12 is the node private IP which is registered as an agent-identifier ipv4=10.0.2.12
  • this also means, there is always one agent which handles a specific host - in this case random does not help
  • why we want to have this? it is always good to route directly to the right agent if we know the destination. For a multizone cluster, it is also beneficial to not send the traffic between the zones.

I'm also trying to figure out what is happening here and I have some interesting findings what I cannot comment.
The main issue we have found the issue is that we gave too low CPU limits for the Agents (50m) based on this https://github.com/kubernetes/kubernetes/blob/v1.21.0/cluster/gce/addons/konnectivity-agent/konnectivity-agent-ds.yaml#L49

But that limit is way to low. If I remove the limit, basically I cannot re-pro the issue anymore. My test is the following:
copy files with different sizes to a pod:

  for i in 1 3 5 10 20 30 40 50
  do
    log "Copy to pod: ${pod_to_copy}"
    head -c "${i}m" </dev/urandom > "${test_file}-${i}M"
    kubectl cp "${test_file}-${i}M" "${pod_to_copy}:/tmp/${test_file}-${i}M"
    kubectl exec "${pod_to_copy}" -- ls /tmp | grep -q "${test_file}-${i}M"
  done

When the CPU limit is:

  • 50m - it always fails
  • 200m - it fails with the bigger files
  • 1 core - it works well so far
  • no limit - it works well so far

I'm also checking the metrics (kubectl top), but it does not show big usage - around ~200m.

Agent settings:

      - --agent-identifiers=ipv4=$(HOST_IP)

Server settings:

        - --mode=http-connect
        - --proxy-strategies=destHost,default

Version is 0.0.24.

What I still don't understand: when the limit is low (50m) and the destHost proxy strategy is not used, it works even with big files - of course, it is quite slow, but works. I'm still trying to figure out how is that possible. Or do have any idea about that?

Thank you!
Adam

@mihivagyok
Copy link
Contributor Author

I have new test results which shows the following:

  • using --proxy-strategies=destHost,default, no CPU limit in Konnectivity agents, the kubectl cp test is faling after 4 run
  • using --proxy-strategies=default, no CPU limit in Konnectivity agents, the kubectl cptest is successful even after the 50 runs

I will execute the same test when the cluster has one node only - to see that if destHost has any effect in that case.

Thank you!
Adam

@mihivagyok
Copy link
Contributor Author

mihivagyok commented Oct 21, 2021

Okay, I see the following:

  • in the konnectivity-server the last transfer which starts fine:
I1021 13:51:28.314586       1 tunnel.go:42] "Received request for host" method="CONNECT" host="10.211.145.183:10250" userAgent=""
I1021 13:51:28.314643       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1021 13:51:28.314907       1 tunnel.go:78] "Set pending(rand=900783470661715446) to &{0xc000572320 0xc000210100 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc0001dac40 map[] true true} map[] false 0 -1 200 false false [] 0 [84 104 117 44 32 50 49 32 79 99 116 32 50 48 50 49 32 49 51 58 53 49 58 50 56 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc000550d20 0}"
I1021 13:51:28.314968       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="10.211.145.183"
I1021 13:51:28.315076       1 tunnel.go:103] "DIAL_REQ sent to backend"
I1021 13:51:28.351718       1 server.go:712] "Received DIAL_RSP" random=900783470661715446 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448
I1021 13:51:28.351798       1 server.go:269] "Register frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc00001ce00 CloseHTTP:0x143b6a0 connected:0xc00052c3c0 connectID:448 agentID:7e0f66a9-efde-44ec-8d13-1f9fd8345e31 start:{wall:13858836059699697597 ext:12587356314015 loc:0x223cee0} backend:0xc00053a198} agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448
I1021 13:51:28.351835       1 tunnel.go:137] "Starting proxy to host" host="10.211.145.183:10250"
I1021 13:51:28.351943       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=254 totalBytes=254 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448

Transfer starts and this is the last log regarding the connectionID=448:

I1021 13:51:28.901338       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=4096 totalBytes=3628308 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448
I1021 13:51:28.901367       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=4096 totalBytes=3632404 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448
I1021 13:51:28.901414       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=4096 totalBytes=3636500 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=448

Then there is nothing about the connection closing.
From agent point of view:

I1021 13:51:28.333167       1 client.go:375] "received DIAL_REQ"
I1021 13:51:28.369890       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.369991       1 client.go:527] "write to remote" connectionID=448 lastData=254
I1021 13:51:28.415230       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.415279       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.415725       1 client.go:527] "write to remote" connectionID=448 lastData=2787
I1021 13:51:28.416603       1 client.go:527] "write to remote" connectionID=448 lastData=551
I1021 13:51:28.464363       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.464414       1 client.go:438] "received DATA" connectionID=448
I1021 13:51:28.464431       1 client.go:438] "received DATA" connectionID=448
...
I1021 13:51:28.921415       1 client.go:527] "write to remote" connectionID=448 lastData=26
I1021 13:51:28.921416       1 client.go:438] "received DATA" connectionID=353
I1021 13:51:28.921473       1 client.go:527] "write to remote" connectionID=448 lastData=26
I1021 13:51:28.921491       1 client.go:527] "write to remote" connectionID=353 lastData=39
I1021 13:51:28.921500       1 client.go:527] "write to remote" connectionID=448 lastData=4096
I1021 13:51:28.921535       1 client.go:527] "write to remote" connectionID=448 lastData=4096
I1021 13:51:28.921594       1 client.go:527] "write to remote" connectionID=448 lastData=4096
I1021 13:51:28.921628       1 client.go:527] "write to remote" connectionID=448 lastData=4096

That's the end. I don't see closing logs either.

Then the next kubelet related request comes and I see only these logs:

I1021 14:02:46.886012       1 tunnel.go:42] "Received request for host" method="CONNECT" host="10.211.145.183:10250" userAgent=""
I1021 14:02:46.886053       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1021 14:02:46.886218       1 tunnel.go:78] "Set pending(rand=6648738534997005833) to &{0xc0005723c0 0xc0000be800 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc0001da7e0 map[] true true} map[] false 0 -1 200 false false [] 0 [84 104 117 44 32 50 49 32 79 99 116 32 50 48 50 49 32 49 52 58 48 50 58 52 54 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc000526fc0 0}"
I1021 14:02:46.886292       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="10.211.145.183"

Here I think it stuck in the grpc SendMsg message:
https://github.com/kubernetes-sigs/apiserver-network-proxy/blob/master/pkg/server/tunnel.go#L99

I have added a log message after this line, but it is not printing that.

I see also that other service traffic works:

I1021 13:56:34.135441       1 tunnel.go:42] "Received request for host" method="CONNECT" host="172.21.144.88:443" userAgent=""
I1021 13:56:34.135486       1 tunnel.go:44] "TLS" commonName="konnectivity-client"
I1021 13:56:34.135657       1 tunnel.go:78] "Set pending(rand=212649966676544707) to &{0xc0005721e0 0xc0000be400 {} 0x5253a0 true false false false 0 {0 0} <nil> {0xc0003421c0 map[] true true} map[] false 0 -1 200 false false [] 0 [84 104 117 44 32 50 49 32 79 99 116 32 50 48 50 49 32 49 51 58 53 54 58 51 52 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [50 48 48] 0xc0004f19d0 0}"
I1021 13:56:34.135685       1 desthost_backend_manager.go:49] "Get the backend through the DestHostBackendManager" destHost="172.21.144.88"
I1021 13:56:34.135758       1 tunnel.go:103] "DIAL_REQ sent to backend"
I1021 13:56:34.152607       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=39 totalBytes=3953 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=449
I1021 13:56:34.172690       1 server.go:712] "Received DIAL_RSP" random=212649966676544707 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 13:56:34.172793       1 server.go:269] "Register frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc00001d880 CloseHTTP:0x143b6a0 connected:0xc0001ec540 connectID:453 agentID:7e0f66a9-efde-44ec-8d13-1f9fd8345e31 start:{wall:13858836388085405434 ext:12893177022994 loc:0x223cee0} backend:0xc00053a1b0} agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 13:56:34.172834       1 tunnel.go:137] "Starting proxy to host" host="172.21.144.88:443"
I1021 13:56:34.172906       1 tunnel.go:170] "Forwarding data on tunnel to agent" bytes=296 totalBytes=296 agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
...
I1021 14:02:04.612720       1 tunnel.go:148] "EOF from host" host="172.21.144.88:443"
I1021 14:02:04.612747       1 tunnel.go:177] "Stopping transfer to host" host="172.21.144.88:443" agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 14:02:04.649387       1 server.go:756] "Received CLOSE_RSP" serverID="0ecd9419-3bc4-4dce-8b9b-d87e23a61d96" agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 14:02:04.649512       1 server.go:766] "CLOSE_RSP sent to frontend" connectionID=453
I1021 14:02:04.649602       1 server.go:290] "Remove frontend for agent" frontend=&{Mode:http-connect Grpc:<nil> HTTP:0xc00001d880 CloseHTTP:0x143b6a0 connected:0xc0001ec540 connectID:453 agentID:7e0f66a9-efde-44ec-8d13-1f9fd8345e31 start:{wall:13858836388085405434 ext:12893177022994 loc:0x223cee0} backend:0xc00053a1b0} agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453
I1021 14:02:04.649629       1 server.go:769] "Close streaming" agentID="7e0f66a9-efde-44ec-8d13-1f9fd8345e31" connectionID=453

Agent finishes also fine:

I1021 14:02:04.631057       1 client.go:438] "received DATA" connectionID=453
I1021 14:02:04.631090       1 client.go:449] "received CLOSE_REQ" connectionID=453
I1021 14:02:04.631092       1 client.go:395] "close connection" connectionID=453

What do you think? @caesarxuchao @cheftako @jkh52

Thank you very much!
Adam

@mihivagyok
Copy link
Contributor Author

I have also checked the log "Receive channel on Proxy is full", but it does not appear (yes, I have enabled in the parameters).

Thanks!
Adam

@ydp
Copy link

ydp commented Oct 22, 2021

A similar founding that GRPC itself might get stuck, like #268 , but I think they triggered in different places.

@cheftako
Copy link
Contributor

Sorry I've been quiet on this one. I've added a few optimization on the channels but while they seem to let more data through on the kubectl cp its not fixing the problem. At this point it seems clear the issue has to do with the following stuck goroutine.

goroutine 1551 [select, 4 minutes]:
google.golang.org/grpc/internal/transport.(*writeQuota).get(0xc0003985a0, 0x23)
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/flowcontrol.go:59 +0x74
google.golang.org/grpc/internal/transport.(*http2Server).Write(0xc000450000, 0xc0002a9680, {0xc0002cd520, 0x5, 0x5}, {0xc00030e7c0, 0x1e, 0x1e}, 0x0)
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/http2_server.go:1037 +0x23b
google.golang.org/grpc.(*serverStream).SendMsg(0xc0000fb5c0, {0x1422700, 0xc0002add40})
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:1510 +0x1cc
sigs.k8s.io/apiserver-network-proxy/proto/agent.(*agentServiceConnectServer).Send(0x506225, 0xc0003d0800)
        /go/src/sigs.k8s.io/apiserver-network-proxy/proto/agent/agent.pb.go:198 +0x2b
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*backend).Send(0xc00031f3e0, 0x6)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/backend_manager.go:88 +0x94
sigs.k8s.io/apiserver-network-proxy/pkg/server.(*Tunnel).ServeHTTP(0xc0003079a8, {0x16ad3a8, 0xc0004c07e0}, 0xc0000e0f00)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/server/tunnel.go:166 +0x1a9a
net/http.serverHandler.ServeHTTP({0xc0002dc4e0}, {0x16ad3a8, 0xc0004c07e0}, 0xc0000e0f00)
        /usr/local/go/src/net/http/server.go:2878 +0x43b
net/http.(*conn).serve(0xc000231720, {0x16b0e98, 0xc00038ee10})
        /usr/local/go/src/net/http/server.go:1929 +0xb08
created by net/http.(*Server).Serve
        /usr/local/go/src/net/http/server.go:3033 +0x4e8

That would suggest grpc/grpc-go#2427. However that essentially says the other end (the agent) is not reading. However in this replication of the issue I have exactly 1 agent and if I look at its goroutines I find the following.

goroutine 23 [select]:
google.golang.org/grpc/internal/transport.(*recvBufferReader).readClient(0xc00006b310, {0xc00000cfb8, 0x5, 0x5})
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/transport.go:190 +0xaa
google.golang.org/grpc/internal/transport.(*recvBufferReader).Read(0xc00006b310, {0xc00000cfb8, 0xc000028378, 0xc0002715b8})
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/transport.go:170 +0x147
google.golang.org/grpc/internal/transport.(*transportReader).Read(0xc0001db110, {0xc00000cfb8, 0xc000271630, 0x855847})
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/transport.go:484 +0x32
io.ReadAtLeast({0xa98980, 0xc0001db110}, {0xc00000cfb8, 0x5, 0x5}, 0x5)
        /usr/local/go/src/io/io.go:328 +0x9a
io.ReadFull(...)
        /usr/local/go/src/io/io.go:347
google.golang.org/grpc/internal/transport.(*Stream).Read(0xc00010eea0, {0xc00000cfb8, 0x5, 0x5})
        /go/pkg/mod/google.golang.org/[email protected]/internal/transport/transport.go:468 +0xa5
google.golang.org/grpc.(*parser).recvMsg(0xc00000cfa8, 0x400000)
        /go/pkg/mod/google.golang.org/[email protected]/rpc_util.go:559 +0x47
google.golang.org/grpc.recvAndDecompress(0x4c1f05, 0xc00010eea0, {0x0, 0x0}, 0x400000, 0x0, {0x0, 0x0})
        /go/pkg/mod/google.golang.org/[email protected]/rpc_util.go:690 +0x66
google.golang.org/grpc.recv(0x2, {0x7f74a39086d8, 0xeb0338}, 0xc00010d080, {0x0, 0x0}, {0x9b8720, 0xc0000da900}, 0xc000271858, 0x0, ...)
        /go/pkg/mod/google.golang.org/[email protected]/rpc_util.go:758 +0x6e
google.golang.org/grpc.(*csAttempt).recvMsg(0xc00011ef20, {0x9b8720, 0xc0000da900}, 0xc000120180)
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:975 +0x2b0
google.golang.org/grpc.(*clientStream).RecvMsg.func1(0x203000)
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:826 +0x25
google.golang.org/grpc.(*clientStream).withRetry(0xc000150600, 0xc000271a20, 0xc0002719f0)
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:680 +0x2f6
google.golang.org/grpc.(*clientStream).RecvMsg(0xc000150600, {0x9b8720, 0xc0000da900})
        /go/pkg/mod/google.golang.org/[email protected]/stream.go:825 +0x11f
sigs.k8s.io/apiserver-network-proxy/proto/agent.(*agentServiceConnectClient).Recv(0xc0002a0100)
        /go/src/sigs.k8s.io/apiserver-network-proxy/proto/agent/agent.pb.go:159 +0x4c
sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).Recv(0xc0000b7940)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:285 +0xa5
sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*Client).Serve(0xc0001680c0)
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/client.go:367 +0x393
created by sigs.k8s.io/apiserver-network-proxy/pkg/agent.(*ClientSet).syncOnce
        /go/src/sigs.k8s.io/apiserver-network-proxy/pkg/agent/clientset.go:197 +0x405

So it would appear it is calling RecvMesg. Seems like its time to engage the GRPC folks.

@mihivagyok
Copy link
Contributor Author

Hi All!

I have run my tests with the latest v0.0.26 (compiled with golang 1.16.10), and I still see the issue (although it has newer GRPC).

Thank you!
Adam

@mihivagyok
Copy link
Contributor Author

@cheftako @andrewsykim @jkh52

I commented on #324 , but let me repeat it:

I think we have some idea why kubectl cp fails eventually: the issue comes when we have multiple proxy strategies. The backend (agent) connections are guarded by mutex. But if there are multiple strategies then the same backend with the same connection will be registered to multiple backend managers. To illustrate this:

BackendManager - DestHost
    - Backend1 (agent identifier, conn, mutex)
BackedManager - Default
    - Backend1* (conn, mutex*)

We have two agent instances (one for DestHost, one for Default backend manager), but with the same conn. In this case, it is possible that different go routines are trying to write to the same connection as there is no protection between the instances (there is mutex only within the instance), error could happen. I could submit an issue about this and discuss this theory there.

What do you think about this theory?

Thanks!

@andrewsykim
Copy link
Member

@mihivagyok I was able to reproduce the issue w/ kubectl cp. But testing #324 doesn't seem to fix the issue I'm seeing. It's possible I'm running into a different issue from what you're seeing though. I'm only using the default strategy in this case.

Is it possible for you to test kubectl cp on a cluster using only 1 backend strategy to test your theory?

@mihivagyok
Copy link
Contributor Author

@andrewsykim Yes, there could be several leaks :)

So we have already tested with single and multiple strategies, and we came to the conclusion that the hanging stream/kubectl cp issues are happening only when we have multiple proxyserver strategies.

cc @Tamas-Biro1

Thanks!
Adam

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022
@rastislavs
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2022
@mihivagyok
Copy link
Contributor Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 22, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 20, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 20, 2022
@Tamas-Biro1
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 24, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 24, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 23, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 23, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

10 participants