-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid metric reported on the transmit side and missing metric on the receiving service side #1045
Comments
Thanks so much for reporting this @corneliu-calciu ! I believe we fixed number of these issues in our main branch, 1.7.0 is imminent I think, so I expect it will happen this week. If you are able to try the beyla:main branch that would be great, if not 1.7.0 should come very soon. |
Hi Nikola I did run the test against beyla main version of the container and the second issue, related to missing flows is still present. The flow from load-generator-4 to frontend-4 was missing on the server side metrics http_server_request_duration_seconds_count family, but present on the client side http_client_request_duration_seconds_count metrics. Client side http_client_request_duration_seconds_count{http_request_method="GET", http_response_status_code="200", http_route="/**", instance="localhost:8999", job="beylaremote", k8s_deployment_name="smartscaler-load-gen-acme-4", k8s_namespace_name="acme", k8s_node_name="ip-10-0-14-188.ec2.internal", k8s_pod_name="smartscaler-load-gen-acme-4-85f77698fb-shwsk", k8s_pod_start_time="2024-07-23 19:08:26 +0000 UTC", k8s_pod_uid="288f3690-ebc9-4491-8e6d-4be72ea775cd", k8s_replicaset_name="smartscaler-load-gen-acme-4-85f77698fb", server_address="frontend-acme-4", server_port="3000", service_name="smartscaler-load-gen-acme-4", service_namespace="acme", url_path="/products"} Server side Relax the query and specify only the client_address="smartscaler-load-gen-acme-4" http_server_request_duration_seconds_count{client_address="smartscaler-load-gen-acme-4", http_request_method="GET", http_response_status_code="200", http_route="/", instance="localhost:8999", job="beylaremote", k8s_deployment_name="catalog-acme-0", k8s_namespace_name="acme", k8s_node_name="ip-10-0-6-123.ec2.internal", k8s_pod_name="catalog-acme-0-8666f949bf-926zb", k8s_pod_start_time="2024-07-24 03:07:08 +0000 UTC", k8s_pod_uid="d8374dff-882a-4113-b174-e31e4db1377a", k8s_replicaset_name="catalog-acme-0-8666f949bf", server_address="frontend-acme-4", server_port="15006", service_name="catalog-acme-0", service_namespace="acme", url_path="/products"} 1 No entry with server_address="frontend-acme-4" and k8s_deployment_name ="frontend-acme-4" is found, but two other stray entries are present having the counter value of 1 and 2. The server_address="frontend-acme-4" but the k8s_deployment_name is "frontend-acme-9" or is "catalog-acme-0". Please let me know if is possible to enable at runtime the print_traces: true to inspect the samples sent to userspace. I was able to reproduce the issue but the configuration was having the print_traces disabled and now I cannot do any effective troubleshooting. Please let me know if I can collect other information's that can help to isolate the root cause of this intermittent issue. The configuration of the Beyla is: and: ebpf:
Best Regards |
Thanks again @corneliu-calciu ! I think the issue is that we are not instrumenting the server application correctly. It's marked in the DEMO as a Go service, which makes me think it's one of the two reasons:
I was unable to find the source for these services. Do you know if they have published any? One way to investigate this further would be to set BEYLA_LOG_LEVEL=debug as environment variable, this would tell us what we instrumented and what we could find, it will be in the first 200 lines of the Beyla log. |
Hi Nikola Please see the diagram here: https://github.com/vmwarecloudadvocacy/acme_fitness_demo/blob/master/acmeshop.png, the front-end service is described as a NodeJS application. Unfortunately I don't have the information's where the source code for these services are available. I will continue the research on this path. Thanks for the support! Corneliu |
Hi Nikola Please find attached the logs while using BEYLA_LOG_LEVEL=DEBUG. The issue was not reproduced but may give you some information's. I was reproducing the issue having ISTIO service mesh installed and enabled for the namespace "acme" where I was performing the tests. I don't know if this environment may be the trigger of this intermittent condition. I will try to reproduce the issue without ISTIO sidecars in the next days. Best Regards |
Hi @corneliu-calciu, I wasn't able to find the line in your log which mentions us instrumenting the catalog service, but I was able to run their example myself. I didn't try the k8s deployment, I tried their docker-compose after making the docker-compose example use the same images (:latest) as the k8s demo. After I tried this, Beyla's output shows the problem:
The examples they have use a very old Go version which we don't support. Go officially supports the latest version - 2, so they are up to 1.20 latest supported. Beyla goes down to 1.17, but we can't do earlier than that. When we encounter old Go version we resort to using the kernel level instrumentation, which can miss a lot of signals. The reason for this is because Go internally overlays the OS thread with virtual threads (goroutines) and it's not always possible for us to track the connection information, without explicit Go support. So technically this is a current limitation. |
Hi Nikola Thank you very much for the explanation. Best Regards |
Hi Nikola I did check on my side and the issue I have is with "frontend" service and this is NodeJS From Beyla logs: Please let me know if for NodeJS applications are known limitations similar to the golang supported version. The Kubernetes manifest to deploy only frontend: Best Regards |
Hi @corneliu-calciu, the nodejs application will not have these limitations. |
Hi
I was running a test with 10 load generator instances sending HTTP requests to 10 ACME frontend service instances (https://github.com/vmwarecloudadvocacy/acme_fitness_demo). The load generator instance N sends requests to frontend instance N, using a one to one relation.
The test was performed o a 14 nodes AKS Kubernetes cluster using Grafana Beyla 1.6.4 release.
Important note: the Grafana Beyla daemon set was deployed after the applications and the load generators was ready and sending requests.
I found two issue that may be related:
On the node (6) where the load generator instance 6 was running for the http_server_request_duration_seconds family I had one unexpected metric having "k8s_deployment_name" label, "service_name" and the 'client_address' label equal to the "loadgenerator-6". Basically a flow receiving all the data sent by the traffic generator was reported as been received by the traffic generator instance 6.
This issue was seen several times including in one basic deployment with a single load generator instance and one single destination microservice. The issue is intermittent and seems to be seen only if the Grafana Beyla eBPF agent is started in a cluster where the traffic (HTTP requests) are already flowing.
On the node (14) where the frontend service was running, receiving about 10 HTTP requests per second, for the http_server_request_duration_seconds family no metric with "k8s_deployment_name" and ""service_name" labels equal to "frontend-6" and the 'client_address' equal to "loadgenerator-6" is found. This is unexpected as the HTTP requests was reported as successfully processed on the "loadgenerator-6" instance.
The next steps was performed to understand the issue behavior:
I think the issue is related to the first packet identification for a flow, and seems a response packet was selected to build a flow session, but I have no strong evidence to confirm this hypothesis.
I will try to find a easy and more predictable way to reproduce the issues and do further investigations.
Please let me know if such issues are known and if such issues was fixed after the release 1.6.4.
Please see below a simple diagram presenting the issues.
Best Regards
Corneliu
The text was updated successfully, but these errors were encountered: