Invalid metric reported on the transmit side and missing metric on the receiving service side #1045

corneliu-calciu · 2024-07-22T07:19:46Z

Hi

I was running a test with 10 load generator instances sending HTTP requests to 10 ACME frontend service instances (https://github.com/vmwarecloudadvocacy/acme_fitness_demo). The load generator instance N sends requests to frontend instance N, using a one to one relation.
The test was performed o a 14 nodes AKS Kubernetes cluster using Grafana Beyla 1.6.4 release.
Important note: the Grafana Beyla daemon set was deployed after the applications and the load generators was ready and sending requests.

I found two issue that may be related:

On the node (6) where the load generator instance 6 was running for the http_server_request_duration_seconds family I had one unexpected metric having "k8s_deployment_name" label, "service_name" and the 'client_address' label equal to the "loadgenerator-6". Basically a flow receiving all the data sent by the traffic generator was reported as been received by the traffic generator instance 6.
This issue was seen several times including in one basic deployment with a single load generator instance and one single destination microservice. The issue is intermittent and seems to be seen only if the Grafana Beyla eBPF agent is started in a cluster where the traffic (HTTP requests) are already flowing.
On the node (14) where the frontend service was running, receiving about 10 HTTP requests per second, for the http_server_request_duration_seconds family no metric with "k8s_deployment_name" and ""service_name" labels equal to "frontend-6" and the 'client_address' equal to "loadgenerator-6" is found. This is unexpected as the HTTP requests was reported as successfully processed on the "loadgenerator-6" instance.

The next steps was performed to understand the issue behavior:

Delete the "loadgenerator-6" instance POD, wait to be recreated and send requests. The "loadgenerator-6" was scheduled on another cluster node. Both issues are remaining present.
Delete the "frontend-6" instance POD, wait to be recreated and operational. The "frontend-6" POD was scheduled on another cluster node. At this moment both issues disappeared.

I think the issue is related to the first packet identification for a flow, and seems a response packet was selected to build a flow session, but I have no strong evidence to confirm this hypothesis.

I will try to find a easy and more predictable way to reproduce the issues and do further investigations.

Please let me know if such issues are known and if such issues was fixed after the release 1.6.4.

Please see below a simple diagram presenting the issues.

Best Regards
Corneliu

grcevski · 2024-07-22T18:24:18Z

Thanks so much for reporting this @corneliu-calciu ! I believe we fixed number of these issues in our main branch, 1.7.0 is imminent I think, so I expect it will happen this week. If you are able to try the beyla:main branch that would be great, if not 1.7.0 should come very soon.

corneliu-calciu · 2024-07-24T10:42:18Z

Hi Nikola

I did run the test against beyla main version of the container and the second issue, related to missing flows is still present.

The flow from load-generator-4 to frontend-4 was missing on the server side metrics http_server_request_duration_seconds_count family, but present on the client side http_client_request_duration_seconds_count metrics.

Client side
Verified on the node where “smartscaler-load-gen-acme-4” is running

http_client_request_duration_seconds_count{http_request_method="GET", http_response_status_code="200", http_route="/**", instance="localhost:8999", job="beylaremote", k8s_deployment_name="smartscaler-load-gen-acme-4", k8s_namespace_name="acme", k8s_node_name="ip-10-0-14-188.ec2.internal", k8s_pod_name="smartscaler-load-gen-acme-4-85f77698fb-shwsk", k8s_pod_start_time="2024-07-23 19:08:26 +0000 UTC", k8s_pod_uid="288f3690-ebc9-4491-8e6d-4be72ea775cd", k8s_replicaset_name="smartscaler-load-gen-acme-4-85f77698fb", server_address="frontend-acme-4", server_port="3000", service_name="smartscaler-load-gen-acme-4", service_namespace="acme", url_path="/products"}
33038 << Correct

Server side
Verified on the node where “frontend-acme-4” is running.

Relax the query and specify only the client_address="smartscaler-load-gen-acme-4"

http_server_request_duration_seconds_count{client_address="smartscaler-load-gen-acme-4", http_request_method="GET", http_response_status_code="200", http_route="/", instance="localhost:8999", job="beylaremote", k8s_deployment_name="catalog-acme-0", k8s_namespace_name="acme", k8s_node_name="ip-10-0-6-123.ec2.internal", k8s_pod_name="catalog-acme-0-8666f949bf-926zb", k8s_pod_start_time="2024-07-24 03:07:08 +0000 UTC", k8s_pod_uid="d8374dff-882a-4113-b174-e31e4db1377a", k8s_replicaset_name="catalog-acme-0-8666f949bf", server_address="frontend-acme-4", server_port="15006", service_name="catalog-acme-0", service_namespace="acme", url_path="/products"} 1
http_server_request_duration_seconds_count{client_address="smartscaler-load-gen-acme-4", http_request_method="GET", http_response_status_code="200", http_route="/", instance="localhost:8999", job="beylaremote", k8s_deployment_name="frontend-acme-9", k8s_namespace_name="acme", k8s_node_name="ip-10-0-6-123.ec2.internal", k8s_pod_name="frontend-acme-9-68ddb6ff8d-sff9t", k8s_pod_start_time="2024-07-23 19:07:29 +0000 UTC", k8s_pod_uid="b44c8a9d-fc8d-445f-a34c-0f46fe5aa1ad", k8s_replicaset_name="frontend-acme-9-68ddb6ff8d", server_address="frontend-acme-4", server_port="15006", service_name="frontend-acme-9", service_namespace="acme", url_path="/products"}

No entry with server_address="frontend-acme-4" and k8s_deployment_name ="frontend-acme-4" is found, but two other stray entries are present having the counter value of 1 and 2.

The server_address="frontend-acme-4" but the k8s_deployment_name is "frontend-acme-9" or is "catalog-acme-0".
The server_address and the k8s_deployment_name values are inconsistent.

Please let me know if is possible to enable at runtime the print_traces: true to inspect the samples sent to userspace. I was able to reproduce the issue but the configuration was having the print_traces disabled and now I cannot do any effective troubleshooting.

Please let me know if I can collect other information's that can help to isolate the root cause of this intermittent issue.

The configuration of the Beyla is:
AGENT_MODE: flow
BEYLA_KUBE_METADATA_ENABLE: true
BEYLA_DISCOVERY_POLL_INTERVAL: 500ms
BEYLA_METRICS_REPORT_PEER: true
BEYLA_METRICS_REPORT_TARGET: true
BEYLA_SKIP_GO_SPECIFIC_TRACERS: true
BEYLA_PRINT_TRACES: false
BEYLA_METRICS_INTERVAL: 10ms
BEYLA_BPF_BATCH_TIMEOUT: 10ms
BEYLA_LOG_LEVEL: INFO
BEYLA_BPF_DEBUG: false
BEYLA_BPF_HTTP_REQUEST_TIMEOUT: 10s
BEYLA_PROCESSES_INTERVAL: 100ms
BEYLA_INTERNAL_METRICS_PROMETHEUS_PORT: 8999
BEYLA_NETWORK_DROP_EXTERNAL: true

and:

ebpf:
wakeup_len: 100
grafana:
otlp:
submit:
- metrics
- traces
otel_traces_export:
sampler:
name: parentbased_traceidratio
arg: 0.01
routes:
patterns:

/**
ignored_patterns:
/readyz
/livez
/stats/prometheus
/healthz/ready
/health**
/readiness
/liveness
/is-dynamic-lb-initialized
/apis/metrics.k8s.io/v1beta1
/openapi/v2
/metrics/resource
unmatched: wildcard
discovery:
services:
k8s_namespace: (.*)
attributes:
kubernetes:
enable: true
prometheus_export:
port: 8999
path: /metrics

Best Regards
Corneliu

grcevski · 2024-07-24T13:47:06Z

Thanks again @corneliu-calciu !

I think the issue is that we are not instrumenting the server application correctly. It's marked in the DEMO as a Go service, which makes me think it's one of the two reasons:

It's using a server framework we don't support yet, like go fibers.
It's using an old Go version, prior to 1.17 which we don't support properly.

I was unable to find the source for these services. Do you know if they have published any?

One way to investigate this further would be to set BEYLA_LOG_LEVEL=debug as environment variable, this would tell us what we instrumented and what we could find, it will be in the first 200 lines of the Beyla log.

corneliu-calciu · 2024-07-24T14:07:36Z

Hi Nikola

Please see the diagram here: https://github.com/vmwarecloudadvocacy/acme_fitness_demo/blob/master/acmeshop.png, the front-end service is described as a NodeJS application. Unfortunately I don't have the information's where the source code for these services are available. I will continue the research on this path.
I will use your advice and set BEYLA_LOG_LEVEL=debug and try to reproduce again the issue to collect more data.

Thanks for the support!

Corneliu

corneliu-calciu · 2024-07-24T14:45:41Z

Hi Nikola

Please find attached the logs while using BEYLA_LOG_LEVEL=DEBUG. The issue was not reproduced but may give you some information's.

I was reproducing the issue having ISTIO service mesh installed and enabled for the namespace "acme" where I was performing the tests. I don't know if this environment may be the trigger of this intermittent condition. I will try to reproduce the issue without ISTIO sidecars in the next days.

Best Regards
Corneliu
beyla.log

grcevski · 2024-07-24T19:49:52Z

Hi @corneliu-calciu,

I wasn't able to find the line in your log which mentions us instrumenting the catalog service, but I was able to run their example myself. I didn't try the k8s deployment, I tried their docker-compose after making the docker-compose example use the same images (:latest) as the k8s demo.

After I tried this, Beyla's output shows the problem:

time=2024-07-24T15:41:35.168-04:00 level=INFO msg="instrumenting process" component=discover.TraceAttacher cmd=/app/catalog pid=2505270
time=2024-07-24T15:41:35.168-04:00 level=WARN msg="Unsupported Go program detected, using generic instrumentation" component=discover.TraceAttacher error="finding instrumentation points: unsupported Go version: go1.13.6. Minimum supported version is 1.17"

The examples they have use a very old Go version which we don't support. Go officially supports the latest version - 2, so they are up to 1.20 latest supported. Beyla goes down to 1.17, but we can't do earlier than that.

When we encounter old Go version we resort to using the kernel level instrumentation, which can miss a lot of signals. The reason for this is because Go internally overlays the OS thread with virtual threads (goroutines) and it's not always possible for us to track the connection information, without explicit Go support.

So technically this is a current limitation.

corneliu-calciu · 2024-07-25T06:48:07Z

Hi Nikola

Thank you very much for the explanation.

Best Regards
Corneliu

corneliu-calciu · 2024-07-25T07:40:51Z

Hi Nikola

I did check on my side and the issue I have is with "frontend" service and this is NodeJS
From inside the frontend the container run:
root@frontend-78d557cc6c-xsmxx:/app# cat /proc/25/cmdline
nodeapp.js

From Beyla logs:
time=2024-07-25T07:31:08.095Z level=INFO msg="new process for already instrumented executable" component=discover.TraceAttacher pid=136953 child=[] exec=/opt/bitnami/node/bin/node

Please let me know if for NodeJS applications are known limitations similar to the golang supported version.

The Kubernetes manifest to deploy only frontend:
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
spec:
selector:
matchLabels:
app: frontend
replicas: 1
template:
metadata:
labels:
app: frontend
spec:
containers:
- name: frontend
image: gcr.io/vmwarecloudadvocacy/acmeshop-front-end:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 30000
env:
- name: FRONTEND_PORT
value: "3000"
- name: USERS_HOST
value: users-0
- name: CATALOG_HOST
value: catalog-0
- name: ORDER_HOST
value: order-0
- name: CART_HOST
value: cart-0
- name: USERS_PORT
value: "8083"
- name: CATALOG_PORT
value: "8082"
- name: CART_PORT
value: "5000"
- name: ORDER_PORT
value: "6000"
- name: JAEGER_AGENT_HOST
value: localhost
resources:
requests:
memory: "10Mi"
cpu: "10m"
limits:
memory: "200Mi"
cpu: "1000m"
restartPolicy: Always

Best Regards
Corneliu

grcevski · 2024-07-25T14:25:06Z

Hi @corneliu-calciu, the nodejs application will not have these limitations.

grcevski added the bug Something isn't working label Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid metric reported on the transmit side and missing metric on the receiving service side #1045

Invalid metric reported on the transmit side and missing metric on the receiving service side #1045

corneliu-calciu commented Jul 22, 2024

grcevski commented Jul 22, 2024

corneliu-calciu commented Jul 24, 2024 •

edited

Loading

grcevski commented Jul 24, 2024

corneliu-calciu commented Jul 24, 2024

corneliu-calciu commented Jul 24, 2024

grcevski commented Jul 24, 2024

corneliu-calciu commented Jul 25, 2024

corneliu-calciu commented Jul 25, 2024

grcevski commented Jul 25, 2024

Invalid metric reported on the transmit side and missing metric on the receiving service side #1045

Invalid metric reported on the transmit side and missing metric on the receiving service side #1045

Comments

corneliu-calciu commented Jul 22, 2024

grcevski commented Jul 22, 2024

corneliu-calciu commented Jul 24, 2024 • edited Loading

grcevski commented Jul 24, 2024

corneliu-calciu commented Jul 24, 2024

corneliu-calciu commented Jul 24, 2024

grcevski commented Jul 24, 2024

corneliu-calciu commented Jul 25, 2024

corneliu-calciu commented Jul 25, 2024

grcevski commented Jul 25, 2024

corneliu-calciu commented Jul 24, 2024 •

edited

Loading