Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specification for telemetry #17

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Add specification for telemetry #17

wants to merge 1 commit into from

Conversation

adklempner
Copy link
Member

No description provided.

@adklempner adklempner marked this pull request as draft May 24, 2024 01:23
@@ -0,0 +1,39 @@
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need metrics section. There can be 2 types of them: general and implementation specific.

As a basic I think following should be specified:

  • timestamps of receiving and sending message;
  • size of message packets;
  • content topics / pubsub topics of the messages;
  • code logs of errors and warnings;
  • peer ID or sender and receiver;
  • message hashes;

In case we collect multi addresses we can derive IP and try to see correlation between geographical presence and availability. This approach is questionable and unless we need it we should drop it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relay metrics needed:

  • mesh state: peers discovered, connections established;
  • any errors with peers;

This is a follow-up from today's Bug Bash meeting.

What else can we add, @jm-clius @chaitanyaprem ?


#### Group Chats

#### Contact Requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should understand what is the center of our interest for this RFC. It seems Waku and not Status (there should be one specific for them).

Status as an app will have to collect protocol level telemetry and on top of it application level.
Ideally for collecting Waku - they shouldn't care and we should provide some framework. I might be:

  • a way to get data from within a node;
  • or turn on or off Telemetry service embedded into running Waku node (makes sense if we start running our own nodes with embedded telemetry to measure network);

@@ -0,0 +1,39 @@
---
title: TELEMETRY
name: Waku v2 Telemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to add Metrics usage section that align with our building in public principle.
In that section we should explain what knowledge we are aiming to get from the collected data and re-assure a reader in us not having any way of personalizing collected data etc. This might come in handy in case we'd need to communicate is as a Privacy policy.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think beyond the metruc usage, we need to state what we are trying to learn so we can justify the work to ourselves. Not really any point collecting random data if you we don't know how we are going to analyse it.

@@ -0,0 +1,39 @@
---
title: TELEMETRY
name: Waku v2 Telemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should stop using the v2 term

Suggested change
name: Waku v2 Telemetry
name: Waku Telemetry

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's more of a meta concern cc @jm-clius

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. We're certainly not consistent here yet, but:

  • the name of a new spec should no longer include "v2" (just call it "Waku Telemetry")
  • the title of specs should preferably be prefixed with WAKU2- (i.e. I propose WAKU2-TELEMETRY as title)

I've updated the contribution guidelines to reflect this: https://github.com/waku-org/specs/blob/master/README.md#contributions

@adklempner
Copy link
Member Author

Two strategies for tracking message reliability:

  • A:

    1. User sends message with light push and notifies telemetry after receiving success confirmation from minimum number of peers (likely 1 peer)
    2. Telemetry waits for a store node to have received the message
  • B: (more akin to end-to-end reliability)

    1. User sends message with Light Push and notifies telemetry (after success)
    2. Another user running Waku receives the message via Filter and notifies telemetry

Note that for B, both users must be running the Waku application at the same time, since we do not expect "offline" nodes to receive a historical message via Filter

Message that was sent but never received by store node is worst case
In each case, the latency between steps 1 and 2 can be a metric to measure.


Types of metrics, sorted by priority:

  1. Reliability
  2. Bandwidth
  3. Latency

@adklempner
Copy link
Member Author

Should telemetry implementation be purely external or purely internal? i.e. with current work in status-go and waku dogfooding, all metrics are collected without making changes within the go-waku and js-waku libraries. This makes telemetry an application-level feature (or perhaps an application in itself), and thus any changes to the implementation or spec should not affect the protocol. It also limits metrics to use data exposed by the waku libraries' public interfaces.

If purely internal, then telemetry becomes more closely coupled with core Waku specs/implementation but allows for more granular metric collection.

@fryorcraken
Copy link

Should telemetry implementation be purely external or purely internal? i.e. with current work in status-go and waku dogfooding, all metrics are collected without making changes within the go-waku and js-waku libraries. This makes telemetry an application-level feature (or perhaps an application in itself), and thus any changes to the implementation or spec should not affect the protocol. It also limits metrics to use data exposed by the waku libraries' public interfaces.

If purely internal, then telemetry becomes more closely coupled with core Waku specs/implementation but allows for more granular metric collection.

There could be merit of having telemetry at both level.
The issue with telemetry at Waku level is that a number of information and learning would be lost, as telemetry also hleps understand application behaviour but if it's at Waku level then application information (e.g. status message type) is lost.

For now, let's focus telemetry at application level. We can review if Waku level telemetry is something wanted later.

@fryorcraken
Copy link

I think this is not necessary anymore because of the direction we have taken for telemetry.

Let's review as part of 2025 H1 roadmap but it looks like having Prometheus metrics is good enough for what we need and we can move in this direction.

I think a spec was more necessary when we had a complex behaviour as per the original telemetry design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants