Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metrics] [dataexchange] [networkmap] DXConnect Callbacks for Node Identity Check Metrics #1652

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

onelapahead
Copy link
Contributor

@onelapahead onelapahead commented Feb 28, 2025

Proposed changes

Private messaging can be easily disrupted if the identity broadcasted DX certificates are out-of-sync with the certificates actually in-use by a DX plugin. This can be true for both the OSS plugins and other proprietary DX plugin implementations. We've documented this procedure, and are considering further automating this procedure within FireFly or auxiliary tooling in future releases.

For now, FireFly has all the means to monitor this and make it known to the administrators running it via metrics and logs - so that they can use their judgement on whether to update a profile on-chain, or rollback their DX certs, etc.

This PR adds a new CheckNodeIdentityStatus to the NetworkMap, and a DXConnect callback to the DX plugin. Whenever a DX reconnects, either due to FireFly starting up or DX starting up or network issues, FireFly sees if its needs to re-initialize the DX. On this new connection is a great moment to ask DX for its cert and compare it to what was broadcasted, additionally the DX cert can be examined to see when a soonest expiry of its cert chain is for further monitoring. Using the DXConnect callback, the Orchestrator can be notified of the new connection, and ask the NetworkMap to do the status checks.

New metrics ff_multiparty_node_identity_dx_mismatch and ff_multiparty_node_identity_dx_expiry_epoch are added so that the NetworkMap can then publish these status findings as gauges that can be easily monitoring by a timeseries and alerting system.

Additional changes

  • returns a 400 if profile is not set on a PATCH /identitites/{iid} as we found the code currently assumes a profile object is always provided and if its not that causes a nil error / connection abruptly closed
  • all existing metrics are given a ns label so our users can monitoring the different metrics from the perspective of the various namespaces they may be running. NOTE: for Prometheus and other timeseries monitoring systems this could greatly increase the cardinality of the metrics being produced if there are a lot of namespaces running. But this is necessary to be able to identify if message, etc. are failing or pending for a particular namespace vs. another w/o inspecting logs.

Types of changes

  • Bug fix
  • New feature added
  • Documentation Update
  • New metrics added / updated

Please make sure to follow these points

  • I have read the contributing guidelines.
  • I have performed a self-review of my own code or work.
  • I have commented my code, particularly in hard-to-understand areas.
  • My changes generates no new warnings.
  • My Pull Request title is in format < issue name > eg Added links in the documentation.
  • I have added tests that prove my fix is effective or that my feature works.
  • My changes have sufficient code coverage (unit, integration, e2e tests).

Screenshots (If Applicable)

TODO

Other Information

TODO

Copy link

codecov bot commented Feb 28, 2025

Codecov Report

Attention: Patch coverage is 78.48837% with 37 lines in your changes missing coverage. Please review.

Project coverage is 99.83%. Comparing base (490539e) to head (b069dfd).

Files with missing lines Patch % Lines
internal/networkmap/check_node.go 55.38% 19 Missing and 10 partials ⚠️
internal/dataexchange/ffdx/ffdx.go 40.00% 2 Missing and 1 partial ⚠️
internal/networkmap/update_identity.go 0.00% 2 Missing and 1 partial ⚠️
internal/metrics/identity.go 92.59% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1652      +/-   ##
==========================================
- Coverage   99.95%   99.83%   -0.13%     
==========================================
  Files         337      339       +2     
  Lines       29607    29715     +108     
==========================================
+ Hits        29595    29666      +71     
- Misses          8       33      +25     
- Partials        4       16      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant