[metrics] [dataexchange] [networkmap] DXConnect Callbacks for Node Identity Check Metrics #1652
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed changes
Private messaging can be easily disrupted if the identity broadcasted DX certificates are out-of-sync with the certificates actually in-use by a DX plugin. This can be true for both the OSS plugins and other proprietary DX plugin implementations. We've documented this procedure, and are considering further automating this procedure within FireFly or auxiliary tooling in future releases.
For now, FireFly has all the means to monitor this and make it known to the administrators running it via metrics and logs - so that they can use their judgement on whether to update a profile on-chain, or rollback their DX certs, etc.
This PR adds a new
CheckNodeIdentityStatus
to theNetworkMap
, and aDXConnect
callback to the DX plugin. Whenever a DX reconnects, either due to FireFly starting up or DX starting up or network issues, FireFly sees if its needs to re-initialize the DX. On this new connection is a great moment to ask DX for its cert and compare it to what was broadcasted, additionally the DX cert can be examined to see when a soonest expiry of its cert chain is for further monitoring. Using theDXConnect
callback, theOrchestrator
can be notified of the new connection, and ask theNetworkMap
to do the status checks.New metrics
ff_multiparty_node_identity_dx_mismatch
andff_multiparty_node_identity_dx_expiry_epoch
are added so that theNetworkMap
can then publish these status findings as gauges that can be easily monitoring by a timeseries and alerting system.Additional changes
profile
is not set on aPATCH /identitites/{iid}
as we found the code currently assumes aprofile
object is always provided and if its not that causes a nil error / connection abruptly closedns
label so our users can monitoring the different metrics from the perspective of the various namespaces they may be running. NOTE: for Prometheus and other timeseries monitoring systems this could greatly increase the cardinality of the metrics being produced if there are a lot of namespaces running. But this is necessary to be able to identify if message, etc. are failing or pending for a particular namespace vs. another w/o inspecting logs.Types of changes
Please make sure to follow these points
< issue name >
egAdded links in the documentation
.Screenshots (If Applicable)
TODO
Other Information
TODO