Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(outbound): TCP/TLS route metrics #3355

Merged
merged 13 commits into from
Nov 13, 2024
Merged

feat(outbound): TCP/TLS route metrics #3355

merged 13 commits into from
Nov 13, 2024

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented Nov 12, 2024

This PR adds transport level route metrics for TCP and TLS. The metrics added are:

  • outbound_(tcp/tls)_route_open_total - provides parent and route reference information.
  • outbound_(tcp/tls)_route_close_total - provides parent and route information as well as an error label that is populated in case the connection has been closed due to an error.

Signed-off-by: Zahari Dichev [email protected]

zaharidichev and others added 10 commits October 29, 2024 18:18
Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
Signed-off-by: Zahari Dichev <[email protected]>
Because the opaq route mocks its selection logic, the first route returned by
the API will always be used. This means that all subsequent routes are
incoherent and cannot be selected.

There's no value in preserving incoherent states in the proxy, so it's
preferable to initially require that the control plane only returns one opaq
route.

This change removes some of the unused matching types and simplifies the opaq
router to only handle one route.
Signed-off-by: Zahari Dichev <[email protected]>
Copy link
Member Author

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general sense is:

  • We don't want the errno changes, or anything to do with errno in the short term.
  • We should use a simple 'Error' enum with expected errors ('Forbidden', 'I/O', 'Invalid', 'Unexpected', '')--we can expand this with additional variants matching against IO error codes, but we can use a bounded set rather than one that takes a numeric code.
  • If we have a constrained set of errors, we should eagerly register them rather than lazily.
    • I.e. we have a TcpRouteMetricsFamily that is registered against route labels in a NewService, which produces a TcpRouteMetrics with various error counters.
  • My sense is that we don't gain much from the current factoring/reusability of InstrumentConnection--we may be better off having all of the metrics and labels defined in the outbound crate. I'm not sure that the middleware portion of this is particularly resusable. Given that errors are defined in the outbound crate, it probably makes sense for all of the middleware to be bound in the outbound crate. If we do want a reusable middleware, it probabably need not be prom::Counter-oriented (i.e. it could hold a trait that calls a function on Result<(), &Error>).

This is all to try to avoid being generic in places that are not really advantageous.

Comment on lines 39 to 92
impl<T, N, L> NewService<T> for NewInstrumentConnection<N, L>
where
N: Clone,
L: Clone,
{
type Service = InstrumentConnection<T, N, L>;

fn new_service(&self, target: T) -> Self::Service {
InstrumentConnection::new(target, self.inner.clone(), self.params.clone())
}
}

// === impl InstrumentConnection ===

impl<T, N, L: Clone> InstrumentConnection<T, N, L> {
fn new(target: T, inner: N, params: TcpMetricsParams<L>) -> Self {
Self {
target,
inner,
params,
}
}
}

impl<T, I, N, S, L> Service<I> for InstrumentConnection<T, N, L>
where
L: Clone + Hash + Eq + EncodeLabelSetMut + Debug + Send + Sync + 'static,
T: Clone + Send + Sync + 'static,
T: Param<L>,
I: io::AsyncRead + io::AsyncWrite + Debug + Send + Unpin + 'static,
N: NewService<T, Service = S> + Clone + Send + 'static,
S: Service<I> + Send,
S::Error: Into<Error>,
S::Future: Send,
{
type Response = S::Response;
type Error = Error;
type Future = Pin<Box<dyn Future<Output = Result<S::Response, Error>> + Send + 'static>>;

#[inline]
fn poll_ready(&mut self, _: &mut Context<'_>) -> Poll<Result<(), Self::Error>> {
Poll::Ready(Ok(()))
}

fn call(&mut self, io: I) -> Self::Future {
const ERRNO_UNKNOWN: i32 = 0;
let target = self.target.clone();
let new_accept = self.inner.clone();
let labels: L = target.param();
let params = self.params.clone();

Box::pin(async move {
let metrics_open = params.metrics_open(labels.clone());
let svc = new_accept.new_service(target);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be structured such that we don't "break" the stacking--the NewService should eagerly do as much work as possible and this future should only handle registering completion metrics.

Signed-off-by: Zahari Dichev <[email protected]>
let metrics = self.metrics.clone();

Box::pin(async move {
let svc = new_accept.new_service(target);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this happen in the NewService impl so that this middleware can properly proxy poll_ready()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will give it a try

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2549251

Signed-off-by: Zahari Dichev <[email protected]>
Comment on lines 219 to 223
let inner = self.inner.clone();

Box::pin(async move {
metrics.inc_open();
match inner.oneshot(io).await.map_err(Into::into) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let inner = self.inner.clone();
Box::pin(async move {
metrics.inc_open();
match inner.oneshot(io).await.map_err(Into::into) {
metrics.inc_open();
let call = self.inner.call(io);
Box::pin(async move {
match call.await.map_err(Into::into) {

The clone/oneshot invalidates the prior poll_ready -- poll_ready "reserves" us the ability to send a request. We want to use that reservation. When we clone, we invalidate that reservation. Oneshot combines waiting for a reservation and dispatching the request.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olix0r Done in 5ac1352

@olix0r
Copy link
Member Author

olix0r commented Nov 13, 2024

LGTM once we fix the service middleware.

Signed-off-by: Zahari Dichev <[email protected]>
@zaharidichev zaharidichev marked this pull request as ready for review November 13, 2024 17:16
@zaharidichev zaharidichev requested a review from a team as a code owner November 13, 2024 17:16
@olix0r olix0r merged commit aaf6db0 into main Nov 13, 2024
15 checks passed
@olix0r olix0r deleted the zd/tcp-level-route-metrics branch November 13, 2024 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants