Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCHEMATIC-240] stripped google sheet information from open telemetry span status and message #1573

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

linglp
Copy link
Contributor

@linglp linglp commented Feb 6, 2025

Problem

When an error involving Google Sheets is logged, the message may include the URL of the Google Sheet being processed and the google sheet link can be found in signoz both in log and in traces.

Solution

Create a custom log processor to filter out the sensitive information in the log

Evidence that this is working

Screenshot 2025-02-11 at 3 38 46 PM

@linglp linglp requested a review from a team as a code owner February 6, 2025 20:57
@andrewelamb
Copy link
Contributor

Perhaps we should split the Jira issue out, with another to further investiagte doing this via OTEL/Signoz instead of the code?

try:
wb.set_dataframe(manifest_df, (1, 1), fit=True)
except HttpError as ex:
pattern = r"https://sheets\.googleapis\.com/v4/spreadsheets/[\w-]+"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this catch all google sheet urls?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also my question. Could a change to the google api, that would otherwise be non-breaking, change the format of the URL used so that it's no longer appropriately found in the message string?

@linglp
Copy link
Contributor Author

linglp commented Feb 6, 2025

@andrewelamb I'm not entirely sure if I'm on the right track, so I'd like to wait until @BryanFauble returns to confirm. This implementation removes Google Sheet links in spans, but I'm unsure if the ticket also requires a more general approach to stripping all sensitive information. I'll make further modifications as I discuss more with Bryan.

@thomasyu888
Copy link
Member

thomasyu888 commented Feb 7, 2025

not entirely sure if I'm on the right track... This implementation removes Google Sheet links in spans, but I'm unsure if the ticket also requires a more general approach to stripping all sensitive information.

@linglp this is a good line of thought. Some questions to guide you. Is this the only place googlesheets are logged? Off the top of your head, do you know of any other sensitive information that is logged? Is there functionality with OTEL that filters out logs? (e,g https://signoz.io/blog/sending-and-filtering-python-logs-with-opentelemetry/#how-the-default-filter-keeps-out-unwanted-logs?)

@andrewelamb what are your thoughts? What we want to avoid is sensitive data being transferred to signoz cloud.

I'll add my personal views after this discussion.

wb.set_dataframe(manifest_df, (1, 1), fit=True)
try:
wb.set_dataframe(manifest_df, (1, 1), fit=True)
except HttpError as ex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending further discussion of whether we think this is the right approach, this new exception that's being caught should have a unit test.

This is a good example of tiny design before doing the work would be helpful, that said, sometimes you have to do a little bit before knowing your options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

@andrewelamb
Copy link
Contributor

@thomasyu888 I don't have anythign to add at the moment on data getting into Signoz that shouldn't(Except to agree that it's bad :) ). This is somethign I'll keep in mind as I start working more cloesly with Signoz however.


wb.set_dataframe(manifest_df, (1, 1), fit=True)
try:
wb.set_dataframe(manifest_df, (1, 1), fit=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only call that can trigger an error with the URL in it? I'm thinking of things like sh = gc.open_by_url(manifest_url) in get_dataframe_by_url or similar.

@SageGJ
Copy link
Collaborator

SageGJ commented Feb 7, 2025

What are y'all's thoughts about wrapping/modifying sys.excepthook, catching HttpErrors within, and sanitizing the messages there?

Something like*

import sys
def custom_except_hook(type, value, traceback):
  if type == HttpError:
    # message sanitizing
    # message raising
  else:
    sys.__excepthook__(type, value, traceback)
sys.excepthook = custom_except_hook
  • would apply to multiple locations where an error is raised that includes a google sheets url
  • could be extended to modify different error types that contain other sensitive information
  • avoids having to wrap multiple blocks of code in try: catch: statements

@linglp @andrewelamb @thomasyu888

*modified from here

@linglp
Copy link
Contributor Author

linglp commented Feb 7, 2025

Thanks for all the discussion here. In retrospect, a design document could help clarify things further. To summarize, the Google Sheet link was originally found in SigNoz traces, and this solution specifically removes it from traces and spans within a single function call. My plan was to confirm with Bryan whether removing sensitive information from traces is necessary, as the ticket only mentioned "logs," and whether handling it directly in the function (rather than using a custom trace processor) is an acceptable approach. I can also document what I’ve tried and why those approaches didn’t work separately.

Since the error originates from Google APIs, any part of the system that interacts with them could potentially trigger it. If Bryan confirms this approach is acceptable, I can proceed with adding unit tests and considering how to wrap the exception.

@linglp linglp marked this pull request as draft February 7, 2025 20:26
@thomasyu888
Copy link
Member

thomasyu888 commented Feb 10, 2025

What are y'all's thoughts about wrapping/modifying sys.excepthook, catching HttpErrors within, and sanitizing the messages there?

@SageGJ here are my thoughts. Using sys.excepthook to filter out sensitive Google API URLs from exception messages can work and is creative, but some considerations:

  1. Setting sys.excepthook changes how all unhandled exceptions are processed globally in one module. I wonder if it would have issues when it's run in a multi-threaded or multi-module environment.
  2. This only catches uncaught exceptions. If the HttpError is handled somewhere else (e.g., inside a try-except block), this function won't intercept it.
  3. There can be potential suppression of useful debug Info. This is a general issue, but we actually want these logs when this is run in the CLI/library, but we just don't want to send the information to SigNoz. @linglp . For example, users of schematic CLI should have these googlesheet links returned to them.

@SageGJ
Copy link
Collaborator

SageGJ commented Feb 11, 2025

@thomasyu888 thanks for adding!
For the points you've added:

  1. I agree with the concern about multi-threaded or multi module environments. I was envisioning applying this across all of schematic, in the __init__.py file since we could specifically catch and modify the appropriate HTTP errors with sheets urls in them and pass the other exceptions to the regular handler.
  2. If the error is caught and handled without being raised will it still be logged in signoz?
  3. I agree. There's also the concern when signoz is used locally with the library/cli where we'd want to censor this information from reaching signoz.

Given points 1 and 3, and if we decide we'd want to handle this within schematic and not within OTEL itself, we could modify __init__.py where the tracing is currently set up so that it also includes something like

import sys
import os

def custom_except_hook(type, value, traceback):
  if type == HttpError:
    # message sanitizing
    # message raising
  else:
    sys.__excepthook__(type, value, traceback)


signoz_enabled = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT")
if signoz_enabled:
    sys.excepthook = custom_except_hook

I noticed tracing is still enabled in the absence of the OTEL headers so it might be better to check for the presence of TRACING_EXPORT_FORMAT or LOGGING_EXPORT_FORMAT for tracing in general.
I also realize changing how we process all exceptions could be a bit much so if we go this route we'd want to be really strict in selecting which ones are caught and to minimize side effects.

self._exporter = exporter
self._shutdown = False

def redact_google_sheet(self, message: str) -> str:
Copy link
Member

@thomasyu888 thomasyu888 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important we keep the logs when people are running the CLI. If I'm not mistaken, I think some CLI commands rely on the gsheets link being returned.

My guess is that this would fail some of the CLI tests.

See:

google_sheet_result = [
result
for result in result_list
if result.startswith("https://docs.google.com/spreadsheets/d/")
]
assert len(google_sheet_result) == 1
google_sheet_url = google_sheet_result[0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants