-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SCHEMATIC-240] stripped google sheet information from open telemetry span status and message #1573
base: develop
Are you sure you want to change the base?
Conversation
Perhaps we should split the Jira issue out, with another to further investiagte doing this via OTEL/Signoz instead of the code? |
try: | ||
wb.set_dataframe(manifest_df, (1, 1), fit=True) | ||
except HttpError as ex: | ||
pattern = r"https://sheets\.googleapis\.com/v4/spreadsheets/[\w-]+" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this catch all google sheet urls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also my question. Could a change to the google api, that would otherwise be non-breaking, change the format of the URL used so that it's no longer appropriately found in the message string?
@andrewelamb I'm not entirely sure if I'm on the right track, so I'd like to wait until @BryanFauble returns to confirm. This implementation removes Google Sheet links in spans, but I'm unsure if the ticket also requires a more general approach to stripping all sensitive information. I'll make further modifications as I discuss more with Bryan. |
@linglp this is a good line of thought. Some questions to guide you. Is this the only place googlesheets are logged? Off the top of your head, do you know of any other sensitive information that is logged? Is there functionality with OTEL that filters out logs? (e,g https://signoz.io/blog/sending-and-filtering-python-logs-with-opentelemetry/#how-the-default-filter-keeps-out-unwanted-logs?) @andrewelamb what are your thoughts? What we want to avoid is sensitive data being transferred to signoz cloud. I'll add my personal views after this discussion. |
wb.set_dataframe(manifest_df, (1, 1), fit=True) | ||
try: | ||
wb.set_dataframe(manifest_df, (1, 1), fit=True) | ||
except HttpError as ex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending further discussion of whether we think this is the right approach, this new exception that's being caught should have a unit test.
This is a good example of tiny design before doing the work would be helpful, that said, sometimes you have to do a little bit before knowing your options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
@thomasyu888 I don't have anythign to add at the moment on data getting into Signoz that shouldn't(Except to agree that it's bad :) ). This is somethign I'll keep in mind as I start working more cloesly with Signoz however. |
|
||
wb.set_dataframe(manifest_df, (1, 1), fit=True) | ||
try: | ||
wb.set_dataframe(manifest_df, (1, 1), fit=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only call that can trigger an error with the URL in it? I'm thinking of things like sh = gc.open_by_url(manifest_url)
in get_dataframe_by_url
or similar.
What are y'all's thoughts about wrapping/modifying sys.excepthook, catching Something like*
@linglp @andrewelamb @thomasyu888 *modified from here |
Thanks for all the discussion here. In retrospect, a design document could help clarify things further. To summarize, the Google Sheet link was originally found in SigNoz traces, and this solution specifically removes it from traces and spans within a single function call. My plan was to confirm with Bryan whether removing sensitive information from traces is necessary, as the ticket only mentioned "logs," and whether handling it directly in the function (rather than using a custom trace processor) is an acceptable approach. I can also document what I’ve tried and why those approaches didn’t work separately. Since the error originates from Google APIs, any part of the system that interacts with them could potentially trigger it. If Bryan confirms this approach is acceptable, I can proceed with adding unit tests and considering how to wrap the exception. |
@SageGJ here are my thoughts. Using
|
@thomasyu888 thanks for adding!
Given points 1 and 3, and if we decide we'd want to handle this within schematic and not within OTEL itself, we could modify
I noticed tracing is still enabled in the absence of the OTEL headers so it might be better to check for the presence of |
|
self._exporter = exporter | ||
self._shutdown = False | ||
|
||
def redact_google_sheet(self, message: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's important we keep the logs when people are running the CLI. If I'm not mistaken, I think some CLI commands rely on the gsheets link being returned.
My guess is that this would fail some of the CLI tests.
See:
schematic/tests/integration/test_commands.py
Lines 675 to 681 in c50c599
google_sheet_result = [ | |
result | |
for result in result_list | |
if result.startswith("https://docs.google.com/spreadsheets/d/") | |
] | |
assert len(google_sheet_result) == 1 | |
google_sheet_url = google_sheet_result[0] |
Problem
When an error involving Google Sheets is logged, the message may include the URL of the Google Sheet being processed and the google sheet link can be found in signoz both in log and in traces.
Solution
Create a custom log processor to filter out the sensitive information in the log
Evidence that this is working