Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Add Mysql2 and Trilogy db.collection.name attribute #1109

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

hannahramadan
Copy link
Contributor

@hannahramadan hannahramadan commented Aug 8, 2024

The db.collection.name attribute is conditionally required for Database spans. This PR uses regex to detect the collection name and report it as the db.collection.name attribute for Mysql2 and Trilogy instrumentation.

Closes #1100

@hannahramadan hannahramadan changed the title Add Mysql2 and Trilogy db.collection.name attribute Feat: Add Mysql2 and Trilogy db.collection.name attribute Aug 8, 2024
Copy link
Contributor

@kaylareopelle kaylareopelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small wording things on the trilogy test instructions you can keep or leave. Other than that, looks great! Thank you!

@@ -83,9 +86,17 @@ def _otel_client_attributes

attributes[SemanticConventions::Trace::DB_NAME] = _otel_database_name
attributes[SemanticConventions::Trace::PEER_SERVICE] = config[:peer_service]
attributes['db.collection.name'] = collection_name(sql)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid adding additional overhead to the instrumentation if possible.

In cases where the SQL statement is omitted or not sanitized, there isn't any regexp scan that occurs.

Is there a way to optimize it so that there isn't additional processing?

I would also like to consider that this db.collection.name matches newer versions of the OTel Schema and the instrumentations are still using pre-1.0 semantics.

If I am not mistaken that would have been db.table.name. I think we should continue using pre-1.0 semantics until we have
OTEL_SEMCONV_STABILITY_OPT_IN implemented

https://opentelemetry.io/docs/specs/semconv/database/database-spans/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arielvalentin! It makes sense to keep consistent with pre-1.0 semantics. In this case, the attribute is db.sql.table. I've made this update!

Re additional processing: I see your point about putting every SQL statement through a regexp scan. Because the table/collection name isn't available on the client, I'm not sure how else we'd be able to get this information. We could make table name an opt-in/opt-out attribute, but is that level of control something we want to provide? What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it's difficult to say. I feel like maybe this should only be done when the we also include the db.statement and I don't know if it would be possible to combine with the obfuscation code.

Maybe I'm making too much of it? I'm not sure.

Perhaps benchmarking will put me a bit at ease.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if only including the table name if we're recording the db.statement deviates from the convention (because as of now, the convention has it that recording the collection name isn't based on any condition, just availability).

The change to Regexp.last_match(1) and an updated regex string improved speed. In the following benchmark example, I used the MySQL obfuscation SQL code as a baseline, and found running the extra db.collection.name sql was 1.19x slower.

require 'benchmark/ipsa'

TABLE_NAME = /\b(?:(?:FROM|INTO|UPDATE)|(?:(?:CREATE|DROP|ALTER)\s+TABLE(?:\s+IF\s+(?:NOT\s+)?EXISTS)?))\s+["']?([\w.]+)["']?/i
MYSQL_OBFUSCATION = /(?-mix:'(?:[^']|'')*?(?:\\'.*|'(?!')))|(?-mix:"(?:[^"]|"")*?(?:\\".*|"(?!")))|(?-mix:(\$(?!\d)[^$]*?\$).*?(?:\1|$))|(?-mix:\{?(?:[0-9a-fA-F]\-*){32}\}?)|(?-mix:-?\b(?:[0-9]+\.)?[0-9]+([eE][+-]?[0-9]+)?\b)|(?i-mx:\b(?:true|false|null)\b)|(?-mix:0x[0-9a-fA-F]+)|(?i-mx:(?:#|--).*?(?=\r|\n|$))|(?m-ix:\/\*.*?\*\/)|(?-mix:q'\[.*?(?:\]'|$)|q'\{.*?(?:\}'|$)|q'\<.*?(?:\>'|$)|q'\(.*?(?:\)'|$))/
SQL = 'SELECT * FROM test_table'

def collection_name
  shared_operation
  Regexp.last_match(1) if SQL =~ TABLE_NAME
end

def no_collection_name
  shared_operation
  SQL
end

def shared_operation
  SQL.gsub(MYSQL_OBFUSCATION, '?')
end

Benchmark.ipsa do |x|
  x.report('collection_name') { collection_name }
  x.report('no_collection_name') { no_collection_name }

  x.compare!
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielvalentin We chatted about this in the SIG today and decided the reporting of table/collection names can be put behind a feature flag. What do you think? If that works, the remaining consideration is if the default on or off - do you have any preference on this?

Copy link
Contributor Author

@hannahramadan hannahramadan Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arielvalentin - I made a suggestion for the config name db_collection_name and default value include, with omit as the other option: 8328668

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arielvalentin! Wanted to check in and see if you had thoughts on the above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience. I'm going to add this to my list to review by EoD tomorrow

attributes
end

def collection_name(sql)
Regexp.last_match[1] if sql =~ TABLE_NAME
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could probably look this up myself but...

How does Regexp.last_match work?

Is this thread safe?

What happens if you are in a multi-threaded worker and an unrelated Regexp runs. Will it return the last_match from an unexpected regular expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regexp.last_match(1) is returning the first matched result and stopping. Since we just need the first match, this approach is faster than the previously used .scan, which was capturing every single match.

Re thread safety: I think so. Here's a StackOverflow answer that talks a bit about it. Because its being called inside an instance method, Regexp.last_match() should only be yielding the last capture that was evaluated inside the method, so no clobbering should be taking place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add db.collection.name to mysql-based instrumentation libraries
4 participants