-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
<regex>
: Simplify regex_traits<_Elem>::translate(_Elem)
#5209
<regex>
: Simplify regex_traits<_Elem>::translate(_Elem)
#5209
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, it seems that we took too much time to look into this.
[re.traits]/4 simply specifies that an implementation-provided regex_traits<C>::translate
just returns the argument. See also https://en.cppreference.com/w/cpp/regex/regex_traits/translate.
I put so much work in this mainly because I was worried about mix-and-match scenarios, not because I consider the old implementation correct. Think of the scenario that the implementations actually commonly produce different results and regex parser and matcher pick up different implementations or some other strange combination. Then this change had the potential to actually degrade the regex engine in such a mix-and-match scenario and lead to regex bugs that are difficult to understand and reproduce. |
Thanks for the careful analysis! 😻 |
This has gotta be a perf win. |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Who knew machine translation was so easy! 🤖 😹 🤪 |
Simplifies
regex_traits<_Elem>::translate(_Elem)
to just return its only argument. In #5204 (comment), I voiced my suspicion that the current implementation essentially does just that in a very complicated and expensive way. I verified this now by running the following program that tested 902 locales available on my machine (no output to stdout produced by the program):Even so, this PR still introduces a minor behavior change: The previous implementation can throw
length_error("string too long")
when it is passed achar
that isn't a valid character in the locale's encoding (e.g., 0x80 in locales using UTF-8 encoding). But I think that the old behavior is undesirable anyway, as it makes the regex engine always fail with an exception inregex_constants::collate
mode when a locale using UTF-8 encoding is imbued and the regex engine is applied to strings containing non-ASCII characters.