Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize like/ilike kernels for StringView #5951

Closed
Tracked by #6163 ...
alamb opened this issue Jun 24, 2024 · 4 comments
Closed
Tracked by #6163 ...

Optimize like/ilike kernels for StringView #5951

alamb opened this issue Jun 24, 2024 · 4 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Jun 24, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

@XiangpengHao added support for StringView to the like/ilike kernels in #5931

#5931 (review)

This PR does not leverage the special 4 bytes inlined prefix for large string views, which might be able to optimize certain cases (specifically for quickly testing if the starts_with variant of like doesn't match without having to consult the actual strings, as @wjones127 discusses in #5931 (review)

does not leverage the inlined 4 bytes to shortcut some comparison. I plan to leave this as a future work as it can significantly complicate the code. (I also doubt that we can gain any benefit)

I agree we should only start adding specialized paths if we determine it is worth it. starts_with might be, especially if the parameter is short enough to always be able to use the prefix. We can decide based on benchmarks which are appropriate.

I added a benchmark in #5931

Describe the solution you'd like
Investigate various ways to make the benchmark faster. Run the benchmark like this:

cargo bench --bench comparison_kernels -- utf8view

Describe alternatives you've considered
Maybe it is fast enough as is

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jun 24, 2024
@alamb alamb changed the title Optimizer like/ilke kernels for StringView Optimize like/ilke kernels for StringView Jun 24, 2024
@xinlifoobar
Copy link
Contributor

take

@xinlifoobar
Copy link
Contributor

I read through the code and have some initial ideas. There are serval sub-optimizations:

Good to go:

  • Startwith/IStartwith when the needle has length < 4

Side effects (might be, this could be removed completed by specialized string view functions):

  • Contains: when the needle has length < 4 and the first 4 bytes match.
  • Regex: when the regex has match expected < 4 and the first 4 bytes match.

I would start with startwith and istartwith and needs some inputs for the latter 2. Please let me know your thoughts on generating better implementations :).

CC @alamb @XiangpengHao

@XiangpengHao
Copy link
Contributor

Sounds good to me! Here's the current comparsion function that has similar handling for 4/12 bytes:

https://github.com/apache/arrow-rs/blob/master/arrow-array/src/array/byte_view_array.rs#L340-L399

@alamb alamb changed the title Optimize like/ilke kernels for StringView Optimize like/illke kernels for StringView Sep 9, 2024
@alamb alamb changed the title Optimize like/illke kernels for StringView Optimize like/ilike kernels for StringView Sep 9, 2024
@alamb
Copy link
Contributor Author

alamb commented Sep 9, 2024

It seems to me like this ticket was completed with #6231

@xinlifoobar has some ideas for improving other kernels (like regexp) here #5951 (comment)

I did file #6370 for regexp_is_match_utf8

Let me know if there are other kernels that we should update

@alamb alamb closed this as completed Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants