-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Implement filesystem="arrow"
in dask_cudf.read_parquet
#16684
Draft
rjzamora
wants to merge
26
commits into
rapidsai:branch-24.10
Choose a base branch
from
rjzamora:dask-cudf-arrow-filesystem
base: branch-24.10
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
[WIP] Implement filesystem="arrow"
in dask_cudf.read_parquet
#16684
rjzamora
wants to merge
26
commits into
rapidsai:branch-24.10
from
rjzamora:dask-cudf-arrow-filesystem
+305
−39
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
rjzamora
added
feature request
New feature or request
2 - In Progress
Currently a work in progress
dask
Dask issue
non-breaking
Non-breaking change
labels
Aug 28, 2024
rapids-bot bot
pushed a commit
to rapidsai/dask-cuda
that referenced
this pull request
Aug 30, 2024
Adds new benchmark for parquet read performance using a `LocalCUDACluster`. The user can pass in `--key` and `--secret` options to specify S3 credentials. E.g. ``` $ python ./local_read_parquet.py --devs 0,1,2,3,4,5,6,7 --filesystem fsspec --type gpu --file-count 48 --aggregate-files Parquet read benchmark -------------------------------------------------------------------------------- Path | s3://dask-cudf-parquet-testing/dedup_parquet Columns | None Backend | cudf Filesystem | fsspec Blocksize | 244.14 MiB Aggregate files | True Row count | 372066 Size on disk | 1.03 GiB Number of workers | 8 ================================================================================ Wall clock | Throughput -------------------------------------------------------------------------------- 36.75 s | 28.78 MiB/s 21.29 s | 49.67 MiB/s 17.91 s | 59.05 MiB/s ================================================================================ Throughput | 41.77 MiB/s +/- 7.81 MiB/s Bandwidth | 0 B/s +/- 0 B/s Wall clock | 25.32 s +/- 8.20 s ================================================================================ ... ``` **Notes**: - S3 Performance generally scales with the number of workers (multiplied the number of threads per worker) - The example shown above was not executed from an EC2 instance - The example shown above *should* perform better after rapidsai/cudf#16657 - Using `--filesystem arrow` together with `--type gpu` performs well, but depends on rapidsai/cudf#16684 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1371
3 tasks
Closes rapidsai#14537. Authors: - Matthew Murray (https://github.com/Matt711) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#16601
…#16574) Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`. Addresses some concerns from issue rapidsai#15924 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#16574
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2 - In Progress
Currently a work in progress
dask
Dask issue
feature request
New feature or request
non-breaking
Non-breaking change
Python
Affects Python cuDF API.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR piggybacks on the existing CPU/Arrow Parquet infrastructure in dask-expr. With this PR,
will produce a
cudf
-backed collection using PyArrow for IO (i.e. disk->pa.Table
->cudf.DataFrame
). Before this PR, passingfilesystem="arrow"
will simply result in an error.Although this code path is not ideal for fast/local storage, it can be very efficient for remote storage (e.g. S3).
Checklist