Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement filesystem="arrow" in dask_cudf.read_parquet #16684

Draft
wants to merge 26 commits into
base: branch-24.10
Choose a base branch
from

Conversation

rjzamora
Copy link
Member

Description

This PR piggybacks on the existing CPU/Arrow Parquet infrastructure in dask-expr. With this PR,

df = dask_cudf.read_parquet(path, filesystem="arrow")

will produce a cudf-backed collection using PyArrow for IO (i.e. disk->pa.Table->cudf.DataFrame). Before this PR, passing filesystem="arrow" will simply result in an error.

Although this code path is not ideal for fast/local storage, it can be very efficient for remote storage (e.g. S3).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Aug 28, 2024
@rjzamora rjzamora self-assigned this Aug 28, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 28, 2024
rapids-bot bot pushed a commit to rapidsai/dask-cuda that referenced this pull request Aug 30, 2024
Adds new benchmark for parquet read performance using a `LocalCUDACluster`. The user can pass in `--key` and `--secret` options to specify S3 credentials.

E.g.
```
$ python ./local_read_parquet.py --devs 0,1,2,3,4,5,6,7 --filesystem fsspec --type gpu --file-count 48 --aggregate-files

Parquet read benchmark
--------------------------------------------------------------------------------
Path                      | s3://dask-cudf-parquet-testing/dedup_parquet
Columns                   | None
Backend                   | cudf
Filesystem                | fsspec
Blocksize                 | 244.14 MiB
Aggregate files           | True
Row count                 | 372066
Size on disk              | 1.03 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
36.75 s                   | 28.78 MiB/s
21.29 s                   | 49.67 MiB/s
17.91 s                   | 59.05 MiB/s
================================================================================
Throughput                | 41.77 MiB/s +/- 7.81 MiB/s
Bandwidth                 | 0 B/s +/- 0 B/s
Wall clock                | 25.32 s +/- 8.20 s
================================================================================
...
```

**Notes**:
- S3 Performance generally scales with the number of workers (multiplied the number of threads per worker)
- The example shown above was not executed from an EC2 instance
- The example shown above *should* perform better after rapidsai/cudf#16657
- Using `--filesystem arrow` together with `--type gpu` performs well, but depends on rapidsai/cudf#16684

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #1371
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress dask Dask issue feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants