[WIP] Implement `filesystem="arrow"` in `dask_cudf.read_parquet` #16684

rjzamora · 2024-08-28T22:08:40Z

Description

This PR piggybacks on the existing CPU/Arrow Parquet infrastructure in dask-expr. With this PR,

df = dask_cudf.read_parquet(path, filesystem="arrow")

will produce a cudf-backed collection using PyArrow for IO (i.e. disk->pa.Table->cudf.DataFrame). Before this PR, passing filesystem="arrow" will simply result in an error.

Although this code path is not ideal for fast/local storage, it can be very efficient for remote storage (e.g. S3).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…rrow-filesystem

Adds new benchmark for parquet read performance using a `LocalCUDACluster`. The user can pass in `--key` and `--secret` options to specify S3 credentials. E.g. ``` $ python ./local_read_parquet.py --devs 0,1,2,3,4,5,6,7 --filesystem fsspec --type gpu --file-count 48 --aggregate-files Parquet read benchmark -------------------------------------------------------------------------------- Path | s3://dask-cudf-parquet-testing/dedup_parquet Columns | None Backend | cudf Filesystem | fsspec Blocksize | 244.14 MiB Aggregate files | True Row count | 372066 Size on disk | 1.03 GiB Number of workers | 8 ================================================================================ Wall clock | Throughput -------------------------------------------------------------------------------- 36.75 s | 28.78 MiB/s 21.29 s | 49.67 MiB/s 17.91 s | 59.05 MiB/s ================================================================================ Throughput | 41.77 MiB/s +/- 7.81 MiB/s Bandwidth | 0 B/s +/- 0 B/s Wall clock | 25.32 s +/- 8.20 s ================================================================================ ... ``` **Notes**: - S3 Performance generally scales with the number of workers (multiplied the number of threads per worker) - The example shown above was not executed from an EC2 instance - The example shown above *should* perform better after rapidsai/cudf#16657 - Using `--filesystem arrow` together with `--type gpu` performs well, but depends on rapidsai/cudf#16684 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1371

…rrow-filesystem

Closes rapidsai#14537. Authors: - Matthew Murray (https://github.com/Matt711) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - Vyas Ramasubramani (https://github.com/vyasr) URL: rapidsai#16601

…#16574) Improves performance of wide strings (avg > 64 bytes) when using `cudf::strings::slice_strings`. Addresses some concerns from issue rapidsai#15924 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#16574

…rrow-filesystem

rjzamora added 6 commits August 27, 2024 11:51

allow pyarrow-based read with cudf backend

469bc5e

re-org

f20cc25

temporary change for debugging

8f0f598

adjust for upstream bug

64fd701

remove stale comment

8e0c902

add file aggregation

18e1c08

rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Aug 28, 2024

rjzamora self-assigned this Aug 28, 2024

github-actions bot added the Python Affects Python cuDF API. label Aug 28, 2024

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

5215a05

rjzamora mentioned this pull request Aug 29, 2024

[Benchmark] Add parquet read benchmark rapidsai/dask-cuda#1371

Merged

rjzamora added 4 commits August 29, 2024 10:00

test coverage

c51a7bb

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

b7a90c1

allow aggregate_files=True

43274e2

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

63c3f04

…rrow-filesystem

rjzamora added 3 commits August 30, 2024 09:46

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

a1bd43c

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

e3ca47f

…rrow-filesystem

fix test

12c09a5

rjzamora mentioned this pull request Sep 3, 2024

Add dask query-planning support NVIDIA/NeMo-Curator#139

Open

3 tasks

rjzamora and others added 7 commits September 4, 2024 09:03

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

daee7ec

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

d068103

…rrow-filesystem

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

257eb26

skip for pyarrow<15

bdd2bab

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

d943d8d

…rrow-filesystem

rjzamora added 5 commits September 10, 2024 09:34

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

eb9eee0

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

b9c5147

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

ec04e78

Merge branch 'branch-24.10' into dask-cudf-arrow-filesystem

e391789

Merge remote-tracking branch 'upstream/branch-24.10' into dask-cudf-a…

e154d01

…rrow-filesystem

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement `filesystem="arrow"` in `dask_cudf.read_parquet` #16684

[WIP] Implement `filesystem="arrow"` in `dask_cudf.read_parquet` #16684

rjzamora commented Aug 28, 2024

[WIP] Implement filesystem="arrow" in dask_cudf.read_parquet #16684

Are you sure you want to change the base?

[WIP] Implement filesystem="arrow" in dask_cudf.read_parquet #16684

Conversation

rjzamora commented Aug 28, 2024

Description

Checklist

[WIP] Implement `filesystem="arrow"` in `dask_cudf.read_parquet` #16684

[WIP] Implement `filesystem="arrow"` in `dask_cudf.read_parquet` #16684