You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some users experience out-of-memory errors during IO when loading datasets that they feel like should be fine for their given GPU. This is currently less significant for cudf.pandas, as we now enable a prefetch-optimized unified memory by default.
Because we don't currently have a similar UVM setup for the Polars GPU engine, this is an acute pain point that blocks usage for many workflows. We've developed chunked readers for Parquet and ORC files that may be able to help in this situation.
Initial testing suggests that a properly configured chunked parquet reader may be effective at reducing peak memory requirements without significantly impacting performance.
For example, running PDS-H q7 at SF200 immediately runs into an OOM with the default Parquet reader. With a pass_read_limit of 16GB for the chunked reader, we can smoothly finish the query and provide a speedup with an H100 vs. the CPU engine on a high-end CPU.
Default CPU engine on a dual socket Intel 8480CL:
Running experiments...
Running scale factor 200.0 on GPU setting false...
Code block 'Run polars query 7' took: 18.65865 s
Experiments completed. ...
Run complete!
Default GPU engine behavior with cuda-async memory resource
Running experiments...
Running scale factor 200.0 on GPU setting true...
Code block 'Run polars-gpu-cuda-async query 7' took: 3.28927 s
q7 FAILED
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /tmp/pip-build-env-7ci1mf7i/normal/lib/python3.11/site-packages/librmm/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:111: cudaErrorMemoryAllocation out of memory
Experiments completed. ...
Run complete!
GPU engine behavior with cuda-async memory resource and "pass_read_limit": 16024000000
Running experiments...
Running scale factor 200.0 on GPU setting true...
Code block 'Run polars-gpu-cuda-async query 7' took: 10.78470 s
Experiments completed. ...
Run complete!
We should do a full evaluation of the chunked Parquet reader on the PDS-H benchmarks to empirically assess the potential opportunity and tradeoffs for chunked IO. Starting with Parquet makes sense, as it's a more common file format in the PyData world. We can expand from there, as needed.
beckernick
changed the title
[FEA] Evaluate and enable the chunked parquet reader for Polars GPU engine
[FEA] Investigate the chunked parquet reader for Polars GPU engine
Sep 17, 2024
Some users experience out-of-memory errors during IO when loading datasets that they feel like should be fine for their given GPU. This is currently less significant for cudf.pandas, as we now enable a prefetch-optimized unified memory by default.
Because we don't currently have a similar UVM setup for the Polars GPU engine, this is an acute pain point that blocks usage for many workflows. We've developed chunked readers for Parquet and ORC files that may be able to help in this situation.
Initial testing suggests that a properly configured chunked parquet reader may be effective at reducing peak memory requirements without significantly impacting performance.
For example, running PDS-H q7 at SF200 immediately runs into an OOM with the default Parquet reader. With a
pass_read_limit
of 16GB for the chunked reader, we can smoothly finish the query and provide a speedup with an H100 vs. the CPU engine on a high-end CPU.Default CPU engine on a dual socket Intel 8480CL:
Default GPU engine behavior with cuda-async memory resource
GPU engine behavior with cuda-async memory resource and
"pass_read_limit": 16024000000
We should do a full evaluation of the chunked Parquet reader on the PDS-H benchmarks to empirically assess the potential opportunity and tradeoffs for chunked IO. Starting with Parquet makes sense, as it's a more common file format in the PyData world. We can expand from there, as needed.
cc @quasiben @brandon-b-miller (offline discussion)
The text was updated successfully, but these errors were encountered: