Improve file metadata lookup in Parquet SDF #4979

clairemcginty · 2023-09-05T13:04:23Z

When using the new Parquet SplittableDoFn implementation to read a large # of files, the file metadata lookup (required to break down individual files into parallelizable row groups) can be a performance bottleneck because it's pretty much single threaded+sequential: if you look at the worker graph, you'll see a single worker just doing metadata lookups for 10-20 min before the actual splitting operations kick in. Using the ParquetReadConfiguration.SplitGranularityFile option can remediate this, but at the cost of available parallelism

Can we improve this? Some ideas:

Simplest -- just do file lookups in parallel.
Introduce an option like ParquetReadConfiguration.UseEstimatedRowGroupSize -- basically, instead of reading every file's metadata, we can just sample a few files, and use their average value to extrapolate the rest.
Write some kind of a manifest file/metastore entry that maps individual files --> [# row groups, group byte size]

The text was updated successfully, but these errors were encountered:

clairemcginty added enhancement New feature or request parquet labels Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve file metadata lookup in Parquet SDF #4979

Improve file metadata lookup in Parquet SDF #4979

clairemcginty commented Sep 5, 2023

Improve file metadata lookup in Parquet SDF #4979

Improve file metadata lookup in Parquet SDF #4979

Comments

clairemcginty commented Sep 5, 2023