You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the new Parquet SplittableDoFn implementation to read a large # of files, the file metadata lookup (required to break down individual files into parallelizable row groups) can be a performance bottleneck because it's pretty much single threaded+sequential: if you look at the worker graph, you'll see a single worker just doing metadata lookups for 10-20 min before the actual splitting operations kick in. Using the ParquetReadConfiguration.SplitGranularityFile option can remediate this, but at the cost of available parallelism
Can we improve this? Some ideas:
Simplest -- just do file lookups in parallel.
Introduce an option like ParquetReadConfiguration.UseEstimatedRowGroupSize -- basically, instead of reading every file's metadata, we can just sample a few files, and use their average value to extrapolate the rest.
Write some kind of a manifest file/metastore entry that maps individual files --> [# row groups, group byte size]
The text was updated successfully, but these errors were encountered:
When using the new Parquet SplittableDoFn implementation to read a large # of files, the file metadata lookup (required to break down individual files into parallelizable row groups) can be a performance bottleneck because it's pretty much single threaded+sequential: if you look at the worker graph, you'll see a single worker just doing metadata lookups for 10-20 min before the actual splitting operations kick in. Using the
ParquetReadConfiguration.SplitGranularityFile
option can remediate this, but at the cost of available parallelismCan we improve this? Some ideas:
ParquetReadConfiguration.UseEstimatedRowGroupSize
-- basically, instead of reading every file's metadata, we can just sample a few files, and use their average value to extrapolate the rest.The text was updated successfully, but these errors were encountered: