Ahmad Dorri

07/02/2023, 5:24 PM
Hello I have a question regarding the customization of the Delta Parquet Reader in Spark and its effect on byte range queries for Parquet files stored in a distributed environment, specifically on Ceph storage. I'd like to start by asking about the customization of the spark.sql.parquet.columnarReaderBatchSize configuration. In my experiments, I've noticed that modifying this configuration has an impact on memory usage. For instance, setting a very high value results in Out of Memory (OOM) errors. Furthermore, I've observed that, regardless of the value I set for columnarReaderBatchSize (excluding the very high value mentioned earlier), the number of GET queries performed on Ceph remains constant for a given file. Specifically, I consistently observe 18 GET queries. It seems that the columnarReaderBatchSize configuration doesn't affect the number of byte range queries on the distributed storage. Could you shed some light on why this is the case and explain if there are other factors influencing the number of GET queries?

JosephK (exDatabricks)

07/02/2023, 6:46 PM
there is no such thing as "delta parquet reader in spark"