Oliver Angelil
07/29/2023, 6:44 PMdt.optimize().executeCompaction()
on a directory with 2411 parquet files (average file size around 150mb) with OPTIMIZE_TARGET_SIZE
at 268mb, there were only 164 added files with file sizes much closer to 268mb. The other 2K+ files were untouched and remain at ~150mb. Any clues why the vast majority of files could not be compacted? Is this a common experience?Dominique Brezinski
07/29/2023, 7:07 PMOliver Angelil
07/29/2023, 7:19 PMdt.optimize().executeZOrderBy("TimeStamp").executeCompaction()
Dominique Brezinski
07/29/2023, 7:46 PM.executeCompaction()
after .executeZOrdeerBy()
Z-Ordering inherently will compact filesOliver Angelil
07/29/2023, 8:10 PMdf.coalesce(1)
my entire table? Also, why would the default OPTIMIZE_TARGET_SIZE
be more than 3 orders of magnitude smaller 🤔Dominique Brezinski
07/29/2023, 8:22 PMOliver Angelil
07/29/2023, 10:34 PMDominique Brezinski
07/29/2023, 10:36 PMOliver Angelil
07/31/2023, 6:59 PMdt.optimize().executeZOrderBy("TimeStamp")
with targetFileSize=1gb
: query time = 15 seconds
• dt.optimize().executeZOrderBy("TimeStamp")
with targetFileSize=100mb
: query time = 1.2 seconds
• just for fun: repartition(1326)
to shuffle all data: query time= 20 mins 🙃
The trend looks like the smaller the parquet file, the faster the query... I'm not so convinced that having a single parquet file of 354GB would be a good idea 😄Dominique Brezinski
07/31/2023, 8:08 PMOliver Angelil
07/31/2023, 8:12 PM.partionBy
subdirectory style of storage on disc.. and it doesn't mean all data in 1 parquet file...Dominique Brezinski
07/31/2023, 8:12 PMOliver Angelil
07/31/2023, 8:13 PMDominique Brezinski
07/31/2023, 8:23 PMOliver Angelil
07/31/2023, 8:32 PMDominique Brezinski
07/31/2023, 8:34 PMOliver Angelil
08/01/2023, 9:30 AMdf=spark.read.format('delta').load(...)
(df
.filter(
(F.col('TimeStamp')>'2021-06-19T01:20')
& (F.col('TimeStamp')<= '2022-07-20T01:20'))
).count()
Dominique Brezinski
08/01/2023, 1:18 PMOliver Angelil
08/01/2023, 1:30 PMDominique Brezinski
08/01/2023, 3:17 PMOliver Angelil
08/01/2023, 8:11 PM{"add":{"path":"SiteNumber=BE01/Year=2023/Month=8/Day=1/part-00004-2dfc8f9a-1f94-4d87-881a-5f970cb00451.c000.snappy.parquet"
. The files are simply written into daily bins.
• In the 2nd, I am z-ordering on TimeStamp, with targetFileSize=268mb.