https://delta.io logo
r

Rahul Sharma

02/17/2023, 5:32 AM
Hii , i can see table property targetFileSize is set to 10 mb but getting data size 40 mb and so on. please suggest me where i am doing wrong
Copy code
{"commitInfo":{"timestamp":1676611727893,"operation":"SET TBLPROPERTIES","operationParameters":{"properties":"{\"delta.compatibility.symlinkFormatManifest.enabled\":\"true\",\"delta.targetFileSize\":\"10mb\",\"delta.tuneFileSizesForRewrites\":\"10mb\"}"}
j

Jon Stockham

02/17/2023, 8:56 AM
Did you try what I suggested?
r

Rahul Sharma

02/17/2023, 8:56 AM
yes
spark.conf.*set*(“spark.databricks.delta.optimize.maxFileSize”,20971520)
i have set above property in spark session
j

Jon Stockham

02/17/2023, 10:03 AM
That works for me. if I set it to 10MB, I get a little bit of spill over but all files are roughly 10mb in size after I execute z-order
j

JosephK (exDatabricks)

02/17/2023, 12:11 PM
Why are you doing 10MB file size? This could potentially create thousands of small files and make the table unreadable.
r

Rahul Sharma

02/17/2023, 12:13 PM
i am getting every time update in one partition so if i have 50mb file then delta will read so many file 50- 50 mb to update the value i think it would be heavy task ,what you think ?
j

JosephK (exDatabricks)

02/17/2023, 12:24 PM
So in spark the file reader is 128mb, so anything smaller than that will cause problems. If your data is in the size of MBs, then maybe just use postgress. You don’t need spark for the scale
r

Rahul Sharma

02/17/2023, 12:27 PM
i have a 1gb file in one partition(Deposittransaction Data) and suppose if one update come then it will read 128 mb file and then perform update ,i am letting you one update but in 1 sec there could be multiple update don’t you think if i read 4 gb data every trigger interval and then perform UPSERT. that’s why i was thinking if i set file size 10 mb then only specific file would be sort
what you suggest please tell me
j

JosephK (exDatabricks)

02/17/2023, 12:28 PM
I think if you’re processing multiple updates per second and have small data, then use postgress or a traditional database.
r

Rahul Sharma

02/17/2023, 12:29 PM
bro i have 10 gb data in very small table
10 Views