https://delta.io logo
r

rtyler

02/09/2023, 10:24 PM
I am doing some optimization and I am wondering if anybody has found a way to reduce or improve the number of ListBucket operations from delta/spark ?
s

Scott Sandre (Delta Lake)

02/09/2023, 10:29 PM
which cloud are you using? s3?
r

rtyler

02/09/2023, 10:42 PM
aye
d

Dominique Brezinski

02/09/2023, 10:54 PM
I think Databricks is investigating, since we were beating up S3 pretty hard.
s

Scott Sandre (Delta Lake)

02/09/2023, 10:55 PM
This PR https://github.com/delta-io/delta/pull/1210 will be part of the next 2.3.0 release and should reduce and speed up LIST operations. Would this help you?
r

rtyler

02/09/2023, 10:57 PM
@Scott Sandre (Delta Lake) I am sure this will help, I don't have sufficient insight into the system to understand where the plethora of ListBucket operations are coming from.
I look forward to picking this change up in a DBR sometime in 2024 🤣
s

Scott Sandre (Delta Lake)

02/09/2023, 10:58 PM
Btw, if you have a specific DBR complaint, you should contact Databricks
r

rtyler

02/09/2023, 11:32 PM
hah, don't worry my account team gets plenty of feedback. What I am understanding from your comments @Scott Sandre (Delta Lake) is that there are no known knobs or switches to be turned to optimize lists, and it doesn't seem like there's any decent way to figure out which workloads are resulting in excessive list operations (other than guess and check through access logs I suppose)
3 Views