https://delta.io logo
j

Jordan Cuevas

07/07/2023, 2:02 PM
I'm hoping someone can help me troubleshoot a weird performance issue I'm seeing. I have a job that joins three delta tables, does some aggregation, and returns the output. This job used to take about 4 minutes to complete, but all of a sudden it started taking 15 minutes this week. Nothing has changed on the cluster config, the underlying delta tables' schemas, or the relative sizes of the underlying tables. I also benchmarked the job vs. running on last week's versions of the underlying delta tables to rule out I/O problems reading/writing to our storage location (s3) and I see the same performance gap. Does anybody have any advice on what I can dig into to figure out what could be going on?
🐿️ 1
d

Dominique Brezinski

07/07/2023, 4:45 PM
So it is quite possible your data size just crossed a threshold that is causing spilling or enough imbalance to greatly impact run time. If you can check the executor metrics for each stage of the job in the Spark UI (or notebook in Databricks) to see what the 75th percentile versus Max looks like?
j

Jordan Cuevas

07/07/2023, 4:58 PM
I don't have the spark UI enabled for these jobs, but looking through the logs I'm seeing a lot of messages about spilling that are associated with big wait times, so I suspect that you're right. I think this query can be written more efficiently, so I'm going to work on that, but is there any config option I can set that would help with this spilling?
d

Dominique Brezinski

07/07/2023, 5:04 PM
I am not sure if there are any tuning parameters that affect that 4096 row threshold, but if you don't have to sort at that stage, then you would avoid it altogether. You would have to google or search the codebase to see if there is anything settable that governs that.
j

Jordan Cuevas

07/07/2023, 5:05 PM
Ok thank you. This step is running a bunch of window functions, hence the sorting. I'll work on rewriting those, but I'll reply back here for visibility if I find another solution
d

Dominique Brezinski

07/07/2023, 5:06 PM
So you can set
spark.sql.windowExec.buffer.spill.threshold
to something larger to avoid spilling on thee window operation
I just learned something new!
The comments on that issue are worth reading for a little more understanding
j

Jordan Cuevas

07/07/2023, 5:11 PM
yeah I'm reading through them now. I'll report back with an update shortly
interestingly, I updated the config, but it doesn't seem to have any effect on the spilling. I'll keep digging into that thread and report back when I find a solution.
Looks like I also had to update spark.sql.windowExec.buffer.in.memory.threshold as well Now I stopped seeing the spilling in the logs, but I'm running into a memory error, which I'm more comfortable troubleshooting. Thanks for the help!