I am processing around 130 GB of delta table in EMR on EKS with fargate(Spark Version 3.3.2 and delta 2.2) and it took around 6 hrs for processing. In processing, I selected only a few columns and wrote into another delta table with partitionby on one column with the help of merge function. I tested this job with 16vCPU and 80 GB memory of driver and executor and the number of executors set to 50. So could you please suggest a few optimization techniques which help me to reduce run time.
Thanks in advance!
07/02/2023, 6:36 PM
It depends on so many things like what the columns are, how many files, and other things. My first guess is that 50 executors for 130gb of data is wrong and you should probably have 1 VM with 200gb of ram and 1 executor