https://delta.io logo
v

Vishal Kadam

06/09/2023, 4:41 AM
Hi everyone, My pyspark job is failing due to no space left on the device. In the job many spill and shuffles are happening due to which the executor fails with space issues. Use case - I am using EMR serverless cluster to process around 130 GB data. In the processing normally it read the delta table from S3 and writes it to another S3 location with partition delta table using the merge function. But it failed due to the lots of shuffles and spill happening on the executor side. I increased the shuffle partitions size but that doesn't help. Even increased the disk size of the executor to 200G along with few spark properties but that also does not help. Delta version -2.2.0 EMR Release version - 6.10.0 Could you please help me with this issue? #delta-community #deltalake-questions
s

Sukumar Nataraj

06/09/2023, 7:34 AM
no space left on the device
error usually come when disk space volume is low on EC2 instances. Increase the EBS volume when spinning up EMR. It is not related to delta issue.
v

Vishal Kadam

06/09/2023, 7:35 AM
Yes... But I am using the EMR serverless which has limitations on the disk size of 200G. So is there any way that I can handle this issue with any spark properties
s

Sukumar Nataraj

06/09/2023, 7:41 AM
I believe AWS Support team is the right one to help you here. Nothing to do with spark or Delta as far as I know because this is due to infra.
👍 1
v

Vishal Kadam

06/09/2023, 7:49 AM
@Sukumar Nataraj Thank you.. I already raised a support case to AWS ... hope, I will get some solution from them
k

Kashyap Bhatt

06/09/2023, 1:32 PM
IMO if your hardware was reasonable to begin with (i.e. you didn't explicitly use smaller hardware), then I would instead repartition your data to ensure your jobs/transformations operate on smaller amount of data and require less resources. Only exception would be when you notice that you're doing some joins which balloon the data size.
Nothing to do with spark or Delta as far as I know because this is due to infra.
I do not agree this with at all. You could cross-join a billion rows with a billion rows and then try to throw as much hardware at it as you like and it won't work (e.g. will take 3 years to run the job). If you're running outta memory then first thing I'd check is what I'm doing wrong instead of increasing memory, unless I was being cheap to begin with and using small hardware.