https://delta.io logo
r

Roy Green

01/06/2023, 11:36 AM
Hi, I’m trying to convert vanilla parquet files on S3 to delta tables using spark on data bricks. There is ~30 TB I’m running on a cluster of 10 r5d.4xlarge, which is delta cache accelerated, with photon enabled. I’m running this command:
Copy code
CONVERT TO DELTA parquet.`s3a:/path_to_s3_dir` no statistics PARTITIONED BY (is_valid string, created_dt string)
Any idea of how much time should the execution take? even a rough prediction would help… Thanks!
j

JosephK (exDatabricks)

01/06/2023, 12:13 PM
It depends on how many files there are. If you have 3000 1GB files it will take a lot less time that if you have 30k 0.1GB files. If you schedule it as a job and set the cluster to autoturnoff after 10 minutes it shouldn’t be too long or expensive
👍 1
r

Roy Green

01/06/2023, 12:17 PM
Thanks for the quick response. so a follow-up question if you don’t mind… actually there are a lot of small files (average is ~40 MB) I indeed ran it on a job cluster with auto turn-off. and currently, it seems like it stuck for a long time on
Copy code
try: def ____databricks_percent_sql(): i...
(kill)collect at ConvertToDeltaCommand.scala:579
the logs from the executor are a lot of this:
Copy code
[Full GC (Ergonomics) [PSYoungGen: 7047168K->3975055K(7699456K)] [ParOldGen: 16703010K->16703377K(16703488K)] 23750178K->20678432K(24402944K), [Metaspace: 104549K->104549K(1157120K)], 35.9680545 secs] [Times: user=455.87 sys=1.05, real=35.97 secs] 
2023-01-06T12:12:39.204+0000: [Full GC (Ergonomics) [PSYoungGen: 7047168K->4016849K(7699456K)] [ParOldGen: 16703377K->16703233K(16703488K)] 23750545K->20720083K(24402944K), [Metaspace: 104557K->104557K(1157120K)], 30.7333304 secs] [Times: user=388.86 sys=0.82, real=30.73 secs]
and nothing is moving. any idea? Thanks a lot!
j

JosephK (exDatabricks)

01/06/2023, 12:22 PM
Yeah, the small files is going to be a problem. Convert won’t work well because you’ll have to collect a lot of stats on many files. In this case it’s better to do a parquet read and delta write with a repartition before the write to compact the smallfiles. I think autoloader will work well on the initial read and because it’s streaming will work incrementally
r

Roy Green

01/06/2023, 12:26 PM
but I’m doing the command with “no statistics”, isn’t it should help with the issue? (I planned to run the optimizer for the statistics later after compaction)
j

JosephK (exDatabricks)

01/06/2023, 12:28 PM
No stats will help make it faster, but it won’t help w/ the small file problem which will be the much slower part. I think you need the stats to do the zordering later
r

Roy Green

01/06/2023, 12:29 PM
got you. so you think it won’t finish? or If I’d wait lets say for 24 hours would it be OK? I want to simplify the process and don’t care to pay a one-time cost and time for that
j

JosephK (exDatabricks)

01/06/2023, 12:38 PM
I’m not as sure about the finishing and how long it will take. What you can do is try to write a single partition and see how long that takes then then it should scale linearly. Shouldn’t be any harm in doing it partition by partition since they are independent.
r

Roy Green

01/06/2023, 12:39 PM
OK, I’ll try that. thanks!!
😄 1
4 Views