https://delta.io logo
r

Ritesh Malav

03/09/2023, 4:06 PM
Hi team, we have our
datalake
of parquet files which we want to convert to
deltalake
format. Existing folder structure in our datalake is like this.
Copy code
user_data
     user_hash=1
          date=2022-01-01
          date=2022-01-02
     user_hash=2
          date=2022-01-01
          date=2022-01-02
     user_hash=3
          date=2022-01-01
          date=2022-01-02
     user_hash=4
          date=2022-01-01
          date=2022-01-02
Can someone suggest how can I do this conversion faster ? I have already tried following code snippet
Copy code
data = spark.read.format("parquet").load("/data-pipeline")
data.write.format("delta").save("/tmp/delta/data-pipeline/")
but it gets rid of the folder structure and create the deltalake format in flat structure
j

JosephK (exDatabricks)

03/09/2023, 4:39 PM
You can partition by in delta the same as parquet. Make sure your partitions are about 10-50GB in size.
r

Ritesh Malav

03/09/2023, 4:49 PM
Ok, I got what I was looking for
😄 1
2 Views