https://delta.io logo
h

Hana BOUACILA

05/30/2023, 1:58 PM
HI, I am currently working on a project where I have stored my tables as individual Parquet files in HDFS. To leverage the benefits of Delta Lake, such as data versioning and easier data transformations, I am looking to convert these Parquet files to Delta Lake format. I have a few questions regarding the conversion process and would greatly appreciate your guidance. *1-*Firstly, I would like to know if it is necessary to convert each Parquet file individually to a Delta table. Since I only need a subset of tables, I don't want to convert all of them. Alternatively, can I convert the directory containing the Parquet files into a single large Delta table? My concern here is whether I will be able to access and manipulate each table individually, such as performing joins and selecting specific columns, after consolidating them into one Delta table. 2- my second question, I am ingesting data from a PostgreSQL database into HDFS using Apache NiFi. My intention is to configure the ingestion process to continuously capture new data from the database and store it in the same Parquet files in the data lake. In this scenario, would it be necessary to reconvert the Parquet files to Delta Lake every time new data appears, or is there an alternative mechanism that ensures the seamless integration of new data into the Delta Lake format?. I genuinely appreciate any assistance or recommendations.
n

Nick Karpov

05/30/2023, 7:45 PM
1. there are use cases that benefit from this but it's very rare, i would follow the standard 1:1 pattern to start
2. doesn't appear to be a more native connector yet (https://issues.apache.org/jira/browse/NIFI-7090) 😞 after a brief search it looks like the most set-it-forget-it option at the moment is to add a structured streaming job with spark that stream read/writes the parquets into a delta table...not ideal but will work (you can add a very low data retention policy on the intermediate parquet table)... you could also use something like kafka in between which would save on storage but also introduce more infra, not ideal either... hopefully others here might have some better ideas