Hi All, We are currently in the process of choosing Delta Lake or Iceberg as the solution to manage our data lake formation. Our data lake consists of both JSON and CSV files, with AWS Glue Catalog managing table definitions for querying CSV data. However, we do not currently manage tables for JSON data.
Our objective is to transition to Delta Lake Tables for both JSON and CSV data, with all new data being written to Delta Lake tables. However, we want to avoid the expensive reprocessing of our existing CSV files into Parquet format.
Hence, I would like to inquire if there are any known architectural patterns or strategies that can help us query the existing data from the same tables without the need for costly reprocessing of CSV files.
We have considered the following possibilities but would greatly appreciate your insights or any additional recommendations:
1. Schema Evolution: Create delta lake tables with CSV data and Evolve schema at some future point to parquet? Is this a possibility?
2. Conversion on Read: Builds on top of #1 but converts CSV data to parquet on read.
3. Conversion: Write pripeline to load csv file in a data frame and than write to parquet tables. We want to avoid this as our lake size is PB in size.
Thanks for help. Please feel free to point towards a Readme.