06/27/2023, 6:48 AM
HI I have 2 questions. 1. In my project, I need to bring data for MLOps with Delta Lake. I have JSON-formatted event data and video data from blob storage. To manage the JSON data, I'm considering storing each event JSON data within a single cell of Parquet based on ID. Then, I can load and analyze it using Pandas when needed. I need updating the event JSON within the Parquet cell if there are any updates. Please let me know your thoughts on my idea. Is it reasonable? 2. I want to know if there's any method to update parquet with delta-rs(without using Spark, for sure). So far, I have been using pandas to create and merge the data before overwriting for update. However, due to the large volume of event data being generated, it seems impractical to handle it that way.

Michael Nacey

06/27/2023, 8:01 PM
This is feasible. However, note that your update can be slowed if it takes a long time to lookup your row by key. If you have large volume and low latency requirements, landing with merge may suffer. In that case, I might land as append and merge into a second table as a separate operation (i.e. bronze --> silver). As to #2, you may have to make your own "upsert" in the same way you did with pandas.
🙌 1