Prashanth Ramanna

09/16/2023, 7:51 PM
Hi All, I am using the Delta Standalone library to update delta table for parquet files written by a SQS worker fleet. I am observing increasing latency in checkpointing as the table size has increased to contain 10's of millions of files. The latency continues to rise with increasing number of files. I have identified few options to experiment with : 1. Increasing checkpoint interval configuration 2. Retaining the logs for shorter period of time 3. Compacting the data before write I am yet to run the above experiments to see how much scaling horizon it gives us and understand what are the tradeoff's being made. I saw a related feature launch in the 3.0.0rc1 announcement
Support for disabling Delta checkpointing during commits - For very large tables with millions of files, performing Delta checkpoints can become an expensive overhead during writes. Users can now disable this checkpointing by setting the hadoop configuration property
. This is only safe and suggested to do if another job will periodically perform the checkpointing.
I looked into the codebase but the standalone API's do not expose checkpointing capability. So here are my two questions 1. How can I configure another job (preferably using Standalone) to perform checkpointing ? 2. Are there any other recommendations/learnings for handling large delta tables ?

Ashok Krishna

09/17/2023, 3:20 PM
Is it possible to do commit at the end ? After all the work is done by worker fleet ?

Prashanth Ramanna

09/17/2023, 7:41 PM
The commits are not done in the worker fleet. Instead notifications are sent out via each of the workers to be batched and committed by a "DeltaCommitter" Service. The worker fleet will be continously polling and writing ingested data to parquet, sending notifications of the filepath and other info to the "DeltaCommitter" 1. This has helped in reducing the number of commits 2. Decouple worker fleet from Delta
FWIW I am experimenting with Delta Standalone in the below setup to workaround high latency in checkpointing large tables 1. Integrate with 3.0.0rc1 version of Standalone 2. "DeltaCommitter Service" will have have
set to false 3. Create a new "DeltaCheckpointer Sevice", which periodically wakes up and performs checkpointing. •
will be set to 1 •
will be set to true • Since Standalone doesn't have API's to commit. I plan to listen to one of the notifications from the worker fleet and use it to perform a commit. I would prefer DeltaCheckpointer would just checkpoint without any integrations with the worker fleet. But I haven't been able to make it work without a valid AddFile in the commit info.