https://delta.io logo
m

Michael Romero

07/11/2023, 8:53 PM
[In Order Processing of Micro Batches] Let's say I'm streaming from Kafka to delta table 1 via append only. Then I open a stream on delta table 1 to write to delta table 2. The current delta version has 100 parquet files in the initial snapshot, but the stream can only process 10 at a time. Is there a way to guarantee the order of the processing? It's vitally important we process the records in order of their columns(kafka_partition, kafka_offset), because between delta table 1 and delta table 2 we "reduce" the rows based on the operation in the row. E.g. 2*3+4 is different from 4 + 3 then * 2.
n

Nick Karpov

07/11/2023, 9:03 PM
you're guaranteed to read delta versions in order, but not the files within a version
m

Michael Romero

07/11/2023, 9:05 PM
Is there a way to extend it to files within a version? Or how do large companies use delta merge to update reliably?
or theoretically, if the stream was able to handle the entire snapshot in 1 batch, then we'd be able to run group by/order by to solve it?
n

Nick Karpov

07/11/2023, 10:22 PM
yup, you can definitely use actual event data to guarantee ordering, but i think you can only do it with
Trigger.Once
and not a continuous stream
d

Dominique Brezinski

07/11/2023, 10:32 PM
You can set maxFilesPerTrigger or maxBytesPerTrigger to control micro-batch size, but that doees not guarantee that will process all the files from each version peer batch. The semantics don't work that way. The stream processes the delta log in order of add/remove file operations. Per commit, there is no strong guarantee of any specific ordering of the add_file actions. Best you can do is order within foreachBatch before merge OR use a suitable sized watermark on the stream and order within the stream before merging?
h

Harry Metzger

07/13/2023, 5:38 PM
are we guarenteed to get an entire parquet file in a micro-batch? if we use partitioning and coalesce each partition to 1 file and guarentee that the rows which need to be processed in order are all in the same parquet file, would that work? Or is it possible it get a partial parquet file in a microbatch?
d

Dominique Brezinski

07/13/2023, 6:49 PM
i am pretty sure it will not give you a partial parquet file in a batch.
👍 2