https://delta.io logo
g

GapyNi

05/23/2023, 5:45 PM
Dear all, we are loading data into bronze layer with "Append" option by using Autoloader. We are than reading them in the silver layer with
readStream.option("startingVersion", lastMaxVersionFromAppend)
, where we do
foreEachBatch
in the next step to merge them into silver layer. Now the question is as i looked into the metrics it seems that the process read all the
numAddedFiles
but the
numOutputRows
(from bronze layer) does not exactly match with the
numSourceRows
from silver layer, when performing
forEachBatch
merge commando. Do you know what could be the reason? Thanks and regards, GapyNi
j

Joydeep Banik Roy

05/24/2023, 3:02 AM
if you look closely 593144726 + 12547566 = 605692292 this is because your source rows are either candidates for updates or inserts (considering there is no delete in your case) Now these rows reside in some parquet files in your delta table; when you do update, the entire parquet file containing other rows are copied and rewritten, hence you see the
numTargetRowsCopied
therefore,
numOutputRows = numTargetRowsCopied + numSourceRows
g

GapyNi

05/24/2023, 7:14 AM
Hi Banik, thanks for the answer. I was referring to the comparison of
numOfOutputRows
(in bronze layer) from first printscreen and the
numSourceRows
(in silver layer) from the second printscreen. Thanks and regards, GapyNi
12 Views