https://delta.io logo
c

Chetan Joshi

08/10/2023, 8:04 PM
Hello Team, I have 2 spark jobs - one keeps adding new set of rows to a delta table and other performs updates on this same delta table. Inserts are more frequent than the updates. Inserts are done by spark write commands and updates using spark sql. We are seeing that that some of the updates are getting lost (once in hundred runs) and we don’t get any exceptions. Has anyone observed similar issue and mitigated it? (Delta v2.0.1)
More details: Delta table schema --> {inputId, processedFlag, fileName, metadata} • Inserts are done by writing data partitioned by incremental inputId by job1. • Updates are done by job2 to set processedFlag=true using following where condition ->
fileName in (file1, file2..) and inputId<N
(basically we are marking files which got processed as true and taking max inputId (N) corresponding to these files to avoid partition/file lookup for new entries.
m

Madhumita Bharde

08/10/2023, 8:14 PM
delta on S3?
c

Chetan Joshi

08/10/2023, 8:14 PM
It is over Oracle Cloud’s Object Store
m

Madhumita Bharde

08/10/2023, 8:15 PM
it essentially is supposed to be issue with consistency guarantees provided by underlying object storage. Don’t know much about Oracle Cloud’s Object Store
c

Chetan Joshi

08/10/2023, 8:18 PM
Thanks, let me take a look.
n

Nick Karpov

08/10/2023, 8:36 PM
it's likely you'll need to implement a LogStore very similar to this one one but with an Oracle service (I'm assuming dynamodb is not an option)... perhaps oracle's nosql
gratitude thank you 1