https://delta.io logo
a

Andy Grove

06/29/2023, 7:30 PM
I wonder if these metrics are specific to the Databricks implementation, and not in the OSS version?
n

Nick Karpov

06/30/2023, 4:10 PM
yes, that's correct, i don't ~think this is intentional though, @venki? @Andy Grove anything particular you're looking for/trying to do or just something you noticed?
a

Andy Grove

06/30/2023, 4:13 PM
I work on the RAPIDS Accelerator for Apache Spark and discovered this while trying to make sure that our behavior is consistent with Databricks when we GPU-accelerate DELETE commands. I was checking to see if this was implemented in OSS and found the reference in the example transaction logs but not the code.
Is there a channel where I can ask about Databricks-specific Delta Lake functionality?
n

Nick Karpov

06/30/2023, 4:30 PM
no please feel free to ask here - we're doing everything we can to bring it to parity so the visibility is good... we can create a specific channel though for the rapids work to not drown this channel though?
a

Andy Grove

06/30/2023, 4:35 PM
thank you! yes, a separate channel probably makes sense since this won't be of interest to a wider audience
v

venki

06/30/2023, 4:44 PM
Hi, the metrics works for DELETE with DV is not done yet. It requires refactoring of existing DELETE code. We have an open task to fix this in the next release.
a

Andy Grove

06/30/2023, 4:51 PM
Thanks @venki is there a GitHub issue for this that you could share?
v

venki

06/30/2023, 4:55 PM
There is a uber issue which mentions this subtask. I recall there was an issue created from metrics, but can't find it. Creating one now.
🙌 1
1
a

Andy Grove

06/30/2023, 4:56 PM
so here is my Databricks-specific question (and we can move this to a new channel once created) .. Our plugin works by replacing Spark/Databricks physical operators with our own custom version that runs on GPU. For example, we replace
DeleteCommand
with
GpuDeleteCommand
which is based on OSS Delta Lake. We have this code that logs the delete event/metrics:
Copy code
recordDeltaEvent(
      deltaLog,
      "delta.dml.delete.stats",
      data = DeleteMetric(
        ...
     )
There are no deletion vector metrics here, so I tried just adding
Copy code
persistentDeletionVectorsUsed = true,
        numDeletionVectorsAdded = Some(0),
        numDeletionVectorsRemoved = Some(0)
This is compiling against the Databricks libraries. These metrics do not show up in the JSON though and that is where I am stuck right now.
v

venki

06/30/2023, 5:05 PM
In the current OSS, the the DELETE command collects metrics as SQL metrics (which are basically Spark accumulators). As part of the Delta commit, commit method looks up the metrics from the list
DeltaOperations.Delete.operationMetrics
in SQL metrics. These metrics are what filled as part of the
CommitInfo
which is serialized as JSON.
I am wondering if the missing metrics names in
DeltaOperations.Delete.operationMetrics
are causing the issue you are seeing?
a

Andy Grove

06/30/2023, 5:07 PM
I see. That makes sense. I had also been experimenting with adding SQL metrics for these. Let me try that approach again. Thanks for the pointer .. this is very helpful
👍 1
Yes, it does seem likely that this could be the issue. I will study this today and get back to you. Are you looking for contributors to help with implementing the deletion vector metrics in OSS? This is something I could potentially help with
🙌 1
v

venki

06/30/2023, 6:03 PM
Hi Andy, as far as I know no one is working on this yet. I just created an issue. Feel free to post a PR. Thank you for helping.
🙌 1