https://delta.io logo
a

Alessandro Mangone

01/25/2023, 4:00 PM
Hi everyone, I am writing some code that needs to run a `DELETE`statement on a delta table, and the delete condition relies on a
row_number()
window function. When using the
row_number() > 1
condition, I get the error that windows and aggregations are not supported by DeltaTable’s delete. I tried rewriting that as a subquery, but I get the error that subqueries are not supported by delete. Is collecting the unwanted values and use
Column.isin(unwantedValues)
my only option? Are there any alternatives I might consider?
n

Nick Karpov

01/25/2023, 4:53 PM
there's a chance you can accomplish it with MERGE, since the matching condition is dynamic, but not sure... can you share the original query/sample data? otherwise the subquery limitation is known (https://github.com/delta-io/delta/issues/826) if you could add your case to it that would help the community prioritize
a

Alessandro Mangone

01/25/2023, 5:00 PM
Thank you for the tip! The idea is pretty simple: I have a table where I am constantly appending data in a streaming job. This guarantees the lowest possible latency, with the drawback that users querying this data access it through views where we retrieve the latest version of the record using a
row_number()
function. What I would like to achieve is periodically removing from the table old record versions (using a
row_number()
in this case as well), so that our data grows but in a controlled way.
m

Matthew Powers

01/25/2023, 5:06 PM
@Alessandro Mangone - we use some
row_number() > 1
logic in the
drop_duplicates_pkey
function in mack: https://github.com/MrPowers/mack/blob/main/mack/__init__.py#L246. You mind find that code useful. cc: @Robert Kossendey
👍 2
a

Alessandro Mangone

01/25/2023, 5:07 PM
Thanks! It seems very similar to what I’m trying to achieve!
l

Lennart Skogmo

01/25/2023, 5:15 PM
Spark api also has built in
Copy code
df.dropDuplicates("col1", "col2")
n

Nick Karpov

01/25/2023, 5:16 PM
I think you can use matt's drop duplicates ootb to delete your stale rows async
a

Alessandro Mangone

01/25/2023, 5:16 PM
drop duplicates won’t know which version to keep, I’d like it to be the latest according to a column, plus I’d need to rewrite the whole dataset
yes that’s what I’ll try
👍 2
Thank you for all your help!
🙌 2
n

Nick Karpov

01/25/2023, 5:18 PM
if you have a spare moment would love you to add your case just briefly to that github issue, we'd love to get this kind of functionality more native in the Delta spark connector
a

Alessandro Mangone

01/25/2023, 5:18 PM
sure!
🙏 1
l

Lennart Skogmo

01/25/2023, 5:22 PM
Also in regards to getting around limitation of no subquery supported for some operations, like Nick mentioned. I made a demo sometime back where I used merge to be able to update from subquery with a plain s=source and t=target syntax. Can probably be rewritten easily to delete as well if you wanna stick to sql.
Copy code
MERGE INTO pbc.mrt_leave t
USING
(
  SELECT
  member_id,
  COUNT(*) AS order_cnt,
  SUM(CASE WHEN order_status = 'Shipped' THEN 1 ELSE 0 END) AS shipped_order_cnt,
  SUM(CASE WHEN order_status = 'Cancelled' THEN 1 ELSE 0 END) AS cancelled_order_cnt,
  SUM(CASE WHEN order_status = 'Shipped' THEN price + discount ELSE 0 END) AS revenue
  FROM pbc.arc_order
  GROUP BY member_id
)
s
ON t.member_id = s.member_id
WHEN MATCHED THEN UPDATE SET
t.order_cnt = s.order_cnt,
t.shipped_order_cnt = s.shipped_order_cnt,
t.cancelled_order_cnt = s.cancelled_order_cnt,
t.revenue = s.revenue

-- Untested delete merge
MERGE INTO table t
USING
(
  SELECT id, date, row_number() OVER (PARTITION BY id, date DESC) AS rank
  FROM table
)
s
ON t.id = s.id AND t.date = s.date AND s.rank > 1
WHEN MATCHED THEN DELETE
👍 1
a

Alessandro Mangone

01/25/2023, 5:23 PM
I’m actually using Scala, but thanks!
👍 1
m

Matthew Powers

01/25/2023, 5:24 PM
@Alessandro Mangone - If you’re using Scala, then jodie is the even better project to reference: https://github.com/MrPowers/jodie/blob/main/src/main/scala/mrpowers/jodie/DeltaHelpers.scala#L80. Thanks to @Brayan Jules for writing this great code!
👀 2
a

Alessandro Mangone

01/25/2023, 5:25 PM
awesome!
5 Views