https://delta.io logo
j

jacktby jacktby

08/11/2023, 7:39 AM
I get "Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table". This is not reasonable, although match multi_row by join on id, but for the match clause, we should use cdc.operation='U', so we can perform it with only one row match. Bad response.
Copy code
create table account 
using delta select col1 as id,col2 as name,col3 as last_updated from values (1, 'alice', '2023-01-01');

create temp view account_stage as select col1 as id, col2 as name,col3 as last_updated,col4 as operation 
from values(1, 'alice', '2023-02-22', 'U'),(1, 'alice', '2023-02-22', 'D');

MERGE INTO account
using (SELECT id, NAME, last_updated,operation FROM   account_stage) AS cdc
ON account.id = cdc.id

WHEN matched AND cdc.operation='U' THEN
  UPDATE SET NAME = cdc.NAME, last_updated = cdc.last_updated
WHEN matched AND cdc.operation='D' THEN
  DELETE
WHEN NOT matched AND cdc.operation='I' THEN
  INSERT (account.id,account.name,account.last_updated) VALUES(cdc.id, cdc.NAME, cdc.last_updated);
v

Vincent Chee

08/11/2023, 8:08 AM
You have to deduplicate by id before merge operation.
You need to understand merge is a batch operation and all updates are performed in a single commit. If there exists multiple rows matching the update, there will be a conflict.
j

jacktby jacktby

08/11/2023, 8:11 AM
no, I think you shouldn't just consider about id, you also need to consider about the match_condition
delta_lake just throw an error when there are over two match rows on merge_condition at the first join, but no idea on match condition forward.
v

Vincent Chee

08/11/2023, 8:14 AM
so do you expect the row to be deleted or updated in this case?
since there are two corresponding events for
alice
j

jacktby jacktby

08/11/2023, 8:16 AM
yes, you should use the match_option to see if there are multi_rows match but not only use merge_option.
Copy code
CREATE TABLE account_stage
  (
     id           INT,
     NAME         VARCHAR,
     last_updated DATETIME,
     operation    VARCHAR
  );

INSERT INTO account_stage VALUES     (1, 'alice', '2023-02-22', 'U');
INSERT INTO account_stage VALUES     (1, 'alice', '2023-02-22', 'D'); 

CREATE TABLE account
 (
     id           INT,
     name         VARCHAR,
     last_updated DATETIME
 );

INSERT INTO account VALUES     (1, 'alice', '2023-01-01');

MERGE INTO account
using (SELECT id, NAME, last_updated,operation FROM   account_stage) AS cdc
ON account.id = cdc.id

WHEN matched AND cdc.operation='U' THEN
  UPDATE SET NAME = cdc.NAME, last_updated = cdc.last_updated
WHEN matched AND cdc.operation='D' THEN
  DELETE
WHEN NOT matched AND cdc.operation='I' THEN
  INSERT VALUES(cdc.id, cdc.NAME, cdc.last_updated);
we can success on other products
v

Vincent Chee

08/11/2023, 8:19 AM
Im not familiar with Snowflake. So the update happens first?
And how Snowflake determines that the update precedes deletes?
btw, i just found this doc on Snowflake. Snowflake will produce an error by default https://docs.snowflake.com/en/sql-reference/sql/merge#duplicate-join-behavior
j

jacktby jacktby

08/11/2023, 8:33 AM
but you can try the sqls I give above in snowflake, it really success.
and the two match clauses will apply action on different target rows. So it's Deterministic
v

Vincent Chee

08/11/2023, 8:36 AM
your target is a cdc table?
j

jacktby jacktby

08/11/2023, 8:36 AM
it's just a normal table, you can see I just use "create table xxx"
v

Vincent Chee

08/11/2023, 8:37 AM
Copy code
(1, 'alice', '2023-02-22', 'U')
(1, 'alice', '2023-02-22', 'D')
These two events are modifying two different rows?
FWIU, the target table contains one alice row?
j

jacktby jacktby

08/11/2023, 8:38 AM
yes, initially is below:
Copy code
CREATE TABLE account
 (
     id           INT,
     name         VARCHAR,
     last_updated DATETIME
 );

INSERT INTO account VALUES     (1, 'alice', '2023-01-01');
v

Vincent Chee

08/11/2023, 8:40 AM
yea, two events will attempt to modify that one alice row so it means “multiple sources one target”
j

jacktby jacktby

08/11/2023, 8:53 AM
Copy code
CREATE TABLE account
 (
     id           INT,
     name         VARCHAR,
     last_updated DATETIME
 );

INSERT INTO account VALUES     (1, 'alice', '2023-01-01');
INSERT INTO account VALUES     (2, 'bob',  '2023-01-01');
INSERT INTO account VALUES     (3, 'charlie',  '2023-01-02'); 

CREATE TABLE account_stage
  (
     id           INT,
     NAME         VARCHAR,
     last_updated DATETIME,
     operation    VARCHAR
  );

INSERT INTO account_stage VALUES     (1, 'alice', '2023-02-22', 'U');
INSERT INTO account_stage VALUES     (1, 'alice', '2023-02-23', 'D');
INSERT INTO account_stage VALUES     (2, 'bob', '2023-02-21', 'U');
INSERT INTO account_stage VALUES     (2, 'bob', '2023-02-23', 'U');
INSERT INTO account_stage VALUES     (4, 'dell',  '2022-02-23',  'I'); 

MERGE INTO account
using (SELECT id, NAME, last_updated,operation FROM   account_stage) AS cdc
ON account.id = cdc.id

WHEN matched AND cdc.operation='U' THEN
  UPDATE SET NAME = cdc.NAME, last_updated = cdc.last_updated
WHEN matched AND cdc.operation='D' THEN
  DELETE
WHEN NOT matched AND cdc.operation='I' THEN
  INSERT VALUES(cdc.id, cdc.NAME, cdc.last_updated)
what about this? you think it will success or fail?
v

Vincent Chee

08/11/2023, 8:56 AM
i think this would have conflict as well
j

jacktby jacktby

08/11/2023, 9:05 AM
this is true, but I success it 5 hours ago in snowflake, but now it failed ..... confused
v

Vincent Chee

08/11/2023, 9:12 AM
i can only assume your merge runs in a different timing
j

jacktby jacktby

08/11/2023, 9:27 AM
Copy code
create table account 
using delta select col1 as id,col2 as name,col3 as last_updated from values (1, 'alice', '2023-01-01');

create temp view account_stage as select col1 as id, col2 as name,col3 as last_updated,col4 as operation 
from values(1, 'alice', '2023-02-22', 'U'),(1, 'alice', '2023-02-22', 'D');

MERGE INTO account
using (SELECT id, NAME, last_updated,operation FROM   account_stage) AS cdc
ON account.id = cdc.id

WHEN matched AND cdc.operation='U' THEN
  UPDATE SET NAME = cdc.NAME, last_updated = cdc.last_updated
WHEN matched AND cdc.operation='U' THEN
  DELETE
WHEN NOT matched AND cdc.operation='I' THEN
  INSERT (account.id,account.name,account.last_updated) VALUES(cdc.id, cdc.NAME, cdc.last_updated);
what about this?