https://delta.io logo
s

Szymon Sikora

06/28/2023, 12:07 PM
Hi, I have beginner question regarding delta lake. Am I able to set uuid automatically during merge statement? I've created following code snippet but it does not allow me to use uuid() function.
Copy code
kv = {c: f"source.{c}" for c in df2.columns}

target_df.alias("target").merge(
    df2.alias("source"), "source.Name = target.Name"
).whenNotMatchedInsert(
    values={
        "uuid": "uuid()",
        **kv
    }
).whenNotMatchedBySourceDelete().execute()
Is there any way to do that or such an action is not supported?
m

Martin

06/28/2023, 2:05 PM
SparkSQL has a
uuid()
function: https://spark.apache.org/docs/latest/api/sql/index.html#uuid I think it is not (yet) exposted to the PySpark API. You could work around this by using PySpark's
expr()
function: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.expr.html#pyspark.sql.functions.expr
Copy code
from pyspark.sql import functions as f

kv = {c: f"source.{c}" for c in df2.columns}

target_df.alias("target").merge(
    df2.alias("source"), "source.Name = target.Name"
).whenNotMatchedInsert(
    values={
        "uuid": f.expr("uuid()"),
        **kv
    }
).whenNotMatchedBySourceDelete().execute()
s

Szymon Sikora

06/28/2023, 2:54 PM
i've tried that method too, but when i try to run the merge i'm getting:
Copy code
AnalysisException: nondeterministic expressions are only allowed in
Project, Filter, Aggregate or Window
m

Martin

06/28/2023, 5:31 PM
what if you add a uuid-column to
df2
upfront (before the merge)?
Copy code
from pyspark.sql import functions as f

df2 = df2.withColumn("uuid", f.expr("uuid()"))

kv = {c: f"source.{c}" for c in df2.columns}

target_df.alias("target").merge(
    df2.alias("source"), "source.Name = target.Name"
).whenNotMatchedInsert(
    values={
        **kv
    }
).whenNotMatchedBySourceDelete().execute()
s

Szymon Sikora

06/29/2023, 7:06 AM
yes that would work, i've been thinking about using in in merge to make it more explicit that i'm assigning new uuid only to new data. But if that's the only way i will go with it 🙂 Thanks!