https://delta.io logo
r

Rudhra Raveendran

05/11/2023, 6:43 PM
Hi folks, does anyone know how to drop columns from a Delta Lake schema when the data is stored in Azure Blob Storage? E.g. I have data in storage account like so: abfss://container@storageaccount/path/to/data. path/to/data contains a bunch of folders in yyyy/mm/dd format, as well as a _delta_log folder. I'm trying to interact with this delta schema via scala in Azure synapse, e.g. val deltaTable = DeltaTable.forPath("abfss://..."). But then from there I'm not sure what to do, as all docs I can find mention writing SQL like ALTER TABLE table DROP COLUMNS columns, but I see no way to do that with this deltaTable. If I try spark.sql("SHOW TABLES") just to see what's there, I get an error about a null path, and if I try spark.sql("SELECT * FROM delta.`abfss://...`") that errors out too. Is there just something simple I'm missing? I feel like dropping columns should be a simple matter but I can't find any way to do it
I've also been told there should be a way of just registering the delta table for spark sql and be able to run the ALTER TABLE commands on it, but I can't find any docs on that either
m

Matthew Powers

05/11/2023, 6:54 PM
I wrote a blog post on this that you might find useful: https://delta.io/blog/2022-08-29-delta-lake-drop-column/
r

Rudhra Raveendran

05/11/2023, 7:22 PM
Hi Matthew, I actually read that already! My confusion is how do I read an existing delta table (I know the _delta_log folder is all json but my actualy data is parquet), and then write the changes I made back to that table? One thing I tried was val table = spark.read.format("delta").load(path), then did table.drop("column"), table.write.format("delta").save(path), but that gave me some errors about parquet files not existing (no idea why that is, delta referring to files that don't exist anymore?). And using the DeltaTable.vacuum command didn't find any files to clean up so not sure what's going on
m

Matthew Powers

05/11/2023, 8:08 PM
Here’s the command: `spark.sql(“ALTER TABLE
my_cool_table
DROP COLUMN language”)`
Here’s what you can use if you have a path to a Delta table:
Copy code
spark.sql("ALTER TABLE delta.`path/to/my_cool_table` DROP COLUMN language")
r

Rudhra Raveendran

05/11/2023, 9:57 PM
Does that latter command you sent work with abfss paths? I tried
Copy code
spark.sql("SELECT * FROM delta.`<abfss://container@account/path/to/data>`")
Where path/to/data has the _delta_log folder, but that gave an error:
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: null path
, am I writing the path wrong?
Hi @Matthew Powers just wanted to bump this in case it got buried! Thanks for your help so far!
g

Gerhard Brueckl

05/15/2023, 7:34 AM
two things that would come to my mind: • did you upgrade your tables properties?
Copy code
spark.sql(
    """ALTER TABLE `my_cool_table` SET TBLPROPERTIES (
   'delta.columnMapping.mode' = 'name',
   'delta.minReaderVersion' = '2',
   'delta.minWriterVersion' = '5')"""
)
• could very well also be that this is not yet working on Azure Synapse due to an older version of Delta Lake 🤔
r

Rudhra Raveendran

05/15/2023, 5:58 PM
I tried running that command to upgrade the table properties like so:
Copy code
spark.sql(
    """ALTER TABLE delta.`<abfss://container@account/path/to/data>` SET TBLPROPERTIES (
   'delta.columnMapping.mode' = 'name',
   'delta.minReaderVersion' = '2',
   'delta.minWriterVersion' = '5')"""
)
But that resulted in the same null path error. I'll try reaching out on the Synapse forums/support and see if this is a known issue on Synapse!
g

Gerhard Brueckl

05/15/2023, 8:50 PM
so in general this works for me on Synapse: my path is in format
abfss://<container>@<account>.<http://dfs.core.windows.net/MyTable|dfs.core.windows.net/MyTable>
r

Rudhra Raveendran

05/15/2023, 10:01 PM
@Gerhard Brueckl Is this a Scala or PySpark notebook? And which Spark runtime are you using? Just want to eliminate as many differences as possible!
Hmm so even in PySpark I get this issue. One thing that I realized may be important is the existence of special characters in the path. If my path
abfss://<container>@<account>.<http://dfs.core.windows.net/path/to/table|dfs.core.windows.net/path/to/table>
contains special characters, e.g. "=" in the path, could that cause this error? And if so, are there any workarounds?
g

Gerhard Brueckl

05/16/2023, 7:18 AM
hmm, could be - can you try without? you might need to URL-escape the special characters -
=
would be
%3D
then https://www.w3schools.com/tags/ref_urlencode.ASP I am using this configuration
r

Rudhra Raveendran

05/16/2023, 6:33 PM
I tried replacing the = with %3D but unfortunately that didn't work either, and my Spark configuration was the same as yours. For some reason or other this just doesn't work for me. On the bright side, my coworker figured out this solution that did work for us, and I'll share here in case this is useful for anyone else:
Copy code
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.delta.catalog.DeltaTableV2
import org.apache.spark.sql.delta.commands.{AlterTableDropColumnsDeltaCommand, AlterTableSetPropertiesDeltaCommand}

val tablePath = "<abfss://container@storage.dfs.core.windows.net/path/to/data>"
val deltaTableV2 =  DeltaTableV2(
    spark,
    new Path(tablePath)
)
AlterTableSetPropertiesDeltaCommand(
    deltaTableV2,
    Map(
    "delta.columnMapping.mode" -> "name",
    "delta.minReaderVersion" -> "2",
    "delta.minWriterVersion" -> "5"
    )
).run(
    spark
)

AlterTableDropColumnsDeltaCommand(
    deltaTableV2,
    Seq(Seq("columnName"))
).run(spark)

spark.read.format("delta").load(tablePath).printSchema()
👍 1
7 Views