https://delta.io logo
r

Rudhra Raveendran

05/22/2023, 4:51 PM
Heyo! If I delete a column from Delta Table schema (because the column has the incorrect datatype and is causing schema mismatch issues) is there any way to query that data + new data together? For example, imagine I have a column Date that is incorrectly typed as a string, so I delete it from the Delta Table and then new data flows in with type date, so now the column gets recreated with the date type (I have merge schemas on). Now if a user queries the delta table for Date, they will only get new data right? Is there any best practice for also getting the old data (even if it means a separate dataframe?)
t

TD

05/22/2023, 4:53 PM
Dropping column in Delta Lake is designed like any SQL database.... its meant to fully drop the column and its associated data. So its not intended that the "old" data should be query-able.
If you want new data to be stored in correct format + old data to be queryable... then you should not drop the old column... just add the new column (with different name) and start storing all new data in that column (old column will be null for new rows). And then if you want to query both old data (with some "fixing code") and new data together as a single unified column.. then you can define a view on the table which defines a column
unified_col = coalesce(new_col, some_fixing_expression(old_col))
r

Rudhra Raveendran

05/22/2023, 5:21 PM
I see, the issue is I want to keep the table schema clean without duplicate columns. I do know that if I drop a column, then I can't see the column in
df = spark.read.format("delta").load(path)
but I can still see the column with
spark.read.parquet(path)
, so would it make sense to use the different functions, one for current schema and one for historical data (until retention expiry?)
k

Kees Duvekot

05/22/2023, 11:10 PM
Sounds like you just need to do a "select ... cast(to_date) .. overwrite" (in very simple terms) on your table
And then continue with the normal process
g

Gerhard Brueckl

05/23/2023, 6:57 AM
technically, you should be able to • add a new column with a random name • load old data and new data into that new column • remove the old column • rename the new column to the name of the old column just make sure to enable column mapping before https://docs.databricks.com/delta/delta-column-mapping.html#how-to-enable-delta-lake-column-mapping
not sure if this a databricks-only feature though šŸ¤”
c

Chain Hermanson

06/03/2023, 7:48 AM
.
Hey folks, would like to check with delta community šŸ™ .Coming from a similar situation but not exactly the same. So i own a delta table, after some assessment we decide to drop a set of columns due to no usage. Post dropping the columns (10 of them), after 2 weeks we had a use case where we need the column that we had dropped previously but not all (3 out of 10) Is the only option to restore the table to the version before we dropped the column? This will lead us to miss 2 weeks worth of data since the table is updated daily and its been 2 weeks ago since we drop the column, Or is there any other way so that i can restore the column again, with minimum disruption.
k

Kees Duvekot

06/03/2023, 7:50 AM
How would you have solved this on a regular RDMS system?
c

Chain Hermanson

06/03/2023, 7:58 AM
Our end users normally interact with Hive external tables pointing to a location like s3 bucket that has a data with a parquet format. In this scenario what I would probably do is that I would just update the table definition to remove the 10 columns , but the underlying parquet will still have the columns its just that when the user query the table, it will not show the removed columns since the hive table definition is already updated. For delta doing this will not work based on my experience since delta catalog will take precedence, thats why I was updating my delta writer version from 2 -> 5 to utilize the table column mapping. When we want to restore it since parquet still contain the data, i would just need to update the table definition. Initially i thought alter drop column will only update the delta table definition but the underlying data is not deleted, turns out my understanding is wrong post reading this thread šŸ™
k

Kees Duvekot

06/03/2023, 8:17 AM
My question was more ... If you had a similar table on an Oracle (or other RDMS) system ... How would you have solved the situation?
c

Chain Hermanson

06/03/2023, 8:33 AM
Correct me if I am wrong, but normally it would be to 1. restore the table before the point of time where the drop column happened , cons is that we suffer any data generated post the event due to the restoration 2. or we can create a back up of the table per before the drop column and fill up the lost data. When we are done with filling up the missing data, we can make the switch of the original table to the backup version of the table with backfilled missing data. I guess where you are coming from is that to solve this we would need to isolate the blast radius and then made the switch of the data sources of the table once we are ready with the fixed data?
k

Kees Duvekot

06/03/2023, 8:37 AM
If the data from before the drop is still available (possible in the history of delta) .. I would create a temporary new table with that dataset .. merge that with the current table (so adding the 3 columns back with all the data that was available) . And then find a source for the 2 weeks of missing data for the 3 columns (you didn't loose the 2 weeks of all the other columns)
But .. nothing specific related to deltalake ... Just normal "data recovery/engineering" .. like I said .. this situation would not have been any different on a normal RDMS ...
The history of delta is nice .. and sometimes very convenient to have .. but is no substitute for "proper" backups
And if I would have dropped 10 columns from a big table .. I would have made sure I have a backup somewhere ... Because "the business" always comes back with questions about it
šŸ‘ 1
c

Chain Hermanson

06/03/2023, 8:52 AM
Understood, valid points. Will go through the usual data recovery process like any other RDMS. thanks for the inputs šŸ™
k

Kees Duvekot

06/03/2023, 8:53 AM
But do have a look if you delta history has the "pre drop" data ...
And make sure you make a copy of that version history to an "backup" table as soon as possible ... Otherwise you might loose it when the history is clean up
c

Chain Hermanson

06/03/2023, 8:58 AM
Sure thats a good call out. Thank you šŸ™
134 Views