Hello everyone,
Can I get some advice on how we can improve our integration between data ingestion and ML using delta lake? Right now our data ingestion is done in databricks notebooks and is saved as hive tables. Pain points with this setup is lack of good version control for the ingestion code and how it integrates with downstream ML. We end up running into errors in feature generation because some schema updates have been made but not communicated well from data team to ML. Ideally we would like to have FG run with every schema update and reject the change or log the failure to be able to catch those issues earlier. I am new to delta lake so if there are any resources I need to check out please point me to them.
k
Kashyap Bhatt
01/13/2023, 8:41 PM
Very open ended question. So hard to answer.
See schema evolution and enforcement: https://www.databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html
Note that OSS may not have all the features offered by Databricks version of deltalake as part of DBR. E.g. https://delta.io/blog/2022-11-16-delta-lake-schema-enforcement/
Using this you should be able to ensure that:
• If there are backward-compatible changes in incoming data's schema, your pipeline doesn't break.
• If there are non-backward-compatible changes in incoming data's schema, your pipeline will give you good info/errors to pin point what changed and whose neck needs wringing.
If you have more specific questions, then it might get you more specific answers.
s
Shane Torgerson
01/14/2023, 12:31 AM
Get your ingestion code out of notebooks and into jar or pyspark jobs.