https://delta.io logo
s

sudo

06/30/2023, 1:58 PM
not sure if this is an usecase for the delta lake, but let me try. We have a set of metadata(json) that we would like to ingest(perhaps through REST) and also consume binary blobs(pdf, zip archives, images, file directories …. perhaps in GB size ) related to the metadata and store it for a given entry. Q1) Is this usecase supported? If yes, Q2) is it possible to version the binary blob data along with the metadata.
d

Dominique Brezinski

06/30/2023, 6:42 PM
There is a lot to this question. Serializing a very large binary blob into column value of a row is probably not wise. A traditional database pattern for this use case would be to store the blob to an object store, and when the write is confirmed, insert the metadata and path to the blob in the database. You can do the same thing with Delta Lake. If you want to delete the blob, you will have to independently delete the metadata record and the blob. Delta Lake does have a convention that directories under the table root that start with an underscore won’t have their contents deleted by vacuum, so you can place the blobs there. You can handle versioning by either writing the new blob version to a different path and updating the metadata record, and then the history is addressable through time travel. Or if using S3 or similar object store that supports versioning, you can rely on that update/pin versions in the metadata record. You will need some frontend service/application logic to perform these things.
s

sudo

06/30/2023, 7:02 PM
hi @Dominique Brezinski that was insightful. We are building machine learning pipelines, as such we would like to associate datasets with some metadata describing what that dataset provides(or who are the creators/authors of the dataset, etc. …)
So was not sure, if we could store the dataset along with its metadata in the deltalake.. but based on your response, seems like the serialization overhead will slowdown this strategy
d

Dominique Brezinski

06/30/2023, 9:32 PM
Well technically you could probably stuff blob in a column. It has been done before. Just not sure it is something you would want to do at high scale. I mean you can try and see how it performs. The JVM has a 2GB bytebuffer limitation, and I worry you might trip over that working with large blobs in a column. People definitely store things like images…