https://delta.io logo
j

junhui huang

04/06/2023, 2:54 AM
Hi, I'm a new user of delta. How to use delta lake manage machine learning massive datasets, images (belong to a dataset), metadata (dataset metadata, images metadata)? Are there any suggestions or similar user cases? Does the number of delta lake tables have an impact on performance, especially when the number of tables reaches hundreds of millions. Thanks.
j

Jim Hibbard

04/06/2023, 8:27 AM
Great question! You mention the possibility of "hundreds of millions" of Delta Tables, could you elaborate on your use case?
j

junhui huang

04/06/2023, 9:16 AM
We currently have 20 million datasets stored in the MongoDB database, each dataset has a few to millions of image files, and each image has some meta information records, such as resolution, size, storage address information. If we map each dataset to a separate dataset_<uuid>_image table, the number of dataset_<uuid>_image tables is 20 million. The purpose of storing the image metadat of each dataset separately in a table is to quickly find all the files under a certain dataset. Another solution is to store all the image files in a table dataset_images, which has a field dataset_uuid indicating the dataset to which the row record belongs, in order to improve the efficiency of finding images according to dataset_uuid, we need to use dataset_uuid as the partition key of the table. And because the value of dataset_uuid changes very much, we need to add something like partition_id field to this table, partition_id = dataset_uuid % PARTITION_NUM. Which of these two options is more suitable for delta lake, or is there a better solution for managing datasets and images?
2 Views