We are moving 1tb data from 1 storage account to another (both mounted in Azure databricks) using dbutils.fs.mv(), this is slow operation. Meanwhile moving the same data manually by using "Microsoft Azure storage explorer" is 100x faster. Anyone has any suggestions for any APIs which can be used to perform this data movement using this higher speed transfer speed?
I would probably try to use Azure Data Factory - it scales pretty well for these scenarios
if you are moving Delta Lake tables you should also have a look at DEEP CLONES
we copied 100TB of data in about 3 hours this way
05/16/2023, 8:04 PM
to this topic, as far as i understand if it is in the same storage account i can just copy everything (including _checkpoint) to a new folder and it should work - i have tested it and it seems that streaming checkpoint does not save any absolute path. I would just need to re-register Hive tables to point to the new location.