https://delta.io logo
m

Martin

04/13/2023, 5:07 PM
Design decision: batched inserts vs. lots of appends + regular compaction? We'd like to store process data of a Workflow Management System (WFMS) in a Delta table. We have an Azure Function (JAVA) that is triggered ~300k times per day and fetches the process data from the WFMS in near real-time. The triggers are not evenly distributed over the day, but rather with two peaks in the late morning and in the afternoon, with 10k triggers per minute. Currently, we consider two options to ingest the data to a Delta Table: 1. Having the Azure Function(s) write the data to JSON files (300k files per day) in an Azure Data Lake Gen2 and then read newly arrived data in batches every 15 minutes to a Spark Dataframe and append it to the Delta Table. The Azure Function would create a new folder every 15 minutes for the JSONs to prevent excessive file listing when reading with Spark. 2. Having the Azure Function(s) to directly write to the Delta Table using Delta Standalone library. Since every append will create a new parquet file and new Delta version in the Delta table, we'd need regular "compaction" (
OPTIMIZE
) in order to keep the Delta table healthy. We are currently leaning towards (1). What are your thoughts?
n

Nick Karpov

04/13/2023, 5:15 PM
300k commits per day makes (2) very unfavorable IMO. that’s a lot of extra metadata… you could implement a pattern that does a commit every X minutes asynchronous to the actual file writes, but that’s a lot to maintain (just throwing the idea out)… option (1) is the standard approach for sure, you could always start with that and adjust based on perf
gratitude thank you 1
k

Kashyap Bhatt

04/13/2023, 5:26 PM
Both are bad IMO. #2 is slightly worse than #1. 1. Write 300k files: a. you'll end up paying for 300k writes and more than 300k reads. That's cost and performance. And you'll have reliability issues to deal with. 2. Perform 300k inserts: a. Not even sure if this is feasible from a performance point of view. It's not just creation of a parquet file, it's update of the logs and a bunch of other calls to ADLS. b. Then you pay compute for compaction of those 300k files into smaller number of files. I would recommend: • create a little pipe (storage-Q, Kafka, service-bus, ...) • post from your Function into the pipe (so this pipe will get 300k messages / day with a peak of ~10k/minute) • tie a new Azure Function to consume from this pipe and write one file per minute (if there are any messages). This way you'll end up with at most 1440 files per day assuming you have at least one request every single minute of the day. Cost savings from following would easily offset the cost of new storage-Q and function: • reduced number of calls to ADLS (down from order or 600k (write/list/read/...) to <10k) • reduced compute required to stream this data into your delta table at every stage (reading from ADLS, writing to Delta table, running optimize on table to compact/z-order). PS: I included Kafka/Service-Bus etc as alternatives to storage-Q in case • storage Q can not handle your peak load (it's a poor man's Q) • your existing application already use them (else cost might be too high to use it only for this purpose, and would be a deal breaker)
gratitude thank you 1
👍 1
• tie a new Azure Function to consume from this pipe and write one file per minute (if there are any messages).
It's been a while and things change pretty quickly. But IIRC you might be able to define the trigger for this new Function based on time or quantity (of messages waiting in Q) or both. So you might end up with even less than 1440 files per day. Only driver to consume from pipe more often than less often would be your application's latency requirements.
n

Nick Karpov

04/13/2023, 5:31 PM
check out this project if you’re willing to add infra https://github.com/delta-io/kafka-delta-ingest i would still advocate for trying (1) bc of how simple it is to setup, but agree with kash otherwise
k

Kashyap Bhatt

04/13/2023, 5:37 PM
Agreed @Nick Karpov. 300k is just in the gray area where you might get away with it (~USD 100-150 / month for just ADLS part, for Hot). If it were say a few million messages per day then it would make the decision more black and white.
💸 1
m

Martin

04/13/2023, 7:39 PM
I haven't had thought about costs, but @Kashyap Bhatt you are perfectly right, that writing lots of small files on ADLS Gen2 can get quite expensive. Thanks a lot for both of your expertises!
💸 1
👍🏽 1
11 Views