https://delta.io logo
s

Seungchan Lee

08/01/2023, 2:09 AM
Hi, is there a way to use Airbyte with Delta Lake? I remember seeing Airbyte in connection to delta-rs in the Delta Lake 3.0 presentation but couldn’t find any documentation on this. If anyone can point me to the right direction, I’d very much appreciate it!
s

Simon Thelin

08/02/2023, 3:39 PM
So far there is no way from what I know. I don’t know how airbytes databricks connector work, so that is prob why it was in the presentation. However, we also use trino with galaxy, they have a connector that can push to apache iceberg, with delta 3.0 you can then link this, I have yet not tested this. For our main workloads we go
pg -> s3 -> delta
atm.
I asked in airbyte slack and no responses yet.
s

Seungchan Lee

08/03/2023, 3:59 PM
Hi @Simon Thelin thanks! So when you say
pg -> s3 -> delta
do you mean you use Airbyte to connect
pg -> s3
and then have some sort of manual script from
s3 -> delta
? Would you mind elaborating a little on the setup?
I’d love to try a workaround like this if it works
s

Simon Thelin

08/04/2023, 1:22 PM
I have a spark bronze job that handles any schema change, data type changes etc yes
so I run airbyte to just incrementally dump data to raw s3 parquet. I then in my spark bronze do partial/full soft delete checks, manage new columns, if something got dropped, following standard practise slow changing dimensions. Here I also allow potential duplicates and similar, then clean it up in silver jobs. I have a quite complex scenario where data can be updated, deleted, inserted in all possible ways on the postgres
which means data I just synced my no longer exist
, so I have to quite a lot of things to make sure it propagates correctly. I unfortunately can’t setup CDC atm due to the PG I am using atm is manually setup so worried what might happen, I would otherwise just go full debezium and then use
kafka-delta-ingest
, and I will move to this at some point. I think Airbyte is quite slow and annoying in general. But it does the job, we are only a team of 3 so it saves me a lot of time right now. I would advice you @Seungchan Lee if you can, use
debezium
and then
kafka-delta-ingest
. This way you will get a very high performant sync directly to delta. Given your postgres is properly setup with IaC and you can easily tear it up and down.
@Seungchan Lee Current architecture I have implemented. What you can’t see here is that we also run some
lambda architecture (not aws lambda)
spark streaming/batch jobs, combined with multiple compute sources like
dask, polars
. I hope to be able to move to full kappa soon and ditch airbyte all together
potentially use it for salesforce
. I will be featured soon in a full article on starburst and happy to explain more about this if it helps you in any way.
s

Seungchan Lee

08/04/2023, 3:07 PM
@Simon Thelin This is awesome - thank you so much for explaining it in detail. I’m quite new to this so this is very helpful! I’m going to have to look more into kappa architecture, but wouldn’t Airbyte still be useful so you don’t have to write connectors for multiple data sources? Also, do you run your own Spark cluster or use Databricks? I was hoping to run my own Spark cluster but not sure if that’s feasible. And do you run your own Kafka implementation or use Aiven or Confluent? Lastly, what do you use for BI dashboards and visualizations once you connect Trino/Starburst to Delta Lake?
Oh and also, do you mind if I send you a connection request in LinkedIn?
s

Simon Thelin

08/04/2023, 4:06 PM
You can run CDC with Airbyte but I would rather just run pure debezium then. However a source like salesforce, you can still publish to a topic via airbyte if you go kappa to unify it. I run spark on k8s via open source spark on k8s operator. I host kafka myself as well
strimzi
but keen to try
red panda
, I host everything on my own, the only thing I don’t host is trino atm, due to not having enough hours in a day haha
I use trino for DBT workloads via their own
delta connector
and for BI tools to connect correct, it is very versatile in that sense. Databricks has some SQL engine of their own but last time I checked it was pretty useless. I don’t mind at all for sure.
@Seungchan Lee this is what I use https://github.com/GoogleCloudPlatform/spark-on-k8s-operator you have a
spark operator
in your cluster, when you then later submit a job, each job becomes its own small cluster. I mimic what https://www.datamechanics.co/ are doing but I just host it myself.
This is what I use to host kafka in k8s @Seungchan Lee https://strimzi.io/
s

Seungchan Lee

08/04/2023, 11:25 PM
@Simon Thelin Very nice - thank you so much for the information! So are you using Superset for BI? Or do you use paid ones like Tableau or something? Lastly, you seem to prefer open-source + self-hosting. What is the rationale behind this preference and in your experience, has it been worth the effort (I’m assuming it hasn’t been easy running your own Kafka and Spark clusters, etc - but correct me if I’m wrong)?