https://delta.io logo
s

Shubham Goyal

03/25/2023, 9:48 AM
Has someone worked parsing complex XML column data into multiple columns in Databricks' pyspark ?
j

Jim Hibbard

03/25/2023, 6:31 PM
Hi Shubham, I haven't personally had to tackle this problem but it looks like there's a Databricks package that could help: https://github.com/databricks/spark-xml There's some example code in the README.md too. Hope that helps!
c

Christina

03/25/2023, 10:06 PM
the spark xml library will help you parse it out into a structure. if using autoloader you can read it into a binary then convert to str/struct
👍 1
k

Kees Duvekot

03/26/2023, 10:39 AM
My experience is that it is harder when you have a complex XML with multiple namespaces and related XSDs
Processing XMLs is not easy with spark .. even though it's very structured and with XSDs you have all the details needed to create the structures automatically
And when the XMLs are not files ..but for example a SOAP response back from a service it becomes even more difficult to get things processed
s

Shubham Goyal

03/26/2023, 11:00 AM
Thanks everyone for responding. @Kees Duvekot I am also facing a lot of difficulties while trying to process a complex XML using Spark. Any alternate you would suggest ?
k

Kees Duvekot

03/26/2023, 11:01 AM
Preprocess it using a xslt .. to make it simple 😁
👍 1
But that usually means losing data ...
s

Shubham Goyal

03/26/2023, 11:08 AM
My requirement is to store XML data into structured format without loosing out on versioning of data .So I was thinking of storing the data into delta format using Databricks' pyspark.
k

Kees Duvekot

03/26/2023, 11:09 AM
The storage is not the problem ... The structured format is.
👍 1
You might be able to store the raw data as separate records in a delta table (bronze) .. and store the relevant parts in a structured data format in a silver delta table
👍 1
s

Shubham Goyal

03/26/2023, 11:12 AM
Yes , converting it into a structured format is becoming difficult.
j

Jim Hibbard

03/26/2023, 3:47 PM
I think Kees makes a great point. You can always store a lossless version in Delta (easy to accomplish, but not easy to query) and then reshape the date / extract only the currently relevant values in your silver and gold layer. Then if in the future you need other details you can reprocess the fully saved delta version of the XML record.
3 Views