https://delta.io logo
u

04/23/2023, 8:49 AM
spark.read.format("xml").option("rowTag","Documents").load("/data/pat/66086B.XML").write.format("delta").mode("append").save("<s3a://delta-lake/demo1>")
Copy code
org.apache.spark.sql.AnalysisException: Failed to merge fields 'PatentDocument' and 'PatentDocument'. Failed to merge fields 'BibliographicData' and 'BibliographicData'. Failed to merge fields 'Parties' and 'Parties'. Failed to merge fields 'Agents' and 'Agents'. Failed to merge fields 'Agent' and 'Agent'. Failed to merge incompatible data types StructType(StructField(Name,StructType(StructField(_VALUE,StringType,true),StructField(_lang,StringType,true)),true),StructField(_format,StringType,true),StructField(_sequence,LongType,true)) and ArrayType(StructType(StructField(Name,StructType(StructField(_VALUE,StringType,true),StructField(_lang,StringType,true)),true),StructField(_format,StringType,true),StructField(_sequence,LongType,true)),true)
Why do the above errors occur?
j

JosephK (exDatabricks)

04/23/2023, 11:16 AM
The XML shockingly didn't parse correctly
😂 3
j

Jim Hibbard

04/23/2023, 5:59 PM
Hi Abel! It looks like your fields change data types part way through your xml parse. You'll have to resolve that to avoid the error. Either casting the incoming values to have the same dtype or merging the schemas so that the dtypes are flexible enough to accommodate all incoming values. Hope that helps!
1
u

04/24/2023, 1:16 AM
I am sure that the XML is parsed properly
@Jim Hibbard I'll give it a try
👍 1
j

Jim Hibbard

04/24/2023, 1:23 AM
Sounds good. It looks like initially the field is parsing as an ArrayType and later as a StructType. If you look at the schema of your destination table and compare it to the schema of the parsed and unmerged records, what do you get?
1
🙌 1
👍 1
u

04/24/2023, 2:51 AM
@Jim Hibbard As you said, changing the field format to String is OK, tks
👍 1
j

Jim Hibbard

04/24/2023, 3:50 AM
Excellent! Glad that worked for you. Let me know if you need anything else 🙌
k

Kees Duvekot

04/24/2023, 5:11 PM
@Jim Hibbard if the XML has pointers to the XSD (or even multiple) ... It would be very handy when the XSDs are used to determine the actual schema .. instead of just the contents (which is what it looks to be doing now)
👍 1
j

Jim Hibbard

04/24/2023, 5:26 PM
I'm not very familiar with XSD, would you be up for creating a PR describing this approach? If there's already schemas built into a document it'd be great to take advantage of that. I could see some complexity cropping up if XSD schemas don't map well onto Delta Lake's data types or the schemas are especially difficult to parse / in an inconvenient part of the document. But parsing XML is a common use case, so definitely interested in any enhancements possible. Thanks Kees!
(Also, if not up for writing that PR, I can definitely submit something on your behalf. It'd be a huge help if you could jot some bullets down on what you'd like to see though!)
k

Kees Duvekot

04/24/2023, 5:36 PM
I know how to write "Programmer" .. but I am not one. But a writeup on what I am thinking about is definitely possible
Let me plan something for later this week
j

Jim Hibbard

04/24/2023, 5:43 PM
Thank you so much! A write-up is exactly what we need. You definitely don't need to contribute the whole feature, your insights are super valuable though. For example, I didn't realize XSD existed until you mentioned it. Thanks Kees 🙌
5 Views