https://delta.io logo
t

Tim Burns

03/26/2023, 7:50 PM
I have a question about the metadata field in the data lake table and how to get it with python. Intuitively, this call seems like it should work:
Copy code
# Load the Delta Lake table
delta_table = DeltaTable.forPath(spark, "/path/to/delta_table")

# Get the metadata of the table
metadata = delta_table.metadata

# Print the metadata
print(metadata)
However, I find that this doesn't work, even through the metadata JSON is right there, so I'm not writing code like this to find a simple table schema on a metadata table.
Copy code
delta_dir = join(table_path, "_delta_log")
for file in os.listdir(delta_dir):
    if file.endswith(".json"):

        with open(join(delta_dir, file)) as json_file:

            for json_line in json_file:
                json_obj = json.loads(json_line)
                if "metaData" in json_obj:
                    self.catalog_metadata[table_path] = json_obj["metaData"]
                    schema_string = metadata["schemaString"]
            result = json.JSONDecoder().decode(schema_string)
It seems like such a simple basic question to ask of the delta table, what is your schema. Why isn't it baked into the API? Or if it, how do I get at it without resorting to writing code? Thanks, Tim
j

Jim Hibbard

03/26/2023, 10:03 PM
Hi Tim, once you've loaded the delta table as a DataFrame you can use the
.schema
property:
Copy code
# load delta table
delta_table = DeltaTable.forPath(spark, '/path/to/delta_table')

delta_table.schema
ā€¢ docs on schema property Hope that helps! Let me know if you're looking for something slightly different. There's a couple ways at this.
t

Tim Burns

03/27/2023, 10:31 AM
Hi Jim, Thanks. I think maybe there was some refactoring that moved this out in 3.2.3 (which is what I have). I see the docs here; https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.schema.html But no schema property (or printSchema) in 3.2.3 (which is what I have): https://spark.apache.org/docs/3.3.2/api/python/reference/api/ It looks like they moved into the DataFrame interface, because this works
Copy code
table_df = delta_table.toDF()
table_df.show()
table_df.printSchema()  # Good
print(table_df.schema)
I'm guessing this will all become clear in time
šŸ‘ 1
j

Jim Hibbard

03/27/2023, 2:18 PM
Ah, I see what you're saying. I think it's done that way because you can access different versions of the delta table and they could have different schemas. Maybe there should be a
.current_schema
property to make this easier.
t

Tim Burns

03/28/2023, 11:18 AM
Yes, very interesting - I am working on a delta - Snowflake integration so I think this is a very interesting perspective. Snowflake also allows for the evolution of a table through time travel so ideally the delta table and the time travel table should be in sync
3 Views