Amit Panwar
05/12/2023, 12:07 PMAmit Panwar
05/12/2023, 12:07 PMAmit Panwar
05/12/2023, 1:00 PMfrom delta.tables import *
from pyspark.sql.functions import *
In [11]:
deltaTable = DeltaTable.forPath(spark, "tmp/delta-table")
In [12]:
# Update every even value by adding 100 to it
deltaTable.update(condition=expr("id % 2 == 0"), set={"id": expr("id + 100")})
i want to run this code locally with pysparkAmit Panwar
05/12/2023, 1:00 PMAmit Panwar
05/12/2023, 1:01 PMRobin Moffatt
05/12/2023, 3:06 PMorg.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 14.0 failed 1 times, most recent failure: Lost task 2.0 in stage 14.0 (TID 66) (1e1dc78cf259 executor driver): java.io.FileNotFoundException:
No such file or directory: <s3a://example/data_load/raw/soil/_delta_log/00000000000000000001.json>
It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
❓ So, on to my question:
I've found that if I restart the Jupyter kernel I can re-query the delta table successfully.
However, I also thought I'd be able to re-create the spark session:
spark.stop()
spark = SparkSession.builder.appName […]
to the same effect - but doing this I still get the FileNotFound
exception, suggesting that something's cached somewhere.
I've tried the suggestion in the error message, but this also throws the FileNotFound exception:
refresh table <delta.s3a://example/data_load/my/table>
Is there a way to programatically flush whatever's been cached, without the "turn it off and turn it back on again" approach of restarting the kernel?Edmondo Porcu
05/12/2023, 3:26 PMKarolina A
05/13/2023, 3:08 PMThanhtan Le
05/14/2023, 3:23 PMKranthi Rao
05/15/2023, 2:53 PMKranthi Rao
05/15/2023, 2:53 PMKranthi Rao
05/15/2023, 2:54 PMKranthi Rao
05/15/2023, 2:56 PMJeremy Jordan
05/15/2023, 7:17 PMLi Sun
05/16/2023, 8:16 PMJohn Darrington
05/17/2023, 12:57 AMdelta
crate and could shed some light on how to hook it up to the storage providers it says it supports?Dinesh N
05/17/2023, 5:05 AMsharath
05/17/2023, 7:13 AMArtsiom Yudovin
05/17/2023, 10:36 AMEric Bellet
05/17/2023, 10:45 AMMartin Beaussart
05/17/2023, 11:03 AMAinesh Pandey
05/17/2023, 3:47 PMdf_final.where(col('dh_state_name')=='Wisconsin').where(col('dh_place_name')=='Madison').where(col('property_type')=='Single-family').display()
These should be showing the exact same results, but sometimes it randomly says "Query returned no results". There is no consistency either; the exact same code will show different results across runs. All of the code is deterministic, so I'm confused about why this is happening. Any thoughts?GapyNi
05/18/2023, 1:15 PMDaniel Kilcoyne
05/18/2023, 6:03 PMTejaswi Bhangi
05/18/2023, 10:51 PMCaoduy Vonguyen
05/19/2023, 7:23 AMCaoduy Vonguyen
05/19/2023, 7:27 AMOvi
05/19/2023, 3:45 PM__delta__log
is not created or, if created, it's left empty, thus the resulted data folder isn't considered to be a Delta table.
The command it's a basic one:
sourceDf.write
.format("delta")
.mode(SaveMode.ErrorIfExists)
.partitionBy("date")
.option("mergeSchema", "true")
.save(deltaLocation)
If anyone has an idea on this, please share...
Thank you in advance,
OviShubham Goyal
05/20/2023, 2:36 PMSimon Thelin
05/22/2023, 9:45 AM