Ryan Aston
09/09/2023, 9:21 PM{"received_date": null, "context_page_path": "/bar/"}
and the resulting path in S3 after the write is <s3://bucket/table/received_date=__HIVE_DEFAULT_PARTITION__/context_page_path=/bar/part-00000-da65176e-fb16-418b-be85-b3a28d3edc11-c000.snappy.parquet>
. It seems like if context_page_path
was ever /bar
it would collide with this partition, even though it’s technically a different value.Sasidharan Sangeeth
09/11/2023, 8:49 AMAnshuman Raina
09/11/2023, 10:49 AMGal Stainfeld
09/11/2023, 12:09 PMRun Job
which allows to trigger a workflow from within a workflow.
There is a mandatory Parameters tab that has to have some values in it which is the parameters passed to job.
Why is this a mandatory section? In the desired workflow i trigger i already have on each of the tasks the parameters required for that tasks’ run so i don’t need any parameters to be passed to the job itself?
Is it possible that it’s a bug ? (forcing me to add parameters to the job)? If not, i can’t seem to use it because i have the same key parameter used for various tasks in my workflow with different values.
Thanks.Athi
09/11/2023, 5:31 PMHamad
09/11/2023, 9:04 PMMarkus Hube
09/12/2023, 1:10 PMdf.writeStream.foreachBatch(...)
to merge against a delta table. the merge is done on a uid column and an hour partition column. But when my data growths in size the query is failing due to spark.driver.maxResultSize
. _I_f I understand the spark plan correctly (see down below) there is a broadcast join going. It clearly makes sense to have have merges without shuffles but is there a way bypassing the driver since this would not scale well?
Thanks
== Physical Plan ==
AdaptiveSparkPlan (30)
+- == Current Plan ==
HashAggregate (19)
+- Exchange (18)
+- HashAggregate (17)
+- Filter (16)
+- HashAggregate (15)
+- Exchange (14)
+- HashAggregate (13)
+- Project (12)
+- BroadcastHashJoin Inner BuildLeft (11)
:- BroadcastQueryStage (6)
: +- BroadcastExchange (5)
: +- * Project (4)
: +- * Filter (3)
: +- * Filter (2)
: +- * Scan ExistingRDD mergeMaterializedSource (1)
+- Filter (10)
+- Project (9)
+- Project (8)
+- Scan parquet (7)
Antara
09/12/2023, 3:38 PMAnalysisException: The provided timestamp (2023-09-08 02:30:00.0) is before the earliest version available to this table (2023-09-08 23:12:30.0). Please use a timestamp after 2023-09-08 23:12:30
I am totally confused 😅 Am i missing something here ? Am i setting deleteFileRetentionDuration
or logRetentionDuration
wrong ? Also my vacuum command has a retain 360 HOURS
Inputs are appreciated.. Thanks 🙂Sangeeth Sasidharan
09/13/2023, 1:28 PMAlexandra Samakhova
09/13/2023, 3:55 PMStatisticsCollection#recompute
command?
If not - what would be your advice.
( I'm currently on 2.1.0
)Stephane DUGUET
09/14/2023, 7:40 AMkrishna prasad
09/14/2023, 9:50 AMlet mut config = DeltaOps(table).create().with_columns(schema.get_fields().clone());
I use datafusion a dataframe by `let df = self.ctx.sql(sql).await.unwrap()`; . When I try to write the reecordbatch from this dataframe by `record_batch_writer.write(batch).await?`; I see that the recordbatchs schema has a change. I see that the timestamp is data_type: Timestamp(Nanosecond, None)
but the table schema has data_type: Timestamp(Microsecond, None)
. Is there a config to fix this behaviour.
Any help is appreciated.Sandeep Anumula
09/14/2023, 1:11 PMRadu Stoian
09/14/2023, 10:11 PMIan
09/15/2023, 2:29 AMNo such file or directory: <s3a://bucket/delta-table/part-00000-57b118c1-a1a6-4ea1-aaa6-e2069601e1b6.c000.snappy.parquet>
It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running ‘REFRESH TABLE tableName’ command in SQL or by
recreating the Dataset/DataFrame involved.
to note we store our delta tables on MinIOSimon Thelin
09/15/2023, 12:35 PMChristian Daudt
09/15/2023, 4:08 PMPrashanth Ramanna
09/16/2023, 7:51 PMSupport for disabling Delta checkpointing during commits - For very large tables with millions of files, performing Delta checkpoints can become an expensive overhead during writes. Users can now disable this checkpointing by setting the hadoop configuration propertyI looked into the codebase but the standalone API's do not expose checkpointing capability. So here are my two questions 1. How can I configure another job (preferably using Standalone) to perform checkpointing ? 2. Are there any other recommendations/learnings for handling large delta tables ?toio.delta.standalone.checkpointing.enabled
. This is only safe and suggested to do if another job will periodically perform the checkpointing.false
Thanhtan Le
09/18/2023, 6:27 AMLori
09/18/2023, 7:01 PMConcurrentAppendException
when I write to the same table (use Delta merge
) from 2 EMR clusters (EMR version 6.10.0, Spark 3.3.1, Delta 2.2.0), and there was no entries in the DynamoDB after writing. It looks like I did not enable Delta Lake Multi-writer successfully. My questions are:
1. How can I debug this? I checked online, but I did not find the useful solutions/debug steps
2. I use the SparkSession.builder().appName().config(…)
to config the spark session. Is this correct? Besides the spark configuration listed in the Setup steps, is there any required Spark configurations I need to setup?
Thank you!Alec Ventura
09/18/2023, 9:38 PMConcurrentAppendException
error:
Files were added to partition [event_date=2023-09-18] by a concurrent update
This log is from the daily job that have a replaceWhere = event_date between '2023-08-18' and '2023-09-17'
and I am sure that there is no 2023-09-18 data on that dataframe.
So my idea now is use upsert on hourly run and continue with replaceWhere on daily one... but I don't know if daily will continue to fail since it will be an insert and an update operation at the same time.... Any thoughts? Thank youLori
09/19/2023, 12:51 AMPerfect Stranger
09/19/2023, 4:26 AMGerhard Brueckl
09/19/2023, 9:22 AMSasidharan Sangeeth
09/20/2023, 5:30 AMMatt Moen
09/20/2023, 3:55 PMli yuanxing
09/21/2023, 9:17 AMli yuanxing
09/21/2023, 9:18 AMli yuanxing
09/21/2023, 9:18 AM