Dmitry Volodin
03/28/2023, 9:43 AMA
and B
. In both tables I have record_date
field, for example.
Now what I want is to select records from A
, which has record_date > max(B.record_date)
.
This one works fine with Databricks:
select
...
from
A
where
A.record_date > (select max(record_date) from B)
When I see the plan - it actually tries to precompute the subquery and use the single value for record_date
for pruning on read.
But now let's suppose that we want to compute few more different attributes from B
, to filter out the records in A
. The absolutely equivalent declaration would be
with constants as (
select
max(record_date) as max_record_date,
min(record_date) as min_record_date,
now()::date < min(record_date) as is_some_flag
from
B
)
select
...
from
A
cross join constants
where
A > max_record_date
However Spark does the nested loop join after reading everything from A
to filter out in that case. Example on the screenshot.
Does anyone know how to instruct Spark optimizer to be more smart in that case?Andres Urrego
03/29/2023, 8:28 PMChanukya Pekala
04/12/2023, 1:28 PMFirst Account
varchar(50),
Second Account
varchar(50)
)
It returns error: contains invalid character(s) among ” ,;{}()\n\t=“. Please use alias to rename it.
If I have similar column names, it would be easier for me to write down the data to SQL Server in a straightforward way!Andres Urrego
04/18/2023, 6:06 PMVenkat Hari
04/30/2023, 6:35 PMAndres Urrego
06/27/2023, 11:50 PMSrinivas Maganti
08/24/2023, 12:27 PMSrinivas Maganti
09/08/2023, 6:12 AM