https://delta.io logo
k

Kashyap Bhatt

03/07/2023, 5:10 PM
Hi Everyone, Is there any dummy's guide that describes SCD-Type-2 implementation with Deltalake/Spark? Specifically how to "_*optimally join"*_ with an SCD-Type-2 dimension table while aggregating facts for reporting. I have working solution with a query. When I run my query in databricks, it gives me a little warning at the bottom: "_Use range join optimization: This query has a join condition that can benefit from range join optimization. To improve performance, consider adding a range join hint_." and points to this link. Question: Is there a more optimal way to query when I'm joining using a
between
condition (instead of
=
condition)? Or use something more optimal than
between
?
Sample output.
m

Matthew Powers

03/07/2023, 5:16 PM
Here’s a PySpark implementation that I created: https://github.com/MrPowers/mack/blob/main/mack/__init__.py#L11-L142. You might be able to compare it with your SQL code.
k

Kashyap Bhatt

03/07/2023, 6:10 PM
Thanks @Matthew Powers. Sorry, but where is the query? The code above seems to create/upsert into SCD table. I'm talking about querying the data in scd table not upserting to it.
q

Quentin Ambard

03/07/2023, 9:47 PM
if you’re on databricks did you try DLT ? it has a native SCDT2 implementation and does out of order updates for you
d

Dominique Brezinski

03/07/2023, 10:07 PM
His question is more about when your query has a non-equality predicate, hence the platform suggesting using a range join hint. The right answer is always have at least one equality in the join predicate if possible, and then use the range join hint to tune the range portion of the predicate. You could also probably run the aggregation over a window function on the range when you want to do something like the sum.
k

Kashyap Bhatt

03/07/2023, 10:11 PM
if you’re on databricks did you try DLT ?
@Quentin Ambard, DLT would provide ways to populate facts and dimensions. I am talking about using/querying these tables. I already have many working ways to create/maintain my SCD tables. Besides the point, but DLT is too new to be used in our production system.
@Dominique Brezinski, yes, that's exactly what I am looking for. Could you elaborate? Could you give me the queries (I've attached a fully reproducible standalone example that you should be able to copy paste and execute). Or point me to some better documentation that describes: • what is a range join hint and • how to use windows function for aggregating
d

Dominique Brezinski

03/08/2023, 7:31 PM
Copy code
select /*+ RANGE_JOIN(e, 365) */ 
       e.name,
       e.region, 
       year(s.sale_date) as sale_year,
       SUM(s.sold_amt)   as sale_amt
from       sales s
left join  employee_scd2 e
  on     e.name       = s.name
  and    s.sale_date between e.start_date and e.end_date
group by e.name, e.region, year(s.sale_date);
since
e.start_date
and
e.end_date
are DATE, bin size is days, so that says to make a bin for each year. It is well described at https://docs.databricks.com/optimizations/range-join.html
bin size does require tuning based on the type used in the interval and the max range that needs to be addressed. For small data like that, it will have no affect on performance
but for large data the impact can be substantial when well tuned.
👍🏽 1
k

Kashyap Bhatt

03/08/2023, 8:30 PM
Only after looking at your example I realized the comment
/*+ RANGE_JOIN(e, 365) */
is the hint.
You could also probably run the aggregation over a window function on the range when you want to do something like the sum.
Would appreciate an example of doing this using window function if you have the time. Thank you @Dominique Brezinski.
3 Views