https://delta.io logo
r

Roy Green

01/22/2023, 11:47 AM
Hi team, I have a delta table (~200 GB, 6 Billion rows) stored on S3. I want to query using databricks SQL the count of distinct values over one column(string) on a given timeframe. I partitioned by day, with auto optimize=True, so the file size ranged between 100-300 Mb. Tried zorder on this column, bloom filter, and a combination of both. I’ve tried different cluster sizes up until X-large, and no matter what, query time is a minimum of one minute. Is this the optimum databricks & delta can achieve, or am I doing something wrong? Thanks!
l

Lennart Skogmo

01/22/2023, 4:25 PM
Let me first say I'm no expert. Seems like the biggest optimization would be taking advantage of pratition pruning, so in other words verifying that the partitions outside of your timeframe is not scanned/read in any way at all from s3. Beyond that I think this might be more of a spark tuning challenge. Depending on what your specific goal of doing the distinct count is, there are some functions that can give you approximate counts which should perform better. It might also be worth trying a single big computer with good I/O and lots of ram over a big cluster of average machines. It could be that the bottleneck is comparing the various sets when they are returning from the nodes and even network I/O assosiated with that. Something else to look out for could be lack of pratitions/spark distribution (separate from delta) maybe? So even if you try applying larger clusters work isnt being spread out properly so nothing is gained. Since you are using databricks it might also be worth investigating if your code can run mostly on photon also. Just some speculation 😛
r

Roy Green

01/22/2023, 6:32 PM
Thanks for the detailed answer, but as far as I understand, most of the things you suggested are impossible on data bricks SQL, I can’t switch to less but bigger machines and can’t control spark internals such as partitions. The only strategy I can think of right now is to create aggregation tables on each hour or something, which is costly and a bit ugly ….
1
I’d be happy to hear if there are some other ideas
j

JosephK (exDatabricks)

01/22/2023, 6:36 PM
One important thing is that z order only works on the first 32 columns. It’s better for high cardinality data, so hard to say w/out knowing the count. Bloom filter will be good if it’s heavier and more unique text. All that said, 1 minute for 200GB table doesn’t seem too bad to me considering it’s close to 1000 files
r

Roy Green

01/22/2023, 6:39 PM
Thanks for the answer! There are 6 columns total, and the relevant column is indeed high cardinality data (a random 48-length string). so using both zorder and bloom do you think this is the best I can get?
j

JosephK (exDatabricks)

01/22/2023, 6:41 PM
I know that zorder will help, I’m not sure how much a bloom filter will matter on top of that. I think it’s better for longer text type of stuff like searching for a particular word, but I’m absolutely not sure about it
r

Roy Green

01/22/2023, 6:41 PM
OK, thanks
worth noting, on Athena on delta it’s a litlle faster (~40 seconds for the same query)
o

Omkar

01/23/2023, 12:12 PM
@Roy Green since you're using Databricks, have you tried Delta Caching? (https://docs.databricks.com/optimizations/disk-cache.html#delta-cache-renamed-to-disk-cache) In addition to Z-Order indexes which you discussed above, I suppose you might be able to leverage this feature to improve your query speed (as it caches the associated parquet data files for faster retrieval), especially if the count distinct query is going to be running frequently (or on regular interval). Check it out may be!
r

Roy Green

01/23/2023, 12:14 PM
Thanks Omkar, I think this is enabled by default on databricks SQL, is it? I can confirm that at the second time I run the query is much faster, however - as it serverless and I turned on the automatic shutdown, the cache invalidation happens a lot any idea how to overcome that?
o

Omkar

01/23/2023, 12:34 PM
Yes @Roy Green if you're using SSD-based workers, I suppose the workers come preconfigured to optimally make use of disk cache (mentioned here in the first para). Indeed you're right, in autoscaling mode whenever a worker is shutdown, the cache is lost and Spark would've to reload the partitions from the source - in such cases the query might take longer to execute. This too is mentioned in this doc in the Notes section. If you're facing too frequent cache invalidation due to autoscaling, you can probably try to tweak the
Min Idle
and
Max Capacity
in your serverless pool config as required. Although please be careful to not set
Min Idle
too high as it may cost you more and the workers may stay idle during off-peak hours. More info here on best practices: https://docs.databricks.com/clusters/cluster-config-best-practices.html#autoscaling This talk might also be useful: https://www.databricks.com/dataaisummit/session/scaling-your-workloads-databricks-serverless Hoping this will be helpful to you! 😁
r

Roy Green

01/23/2023, 12:50 PM
Thanks, it’s indeed helpful. but I’m still struggling trying to optimize the query time somehow (regardless of the cache).
o

Omkar

01/23/2023, 1:54 PM
@Roy Green Can you check the output of
EXPLAIN
for your query? It may tell you which parts of the query are taking time. Explain syntax here: https://docs.databricks.com/sql/language-manual/sql-ref-syntax-qry-explain.html
r

Roy Green

01/23/2023, 1:54 PM
I did, it’s the scan- the fetching of the files from S3
any way to optimize this part?
j

JosephK (exDatabricks)

01/23/2023, 1:55 PM
Not really. S3 takes some time. Delta does it’s own listing to avoid expensive/longer S3 listing. You have a lot of files to read.
r

Roy Green

01/23/2023, 1:55 PM
yeah that’s true
the z-order turned the file size to be ranged between 100-300 Mb
is this the expected behavior?
j

JosephK (exDatabricks)

01/23/2023, 1:58 PM
yes. plain optimize will do 1GB
r

Roy Green

01/23/2023, 1:58 PM
but this contradicts the z-order
so should I avoid doing z-order?
j

JosephK (exDatabricks)

01/23/2023, 1:59 PM
Depends on how many files you’re skipping from zorder. You can see that in the sql tab of the query. How many files scanned vs total
r

Roy Green

01/23/2023, 1:59 PM
a lot are skipped
thanks to z-order I guess
but from the other side the size of each is too small
j

JosephK (exDatabricks)

01/23/2023, 2:00 PM
it’s a balance
r

Roy Green

01/23/2023, 2:00 PM
I understand
so seems like I have nothing to do in order to optimize more- besides aggregative tables
anything I’ve missed out?
j

JosephK (exDatabricks)

01/23/2023, 2:02 PM
I don’t think so
r

Roy Green

01/23/2023, 2:02 PM
OK
thanks anyway
Really appreciated!
o

Omkar

01/23/2023, 2:09 PM
@Roy Green Yeah, if your goal specifically is to keep a distinct count by a string column, you can probably try out storing it in Redis (key - your string column, value - the column's count). Periodically update the counters by checking for new data in your source table. This should be super fast as this is a typical Redis (hashmap) use case which will return you the distinct counts for the specified key in O(1) time.
r

Roy Green

01/23/2023, 2:10 PM
but then how can I use these values on my dashboards?
o

Omkar

01/23/2023, 2:11 PM
Depends, what kind of dashboards are these?
r

Roy Green

01/23/2023, 2:11 PM
Databricks sql dashboard
o

Omkar

01/23/2023, 2:15 PM
Okay, in that case you won't be able to use Redis. But yes you'll be able to do the same in a Delta table e.g. if your column's name is
book_name
then you schema will be something like:
Copy code
table BookCounter
  book_name string (your string column)
  book_count integer (it's overall count)
Periodically update this Counter table from your source table.
r

Roy Green

01/23/2023, 2:15 PM
OK I’ll try that
thanks a lot!!
😁 1
👍🏼 1
3 Views