https://delta.io logo
r

rtyler

04/12/2023, 10:52 PM
@Matthew Powers or @Jim Hibbard I need to do some optimization testing, I know that @Denny Lee has had some bigger delta tables floating around for demo purposes of public data sets, do y'all have something I can copy into my own S3 or minio bucket that can be optimized?
m

Matthew Powers

04/12/2023, 11:12 PM
The h2o benchmarks are well known and easy to generate. You can easily make 5GB of 50GB datasets. The 50GB generation script only works if you have enough RAM cause it’s written in R. Here are the instructions: https://github.com/MrPowers/mrpowers-benchmarks/blob/main/CONTRIBUTING.md#generating-h2o-csv-datasets-with-h2o-r-code Is 50GB large enough for you?
r

rtyler

04/12/2023, 11:13 PM
that should be sufficient, thanks!
m

Matthew Powers

04/12/2023, 11:14 PM
Let me know if you’re able to get that script running. I can jump on a call and pair with you in case you hit any snags.
Also, let’s rewrite that script in Rust!!
😆 1
r

rtyler

04/12/2023, 11:55 PM
well bugger, I think I'm already seeing a challenge with our optimize implementation, it appears to be single threaded 😛
m

Matthew Powers

04/12/2023, 11:55 PM
Awesome you’re running the benchmarks to tell!!
r

rtyler

04/12/2023, 11:58 PM
all aboard the pain train!~
m

Matthew Powers

04/13/2023, 12:03 AM
@rtyler - here’s the frawk script someone made to generate the dataset BTW: https://github.com/MrPowers/mrpowers-benchmarks/issues/2#issuecomment-1016311537
Really would be cool to make that script in Rust. Would be a huge benefit for the data community.
r

rtyler

04/13/2023, 5:33 AM
Looked at the code briefly, why does this need to be converted to Rust? It ran real fast on my machine generating 400MB of data
m

Matthew Powers

04/13/2023, 4:57 PM
@rtyler - the 50GB generation script only works on machines with 64GB of RAM or more. So most users can’t generate this dataset with the R script. Data is output to a single file, so really large 500GB datasets can’t be generated.
r

rtyler

04/13/2023, 4:59 PM
ah, I see. I have never seen R before, so I'm not sure if I'm going to be nerd sniped here 😆 I have enough problems in #delta-rs to consume all my available time
😜 1
d

Denny Lee

04/13/2023, 5:03 PM
We have some larger datasets like NYCTaxi and the tpc-ds benchmark ones - would those work?
10 Views