David Conner
04/18/2023, 8:17 AMmp
into lake
)
I'm looking at the partition discovery section of the Spark SQL guide (and a few other sources), but I can't quite figure out how my data will be treated when it's pulled in. the directory structure looks like this, except I've used a regex to change those directory names to participant_id=$id
. inside each directory, there are parquet files whether the name is $id.parquet, but they're not in a directory named sequence_id=$id
.
.
├── lake
└── mp
├── sign_to_prediction_index_map.json
├── train.csv
└── train_landmark_files
├── 16069
├── 18796
├── ...
├── 61333
└── 62590
is the directory required? or can i just rename the files to sequence_id=$id.parquet
?
I'm getting ready to try messing with a DLT using just one participant's data to see what happens. I haven't extracted any additional data, butJim Hibbard
04/18/2023, 6:04 PMCONVERT TO DELTA
? Documentation here.David Conner
04/18/2023, 6:31 PMJim Hibbard
04/18/2023, 6:32 PM