https://delta.io logo
d

David Conner

04/18/2023, 8:17 AM
I need to import a parquet table to make it accessible via delta tables. it's about 55GB, but there's an extra string field in all the records that I'd like to remove, since it can be reconstructed from other fields. So, i don't want to ingest the data into a delta table yet, as it would create metadata for each of the records. ideally, i'd like to ingest it into a new delta table entirely (clone from
mp
into
lake
) I'm looking at the partition discovery section of the Spark SQL guide (and a few other sources), but I can't quite figure out how my data will be treated when it's pulled in. the directory structure looks like this, except I've used a regex to change those directory names to
participant_id=$id
. inside each directory, there are parquet files whether the name is $id.parquet, but they're not in a directory named
sequence_id=$id
.
Copy code
.
├── lake
└── mp
    ├── sign_to_prediction_index_map.json
    ├── train.csv
    └── train_landmark_files
        ├── 16069
        ├── 18796
        ├──  ...
        ├── 61333
        └── 62590
is the directory required? or can i just rename the files to
sequence_id=$id.parquet
? I'm getting ready to try messing with a DLT using just one participant's data to see what happens. I haven't extracted any additional data, but
j

Jim Hibbard

04/18/2023, 6:04 PM
Hi David, if you're doing a one-time import, have you considered
CONVERT TO DELTA
? Documentation here.
d

David Conner

04/18/2023, 6:31 PM
I'll give that a try, thanks
j

Jim Hibbard

04/18/2023, 6:32 PM
No problem! If it doesn't work for your situation just ping me and we'll hash it out 😄
4 Views