Replies: 2 comments
-
|
Thanks for the great proposal! Episode-level row granularity seems like the right abstraction - it avoids frame scattering and lets users filter on episode metadata before touching the heavy data. Strategy 2 (lazy frame loading) is also the better choice for the same reasons. A few thoughts:
This keeps everything in the lazy evaluation graph and avoids users needing to call it inside UDFs or loops.
|
Beta Was this translation helpful? Give feedback.
-
|
Hey there, Thanks for the proposal ! Looking forward to seeing LeRobot dataset format support in Daft 🔥 Regarding the strategy, I agree with @plotor on the second suggestion. Accessing frames is the critical path at training time (VLA are trained on a batch of single frames, not episodes), and efforts should be focused there. Accessing episodes (i.e. a chunk of contiguous frames in the dataset) is somehow more relevant for human visualization, but should benefit from several optimizations as you'll be fetching slices of MP4 files and groups of rows in parquet files. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Contents
LeRobot V3.0 Format Description
A simple example of LeRobot v3.0 dataset file organization:
Explanation:
meta/info.json: canonical schema (features, shapes/dtypes), FPS, codebase version, and path templates to locate data/video shards.meta/stats.json: global feature statistics (mean/std/min/max) used for normalization; exposed as dataset.meta.stats.meta/tasks.jsonl: natural‑language task descriptions mapped to integer IDs for task‑conditioned policies.meta/episodes/: per‑episode records (lengths, tasks, offsets) stored as chunked Parquet for scalability.data/: frame‑by‑frame Parquet shards; each file typically contains many episodes.videos/: MP4 shards per camera; each file typically contains many episodes.Daft reads the LeRobot dataset
Currently, Daft doesn't support directly reading the LeRobot data format. However, since the LeRobot dataset is organized in the "Parquet + MP4" format, at this stage, LeRobot data files can be read directly through the
read_parquetAPI. However, this method has the following issues:read_parquetisn't designed for the LeRobot format, resulting in incomplete metadata information. That is, the episode frame data itself doesn't contain the metadata information of the dataset, and users need to manually load metadata files according to the relative path in their UDFs.To support natively reading LeRobot format data, this proposal puts forward a data reading method organized at the episode row granularity. The core idea is that each row in the DataFrame corresponds to the data of a complete episode, ensuring that the frames' data belonging to an episode will not be scattered. The data organization level consists of two parts:
meta/episodespath of the dataset.datapath of the dataset. Note that it doesn't include the frames' data of the corresponding video files.For the【Required】episode's metadata, the typical included information is as shown in the following table:
data/chunk-{chunk_index:03d}/file-{file_index:03d}.parquetvideos/{key}/chunk-{chunk_index:03d}/file-{file_index:03d}.mp4For the【optional】episode's frame data, the typical included information is as shown in the following table:
info.jsonfileConsidering that there is a relatively large amount of episode frames' data, and frame data belonging to the same episode usually needs to be aggregated for processing, the following two loading strategies are considered:
episode_frames) after the corresponding episode's metadata. We will consider adding a parameter at the API level to allow users to control whether ScanTask reads this part of the data.episode_framescolumn may be relatively large._letrobot.pyto load all frames' label data of a specified episode (tentatively namedload_episode_frames), allowing users to call this function on demand to obtain frame data.load_episode_framesfunction to load frame data during processing.I am more inclined towards Strategy 2. If you have different opinions or better strategies, you can also comment below. I need to hear your voices. ❤️❤️❤️
How can users access the MP4 file corresponding to an episode?
All frames of a specified episode can be obtained through the
episode_framescolumn or by calling theload_episode_framesfunction; thechunkandfilewhere the current frame is located can be located through the corresponding fields in the metadata, and combined with the timestamp, the specific content of the frame in the corresponding video file can be read.Some additional thoughts:
dataset_path_columnfield to control whether to include the absolute path of the dataset in the data.info.json,stats.json,tasks.parquet), implement abroadcastoperator similar to Spark to broadcast this dataset's metadata to all nodes.Some existing limitations:
Reference
Beta Was this translation helpful? Give feedback.
All reactions