Supports reading LeRobot dataset design #6313

plotor · 2026-02-28T12:52:59Z

plotor
Feb 28, 2026
Collaborator

.
├── meta
│    ├── episodes
│    │    └── chunk-000
│    │        └── file-000.parquet
│    ├── info.json
│    ├── stats.json
│    └── tasks.parquet
├── data
│    └── chunk-000
│        ├── file-000.parquet
│        ├── file-001.parquet
│        ├── file-002.parquet
│        ├── file-003.parquet
│        ├── file-004.parquet
│        ├── file-005.parquet
│        ├── file-006.parquet
│        ├── file-007.parquet
│        ├── file-008.parquet
│        └── file-009.parquet
├── README.md
└── videos
    ├── observation.images.image
    │    └── chunk-000
    │        ├── file-000.mp4
    │        ├── file-001.mp4
    │        ├── file-002.mp4
    │        ├── file-003.mp4
    │        ├── file-004.mp4
    │        ├── file-005.mp4
    │        └── file-006.mp4
    └── observation.images.image2
        └── chunk-000
            ├── file-000.mp4
            ├── file-001.mp4
            ├── file-002.mp4
            ├── file-003.mp4
            ├── file-004.mp4
            ├── file-005.mp4
            └── file-006.mp4

Explanation:

meta/info.json: canonical schema (features, shapes/dtypes), FPS, codebase version, and path templates to locate data/video shards.
meta/stats.json: global feature statistics (mean/std/min/max) used for normalization; exposed as dataset.meta.stats.
meta/tasks.jsonl: natural‑language task descriptions mapped to integer IDs for task‑conditioned policies.
meta/episodes/: per‑episode records (lengths, tasks, offsets) stored as chunked Parquet for scalability.
data/: frame‑by‑frame Parquet shards; each file typically contains many episodes.
videos/: MP4 shards per camera; each file typically contains many episodes.

Daft reads the LeRobot dataset

Currently, Daft doesn't support directly reading the LeRobot data format. However, since the LeRobot dataset is organized in the "Parquet + MP4" format, at this stage, LeRobot data files can be read directly through the read_parquet API. However, this method has the following issues:

Each row in the DataFrame represents one frame of data from an episode. Frames belonging to the same episode may be scattered due to operator processing, which means subsequent processing will need to be regrouped by episode ID.
The read_parquet isn't designed for the LeRobot format, resulting in incomplete metadata information. That is, the episode frame data itself doesn't contain the metadata information of the dataset, and users need to manually load metadata files according to the relative path in their UDFs.

To support natively reading LeRobot format data, this proposal puts forward a data reading method organized at the episode row granularity. The core idea is that each row in the DataFrame corresponds to the data of a complete episode, ensuring that the frames' data belonging to an episode will not be scattered. The data organization level consists of two parts:

【Required】The meta information corresponding to the episode, which is parsed from the parquet files in the meta/episodes path of the dataset.
【Optional】The label data information of all frames belonging to the current episode, which is parsed from the parquet files in the data path of the dataset. Note that it doesn't include the frames' data of the corresponding video files.

For the【Required】episode's metadata, the typical included information is as shown in the following table:

Column	Type	Explanation
episode_index	int64	Episode ID, globally unique within the dataset, numbered starting from 0
task_index	int64	Belonging task
length	int64	Number of frames contained in the episode
data_chunk_index	int64	Data chunk index, indicating that the data of this episode is stored in a certain Parquet file under the directory `data/chunk-{chunk_index:03d}/`
data_file_index	int64	Data file index, indicating that the data of this episode is stored in the specific Parquet file number in the above-mentioned chunk directory, and the file is named as `file-{file_index:03d}.parquet`
data_start_frame	int64	Data start frame ID
data_end_frame	int64	Data end frame ID
video_{key}_chunk_index	int64	Video chunk index (For each camera), indicating that the video file of this episode from a certain camera (such as top, wrist) is located in the directory `videos/{key}/chunk-{chunk_index:03d}/`
video_{key}_file_index	int64	Video file index (For each camera), indicating that the video file of this episode from a certain camera (such as top, wrist) is located in the specific MP4 file number in the above-mentioned chunk directory, and the file is named as `file-{file_index:03d}.mp4`
video_{key}_start_timestamp	float64	Video start timestamp (For each camera), the time position (in seconds) of the first frame of this episode in the corresponding MP4 file
video_{key}_end_timestamp	float64	Video end timestamp (For each camera), the end time position (in seconds) of the last frame of this episode in the MP4 file

For the【optional】episode's frame data, the typical included information is as shown in the following table:

Column	Type	Explanation
observation.state	float32 array	The robot's state information, such as joint angles, has its dimensions defined in the `info.json` file
action	float32 array	Action commands executed by the robot, such as target joint angles, end effector speed, or torque, etc.
episode_index	int64	The ID of the episode to which the current frame belongs
frame_index	int64	The index of the current frame within the corresponding episode (starting from 0)
timestamp	float64	The timestamp of the current frame (in seconds), used to align with video frames
task_index	int64	The task ID to which the current frame belongs

Considering that there is a relatively large amount of episode frames' data, and frame data belonging to the same episode usually needs to be aggregated for processing, the following two loading strategies are considered:

Assemble all frames' data belonging to the same episode in a certain format (such as a JSON array) as a column (tentatively named episode_frames) after the corresponding episode's metadata. We will consider adding a parameter at the API level to allow users to control whether ScanTask reads this part of the data.
- Advantages: Frame data is directly assembled in the same row as episode metadata, making it more straightforward for users to use.
- Disadvantages: For episodes containing a large amount of frame data, the episode_frames column may be relatively large.
Provide a function in _letrobot.py to load all frames' label data of a specified episode (tentatively named load_episode_frames), allowing users to call this function on demand to obtain frame data.
- Advantages: The Dataframe only contains episode metadata and is more lightweight.
- Disadvantages: Users need to additionally call the load_episode_frames function to load frame data during processing.

I am more inclined towards Strategy 2. If you have different opinions or better strategies, you can also comment below. I need to hear your voices. ❤️❤️❤️

How can users access the MP4 file corresponding to an episode?

All frames of a specified episode can be obtained through the episode_frames column or by calling the load_episode_frames function; the chunk and file where the current frame is located can be located through the corresponding fields in the metadata, and combined with the timestamp, the specific content of the frame in the corresponding video file can be read.

Some additional thoughts：

Add the dataset_path_column field to control whether to include the absolute path of the dataset in the data.
Considering that in addition to episode metadata, the LeRobot dataset itself has some common metadata (such as info.json, stats.json, tasks.parquet), implement a broadcast operator similar to Spark to broadcast this dataset's metadata to all nodes.

Some existing limitations:

This proposal is designed based on the LeRobot V3.0 Format Specification, and support for version V2 isn't considered for the time being.
The LeRobot format currently only supports the posix protocol, so access to object storage will not be available.

Reference

LeRobotDataset v3.0

desmondcheongzx · 2026-03-05T04:31:02Z

desmondcheongzx
Mar 5, 2026
Collaborator

Thanks for the great proposal! Episode-level row granularity seems like the right abstraction - it avoids frame scattering and lets users filter on episode metadata before touching the heavy data. Strategy 2 (lazy frame loading) is also the better choice for the same reasons.

A few thoughts:

This should live in daft.datasets. Daft already has a daft/datasets/ module with common_crawl as the first entry. That's the intended home for dataset-specific readers like this. daft/datasets/common_crawl.py is a good template to follow: it handles dataset-specific layout and metadata, delegates to lower-level readers (read_parquet in your case), and returns a lazy DataFrame.
Can you clarify the POSIX-only limitation? The proposal mentions the LeRobot format "currently only supports the posix protocol." but it seems that LeRobot v3.0 does support Hub-native streaming, and the underlying format is Parquet + MP4 which is protocol-agnostic. Is this a constraint of a specific implementation detail, or a scoping choice for the initial version? Ideally the API would accept HF repo IDs (e.g., lerobot/aloha_sim_insertion_human) and resolve them via HTTPS.
load_episode_frames sounds like a good idea. I'm thinking it should be DataFrame-in, DataFrame-out. Rather than a per-episode function, something like:

episodes = daft.datasets.lerobot("lerobot/aloha_sim_insertion_human")
# filter first, load frames only for what you need
episodes = episodes.where(col("length") > 100)
frames = lerobot.load_episode_frames(episodes)

This keeps everything in the lazy evaluation graph and avoids users needing to call it inside UDFs or loops.

Video frame access can build on existing Daft video support since Daft has read_video_frames with time-based sampling, keyframe filtering, and frame decoding via PyAV. The episode metadata (chunk/file indexes + timestamps) maps naturally to this: the implementation could resolve MP4 paths from metadata and delegate to the existing video infrastructure rather than requiring users to wire this up manually.
I'm not sure if a broadcast operator is necessary. It looks like info.json and stats.json are bounded by the number of features/cameras, not the number of episodes, so they should stay small even at scale. For files like these, reading them once in Python and passing the result into UDFs or attaching via with_column + literals could sufficient. That said, if you're seeing a specific scenario where this breaks down, it would be helpful to understand the use case.

0 replies

CarolinePascal · 2026-04-22T18:48:24Z

CarolinePascal
Apr 22, 2026

Hey there,

Thanks for the proposal ! Looking forward to seeing LeRobot dataset format support in Daft 🔥

Regarding the strategy, I agree with @plotor on the second suggestion. Accessing frames is the critical path at training time (VLA are trained on a batch of single frames, not episodes), and efforts should be focused there. Accessing episodes (i.e. a chunk of contiguous frames in the dataset) is somehow more relevant for human visualization, but should benefit from several optimizations as you'll be fetching slices of MP4 files and groups of rows in parquet files.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supports reading LeRobot dataset design #6313

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Supports reading LeRobot dataset design #6313

Uh oh!

Uh oh!

plotor Feb 28, 2026 Collaborator

Contents

LeRobot V3.0 Format Description

Daft reads the LeRobot dataset

Reference

Replies: 2 comments

Uh oh!

desmondcheongzx Mar 5, 2026 Collaborator

Uh oh!

CarolinePascal Apr 22, 2026

plotor
Feb 28, 2026
Collaborator

desmondcheongzx
Mar 5, 2026
Collaborator

CarolinePascal
Apr 22, 2026