Skip to content

Chunking functionality#1556

Merged
pzelasko merged 9 commits intolhotse-speech:masterfrom
nune-tadevosyan:chunking_functionality
Apr 16, 2026
Merged

Chunking functionality#1556
pzelasko merged 9 commits intolhotse-speech:masterfrom
nune-tadevosyan:chunking_functionality

Conversation

@nune-tadevosyan
Copy link
Copy Markdown
Contributor

This adds dynamic chunking functionality for audio files and allows grouping it in the NeMo later

nune-tadevosyan and others added 5 commits March 9, 2026 18:04
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: Nune <ntadevosyan@nvidia.com>
Copy link
Copy Markdown
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nune-tadevosyan can we add unit tests for this as well?

Signed-off-by: Nune <ntadevosyan@nvidia.com>
cut = _make_cut(duration=30.0)
result = list(cut.cut_into_windows_balanced(min_duration=30, max_duration=40))
assert len(result) == 1
assert result[0].id == cut.id
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should test strict equality not just id equality

cut = _make_cut(duration=40.0)
result = list(cut.cut_into_windows_balanced(min_duration=30, max_duration=40))
assert len(result) == 1
assert result[0].id == cut.id
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should test strict equality not just id equality

cut = _make_cut(duration=duration)
windows = list(cut.cut_into_windows_balanced(min_duration=30, max_duration=40, overlap=overlap))

assert len(windows) >= 2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should test for an exact specific number of windows and specific start/duration values, the assertions are too flexible

assert any(w_id.startswith("long-") for w_id in ids)


@pytest.mark.parametrize("num_jobs", [1, 2])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this test, it'll be slow due to multiprocessing

Signed-off-by: Nune <ntadevosyan@nvidia.com>
pzelasko
pzelasko previously approved these changes Mar 23, 2026
Signed-off-by: Nune <ntadevosyan@nvidia.com>
@pzelasko pzelasko merged commit 517a2e8 into lhotse-speech:master Apr 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants