[Feat] Adds LongCat-AudioDiT pipeline by RuixiangMa · Pull Request #13390 · huggingface/diffusers

RuixiangMa · 2026-04-02T17:19:18Z

What does this PR do?

Adds LongCat-AudioDiT model support to diffusers.

Although LongCat-AudioDiT can be used for TTS-like generation, it is fundamentally a diffusion-based audio generation model (text conditioning + iterative latent denoising + VAE decoding) rather than a conventional autoregressive TTS model, so i think it fits naturally into diffusers.

Test

import soundfile as sf
import torch
from diffusers import LongCatAudioDiTPipeline

pipeline = LongCatAudioDiTPipeline.from_pretrained(
    "meituan-longcat/LongCat-AudioDiT-1B",
    torch_dtype=torch.float16,
)
pipeline = pipeline.to("cuda")

audio = pipeline(
    prompt="A calm ocean wave ambience with soft wind in the background.",
    audio_end_in_s=5.0,
    num_inference_steps=16,
    guidance_scale=4.0,
    output_type="pt",
).audios

output = audio[0, 0].float().cpu().numpy()
sf.write("longcat.wav", output, pipeline.sample_rate)

Result

longcat.wav

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Lancer <maruixiang6688@gmail.com>

src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py

dg845 · 2026-04-07T05:23:19Z

src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py

+    )
+
+
+def _pixel_shuffle_1d(hidden_states: torch.Tensor, factor: int) -> torch.Tensor:


Similarly, I think we should inline _pixel_shuffle_1d in UpsampleShortcut following #13390 (comment).

src/diffusers/models/autoencoders/autoencoder_longcat_audio_dit.py

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

dg845 · 2026-04-07T06:18:24Z

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

+        self.time_embed = AudioDiTTimestepEmbedding(dim)
+        self.input_embed = AudioDiTEmbedder(latent_dim, dim)
+        self.text_embed = AudioDiTEmbedder(dit_text_dim, dim)
+        self.rotary_embed = AudioDiTRotaryEmbedding(dim_head, 2048, base=100000.0)
+        self.blocks = nn.ModuleList(


Suggested change

self.time_embed = AudioDiTTimestepEmbedding(dim)

self.input_embed = AudioDiTEmbedder(latent_dim, dim)

self.text_embed = AudioDiTEmbedder(dit_text_dim, dim)

self.rotary_embed = AudioDiTRotaryEmbedding(dim_head, 2048, base=100000.0)

self.blocks = nn.ModuleList(

self.time_embed = AudioDiTTimestepEmbedding(dim)

self.input_embed = AudioDiTEmbedder(latent_dim, dim)

self.text_embed = AudioDiTEmbedder(dit_text_dim, dim)

self.rotary_embed = AudioDiTRotaryEmbedding(dim_head, 2048, base=100000.0)

self.blocks = nn.ModuleList(

See #13390 (comment).

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

dg845 · 2026-04-07T06:23:31Z

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

+        batch_size = hidden_states.shape[0]
+        if timestep.ndim == 0:
+            timestep = timestep.repeat(batch_size)
+        timestep_embed = self.time_embed(timestep)
+        text_mask = encoder_attention_mask.bool()
+        encoder_hidden_states = self.text_embed(encoder_hidden_states, text_mask)


Suggested change

batch_size = hidden_states.shape[0]

if timestep.ndim == 0:

timestep = timestep.repeat(batch_size)

timestep_embed = self.time_embed(timestep)

text_mask = encoder_attention_mask.bool()

encoder_hidden_states = self.text_embed(encoder_hidden_states, text_mask)

batch_size = hidden_states.shape[0]

if timestep.ndim == 0:

timestep = timestep.repeat(batch_size)

timestep_embed = self.time_embed(timestep)

text_mask = encoder_attention_mask.bool()

encoder_hidden_states = self.text_embed(encoder_hidden_states, text_mask)

Can you also refactor forward here so that it is better organized, following #13390 (comment)? See for example the QwenImageTransformer2DModel.forward method:

diffusers/src/diffusers/models/transformers/transformer_qwenimage.py

Line 836 in d7bc233

def forward(

Reorganized parts of forward incrementally; kept the current structure otherwise to avoid unnecessary behavioral churn.

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa · 2026-04-10T09:03:07Z

Thanks for iterating! I left some follow-up comments.

Thx, PTAL

HuggingFaceDocBuilderDev · 2026-04-11T03:26:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845 · 2026-04-11T03:28:54Z

@bot /style

github-actions · 2026-04-11T03:29:22Z

Style bot fixed some files and pushed the changes.

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa · 2026-04-11T05:42:13Z

These CI failures do not appear to be related to this PR

dg845 · 2026-04-11T08:24:04Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+
+def _get_uniform_flow_match_scheduler_sigmas(num_inference_steps: int) -> list[float]:
+    num_inference_steps = max(int(num_inference_steps), 2)
+    num_updates = num_inference_steps - 1


I think we should define num_inference_steps to match the number of function evaluations we're performing (that is, to have the same semantics that num_updates currently has), which is the usual diffusers behavior. This would also allow us to remove the behavior where we overwrite num_inference_steps=1 below in __call__.

dg845 · 2026-04-11T08:27:10Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+    return {key[len(prefix) :]: value for key, value in state_dict.items() if key.startswith(prefix)}
+
+
+def _get_uniform_flow_match_scheduler_sigmas(num_inference_steps: int) -> list[float]:


I think we should inline _get_uniform_flow_match_scheduler_sigmas into __call__ so that it's easier to understand how the sigma schedule is being prepared. See e.g. Flux2Pipeline for an example of this:

diffusers/src/diffusers/pipelines/flux2/pipeline_flux2.py

Line 932 in dc8d903

sigmas = np.linspace(1.0, 1 / num_inference_steps, num_inference_steps) if sigmas is None else sigmas

We generally prefer not to have too many small functions in the pipeline code.

dg845 · 2026-04-11T08:31:31Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+def _approx_duration_from_text(text: str | list[str], max_duration: float = 30.0) -> float:
+    if isinstance(text, list):
+        if not text:
+            return 0.0
+        return max(_approx_duration_from_text(prompt, max_duration=max_duration) for prompt in text)
+
+    en_dur_per_char = 0.082


Suggested change

def _approx_duration_from_text(text: str | list[str], max_duration: float = 30.0) -> float:

if isinstance(text, list):

if not text:

return 0.0

return max(_approx_duration_from_text(prompt, max_duration=max_duration) for prompt in text)

en_dur_per_char = 0.082

def _approx_duration_from_text(text: str | list[str], max_duration: float = 30.0) -> float:

if not text:

return 0.0

if isinstance(text, str):

text = [text]

en_dur_per_char = 0.082

nit: I think refactoring this function to be non-recursive (by making it work naturally with a list of strings) would make it more clear.

dg845 · 2026-04-11T08:33:41Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+                first_hidden = F.layer_norm(first_hidden, (first_hidden.shape[-1],), eps=1e-6)
+            prompt_embeds = prompt_embeds + first_hidden
+        lengths = attention_mask.sum(dim=1).to(device)
+        return prompt_embeds.float(), lengths


Suggested change

return prompt_embeds.float(), lengths

return prompt_embeds, lengths

Do we need to call .float() on prompt_embeds here? I think we should generally respect the output dtype from self.text_encoder.

dg845 · 2026-04-11T08:35:12Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+        )
+        self.scheduler.set_begin_index(0)
+        timesteps = self.scheduler.timesteps
+        sample = latents


I think using the standard name latents instead of sample would be more clear. It would also work better with PipelineTesterMixin tests.

dg845 · 2026-04-11T08:36:33Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+        if latents is None:
+            duration = max(1, min(duration, max_duration))
+
+        text_condition, text_condition_len = self.encode_prompt(normalized_prompts, device)


Suggested change

text_condition, text_condition_len = self.encode_prompt(normalized_prompts, device)

prompt_embeds, text_condition_len = self.encode_prompt(normalized_prompts, device)

Similarly to #13390 (comment), I think using the standard name prompt_embeds would be better here.

dg845 · 2026-04-11T08:39:22Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+        if not return_dict:
+            return (waveform,)


Suggested change

if not return_dict:

return (waveform,)

self.maybe_free_model_hooks()

if not return_dict:

return (waveform,)

Calling self.maybe_free_model_hooks() allows the pipeline to clear model hooks correctly, such as those used to support model offloading.

dg845 · 2026-04-11T08:40:33Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+        if output_type == "latent":
+            if not return_dict:
+                return (sample,)
+            return AudioPipelineOutput(audios=sample)


Suggested change

if output_type == "latent":

if not return_dict:

return (sample,)

return AudioPipelineOutput(audios=sample)

if output_type == "latent":

waveform = sample

A little simpler. Also makes it so that we don't have to call self.maybe_free_model_hooks() twice (see #13390 (comment)).

dg845 · 2026-04-11T08:43:46Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+                    latent_cond=latent_cond,
+                ).sample
+                pred = null_pred + (pred - null_pred) * guidance_scale
+            sample = self.scheduler.step(pred, t, sample, return_dict=False)[0]


Suggested change

sample = self.scheduler.step(pred, t, sample, return_dict=False)[0]

sample = self.scheduler.step(pred, t, sample, return_dict=False)[0]

if callback_on_step_end is not None:

callback_kwargs = {}

for k in callback_on_step_end_tensor_inputs:

callback_kwargs[k] = locals()[k]

callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

sample = callback_outputs.pop("latents", latent)

text_condition = callback_outputs.pop("prompt_embeds", prompt_embeds)

Example for supporting callbacks. This assumes we use the standard names latents and prompt_embeds (see #13390 (comment), #13390 (comment)). See also how e.g. Flux2Pipeline supports callbacks:

diffusers/src/diffusers/pipelines/flux2/pipeline_flux2.py

Lines 993 to 997 in dc8d903

if callback_on_step_end is not None:

callback_kwargs = {}

for k in callback_on_step_end_tensor_inputs:

callback_kwargs[k] = locals()[k]

callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

dg845 · 2026-04-11T08:44:56Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+        guidance_scale: float = 4.0,
+        generator: torch.Generator | list[torch.Generator] | None = None,
+        output_type: str = "np",
+        return_dict: bool = True,


Suggested change

return_dict: bool = True,

return_dict: bool = True,

callback_on_step_end: Callable[[int, int], None] | None = None,

callback_on_step_end_tensor_inputs: list[str] = ["latents"],

Follow-up for callback support (see #13390 (comment)).

dg845 · 2026-04-11T08:46:50Z

src/diffusers/pipelines/longcat_audio_dit/pipeline_longcat_audio_dit.py

+
+
+class LongCatAudioDiTPipeline(DiffusionPipeline):
+    model_cpu_offload_seq = "text_encoder->transformer->vae"


Suggested change

model_cpu_offload_seq = "text_encoder->transformer->vae"

model_cpu_offload_seq = "text_encoder->transformer->vae"

_callback_tensor_inputs = ["latents", "prompt_embeds"]

Follow up for callback support (#13390 (comment)). The callback tests specifically check for the name latents here, which is one reason to use it over samples.

dg845 · 2026-04-11T08:49:44Z

src/diffusers/models/transformers/transformer_longcat_audio_dit.py

+
+
+class LongCatAudioDiTTransformer(ModelMixin, ConfigMixin):
+    _supports_gradient_checkpointing = False


Suggested change

_supports_gradient_checkpointing = False

_supports_gradient_checkpointing = False

_repeated_blocks = ["AudioDiTBlock"]

Setting _repeated_blocks here enables regional compilation support. This also allows us to not skip the TestLongCatAudioDiTTransformerCompile.test_torch_compile_repeated_blocks test.

Add LongCat-AudioDiT pipeline

63d874d

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa changed the title ~~Longcataudiodit~~ [Feat] Adds LongCat-AudioDiT support Apr 2, 2026

RuixiangMa changed the title ~~[Feat] Adds LongCat-AudioDiT support~~ [Feat] Adds LongCat-AudioDiT pipeline Apr 2, 2026

upd

d2a2621

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa force-pushed the longcataudiodit branch from 9c4613f to d2a2621 Compare April 2, 2026 17:37

dg845 requested review from dg845 and yiyixuxu April 4, 2026 00:31

upd

354c983