LongCat‑Image Quickstart

LongCat‑Image is a 6B bilingual (zh/en) text‑to‑image model that uses flow matching and the Qwen‑2.5‑VL text encoder. This guide walks you through setup, data prep, and running a first training/validation with SimpleTuner.

1) Hardware requirements (what to expect)

VRAM: 16–24 GB covers 1024px LoRA at int8-quanto or fp8-torchao. Full bf16 runs may need ~24 GB.
System RAM: ~32 GB is normally enough.
Apple MPS: supported for inference/preview; we already downcast pos‑ids to float32 on MPS to avoid dtype issues.

2) Prerequisites (step‑by‑step)

Python 3.10–3.13 verified:
```
python --version
```
(Linux/CUDA) On fresh images, install the usual build/toolchain bits:
```
apt -y update
apt -y install build-essential nvidia-cuda-toolkit
```

Install SimpleTuner with the right extras for your backend:

pip install "simpletuner[cuda]"   # CUDA
pip install "simpletuner[cuda13]" # CUDA 13 / Blackwell (NVIDIA B-series GPUs)
pip install "simpletuner[mps]"    # Apple Silicon
pip install "simpletuner[cpu]"    # CPU-only

Quantisation is built in (int8-quanto, int4-quanto, fp8-torchao) and does not need extra manual installs in normal setups.

3) Environment setup

Web UI (most guided)

simpletuner server

Visit http://localhost:8001 and pick model family longcat_image.

CLI baseline (config/config.json)

{
  "model_type": "lora",
  "model_family": "longcat_image",
  "model_flavour": "final",                // options: final, dev
  "pretrained_model_name_or_path": null,   // auto-selected from flavour; override with a local path if needed
  "base_model_precision": "int8-quanto",   // good default; fp8-torchao also works
  "train_batch_size": 1,
  "gradient_checkpointing": true,
  "lora_rank": 16,
  "learning_rate": 1e-4,
  "validation_resolution": "1024x1024",
  "validation_guidance": 4.5,
  "validation_num_inference_steps": 30
}

Key defaults to keep

Flow matching scheduler is automatic; no special schedule flags needed.
Aspect buckets stay 64‑pixel aligned; do not lower aspect_bucket_alignment.
Max token length 512 (Qwen‑2.5‑VL).

Optional memory savers (pick what matches your hardware):

--enable_group_offload --group_offload_type block_level --group_offload_blocks_per_group 1
Lower lora_rank (4–8) and/or use int8-quanto base precision.
If validation OOMs, reduce validation_resolution or steps first.

Fast config creation (one-time)

cp config/config.json.example config/config.json

Edit the fields above (model_family, flavour, precision, paths). Point output_dir and dataset paths to your storage.

Start training (CLI)

simpletuner train --config config/config.json

or launch the WebUI and start a run from the Jobs page after selecting the same config.

4) Dataloader pointers (what to supply)

Standard captioned image folders (textfile/JSON/CSV) work. Include both zh/en if you want bilingual strength to persist.
Keep bucket edges on the 64px grid. If you train multi‑aspect, list several resolutions (e.g., 1024x1024,1344x768).
The VAE is KL with shift+scale; caches use the built‑in scaling factor automatically.

5) Validation and inference

Guidance: 4–6 is a good start; leave the negative prompt empty.
Steps: ~30 for speed checks; 40–50 for best quality.
Validation preview works out of the box; latents are unpacked before decoding to avoid channel mismatches.

Example (CLI validate):

simpletuner validate \
  --model_family longcat_image \
  --model_flavour final \
  --validation_resolution 1024x1024 \
  --validation_num_inference_steps 30 \
  --validation_guidance 4.5

6) Troubleshooting

MPS float64 errors: handled internally; keep your config on float32/bf16.
Channel mismatch in previews: fixed by unpacking latents pre‑decode (included in this guide’s code).
OOM: lower validation_resolution, reduce lora_rank, enable group offload, or switch to int8-quanto / fp8-torchao.
Slow tokenisation: Qwen‑2.5‑VL caps at 512 tokens; avoid very long prompts.

7) Flavour selection

final: main release (best quality).
dev: mid‑training checkpoint for experiments/fine‑tunes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongCat‑Image Quickstart

1) Hardware requirements (what to expect)

2) Prerequisites (step‑by‑step)

3) Environment setup

Web UI (most guided)

CLI baseline (config/config.json)

Fast config creation (one-time)

Start training (CLI)

4) Dataloader pointers (what to supply)

5) Validation and inference

6) Troubleshooting

7) Flavour selection

FilesExpand file tree

LONGCAT_IMAGE.md

Latest commit

History

LONGCAT_IMAGE.md

File metadata and controls

LongCat‑Image Quickstart

1) Hardware requirements (what to expect)

2) Prerequisites (step‑by‑step)

3) Environment setup

Web UI (most guided)

CLI baseline (config/config.json)

Fast config creation (one-time)

Start training (CLI)

4) Dataloader pointers (what to supply)

5) Validation and inference

6) Troubleshooting

7) Flavour selection