LongCat‑Image is a 6B bilingual (zh/en) text‑to‑image model that uses flow matching and the Qwen‑2.5‑VL text encoder. This guide walks you through setup, data prep, and running a first training/validation with SimpleTuner.
- VRAM: 16–24 GB covers 1024px LoRA at
int8-quantoorfp8-torchao. Full bf16 runs may need ~24 GB. - System RAM: ~32 GB is normally enough.
- Apple MPS: supported for inference/preview; we already downcast pos‑ids to float32 on MPS to avoid dtype issues.
- Python 3.10–3.13 verified:
python --version
- (Linux/CUDA) On fresh images, install the usual build/toolchain bits:
apt -y update apt -y install build-essential nvidia-cuda-toolkit
- Install SimpleTuner with the right extras for your backend:
pip install "simpletuner[cuda]" # CUDA pip install "simpletuner[cuda13]" # CUDA 13 / Blackwell (NVIDIA B-series GPUs) pip install "simpletuner[mps]" # Apple Silicon pip install "simpletuner[cpu]" # CPU-only
- Quantisation is built in (
int8-quanto,int4-quanto,fp8-torchao) and does not need extra manual installs in normal setups.
simpletuner serverVisit http://localhost:8001 and pick model family longcat_image.
Key defaults to keep
- Flow matching scheduler is automatic; no special schedule flags needed.
- Aspect buckets stay 64‑pixel aligned; do not lower
aspect_bucket_alignment. - Max token length 512 (Qwen‑2.5‑VL).
Optional memory savers (pick what matches your hardware):
--enable_group_offload --group_offload_type block_level --group_offload_blocks_per_group 1- Lower
lora_rank(4–8) and/or useint8-quantobase precision. - If validation OOMs, reduce
validation_resolutionor steps first.
cp config/config.json.example config/config.jsonEdit the fields above (model_family, flavour, precision, paths). Point output_dir and dataset paths to your storage.
simpletuner train --config config/config.jsonor launch the WebUI and start a run from the Jobs page after selecting the same config.
- Standard captioned image folders (textfile/JSON/CSV) work. Include both zh/en if you want bilingual strength to persist.
- Keep bucket edges on the 64px grid. If you train multi‑aspect, list several resolutions (e.g.,
1024x1024,1344x768). - The VAE is KL with shift+scale; caches use the built‑in scaling factor automatically.
- Guidance: 4–6 is a good start; leave the negative prompt empty.
- Steps: ~30 for speed checks; 40–50 for best quality.
- Validation preview works out of the box; latents are unpacked before decoding to avoid channel mismatches.
Example (CLI validate):
simpletuner validate \
--model_family longcat_image \
--model_flavour final \
--validation_resolution 1024x1024 \
--validation_num_inference_steps 30 \
--validation_guidance 4.5- MPS float64 errors: handled internally; keep your config on float32/bf16.
- Channel mismatch in previews: fixed by unpacking latents pre‑decode (included in this guide’s code).
- OOM: lower
validation_resolution, reducelora_rank, enable group offload, or switch toint8-quanto/fp8-torchao. - Slow tokenisation: Qwen‑2.5‑VL caps at 512 tokens; avoid very long prompts.
final: main release (best quality).dev: mid‑training checkpoint for experiments/fine‑tunes.
{ "model_type": "lora", "model_family": "longcat_image", "model_flavour": "final", // options: final, dev "pretrained_model_name_or_path": null, // auto-selected from flavour; override with a local path if needed "base_model_precision": "int8-quanto", // good default; fp8-torchao also works "train_batch_size": 1, "gradient_checkpointing": true, "lora_rank": 16, "learning_rate": 1e-4, "validation_resolution": "1024x1024", "validation_guidance": 4.5, "validation_num_inference_steps": 30 }