Sim-to-Real Style Transfer

January 03, 2026

style-transfer converts Gazebo simulator video into photorealistic factory footage using your own factory images as style reference. It uses a ComfyUI pipeline with either AnimateDiff for temporal consistency or per-frame processing with LoRA fine-tuning.

Two Inference Pipelines

Video Pipeline — processes the entire video as one AnimateDiff job in ComfyUI, using temporal attention for smooth output:

./infer.sh video

Image Pipeline — extracts frames via ffmpeg, submits each as a separate ComfyUI job, and reassembles with temporal blending, enabling per-frame control:

./infer.sh frame

System Requirements

  • Docker and Docker Compose v2
  • NVIDIA Container Toolkit
  • NVIDIA GPU with ≥ 8 GB VRAM
  • ~20 GB free disk space

Setup

1. Organize data

data/
├── input/
│   └── your_video.mp4
└── sample_images/
    └── *.jpg / *.png

Update input.video in config.yaml with your video filename.

2. Download pretrained models (~11 GB)

chmod +x *.sh
./download_models.sh

Downloads: Realistic Vision V5.1, ControlNet canny, IP-Adapter Plus, CLIP Vision ViT-H, VAE, AnimateDiff v2.

3. Train LoRA (learns factory style from your reference images)

./train.sh

Output: models/loras/factory_lora.safetensors

4. Run inference

./infer.sh video    # AnimateDiff pipeline
./infer.sh frame    # Per-frame pipeline

Output: data/output/stylized_video.mp4

The ComfyUI web interface is accessible at http://localhost:8188 during inference.

Parameter Testing

./test_params.sh

Runs four inference jobs with parameter variations, saving results with descriptive filenames for easy comparison.

style-transfer GitHub repository