Running Wan2.1 Text-to-Video Model on macOS

llm neural network wan2.1 on May 20, 2025

I've been experimenting with running the Wan2.1 text-to-video model on macOS, adapting it to work with M1 Pro chips using Metal Performance Shaders (MPS) via Cursor AI.

Source on GitHub

Adapting Wan2.1 for macOS

The key to making Wan2.1 work on macOS was:

Using MPS (Metal Performance Shaders) instead of CUDA
Setting PYTORCH_ENABLE_MPS_FALLBACK=1 for CPU fallback
Adjusting memory usage with --offload_model True and --t5_cpu

Results

These first samples I generated on a MBP M4 Max 128GB RAM, then concatenated them with ffmpeg. The program was using around 100GB of RAM.

Command to generate multiple short videos. (call generate.py to generate a single video)

export PYTORCH_ENABLE_MPS_FALLBACK=1                                     <aws:kenny> <region:us-west-2>
python generate_loop.py \
  --task t2v-1.3B \
  --size "832*480" \
  --frame_num 48 \
  --sample_steps 15 \
  --ckpt_dir ./Wan2.1-T2V-1.3B \
  --offload_model True \
  --t5_cpu \
  --device mps \
  --num_videos 15 \
  --output_dir loop_output \
  --concat \
  --concat_output psychedelic_journey.mp4 \
  --prompt "A mesmerizing journey through a kaleidoscopic dimension where reality bends and flows. Vibrant colors swirl and morph into impossible geometries, creating a hypnotic dance of light and form. The scene continuously evolves, with each moment revealing new patterns and textures that seem to breathe and pulse with energy. The colors shift between electric blues, neon purples, and molten golds, creating a sense of infinite depth and movement."

Command to concatenate videos.

ffmpeg -y -f concat -safe 0 -i <(for f in loop_output/*.mp4; do echo "file '$PWD/$f'"; done) -c copy output/concatenated.mp4

Another compliation in portrait mode.

Memory Optimization

The 1.3B model works better than the 14B model on Mac hardware. Key findings:

32 frames at 480x832 resolution works reliably
Higher resolutions or frame counts can cause memory issues