Which RAM config to buy.

24 GB · the entry tier

Runs 7B–14B models comfortably. Great for everyday assistants (Llama 3.1 8B, Qwen 3 14B equivalents, Phi-4 14B). Good speculative-decoding host for a small draft model.

  • Llama 3.1 8B · ~38 tok/s
  • Phi-4 14B · ~22 tok/s
  • Won't fit anything 32B+
48 GB · the sweet spot

Runs every 32B model with headroom — Qwen 2.5 Coder 32B, Qwen 3 32B, DeepSeek R1 Distill 32B. Best price-per-capability tier on the M4 Pro. What r/LocalLLaMA recommends most.

  • Qwen 2.5 Coder 32B · ~10 tok/s
  • Qwen 3 32B · ~10 tok/s
  • Won't comfortably fit 70B
64 GB · the 70B tier

The only M4 Pro config that fits Llama 3.3 70B at Q4_K_M. Tight: ~42 GB model + ~6 GB context + macOS. Close Chrome and Docker. For comfortable 70B work, the M4 Max with 96 GB is the more honest recommendation.

  • Llama 3.3 70B · ~5 tok/s · tight
  • Better fit on M4 Max 96 GB
  • Babysit recommended for long sessions

Models that fit (with headroom).

Every model in our database, sorted by Q4_K_M memory cost. Tier assumes ~40% headroom for context, OS, and a normal dev stack. tok/s estimates are single-batch on the M4 Pro's 273 GB/s bandwidth at ~70% efficiency.

ModelQ4_K_M sizeMin Mac mini tiertok/s on M4 Pro
Gemma 3 1B0.7 GB24 GB+~273 tok/s
TinyLlama 1.1B0.8 GB24 GB+~239 tok/s
Llama 3.2 3B2.0 GB24 GB+~96 tok/s
Phi-4 Mini 3.8B2.5 GB24 GB+~76 tok/s
Gemma 3 4B3.0 GB24 GB+~64 tok/s
Qwen 3 4B3.0 GB24 GB+~64 tok/s
Llama 3.1 8B4.6 GB24 GB+~42 tok/s
Qwen 3.5 9B5.5 GB24 GB+~35 tok/s
Qwen 3 8B5.5 GB24 GB+~35 tok/s
Phi-4 14B8.7 GB24 GB+~22 tok/s
Qwen 3 14B9.5 GB24 GB+~20 tok/s
Kimi VL A3B Thinking9.5 GB24 GB+~20 tok/s
Mistral Small 3.1 24B14.4 GB24 GB+~13 tok/s
Gemma 3 27B16.5 GB24 GB+~12 tok/s
Qwen 2.5 Coder 32B19.0 GB48 GB+~10 tok/s
Qwen 3 32B19.0 GB48 GB+~10 tok/s
Qwen 3 30B-A3B (MoE)19.0 GB48 GB+~10 tok/s
DeepSeek R1 Distill 32B19.0 GB48 GB+~10 tok/s
Llama 3.3 70B40.6 GB64 GB+~5 tok/s

Run a model in 5 minutes.

# 1. install Ollama (or LM Studio for MLX speed)
brew install --cask ollama

# 2. install DevPulse — pre-flight + babysit
brew install --cask devpulse        # (coming soon — DMG meanwhile)

# 3. pre-flight: will it fit?
devpulse ai --before-load 20000 --auto-clean   # 20 GB ≈ 32B Q4_K_M

# 4. pull and run
ollama pull qwen2.5-coder:32b
ollama run qwen2.5-coder:32b

# 5. for hours-long sessions, babysit it
devpulse babysit --target-free-mb 4096 --json > session.log &

What it actually costs to run.

The Mac Mini M4 Pro idles at 4–7 W and pulls 30–50 W under inference. At a typical US electricity rate of $0.16/kWh, an 8-hour-a-day mixed workload runs roughly $20–30/year — about one month of Claude Pro.

For the full break-even math (Mac purchase amortized vs API cost):

Local-vs-cloud cost calculator →

What people get wrong.

Confusing the 24 GB and 64 GB configs

The Mac Mini M4 Pro spec page lists 24 / 48 / 64 GB. Buyers often misremember which they bought. devpulse status reads it from the kernel — the only ground truth.

“Model B = RAM GB” folk rule

Common shorthand on Reddit. It's almost right but ignores quantization and context. A 32B at Q4_K_M is ~20 GB; a 32B at F16 is ~64 GB. Always check the actual quant.

Buying 64 GB for 70B without trimming the stack

70B Q4_K_M = 42 GB. macOS + Chrome + Docker can easily eat 15-20 GB more. On a 64 GB machine the model OOMs. Either close everything, use Q3_K_M (~32 GB), or buy an M4 Max with 96 GB.

Skipping MLX for Ollama

Ollama uses GGUF; LM Studio supports MLX models built specifically for Apple Silicon. Same parameter count, 10-20% faster on Mac. Worth switching for long-running workloads.

The right model on the right Mac.

DevPulse tells you what fits — and what to free up if it doesn't.

Download for macOS

macOS 14+ · Apple Silicon & Intel · Free during launch