Setup guide · 15 minutes · offline-capable
A quantized 70B model running on a laptop with no internet is no longer a stunt. With unified memory at 64 GB+ and Ollama's mature local-first runtime, it's a workflow.
This guide covers the full setup: realistic memory budget, the actual tokens/second you should expect, the agent-loop pattern people use for long batches, and the operational fixes that keep the model from OOM-crashing halfway through your task.
What you actually need
Llama 3.3 70B at Q4_K_M needs ~42 GB of unified memory just for the weights. Your Mac also needs enough headroom for context (~4–8 GB) and the rest of your stack (browser, terminal, Docker if you use it). Realistic minimum: a 64 GB Mac with a deliberately lean foreground.
| Mac config | Verdict | Notes |
|---|---|---|
| M4 Max / M3 Max, 64 GB | Runs comfortably | ~42 GB model + 6 GB headroom — close Chrome and Docker first. |
| M4 Max / M3 Ultra, 96 GB+ | Runs great | Can stay productive in other apps while it runs. |
| M4 Pro, 48 GB | Tight | Try Q3_K_M (~32 GB) instead, or step down to a 32B. |
| M3/M4, 32 GB or less | Won't fit | Use Llama 3.1 8B or Qwen 2.5 Coder 32B instead. |
Full breakdown: /can-i-run/llama3.3-70b →
The honest numbers
A viral post claimed 71 tok/s for Llama 3.3 70B on an M4 MacBook Pro. That figure is fiction. The realistic ceiling on M-series unified memory for a 70B at Q4_K_M is closer to 8–20 tok/s, bottlenecked by memory bandwidth, not FLOPS.
~12–18 tok/s single-batch on a quantized 70B. With speculative decoding (a small draft model), 20+ is reachable.
~8–14 tok/s. Slightly slower than M4 Max but still faster than most people read.
~18–25+ tok/s. The top-end Studio is genuinely fast on 70Bs and handles bigger contexts gracefully.
Why bandwidth matters: generating each token requires streaming the active weights through memory once. A 42 GB Q4_K_M model on 546 GB/s bandwidth gives a theoretical ceiling of about 13 tok/s before any other overhead. Quoting numbers above the bandwidth-derived ceiling without speculative decoding is a tell that something else is being measured (or fabricated).
The setup · 15 minutes
devpulse CLI runs the pre-flight check.The agent-loop pattern
The reason the “11-hour-flight” pattern works isn't the model. It's the loop: process a task, save output to disk, advance, save state every N tasks. devpulse babysit's NDJSON event stream feeds directly into that pattern.
Want a fully agentic loop instead of a queue? Pair this setup with Aider, Codex, or Hermes Agent — all of which run against the same local 70B and fit the same babysit pattern.
When it goes wrong
Almost always the rest of your stack, not the model. Full diagnosis here →
Thermal throttling on Air-class chassis, or swap pressure as context grows. devpulse status --json tells you which.
Local 70B inference draws ~30–50W under load. Plan on staying plugged in or swapping batteries. devpulse babysit reports battery percent and time-to-empty in every tick.
Q4_K_M is excellent for most tasks but hits limits on tight reasoning. If you have headroom, Q5_K_M (~50 GB) and Q6_K (~57 GB) are noticeably stronger. Q8 is overkill for a 70B at this scale.
DevPulse is the co-pilot that keeps your stack from OOM-killing the model.
Download for macOS