For developers running models offline
A viral post claimed a developer ran a quantized Llama 3.3 70B on a MacBook for an 11-hour transatlantic flight at 71 tokens/second. The 71 tok/s figure is fiction (real ceiling is ~20 tok/s), but the underlying idea is real: a 70B model running offline on a laptop, doing actual work, is finally possible on M3/M4 Max with 64 GB+ unified memory.
The hard part isn't the model. It's the rest of your machine. Chrome with 47 tabs. Docker idling at 4 GB. A stale TypeScript LSP from a project you closed yesterday. One OOM and your model unloads mid-task.
DevPulse ships a devpulse CLI built for this workflow. Pre-flight checks before you load. Watchdog while you run. JSON output that any agent or script can branch on. No daemon. No telemetry. No network calls.
Your MacBook is a one-person data center. DevPulse is what keeps it from crashing.
This isn't local or cloud — the right setup is hybrid: route by query class, local for the bulk, cloud for the hard. The case for hybrid local + cloud →
Why this matters in 2026: Anthropic API uptime hit 98.95%, Claude Code is rate-limited during peak hours, and Blackwell GPU rentals are up 48% in two months. The macro case for local →
Step 1 · Before you load
Don't guess. Run the predicate, branch on exit code.
0 fits · 1won't fit · 2 fits after unload · 3 tight. Branch in shell, no parsing.
Unloads idle Ollama models, kills orphaned LSPs and stale dev servers, then re-evaluates. Final exit code reflects post-cleanup state.
Reports what was reclaimed, what idle models exist, and how much headroom remains. Pipe to jq, feed to your agent.
Step 2 · While you run
Throw it in the background. It watches free VRAM, battery, and swap. When pressure builds and there's something safe to reclaim, it acts — and emits an event you can checkpoint on.
Set --target-free-mb. When available VRAM dips below it andthere's something safe to reclaim, babysit unloads idle models and kills zombies. No noise on healthy ticks.
Tracks battery percent, AC state, low-power mode, time-to-empty. Treats battery ≤ 20% as pressure so your agent can shrink context before the model gets killed.
Each cleanup event is a natural checkpoint trigger. Pipe to your script, write a state snapshot, keep going.
The honest tok/s numbers
Llama 3.3 70B, Q4_K_M, llama.cpp / MLX on Metal. Anyone quoting 70+ tok/s is benchmarking a 7B–13B model and calling it 70B.
Consistent with Llama 3.1 8B, not 70B. The viral post conflated model sizes. Real ceiling for 70B on any MacBook today: ~20 tok/s with MLX on M4 Max 128 GB.
M4 Max under sustained inference draws 60–90 W. MacBook battery is ~100 Wh. That's 1–2 hours of 70B inference, not 11. And MacBook batteries aren't user-swappable.
Quantized 70B running offline at ~20 tok/s is real and useful — faster than most people read, private, no API costs. See the full breakdown → or when it pays for itself →
Hitting Ollama OOM errors? On a Mac with 64 GB but Ollama still says “out of memory”? It's almost never the model. Diagnose and fix in one command →
Want the full 70B-on-a-MacBook setup? Hardware sizing, realistic tokens/second, the agent-loop pattern, and the failure modes you'll actually hit. The complete setup guide →
Worried about the distillation headlines? The White House memo about Chinese AI labs is about training-time IP, not runtime risk. Here's what it does and doesn't mean for running DeepSeek, Qwen or Kimi on your Mac. The technical answer →
No cloud. No daemon. No telemetry.
Install once, get both the GUI and the CLI. The CLI shells out the same data the menu bar shows — your model load decisions never leave your Mac.
DevPulse is free, native, and uses less RAM than this webpage.
Download for macOS