Local AI on Mac — without the OOM crashes

Step 1 · Before you load

Will this 70B model fit?

Don't guess. Run the predicate, branch on exit code.

$ devpulse ai --before-load 42000 --auto-clean --json
{
  "modelSizeMB": 42000,
  "before":  { "verdict": "fits-after-unload", "exitCode": 2 },
  "actions": [
    "unloaded idle ollama model: qwen2.5:7b (4.2 GB)",
    "killed 6 zombie procs (812 MB reclaimed)"
  ],
  "after":   { "verdict": "fits", "exitCode": 0 }
}

$ if devpulse ai --before-load 42000 --auto-clean; then
    ollama run llama3.3:70b
  else
    echo "still won't fit. close Chrome."
  fi

Exit codes that mean something

0 fits · 1won't fit · 2 fits after unload · 3 tight. Branch in shell, no parsing.

--auto-clean

Unloads idle Ollama models, kills orphaned LSPs and stale dev servers, then re-evaluates. Final exit code reflects post-cleanup state.

JSON when you need it

Reports what was reclaimed, what idle models exist, and how much headroom remains. Pipe to jq, feed to your agent.

Step 2 · While you run

Babysit the long ones.

Throw it in the background. It watches free VRAM, battery, and swap. When pressure builds and there's something safe to reclaim, it acts — and emits an event you can checkpoint on.

$ devpulse babysit --duration 660 --target-free-mb 8192 --json > flight.log &

# meanwhile, your agent loop processes the queue...

$ tail -f flight.log
{"event":"started","intervalSec":30,"targetFreeMB":8192,...}
{"event":"tick","tickNum":1,"availableForAIMB":12450,"batteryPercent":94,"onAC":false,...}
{"event":"tick","tickNum":47,"availableForAIMB":7200,"pressure":"free<8192MB",...}
{"event":"cleanup","reasons":"free<8192MB","reclaimedMB":5400,"availableForAIMB":12600,...}
{"event":"done","ticks":1320,"cleanupRuns":3,"totalReclaimedMB":11200,"elapsedMin":660}

Threshold-driven

Set --target-free-mb. When available VRAM dips below it andthere's something safe to reclaim, babysit unloads idle models and kills zombies. No noise on healthy ticks.

Battery-aware

Tracks battery percent, AC state, low-power mode, time-to-empty. Treats battery ≤ 20% as pressure so your agent can shrink context before the model gets killed.

Checkpoint signals

Each cleanup event is a natural checkpoint trigger. Pipe to your script, write a state snapshot, keep going.

The honest tok/s numbers

What 70B actually runs at on a Mac.

Llama 3.3 70B, Q4_K_M, llama.cpp / MLX on Metal. Anyone quoting 70+ tok/s is benchmarking a 7B–13B model and calling it 70B.

Mac                    Quant         tok/s      Status
M4 Max 128 GB          Q4_K_M (MLX)  ~18–20     Usable
M4 Max 128 GB          Q4_K_M        ~18        Usable
M4 Max 128 GB          Q8_0          ~6.5       Slow
M4 Max 96 GB           Q4_K_M        ~8         Tight
M4 Pro 64 GB           Q4_K_M        ~5         Marginal
M4 Max 64 GB           Q4_K_M        won't fit  No
M3 Max 64 GB           Q4_K_M        ~7–8       Usable
M3 Ultra 512 GB        Q4 (4-bit)    ~15.5      Usable (highest documented)
M2 Max 64 GB           Q4 (4-bit)    ~8.8       Tight
M1 Ultra 64 GB         Q4 (4-bit)    ~12.6      Usable

Sources: llama.cpp #4167, Sean Kim M4 Max benchmarks,
LM Studio community runs, r/LocalLLM field reports.

71 tok/s is a 7B model

Consistent with Llama 3.1 8B, not 70B. The viral post conflated model sizes. Real ceiling for 70B on any MacBook today: ~20 tok/s with MLX on M4 Max 128 GB.

11 hours on battery doesn't happen

M4 Max under sustained inference draws 60–90 W. MacBook battery is ~100 Wh. That's 1–2 hours of 70B inference, not 11. And MacBook batteries aren't user-swappable.

What is true

Quantized 70B running offline at ~20 tok/s is real and useful — faster than most people read, private, no API costs. See the full breakdown → or when it pays for itself →

Hitting Ollama OOM errors? On a Mac with 64 GB but Ollama still says “out of memory”? It's almost never the model. Diagnose and fix in one command →

Want the full 70B-on-a-MacBook setup? Hardware sizing, realistic tokens/second, the agent-loop pattern, and the failure modes you'll actually hit. The complete setup guide →

Worried about the distillation headlines? The White House memo about Chinese AI labs is about training-time IP, not runtime risk. Here's what it does and doesn't mean for running DeepSeek, Qwen or Kimi on your Mac. The technical answer →

No cloud. No daemon. No telemetry.

Same binary as the menu bar app.

Install once, get both the GUI and the CLI. The CLI shells out the same data the menu bar shows — your model load decisions never leave your Mac.

# install
brew install --cask devpulse                # (coming soon)
# or download the DMG and drag to /Applications

# CLI is symlinked into ~/.local/bin or /usr/local/bin
$ devpulse status
memory  36.5 / 64.0 GB  (57%)  [healthy]
swap    4.1 GB
gpu     3.3 / 48.0 GB
battery 67%  [battery]  3h 47m to empty  · low-power mode

$ devpulse --help    # full subcommand list

11 hours of local AI. Zero crashes.

Will this 70B model fit?

Babysit the long ones.

What 70B actually runs at on a Mac.

Same binary as the menu bar app.

Stop letting your stack OOM your model loads.