Will this 70B model fit?

Don't guess. Run the predicate, branch on exit code.

$ devpulse ai --before-load 42000 --auto-clean --json
{
  "modelSizeMB": 42000,
  "before":  { "verdict": "fits-after-unload", "exitCode": 2 },
  "actions": [
    "unloaded idle ollama model: qwen2.5:7b (4.2 GB)",
    "killed 6 zombie procs (812 MB reclaimed)"
  ],
  "after":   { "verdict": "fits", "exitCode": 0 }
}

$ if devpulse ai --before-load 42000 --auto-clean; then
    ollama run llama3.3:70b
  else
    echo "still won't fit. close Chrome."
  fi
Exit codes that mean something

0 fits · 1won't fit · 2 fits after unload · 3 tight. Branch in shell, no parsing.

--auto-clean

Unloads idle Ollama models, kills orphaned LSPs and stale dev servers, then re-evaluates. Final exit code reflects post-cleanup state.

JSON when you need it

Reports what was reclaimed, what idle models exist, and how much headroom remains. Pipe to jq, feed to your agent.

Babysit the long ones.

Throw it in the background. It watches free VRAM, battery, and swap. When pressure builds and there's something safe to reclaim, it acts — and emits an event you can checkpoint on.

$ devpulse babysit --duration 660 --target-free-mb 8192 --json > flight.log &

# meanwhile, your agent loop processes the queue...

$ tail -f flight.log
{"event":"started","intervalSec":30,"targetFreeMB":8192,...}
{"event":"tick","tickNum":1,"availableForAIMB":12450,"batteryPercent":94,"onAC":false,...}
{"event":"tick","tickNum":47,"availableForAIMB":7200,"pressure":"free<8192MB",...}
{"event":"cleanup","reasons":"free<8192MB","reclaimedMB":5400,"availableForAIMB":12600,...}
{"event":"done","ticks":1320,"cleanupRuns":3,"totalReclaimedMB":11200,"elapsedMin":660}
Threshold-driven

Set --target-free-mb. When available VRAM dips below it andthere's something safe to reclaim, babysit unloads idle models and kills zombies. No noise on healthy ticks.

Battery-aware

Tracks battery percent, AC state, low-power mode, time-to-empty. Treats battery ≤ 20% as pressure so your agent can shrink context before the model gets killed.

Checkpoint signals

Each cleanup event is a natural checkpoint trigger. Pipe to your script, write a state snapshot, keep going.

What 70B actually runs at on a Mac.

Llama 3.3 70B, Q4_K_M, llama.cpp / MLX on Metal. Anyone quoting 70+ tok/s is benchmarking a 7B–13B model and calling it 70B.

Mac                    Quant         tok/s      Status
M4 Max 128 GB          Q4_K_M (MLX)  ~18–20     Usable
M4 Max 128 GB          Q4_K_M        ~18        Usable
M4 Max 128 GB          Q8_0          ~6.5       Slow
M4 Max 96 GB           Q4_K_M        ~8         Tight
M4 Pro 64 GB           Q4_K_M        ~5         Marginal
M4 Max 64 GB           Q4_K_M        won't fit  No
M3 Max 64 GB           Q4_K_M        ~7–8       Usable
M3 Ultra 512 GB        Q4 (4-bit)    ~15.5      Usable (highest documented)
M2 Max 64 GB           Q4 (4-bit)    ~8.8       Tight
M1 Ultra 64 GB         Q4 (4-bit)    ~12.6      Usable

Sources: llama.cpp #4167, Sean Kim M4 Max benchmarks,
LM Studio community runs, r/LocalLLM field reports.
71 tok/s is a 7B model

Consistent with Llama 3.1 8B, not 70B. The viral post conflated model sizes. Real ceiling for 70B on any MacBook today: ~20 tok/s with MLX on M4 Max 128 GB.

11 hours on battery doesn't happen

M4 Max under sustained inference draws 60–90 W. MacBook battery is ~100 Wh. That's 1–2 hours of 70B inference, not 11. And MacBook batteries aren't user-swappable.

What is true

Quantized 70B running offline at ~20 tok/s is real and useful — faster than most people read, private, no API costs. See the full breakdown → or when it pays for itself →

Hitting Ollama OOM errors? On a Mac with 64 GB but Ollama still says “out of memory”? It's almost never the model. Diagnose and fix in one command →

Want the full 70B-on-a-MacBook setup? Hardware sizing, realistic tokens/second, the agent-loop pattern, and the failure modes you'll actually hit. The complete setup guide →

Worried about the distillation headlines? The White House memo about Chinese AI labs is about training-time IP, not runtime risk. Here's what it does and doesn't mean for running DeepSeek, Qwen or Kimi on your Mac. The technical answer →

Same binary as the menu bar app.

Install once, get both the GUI and the CLI. The CLI shells out the same data the menu bar shows — your model load decisions never leave your Mac.

# install
brew install --cask devpulse                # (coming soon)
# or download the DMG and drag to /Applications

# CLI is symlinked into ~/.local/bin or /usr/local/bin
$ devpulse status
memory  36.5 / 64.0 GB  (57%)  [healthy]
swap    4.1 GB
gpu     3.3 / 48.0 GB
battery 67%  [battery]  3h 47m to empty  · low-power mode

$ devpulse --help    # full subcommand list

Stop letting your stack OOM your model loads.

DevPulse is free, native, and uses less RAM than this webpage.

Download for macOS

macOS 14+ · Apple Silicon & Intel · Free during launch