Memory protocol · works for any local-AI runtime
On Apple Silicon, the question isn't how much RAM your Mac has — it's how much is actually usable for the GPU after the rest of your stack has its share. This is the protocol for clearing the runway before you launch Ollama, llama.cpp, LM Studio, or MLX.
One command does most of it. devpulse ai --before-load <MB> --auto-clean performs the safe parts and tells you what's still in the way. The deep cuts below are for when you want to do it manually, or when you need every last GB.
First — measure
Before reclaiming, see what you have. devpulse status shows unified memory used/free, current GPU allocation, swap, battery, and zombies in one shot.
The protocol · in priority order
Ollama holds models warm for 5 minutes after last use. Often 4–8 GB of free VRAM is sitting in a model you queried earlier and forgot about.devpulse ai shows you which.
TypeScript LSPs, ESLint daemons, file watchers, and dev servers from projects you closed yesterday. Often hundreds of MB, occasionally GBs. devpulse zombies --kill.
Chrome (15–25 GB on most dev Macs) and Docker (3–6 GB idle) are the heaviest. If you don't need them right now, quit them. Memory Saver in Chrome partially helps; closing tabs helps more.
Slack, Discord, Notion, Cursor — each carries a Chromium runtime worth 1–4 GB. Cumulatively a real win.
devpulse ai --before-load <MB> --auto-cleandoes steps 1–2 automatically, then re-evaluates. Won't touch your foreground apps. Returns exit code 0 when the model fits.
sudo sysctl iogpu.wired_limit_mb=57000raises Apple Silicon's GPU cap. Risky above ~85% of total RAM; resets on reboot. Reclaiming existing usage is almost always safer.
The single-command path
All the safe parts of the protocol, executed and reported, in one call. Pass the size of the model you're about to load and DevPulse tells you whether it fits — and what to do if not.
Per runtime · same memory rules
Apple Silicon's unified memory ceiling is a property of the OS, not the inference framework. Whatever you're launching, the headroom you reclaim with DevPulse is available to it.
The most common case. devpulse ai integrates directly: lists loaded models, surfaces idle ones, can unload via --auto-clean.
No daemon — model loads on demand, unloads on exit. Pre-flight before./main -m model.gguf still applies; the same DevPulse check returns the same headroom.
Holds models more deliberately than Ollama. Use LM Studio's “Eject” to unload, then run --before-load with the new model's size.
Apple's own ML framework. Same unified-memory rules; same DevPulse pre-flight. MLX models tend to be slightly smaller in RAM than the same quant in GGUF.
Less common on Mac (CUDA-first), but BYOK setups via Factory's Droid or similar use it. Same memory accounting applies.
If you're loading a HuggingFace model directly via PyTorch with MPS, the OS-level ceiling still rules. Pre-flight first.
Common questions
Default ceiling is ~75% of total RAM (Metal'srecommendedMaxWorkingSetSize). On 64 GB → ~48 GB usable. Raise via sudo sysctl iogpu.wired_limit_mb=<MB>; going past ~85% destabilizes the system.
sudo purge useful?Marginally. Flushes file caches and inactive pages — not wired/GPU memory. Buys hundreds of MB at most. Killing Chrome reclaims 100x more.
Partially — zombies and idle models are reclaimable without touching foreground apps. But the heaviest source (Chrome) needs Memory Saver or fewer tabs.
devpulse ai --before-load <MB> --auto-clean. Only reversible cleanup; stable exit codes (0/1/2/3); won't touch foreground apps or system settings.
DevPulse runs the math, reclaims the safe parts, and tells you in one command.
Download for macOS