Does this work for llama.cpp and LM Studio, not just Ollama?

Yes. Unified memory pressure on Apple Silicon is identical regardless of which runtime you use. Reclaiming RAM with DevPulse before launching llama.cpp, LM Studio, MLX, or Ollama produces the same result — more headroom for the model. Only the unload-idle-models step is Ollama-specific (LM Studio holds models more deliberately, llama.cpp loads on-demand).

How much unified memory can my GPU actually use?

Apple Silicon's default ceiling is `recommendedMaxWorkingSetSize`, which Metal sets to roughly 75% of total RAM on most M-series Macs. On 64 GB you get about 48 GB usable. You can raise it via `sudo sysctl iogpu.wired_limit_mb= `, but going past ~85% destabilizes the system.

Is `sudo purge` worth running before loading a model?

Marginally. `purge` flushes file system caches and inactive memory pages — not GPU/wired memory. It buys you a few hundred MB at most and only matters if you're right at the edge. Killing Chrome reclaims 100x more.

Can I free VRAM without quitting apps?

Partially. `devpulse zombies --kill` reclaims orphaned dev procs without touching foreground apps. Unloading idle Ollama models frees their VRAM. But the heaviest single source — Chrome with many tabs — needs at least Memory Saver enabled or fewer tabs to give it back. There's no magic; the memory has to come from somewhere.

What's the safest way to do this in a script?

Use `devpulse ai --before-load --auto-clean` which only performs reversible cleanup (unload idle models, kill clearly-orphaned procs) and reports whether the model now fits via stable exit codes (0 fits, 1 won't fit, 2 fits-after-unload, 3 tight). It won't quit your foreground apps or modify system settings.

Free VRAM before loading a local AI model on Mac

First — measure

Where you actually are.

Before reclaiming, see what you have. devpulse status shows unified memory used/free, current GPU allocation, swap, battery, and zombies in one shot.

$ devpulse status
memory  41.3 / 64.0 GB  (65%)  [healthy]
swap    4.1 GB
gpu     14.2 / 48.0 GB
battery 67%  [battery]  3h 47m to empty

⚠  6 zombie procs using 812 MB — run: devpulse zombies --kill
⚠  3 idle dev servers using 1.4 GB — projects: api-gateway, marketing-site

# JSON form for scripts:
$ devpulse status --json | jq '.gpu.allocatedMB, .memory.usedGB'

The protocol · in priority order

Six steps from cheapest to deepest.

1 · Unload idle Ollama models

Ollama holds models warm for 5 minutes after last use. Often 4–8 GB of free VRAM is sitting in a model you queried earlier and forgot about.devpulse ai shows you which.

2 · Kill zombies and stale dev servers

TypeScript LSPs, ESLint daemons, file watchers, and dev servers from projects you closed yesterday. Often hundreds of MB, occasionally GBs. devpulse zombies --kill.

3 · Quit memory-heavy apps

Chrome (15–25 GB on most dev Macs) and Docker (3–6 GB idle) are the heaviest. If you don't need them right now, quit them. Memory Saver in Chrome partially helps; closing tabs helps more.

4 · Close idle Electron apps

Slack, Discord, Notion, Cursor — each carries a Chromium runtime worth 1–4 GB. Cumulatively a real win.

5 · Run --auto-clean

devpulse ai --before-load <MB> --auto-cleandoes steps 1–2 automatically, then re-evaluates. Won't touch your foreground apps. Returns exit code 0 when the model fits.

6 · Raise the ceiling (last resort)

sudo sysctl iogpu.wired_limit_mb=57000raises Apple Silicon's GPU cap. Risky above ~85% of total RAM; resets on reboot. Reclaiming existing usage is almost always safer.

The single-command path

Or just do this.

All the safe parts of the protocol, executed and reported, in one call. Pass the size of the model you're about to load and DevPulse tells you whether it fits — and what to do if not.

# Llama 3.3 70B (Q4_K_M ≈ 42 GB)
$ devpulse ai --before-load 42000 --auto-clean
before: Won't fit — 8.2 GB short
  - unloaded idle ollama model: qwen2.5:7b (4.2 GB)
  - killed 6 zombie procs (812 MB reclaimed)
after:  Fits comfortably — 4.4 GB headroom

# Use the exit code in scripts
$ devpulse ai --before-load 42000 --auto-clean && ollama run llama3.3:70b
$ # exit 0 = safe to load; 1 = won't fit; 2 = fits after unload; 3 = tight

Per runtime · same memory rules

This works for every local-AI runtime.

Apple Silicon's unified memory ceiling is a property of the OS, not the inference framework. Whatever you're launching, the headroom you reclaim with DevPulse is available to it.

Ollama

The most common case. devpulse ai integrates directly: lists loaded models, surfaces idle ones, can unload via --auto-clean.

llama.cpp

No daemon — model loads on demand, unloads on exit. Pre-flight before./main -m model.gguf still applies; the same DevPulse check returns the same headroom.

LM Studio

Holds models more deliberately than Ollama. Use LM Studio's “Eject” to unload, then run --before-load with the new model's size.

MLX

Apple's own ML framework. Same unified-memory rules; same DevPulse pre-flight. MLX models tend to be slightly smaller in RAM than the same quant in GGUF.

vLLM

Less common on Mac (CUDA-first), but BYOK setups via Factory's Droid or similar use it. Same memory accounting applies.

Raw transformers

If you're loading a HuggingFace model directly via PyTorch with MPS, the OS-level ceiling still rules. Pre-flight first.

Common questions

FAQ.

How much unified memory can the GPU actually use?

Default ceiling is ~75% of total RAM (Metal'srecommendedMaxWorkingSetSize). On 64 GB → ~48 GB usable. Raise via sudo sysctl iogpu.wired_limit_mb=<MB>; going past ~85% destabilizes the system.

Is `sudo purge` useful?

Marginally. Flushes file caches and inactive pages — not wired/GPU memory. Buys hundreds of MB at most. Killing Chrome reclaims 100x more.

Can I do this without quitting apps?

Partially — zombies and idle models are reclaimable without touching foreground apps. But the heaviest source (Chrome) needs Memory Saver or fewer tabs.

Safe way in a script?

devpulse ai --before-load <MB> --auto-clean. Only reversible cleanup; stable exit codes (0/1/2/3); won't touch foreground apps or system settings.

Free VRAM before loading a local AI model.

Where you actually are.

Six steps from cheapest to deepest.

Or just do this.

This works for every local-AI runtime.

FAQ.

How much unified memory can the GPU actually use?

Is `sudo purge` useful?

Can I do this without quitting apps?

Safe way in a script?

Stop guessing whether the model will fit.

Free VRAM before loading a local AI model.

Where you actually are.

Six steps from cheapest to deepest.

Or just do this.

This works for every local-AI runtime.

FAQ.

How much unified memory can the GPU actually use?

Is sudo purge useful?

Can I do this without quitting apps?

Safe way in a script?

Stop guessing whether the model will fit.

Is `sudo purge` useful?