Hardware: 64 GB or more.

Llama 3.3 70B at Q4_K_M needs ~42 GB of unified memory just for the weights. Your Mac also needs enough headroom for context (~4–8 GB) and the rest of your stack (browser, terminal, Docker if you use it). Realistic minimum: a 64 GB Mac with a deliberately lean foreground.

Mac configVerdictNotes
M4 Max / M3 Max, 64 GBRuns comfortably~42 GB model + 6 GB headroom — close Chrome and Docker first.
M4 Max / M3 Ultra, 96 GB+Runs greatCan stay productive in other apps while it runs.
M4 Pro, 48 GBTightTry Q3_K_M (~32 GB) instead, or step down to a 32B.
M3/M4, 32 GB or lessWon't fitUse Llama 3.1 8B or Qwen 2.5 Coder 32B instead.

Full breakdown: /can-i-run/llama3.3-70b →

Tokens per second on Apple Silicon.

A viral post claimed 71 tok/s for Llama 3.3 70B on an M4 MacBook Pro. That figure is fiction. The realistic ceiling on M-series unified memory for a 70B at Q4_K_M is closer to 8–20 tok/s, bottlenecked by memory bandwidth, not FLOPS.

M4 Max (~546 GB/s)

~12–18 tok/s single-batch on a quantized 70B. With speculative decoding (a small draft model), 20+ is reachable.

M3 Max (~400 GB/s)

~8–14 tok/s. Slightly slower than M4 Max but still faster than most people read.

M3 Ultra (~819 GB/s)

~18–25+ tok/s. The top-end Studio is genuinely fast on 70Bs and handles bigger contexts gracefully.

Why bandwidth matters: generating each token requires streaming the active weights through memory once. A 42 GB Q4_K_M model on 546 GB/s bandwidth gives a theoretical ceiling of about 13 tok/s before any other overhead. Quoting numbers above the bandwidth-derived ceiling without speculative decoding is a tell that something else is being measured (or fabricated).

Step by step.

  1. Install Ollama from ollama.com. It registers itself as a local service on port 11434.
  2. Install DevPulse from devpulse.sh. The menu bar app shows you what's eating unified memory; the bundled devpulse CLI runs the pre-flight check.
  3. Pre-flightbefore pulling the model so you know it'll fit:
$ devpulse ai --before-load 42000 --auto-clean
before: Won't fit — 8.2 GB short
  - unloaded idle ollama model: qwen2.5:7b (4.2 GB)
  - killed 6 zombie procs (812 MB reclaimed)
after:  Fits comfortably — 4.4 GB headroom
$ echo $?
0   # safe to load
  1. Pull and run the model. First pull is ~42 GB — grab coffee.
$ ollama pull llama3.3:70b
$ ollama run llama3.3:70b
  1. For long batches: babysit.Hours-long agent loops need a watchdog so they don't lose progress when memory tightens:
$ devpulse babysit --target-free-mb 8192 --json > babysit.log &
$ ollama run llama3.3:70b < my-task-queue.txt

Process a queue, checkpoint on pressure events.

The reason the “11-hour-flight” pattern works isn't the model. It's the loop: process a task, save output to disk, advance, save state every N tasks. devpulse babysit's NDJSON event stream feeds directly into that pattern.

# pseudocode — drop into your shell or python loop
while task = pop_next(queue):
    output = ollama_chat("llama3.3:70b", task.prompt)
    write_output(task.id, output)
    if task.id % 12 == 0:
        save_checkpoint(queue.position)

# meanwhile, in another pane:
$ tail -f babysit.log | jq 'select(.event == "cleanup")' \
  | while read evt; do
      save_checkpoint  # extra checkpoint on memory pressure
    done

Want a fully agentic loop instead of a queue? Pair this setup with Aider, Codex, or Hermes Agent — all of which run against the same local 70B and fit the same babysit pattern.

Common failures.

OOM during load

Almost always the rest of your stack, not the model. Full diagnosis here →

Sudden slowdown after an hour

Thermal throttling on Air-class chassis, or swap pressure as context grows. devpulse status --json tells you which.

Battery dropping fast

Local 70B inference draws ~30–50W under load. Plan on staying plugged in or swapping batteries. devpulse babysit reports battery percent and time-to-empty in every tick.

Output quality lower than expected

Q4_K_M is excellent for most tasks but hits limits on tight reasoning. If you have headroom, Q5_K_M (~50 GB) and Q6_K (~57 GB) are noticeably stronger. Q8 is overkill for a 70B at this scale.

11 hours of local AI. Zero crashes.

DevPulse is the co-pilot that keeps your stack from OOM-killing the model.

Download for macOS

macOS 14+ · Apple Silicon & Intel · Free during launch