Seven reasons to mix.

1 · Cost

Routine tasks at zero marginal cost on local hardware. Frontier-only tasks at API rates. Typical net: 60-80% of cloud spend gone with negligible quality loss on the share that matters.

2 · Reliability

Cloud at 98.95% uptime (Anthropic, 90-day window) means ~5x the standard outage budget. Hybrid: that 1.05% transparently falls back to local. Graceful degradation, not failure.

3 · Latency

Local first-token: hardware-bound, often <100ms. Cloud queues add 200-500ms+. For tight agentic loops local wins; for deliberation cloud wins. Different shapes, different fits.

4 · Privacy partition

Client code, sensitive docs, internal data → local. Public, generic, or already-shared content → cloud. The router does privacy classification at the same time it does cost.

5 · Capacity hedge

When Anthropic rate-limits at 9am Tuesday, your agent silently fails over to local instead of stopping. The babysit pattern + local fallback = an agent that's always running.

6 · Pricing-churn insulation

Cloud is your quality ceiling, local is your cost floor. As tokenizers creep, TTLs shrink, volume discounts vanish, your blended cost moves slowly. Not all-in on either side's bad day.

7 · Compounding skill

People who operate bothsides outperform people who only know one. Hybrid is the technically literate position — and the bet pays off as both sides keep evolving.

What goes where.

A practical routing table for a typical developer workflow. Yours will differ — the ratios shift with what you build, not the principle.

Task classDefault routeWhy
AutocompleteLocalLatency-critical. 8B model is fast and good enough.
Drafting / summarizationLocalHigh volume, low complexity. 32B handles it.
Routine refactorsLocalQwen 2.5 Coder 32B matches frontier on most.
Code review (your code)LocalPrivacy-sensitive + latency-good-enough.
Quick Q&A / explainLocalAlready covered by 8-14B at acceptable quality.
Complex multi-step reasoningCloudFrontier still wins on hard reasoning chains.
Novel domain expertiseCloudFrontier breadth of training matters here.
Multimodal at scaleCloudLocal vision models exist but lag for serious work.
Long agentic plansCloudTool-use depth + planning still favors frontier.
Cloud is rate-limitedLocal (fallback)Capacity hedge — degrade quality, don't stop.

Open models lag closed by ~8 months, not 4.

Headline benchmark indices say open models are 4–5 months behind frontier. Once you adjust for token usage, eval freshness, distillation, and release-tax delays, the practical gap is closer to 8 months. That's why the split table above doesn't pretend everything goes local.

“Open models may be only 4–5 months behind on coding-heavy, benchmark-visible tasks. But once you adjust for token usage, benchmark weighting, eval freshness, distillation, release delays, and harder-to-measure general capabilities, the gap is likely much larger and closer to 8 months.”
Open models burn 1.5–2x more tokens

On the Artificial Analysis Index, top open models reach Opus 4.5 scores by spending roughly 1.5–2x as many tokens. Same benchmark number, different economics. A model that needs twice the tokens isn't equally capable in any sense that matters when you're paying.

Distillation compresses the visible gap

Some of the open-model catch-up isn't independent progress — it's downstream of frontier API output being used to generate synthetic data, reasoning traces, and preference pairs. The comparison isn't “open labs vs. closed labs.” It's “open labs plus leaked frontier capability vs. closed labs.”

Closed labs pay a release tax

Frontier closed models add 1–2 months of red-teaming, safety evals, and product hardening before release. Open models often ship faster because they skip most of that. So even on the coding axis where open looks closest, the effective gap is more like 5–7 months once release timing is accounted for.

DevPulse is built for this. Local handles the volume — code completions, drafting, agent loops, batch retrieval — where 8 months behind is indistinguishable from frontier. Frontier APIs handle the genuinely hard reasoning where the gap shows up. The router decides per-task, not per-vendor.

The token subsidy is being pulled back.

The frontier gap is real on the hardest tasks. The cost gap on everything else is decisive. As frontier APIs move to consumption billing and tighter rate limits, the agent volume that doesn't need frontier intelligence is getting expensive — fast.

“The age of the token subsidy is being pulled back. Open models have crossed an intelligence threshold making them viable for real-world agents at a fraction of the cost. As teams get exponentially larger monthly bills from the labs, it's worth exploring how many agents today perform just as well using open models.”
Subagents and drivers

You don't have to swap the whole agent. The first place open models slot in is as subagents — the calls that fan out from a frontier-driven plan. Or as driverswith a frontier model in an advisor role. Either way, the expensive token count drops by an order of magnitude.

Volume work, not edge cases

Code completions, retrieval reranking, log triage, drafting, batch summarization — the work that happens on a tokens-per-minute basis. The marginal cost difference between a frontier API and a local 32B is > 20x. The output difference at this task class is < 5%.

Tuning is real but cheap

Open models need some prompt tuning to match a frontier model in a given harness. Not weeks — hours. The savings cover the tuning cost in days, then keep compounding while frontier billing tightens.

Where to start swapping.

Pick the workload, see the open models that replace it, click through for the full RAM / quantization breakdown.

Frontier reasoning~19 GB

Replaces Claude Sonnet · Mid-tier Mac

Qwen 3 32BDeepSeek R1 32BKimi K2
Volume calls~5.5 GB

Replaces Claude Haiku · Any modern Mac

Qwen 3 8BLlama 3.1 8B
Long-context reasoning~42 GB

Replaces GPT-4-class · 64 GB+ Mac

Llama 3.3 70BDeepSeek R1
Edge / always-on agents~3 GB

Replaces Cheap cloud calls · Any Mac

Qwen 3 4BLlama 3.2 3B

Field signal.

Builders are split between the ambition and the realism — exactly the hybrid line.

“I'm obsessed with running local LLMs. Been working with an engineer to build product(s) that are 100% local. A new model that came out recently instantly improved the quality of our product. We live in really interesting times.”

@hnshah, on local LLMs in product

“We're not quite there yet.”

@hnshah, replying to a builder proposing to swap his team's Claude / Codex subscription for Kimi K2 inference on a single M-series rig

Both true at once. Local is already production-grade for specific product surfaces. Whole-team replacement of a frontier subscription isn't there yet. The right move is hybrid now — and DevPulse is the local half of the harness: pre-flight VRAM checks, idle-model unloading, and the budget math that keeps the swap reliable on a Mac.

How DevPulse fits hybrid.

The router's job is easy if local is reliable. Local is reliable if something keeps the rest of your stack from squeezing it. That's DevPulse.

# 1. router asks: is local available right now?
$ devpulse ai --before-load 20000 --json | jq .verdict
"fits"

# 2. router decides per-request:
#    autocomplete + drafting + privacy → local (Ollama)
#    deep reasoning + multimodal       → cloud (Anthropic API)

# 3. babysit keeps local healthy in the background
$ devpulse babysit --target-free-mb 4096 --json &

# 4. when cloud rate-limits, router uses local exit-code as fallback
$ if devpulse ai --before-load 20000; then
    use_local
  else
    use_cloud_or_queue
  fi

DevPulse is the operations layer for the local side of a hybrid stack. Pre-flight checks make local reliable enough to lean on. Babysit keeps it alive over long runs. The CLI's exit codes give your router a clean signal to branch on. Coming in v1.5.0: devpulse route— a one-line routing decision baked into the CLI.

What hybrid actually saves.

The economics page calculator now has a routing-mix slider. Move it to your real split and see the blended monthly cost.

Open the hybrid cost calculator →

FAQ.

Won't I get worse output if I route some traffic local?

Only on the share you misroute. The point of routing is that 60-80% of a typical workflow is task classes where 32B local models match frontier output. The 20-40% that doesn't still goes cloud. Net quality: ~unchanged. Net cost: 60-80% lower.

Isn't this just “use both”? What's actually new?

The new part is making the local side reliable enough that you can route to it deterministicallyinstead of as a hobbyist fallback. Pre-flight checks, auto-clean, babysit watchdog — that's the difference between “I have Ollama installed” and “I trust local to handle 70% of my agent traffic.”

What about privacy if I'm sending some to the cloud?

Hybrid lets you partition explicitly. Sensitive code and client data always go local. Generic, public, or already-shared content can go cloud. The router enforces the partition; you're not trusting yourself to remember.

Doesn't this depend on Anthropic / OpenAI staying around?

It depends on somefrontier provider staying around — and hybrid is precisely how you're insulated if one doesn't. The router can swap cloud providers without changing the local side. Multi-provider hybrid is more resilient than single-provider cloud.

The rational mix has a toolchain.

DevPulse is the operations layer that makes the local side of your hybrid AI stack reliable enough to route to.

Download for macOS

macOS 14+ · Apple Silicon & Intel · Free during launch