Next session: Hub LLM cost routing (hosted Hub only)
Decision: DeepInfra single-provider — 2026-04-30
Status: code on
feat/hosted-mcp-hub-create-proposal; staging validation pending; production flip pending.What was decided: Replace per-feature LLM providers (OpenAI primary, Anthropic fallback, separate Voyage / OpenAI embeddings, juggling ElevenLabs / image-gen keys) with a single DeepInfra OpenAI-compatible key (
DEEPINFRA_API_KEY). The same key drives:
- hosted Hub chat (review hints + Enrich) via
lib/llm-complete.mjswhenKNOWTATION_CHAT_PROVIDER=deepinfra.- hosted bridge embeddings via
EMBEDDING_PROVIDER=deepinfra.- OpenClaw 4.27 orchestration (chat, embeddings, image gen, TTS, audio understanding) — same key.
Why this and not Groq / OpenRouter / self-hosted Ollama: Groq had rate limits / capability gaps that bit prior research; OpenRouter adds another middleman; self-hosted Ollama needs a $15–20/mo VPS reachable from Netlify. DeepInfra: one key, OpenAI wire format (drops into existing
lib/llm-complete.mjswith one new branch), Qwen 2.5 / Llama 3.x / Mistral chat models, BGE / Qwen embedding models, and OpenClaw 4.27 made it a first-class bundled provider — so the OpenClaw conveyor belt and hosted Hub share one bill, one rotation, one place to watch spend.Backward compatibility (verified by 17 unit tests in
test/llm-complete-deepinfra.test.mjs):
KNOWTATION_CHAT_PROVIDER=deepinfra→ DeepInfra wins, with OpenAI / Anthropic as automatic fallback if their keys are still set.KNOWTATION_CHAT_PROVIDER=openai|anthropic→ explicit lock to that provider (no fallback).- Implicit DeepInfra: only fires when
DEEPINFRA_API_KEYis set AND neitherOPENAI_API_KEYnorANTHROPIC_API_KEYis set. Existing OpenAI deploys are NOT silently flipped by adding a DeepInfra key for OpenClaw.- Otherwise: existing OpenAI → Anthropic → Ollama default order is preserved (and
KNOWTATION_CHAT_PREFER_ANTHROPIC=1still flips OpenAI/Anthropic order when both are set).Required gates before production flip on Netlify (do NOT skip):
- Run
node scripts/validate-deepinfra-enrich.mjswithKNOWTATION_CHAT_PROVIDER=deepinfraon a staging Netlify deploy. Pass condition: 10/10 of the built-in Enrich samples must returnparseOk=trueand produce only allow-list frontmatter keys. If <10/10, do not flip — try a stronger model (Qwen/Qwen2.5-72B-Instructis the default; for cheap review hints setDEEPINFRA_CHAT_MODEL=meta-llama/Meta-Llama-3.1-8B-Instructonly after validating the chosen model still passes Enrich).- Re-index a non-production vault on
EMBEDDING_PROVIDER=deepinfra+EMBEDDING_MODEL=BAAI/bge-large-en-v1.5and verify Meaning search returns the same top-3 notes for 10 known queries. Embedding-dimension change requires a full vault re-index (1024 dim by default; seeembeddingDimensioninlib/embedding.mjs).Production flip: On the gateway Netlify site:
DEEPINFRA_API_KEY,KNOWTATION_CHAT_PROVIDER=deepinfra, optionallyDEEPINFRA_CHAT_MODEL. KeepOPENAI_API_KEYset for fallback. On the bridge Netlify site (when separate):DEEPINFRA_API_KEY,EMBEDDING_PROVIDER=deepinfra,EMBEDDING_MODEL=BAAI/bge-large-en-v1.5, then re-index. Watchproposal-review-hints-async+proposal-enrich-hostedlogs for 24h. Roll back by removingKNOWTATION_CHAT_PROVIDER(chat falls back to OpenAI) and switchingEMBEDDING_PROVIDERback on the bridge (then re-index on the prior model).What this supersedes from the original options table below: the "Groq via OpenAI-compat", "Remote Ollama on small VPS", and "Hybrid" rows. The DeepInfra row is the answer for our scale and time budget. The remaining Groq / Ollama notes stay only as historical alternatives in case DeepInfra has an outage longer than fallback can absorb.
Owner: repo author (this branch). Reviewers: none required for code (all tests green); operator must run the staging validation script before flipping production env vars.
Scope (read first)
| In scope for this plan | Out of scope (not the target of “save money” here) |
|---|---|
Hosted Knowtation Hub: the app backed by hub/gateway on Netlify (or any serverless/long-lived cloud deploy of the same gateway), including Netlify environment variables that drive chat for production users |
Local development: your laptop’s default LLM, local npm run hub, local CLI / daemon (daemon-llm.mjs, config/local.yaml) — those may benefit from the same code changes later but are not what this document is optimizing for cost |
Dollar impact: OpenAI / Anthropic bills triggered by hosted traffic (proposal review hints, proposal Enrich, hosted MCP paths that call completeChat, etc.) |
“I want cheaper models when I run knowtation at home” — separate conversation; localhost Ollama already works locally without this plan |
Summary: This document is about cloud / hosted Hub spend (API keys and URLs on the gateway’s deploy, e.g. Netlify), not about replacing your local dev setup.
Use this document to plan research and implementation for reducing or eliminating OpenAI API spend on Knowtation hosted Hub features that call completeChat() (lib/llm-complete.mjs), especially:
- Proposal review hints (
hub/gateway/proposal-review-hints-async.mjs) - Proposal Enrich (
hub/gateway/proposal-enrich-hosted.mjs) - MCP summarize, hosted MCP, or other gateway paths that import
completeChatin the same deploy
Embeddings (indexing / Meaning search on the hosted bridge) are a separate configuration (EMBEDDING_PROVIDER, bridge env, embedding.* in config). This session focuses on chat completions for the gateway unless you explicitly decide to align bridge + gateway secrets in one pass.
Current behavior (facts from repo)
completeChatprovider order:OPENAI_API_KEYset → OpenAI; elseANTHROPIC_API_KEY→ Anthropic; else Ollama atOLLAMA_URL+/api/chatwithOLLAMA_CHAT_MODEL/OLLAMA_MODEL(defaultllama3.2).- Optional
KNOWTATION_CHAT_PREFER_ANTHROPIC=1when both OpenAI and Anthropic keys exist. - Hosted Netlify cannot reach
http://localhost:11434. On hosted Hub, Ollama only works ifOLLAMA_URLpoints to a publicly reachable host (your VPS, Fly.io, etc.). daemon-llm.mjsalready supports OpenAI-compatible endpoints (callOpenAiCompat, custombase_url) for local daemon flows; hosted proposal jobs usecompleteChatdirectly today, notdaemonLlm.- BornFree (
bornfree-hub) uses Groq (OpenAI-compatible…/v1/chat/completions) with env keys and a provider fallback chain — a proven pattern for low/zero marginal cost hosted chat.
Problem statement
- Goal: On hosted Hub, keep review hints, Enrich, MCP summarize, and related gateway LLM features functionally equivalent (quality acceptable for internal/advisory use) while avoiding per-token OpenAI bills where possible.
- Constraint: Prefer no new always-on server only if a managed API (Groq, Together, OpenRouter, etc.) is sufficient for Netlify-side calls; accept a small VPS + Ollama/vLLM ($15–20/mo) if traffic, privacy, or rate limits require it.
Research checklist (assign owners / dates)
A. Volume and cost (hosted)
- [ ] Export Netlify / gateway logs or billing: approximate chat calls per day (hints + enrich + MCP summarize + hosted MCP).
- [ ] Estimate tokens per call (hints/enrich caps in code: e.g.
maxTokens: 400, body slices ~12k chars). - [ ] Price OpenAI
gpt-4o-minivs Groq vs OpenRouter small models at that volume.
B. Provider capabilities
- [ ] Groq: rate limits, free tier caps, model list (Llama 3.x), JSON reliability for Enrich (structured JSON output).
- [ ] Together / Fireworks / other: OpenAI-compat URL, pricing, EU data residency if needed.
- [ ] Self-hosted Ollama or vLLM (reachable from Netlify): GPU RAM for chosen model, cold start, TLS, auth in front of
/api/chat.
C. Code touchpoints
- [ ] Single place to extend:
lib/llm-complete.mjs(add optionalOPENAI_COMPAT_BASE_URL+ key env, orKNOWTATION_CHAT_PROVIDER=groq) vs duplicating in each gateway module. - [ ] Tests: mock
fetchfor chat URL; assert provider selection order when env combinations change. - [ ] Docs:
.env.example,docs/HUB-PROPOSAL-LLM-FEATURES.md, Netlify deploy notes (explicit: “set on hosted site, not only in local.env”).
D. Risk
- [ ] Enrich prompts require valid JSON; smaller/weaker models may break
validateAndNormalizeEnrichResult— need eval samples or stricter repair prompt. - [ ] Hints are plain text; lower risk.
- [ ] Secrets: never commit keys; document
GROQ_API_KEYor compat vars in Netlify UI only.
Implementation options (high level)
| Option | New server? | Marginal API cost | Notes |
|---|---|---|---|
Groq (or OpenRouter) via OpenAI-compat in completeChat |
No | Low / free tier | Align with BornFree; one env block on Netlify. |
| Remote Ollama on small VPS | Yes (~$15–20/mo) | Electricity + VPS | Full control; set OLLAMA_URL on Netlify to that host; not localhost. |
| Hybrid | Optional | Embeddings on Voyage/OpenAI, chat on Groq/Ollama | Already conceptually split in docs. |
Suggested decision flow
- If hosted call volume is low and Groq free tier covers it → implement OpenAI-compat base URL in
completeChat, point at Groq on Netlify, unsetOPENAI_API_KEYfor chat (or add explicit “chat provider” override so embeddings can keep OpenAI if desired). - If rate limits or JSON quality bite → try paid Groq or OpenRouter mid model before self-hosting.
- If data must not leave your infra → VPS + Ollama reachable from Netlify; keep embeddings on current provider or run
nomic-embed-textetc. on same box.
Deliverables for the PR that implements routing
- [ ] Env vars documented and backward compatible (default unchanged if only
OPENAI_API_KEYset on hosted). - [ ] Unit tests for provider selection.
- [ ] Staging Netlify deploy: run create proposal + confirm review hints and Enrich end-to-end.
Related files
lib/llm-complete.mjs— chat routing (shared; hosted gateway loads this)lib/daemon-llm.mjs—callOpenAiCompatreference (local daemon; useful pattern for hostedcompleteChat)hub/gateway/proposal-review-hints-async.mjs,proposal-enrich-hosted.mjsdocs/HUB-PROPOSAL-LLM-FEATURES.mdbornfree-hub/api/lib/llm.js— Groq-first pattern (cross-repo reference)
Hint timeout context (fixed separately)
On hosted Hub, hints run inside a 18s race after POST /proposals. Merging client body into the hints job avoids an extra canister GET and reduces timeouts; see PR introducing proposal-hints-create-context.mjs (merged via fix/review-hints-merge-client-body).