# Next session: Hub LLM cost routing (**hosted Hub only**) > ## Decision: DeepInfra single-provider — 2026-04-30 > > **Status:** code on `feat/hosted-mcp-hub-create-proposal`; staging validation pending; production flip pending. > > **What was decided:** Replace per-feature LLM providers (OpenAI primary, Anthropic fallback, separate Voyage / OpenAI embeddings, juggling ElevenLabs / image-gen keys) with a **single DeepInfra OpenAI-compatible key** (`DEEPINFRA_API_KEY`). The same key drives: > - hosted Hub chat (review hints + Enrich) via `lib/llm-complete.mjs` when `KNOWTATION_CHAT_PROVIDER=deepinfra`. > - hosted bridge embeddings via `EMBEDDING_PROVIDER=deepinfra`. > - OpenClaw 4.27 orchestration (chat, embeddings, image gen, TTS, audio understanding) — same key. > > **Why this and not Groq / OpenRouter / self-hosted Ollama:** Groq had rate limits / capability gaps that bit prior research; OpenRouter adds another middleman; self-hosted Ollama needs a $15–20/mo VPS reachable from Netlify. DeepInfra: one key, OpenAI wire format (drops into existing `lib/llm-complete.mjs` with one new branch), Qwen 2.5 / Llama 3.x / Mistral chat models, BGE / Qwen embedding models, **and** OpenClaw 4.27 made it a first-class bundled provider — so the OpenClaw conveyor belt and hosted Hub share one bill, one rotation, one place to watch spend. > > **Backward compatibility (verified by 17 unit tests in `test/llm-complete-deepinfra.test.mjs`):** > 1. `KNOWTATION_CHAT_PROVIDER=deepinfra` → DeepInfra wins, with OpenAI / Anthropic as automatic fallback if their keys are still set. > 2. `KNOWTATION_CHAT_PROVIDER=openai|anthropic` → explicit lock to that provider (no fallback). > 3. **Implicit DeepInfra:** only fires when `DEEPINFRA_API_KEY` is set AND neither `OPENAI_API_KEY` nor `ANTHROPIC_API_KEY` is set. Existing OpenAI deploys are NOT silently flipped by adding a DeepInfra key for OpenClaw. > 4. Otherwise: existing OpenAI → Anthropic → Ollama default order is preserved (and `KNOWTATION_CHAT_PREFER_ANTHROPIC=1` still flips OpenAI/Anthropic order when both are set). > > **Required gates before production flip on Netlify (do NOT skip):** > - Run `node scripts/validate-deepinfra-enrich.mjs` with `KNOWTATION_CHAT_PROVIDER=deepinfra` on a staging Netlify deploy. Pass condition: 10/10 of the built-in Enrich samples must return `parseOk=true` and produce only allow-list frontmatter keys. If <10/10, do **not** flip — try a stronger model (`Qwen/Qwen2.5-72B-Instruct` is the default; for cheap review hints set `DEEPINFRA_CHAT_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct` only after validating the chosen model still passes Enrich). > - Re-index a non-production vault on `EMBEDDING_PROVIDER=deepinfra` + `EMBEDDING_MODEL=BAAI/bge-large-en-v1.5` and verify Meaning search returns the same top-3 notes for 10 known queries. Embedding-dimension change requires a full vault re-index (1024 dim by default; see `embeddingDimension` in `lib/embedding.mjs`). > > **Production flip:** On the **gateway** Netlify site: `DEEPINFRA_API_KEY`, `KNOWTATION_CHAT_PROVIDER=deepinfra`, optionally `DEEPINFRA_CHAT_MODEL`. Keep `OPENAI_API_KEY` set for fallback. On the **bridge** Netlify site (when separate): `DEEPINFRA_API_KEY`, `EMBEDDING_PROVIDER=deepinfra`, `EMBEDDING_MODEL=BAAI/bge-large-en-v1.5`, then re-index. Watch `proposal-review-hints-async` + `proposal-enrich-hosted` logs for 24h. Roll back by removing `KNOWTATION_CHAT_PROVIDER` (chat falls back to OpenAI) and switching `EMBEDDING_PROVIDER` back on the bridge (then re-index on the prior model). > > **What this supersedes from the original options table below:** the "Groq via OpenAI-compat", "Remote Ollama on small VPS", and "Hybrid" rows. The DeepInfra row is the answer for our scale and time budget. The remaining Groq / Ollama notes stay only as historical alternatives in case DeepInfra has an outage longer than fallback can absorb. > > **Owner:** repo author (this branch). > **Reviewers:** none required for code (all tests green); operator must run the staging validation script before flipping production env vars. > > --- ## Scope (read first) | In scope for this plan | Out of scope (not the target of “save money” here) | |------------------------|------------------------------------------------------| | **Hosted Knowtation Hub**: the app backed by **`hub/gateway`** on **Netlify** (or any serverless/long-lived cloud deploy of the same gateway), including **Netlify environment variables** that drive chat for production users | **Local development**: your laptop’s default LLM, local **`npm run hub`**, local **CLI / daemon** (`daemon-llm.mjs`, `config/local.yaml`) — those may *benefit from the same code changes later* but are **not** what this document is optimizing for cost | | Dollar impact: **OpenAI / Anthropic bills** triggered by **hosted** traffic (proposal review hints, proposal Enrich, hosted MCP paths that call `completeChat`, etc.) | “I want cheaper models when I run knowtation at home” — separate conversation; localhost **Ollama already works** locally without this plan | **Summary:** This document is about **cloud / hosted Hub spend** (API keys and URLs on the **gateway’s deploy**, e.g. Netlify), **not** about replacing your local dev setup. --- Use this document to **plan research and implementation** for reducing or eliminating **OpenAI API** spend on Knowtation **hosted Hub** features that call `completeChat()` (`lib/llm-complete.mjs`), especially: - **Proposal review hints** (`hub/gateway/proposal-review-hints-async.mjs`) - **Proposal Enrich** (`hub/gateway/proposal-enrich-hosted.mjs`) - **MCP summarize**, **hosted MCP**, or other **gateway** paths that import `completeChat` in the same deploy **Embeddings** (indexing / Meaning search on the **hosted bridge**) are a **separate** configuration (`EMBEDDING_PROVIDER`, bridge env, `embedding.*` in config). This session focuses on **chat** completions for the **gateway** unless you explicitly decide to align bridge + gateway secrets in one pass. --- ## Current behavior (facts from repo) 1. **`completeChat`** provider order: **`OPENAI_API_KEY` set → OpenAI**; else **`ANTHROPIC_API_KEY` → Anthropic**; else **Ollama** at `OLLAMA_URL` + `/api/chat` with `OLLAMA_CHAT_MODEL` / `OLLAMA_MODEL` (default `llama3.2`). 2. **Optional** `KNOWTATION_CHAT_PREFER_ANTHROPIC=1` when **both** OpenAI and Anthropic keys exist. 3. **Hosted Netlify** cannot reach `http://localhost:11434`. On **hosted** Hub, Ollama only works if `OLLAMA_URL` points to a **publicly reachable** host (your VPS, Fly.io, etc.). 4. **`daemon-llm.mjs`** already supports **OpenAI-compatible** endpoints (`callOpenAiCompat`, custom `base_url`) for **local daemon** flows; **hosted proposal jobs** use `completeChat` directly today, not `daemonLlm`. 5. **BornFree** (`bornfree-hub`) uses **Groq** (OpenAI-compatible `…/v1/chat/completions`) with env keys and a provider fallback chain — a proven pattern for low/zero marginal cost **hosted** chat. --- ## Problem statement - **Goal:** On **hosted Hub**, keep **review hints**, **Enrich**, **MCP summarize**, and related **gateway** LLM features **functionally equivalent** (quality acceptable for internal/advisory use) while **avoiding per-token OpenAI bills** where possible. - **Constraint:** Prefer **no** new always-on server only if a **managed API** (Groq, Together, OpenRouter, etc.) is sufficient for **Netlify-side** calls; accept a **small VPS + Ollama/vLLM** ($15–20/mo) if traffic, privacy, or rate limits require it. --- ## Research checklist (assign owners / dates) ### A. Volume and cost (hosted) - [ ] Export **Netlify / gateway logs** or billing: approximate **chat calls per day** (hints + enrich + MCP summarize + hosted MCP). - [ ] Estimate **tokens per call** (hints/enrich caps in code: e.g. `maxTokens: 400`, body slices ~12k chars). - [ ] Price **OpenAI `gpt-4o-mini`** vs **Groq** vs **OpenRouter** small models at that volume. ### B. Provider capabilities - [ ] **Groq:** rate limits, free tier caps, model list (Llama 3.x), JSON reliability for **Enrich** (structured JSON output). - [ ] **Together / Fireworks / other:** OpenAI-compat URL, pricing, EU data residency if needed. - [ ] **Self-hosted Ollama or vLLM** (reachable from Netlify): GPU RAM for chosen model, cold start, TLS, auth in front of `/api/chat`. ### C. Code touchpoints - [ ] Single place to extend: **`lib/llm-complete.mjs`** (add optional `OPENAI_COMPAT_BASE_URL` + key env, or `KNOWTATION_CHAT_PROVIDER=groq`) vs duplicating in each gateway module. - [ ] **Tests:** mock `fetch` for chat URL; assert provider selection order when env combinations change. - [ ] **Docs:** `.env.example`, `docs/HUB-PROPOSAL-LLM-FEATURES.md`, **Netlify** deploy notes (explicit: “set on hosted site, not only in local `.env`”). ### D. Risk - [ ] **Enrich** prompts require **valid JSON**; smaller/weaker models may break `validateAndNormalizeEnrichResult` — need eval samples or stricter repair prompt. - [ ] **Hints** are plain text; lower risk. - [ ] **Secrets:** never commit keys; document `GROQ_API_KEY` or compat vars in **Netlify** UI only. --- ## Implementation options (high level) | Option | New server? | Marginal API cost | Notes | |--------|-------------|-------------------|--------| | **Groq (or OpenRouter) via OpenAI-compat** in `completeChat` | No | Low / free tier | Align with BornFree; one env block on **Netlify**. | | **Remote Ollama** on small VPS | Yes (~$15–20/mo) | Electricity + VPS | Full control; set **`OLLAMA_URL`** on Netlify to that host; not localhost. | | **Hybrid** | Optional | Embeddings on Voyage/OpenAI, chat on Groq/Ollama | Already conceptually split in docs. | --- ## Suggested decision flow 1. If **hosted** call volume is low and **Groq free tier** covers it → implement **OpenAI-compat base URL** in `completeChat`, point at Groq on **Netlify**, **unset `OPENAI_API_KEY`** for chat (or add explicit “chat provider” override so embeddings can keep OpenAI if desired). 2. If **rate limits or JSON quality** bite → try **paid Groq** or **OpenRouter** mid model before self-hosting. 3. If **data must not leave your infra** → **VPS + Ollama** reachable from Netlify; keep embeddings on current provider or run `nomic-embed-text` etc. on same box. --- ## Deliverables for the PR that implements routing - [ ] Env vars documented and **backward compatible** (default unchanged if only `OPENAI_API_KEY` set on hosted). - [ ] Unit tests for provider selection. - [ ] **Staging Netlify** deploy: run **create proposal** + confirm **review hints** and **Enrich** end-to-end. --- ## Related files - `lib/llm-complete.mjs` — chat routing (shared; **hosted gateway** loads this) - `lib/daemon-llm.mjs` — `callOpenAiCompat` reference (**local daemon**; useful pattern for hosted `completeChat`) - `hub/gateway/proposal-review-hints-async.mjs`, `proposal-enrich-hosted.mjs` - `docs/HUB-PROPOSAL-LLM-FEATURES.md` - `bornfree-hub/api/lib/llm.js` — Groq-first pattern (cross-repo reference) --- ## Hint timeout context (fixed separately) On **hosted** Hub, hints run inside a **18s** race after `POST /proposals`. Merging **client `body`** into the hints job avoids an extra canister **GET** and reduces timeouts; see PR introducing `proposal-hints-create-context.mjs` (merged via `fix/review-hints-merge-client-body`).