IMPORT-URL-AND-DOCUMENTS-PHASES.md markdown
246 lines 18.5 KB
Raw
sha256:65ccb454656ea5acdea0a10e559b78bcde1eb6ff753ecc2911bc99d1c3d7cadd feat(calendar): enforce agent context tiers in retrieval AP… Human minor ⚠ breaking 2 days ago

Import roadmap: URL, documents, bulk UX (phased)

Branch: feat/import-url-documents-mcp
Policy: Do not push or open a PR until there is solid, testable work worth review. Commit after each phase (or each shippable slice within a phase) on this branch.

This doc splits work so each phase matches how Knowtation already works: lib/importers/ + runImport, Hub / bridge / gateway HTTP, hosted MCP parity, and tests.


Current behavior (baseline — no new code)

Path Folder import Multiple files at once
CLI (knowtation import …) Yes for types that accept a directory (e.g. markdown walks a folder of .md files; see lib/importers/markdown.mjs). Run import multiple times or point at a folder.
Hub (browser) ZIP: uploads whose name ends in .zip are extracted server-side; runImport receives the extracted directory (see hub/server.mjs and bridge). That matches folder-capable importers (e.g. markdown walks .md/.markdown; ChatGPT/Claude exports). pdf and docx importers require a single file path and reject a directory (lib/importers/pdf.mjs, lib/importers/docx.mjs), so in practice Hub PDF/DOCX = one document per request (4B: N sequential POST for many files), not a server ZIP of a folder. 4A₂ builds a client ZIP for tree-shaped types. Yes (4B + 4A₂ + 4C): multiple, Choose folder (webkitdirectory), the Import modal drop zone (§4C), in-browser JSZip when getHubImportFileMode says client_zip — see web/hub/hub.js and web/hub/hub-client-import-zip.mjs.
Hosted MCP import One base64 file (or ZIP) per tool call. Agents can call import repeatedly (no import_batch).

So: entire-folder ingest already exists on the CLI for supported types. In the Hub, the practical “folder” paths are pre-made ZIP, Choose folder / multi-select, or the 4C drop target in the Import modal (same in-memory file pipeline and caps as 4A₂/4B; Chromium is best for full directory-tree paths from a drag; Safari / Firefox may see flatter File lists, same as a non-Chromium multi-file pick).


Phase 1 — URL import (highest adoption leverage)

Outcome: Paste an HTTPS URL → one or more vault notes (article text when possible, bookmark fallback).

Status on branch feat/import-url-documents-mcp: Shipped (single implementation commit: core lib/, POST /api/v1/import-url on Hub + bridge + gateway, Hub modal fields, hosted MCP import_url, self-hosted MCP url_mode, tests, docs).

Work (representative):

  • lib/url-fetch-safe.mjs — SSRF-safe fetch (HTTPS, DNS, redirects, size, timeout).
  • lib/importers/url.mjs — fetch + Readability-style extraction + writeNote / frontmatter.
  • lib/import-source-types.mjs + lib/import.mjs — register url.
  • POST /api/v1/import-url (JSON) on self-hub + bridge + gateway proxy (multipart import stays file-only).
  • Hub UI: URL field + submit; optional copy when extract fails.
  • Hosted MCP: import_url tool → same JSON route.
  • Self-hosted MCP: extend import with source_type: "url" and input = URL string.
  • Tests: unit (safe fetch + importer), bridge/gateway integration, update test/import-source-types.test.mjs for url.
  • Docs: IMPORT-SOURCES.md, AGENT-INTEGRATION.md, PARITY-MATRIX-HOSTED.md, openapi.yaml.

Commit suggestion: feat(import): url importer + import-url API + Hub + MCP + tests (or split into 2 commits: lib first, then HTTP/UI).


Phase 2 — PDF → Markdown notes

Outcome: source_type: pdf, upload .pdf (Hub multipart / MCP base64 / CLI path) → note body with extracted text.

Status on branch feat/import-url-documents-mcp: Shipped (commit feat(import): pdf source type and importer — core lib/importers/pdf.mjs, unpdf, Hub option + copy, hosted MCP enum via IMPORT_SOURCE_TYPES, tests, docs).

Work:

  • lib/importers/pdf.mjs + dependency (e.g. pdf-parse / unpdf — choose in implementation).
  • Register in import.mjs / import-source-types.mjs.
  • Hub: allow source type + file; hosted uses existing import MCP with new source_type.
  • Fixture PDFs + tests; docs row in IMPORT-SOURCES.

Commit suggestion: feat(import): pdf source type and importer.


Phase 3 — DOCX → Markdown notes

Outcome: source_type: docx, upload .docx → note(s) via Mammoth (or equivalent).

Status on branch feat/import-url-documents-mcp: Shipped (commit feat(import): docx source type and importerlib/importers/docx.mjs, mammoth, Hub option + copy, hosted MCP enum via IMPORT_SOURCE_TYPES, tests, docs).

Work:

  • lib/importers/docx.mjs + mammoth (typical).
  • Same registration / Hub / MCP pattern as PDF.
  • Tests + docs.

Commit suggestion: feat(import): docx source type and importer.


Phase 4 (optional) — Bulk UX: folder picker & multi-file

Goal: Make “bring my whole pile of documents” as easy in the Hub as it already is on the CLI (folder) or via ZIP.

4A — Low cost (recommend first)

Status on branch feat/import-url-documents-mcp: Shipped (Hub Import modal + docs: explain one multipart upload; ZIP for folder-capable types; PDF/DOCX = single file, not ZIP—see IMPORT-SOURCES.md § “Hub browser: ZIP and bulk”.)

  • Document clearly in Hub and IMPORT-SOURCES.md: ZIP a folder of Markdown (or use ZIP for exports that expect a directory); PDF and DOCX = upload one .pdf / .docx per import (ZIP extracts to a directory; those importers require a file).

Complexity: Low (copy/docs only). 4A does not go obsolete when 4A₂ or 4B land: exports still arrive as ZIPs; the copy stays the contract for server-side extraction.

4A₂ — Client-side ZIP (JSZip) — optional supplement

Status on branch feat/import-url-documents-mcp: Shippedweb/hub/hub-client-import-zip.mjs (mode rules, buildImportZipBlobWithJsZip, default caps) + UMD JSZip (CDN) + hub-import-zip-shim.mjs; one POST /api/v1/import with file = hub-bulk.zip when getHubImportFileMode returns client_zip (e.g. markdown + multiple files, chatgpt without a server .zip, claude + all .md, mif / gdrive / notebooklm trees). Not for PDF/DOCX.

Goal: User drags a folder (or many files) in the browser; the Hub builds one .zip and POSTs the existing multipart POST /api/v1/import once—no new server route if the payload is still one file field.

  • Hub: “Choose folder (ZIP in browser)” (webkitdirectory); size / file-count caps; duplicate path renames with optional warning text; large selections stay in browser memory (prefer a desktop-made ZIP for huge trees).
  • Tests: test/hub-client-import-zip.test.mjs.

Complexity: Low–medium (limits and UX). Scope: folder-capable source_type values only—not PDF/DOCX (see lib/importers/pdf.mjs / docx.mjs).

4B — Native multi-file / folder picker (webkitdirectory)

Status on branch feat/import-url-documents-mcp: Shippedweb/hub/hub.js + Import modal: multiple on the main file field; sequential POST /api/v1/import for pdf, docx, mem0-export, linear-export, jira-export, wallet-csv, supabase-memory, audio, and claude-export when the selection is not all-.md (multiple JSON → one import per file). Max 200 files per batch in UI; per-file cap ~100MB (multer on Hub/bridge). Live region #import-batch-aria, Stop batch (Abort). MCP: no import_batch; agents use repeated import as before.

Goal: Users can pick many files or a folder without pre-zipping, within caps.

  • Hub: multiple and webkitdirectory; per-file outcome summary.
  • Server approach: sequential POST /api/v1/import (N requests; reuses multer + runImport on Hub and bridge).
  • Progress / partial failure:Batch: x of n …” and named failure snippets.
  • Limits: 200/batch, ~100MB/file, hosted time limits (~26s common on serverless) — see modal copy and IMPORT-SOURCES.md.
  • Parity: PARITY-MATRIX-HOSTED.md.

Complexity: Medium (client loop + caps + UX).

4C — Import modal drop zone (drag files or a folder)

Status: Shipped — the Import dialog includes a drop region (#import-drop-zone in web/hub/index.html); web/hub/hub.js uses DataTransfer / webkitGetAsEntry where available to build a File[] with relative paths (Chromium: recursive directory walk; flat files fallback otherwise). The same getHubImportFileMode + JSZip/sequential pipeline as 4A₂/4B applies. No new HTTP route or canister. Clear drop state on modal close and when the file inputs change.

Goal: Same outcome as Choose folder or multi-file without a click, inside the Import modal only (not page-level drop).

  • Parity / caps: Same as 4A₂/4B (100MB, 200 sequential, JSZip limits, etc.).

Complexity: Low–medium (80–150 lines of client wiring + HTML/CSS; browser variance in directory drops).

4V — Pre-merge hosted verification

Status: Operational — this subsection is the full 4V gate. This is the import verification track, not HUB-WIZARD-HOSTED-STORY.md “Phase 5” (wizard).

Goal: Before merge (or before relying on a hosted release), prove bridge-backed POST /api/v1/import works in the real gateway + static Hub bundle, including 4C (same HTTP path as Choose folder; client-only diff).

Premises (code): Hosted POST /api/v1/import is proxyImportToBridge on the gateway when BRIDGE_URL is set (hub/gateway/server.mjs). If BRIDGE_URL is unset, the gateway returns 501 and code: NOT_AVAILABLE. 4C only changes how the browser builds File[] in web/hub/hub.js.

flowchart LR
  subgraph client [Hub browser]
    Drop4C[4C drop builds File array]
    ZipSeq[JSZip or sequential]
  end
  subgraph gw [Gateway]
    ImportPOST["POST /api/v1/import"]
    Proxy["proxyImportToBridge"]
  end
  subgraph br [Bridge]
    RunImport["runImport to canister"]
  end
  Drop4C --> ZipSeq
  ZipSeq --> ImportPOST
  ImportPOST --> Proxy
  Proxy --> RunImport

4V checks (copy into PR as evidence): V1 BRIDGE_URL on the gateway = bridge origin only (https://…, no path) — hub/gateway/README.md. V2 (optional) env without BRIDGE_URLPOST /api/v1/import returns 501 + NOT_AVAILABLE. V3 With bridge: Markdown + small .md tree — Choose folder once, 4C drop once (Chromium) — same Network pattern for client_zip (e.g. one hub-bulk.zip) and >0 notes. V4 After deploy, hub.js?v=… and #import-drop-zone in the Import modal. V5 (optional) GET {BRIDGE}/api/v1/bridge-version. V6 If BILLING_ENFORCE, import not blocked. V7 Small fixtures (hosted timeouts). Waiver: If time-critical, state in the PR; follow up in an issue.

  1. 4A₂ (JSZip)shipped.
  2. 4B (multi + folder) — shipped (sequential, no new HTTP body).
  3. 4C (import modal drop zone) — shipped.
  4. 4V (hosted verification) — complete §4V checks above before merge of Hub import or gateway proxyImport changes.
  5. Optional next: documented limits in openapi.yaml / HUB-API.md (HTTP shape unchanged; limits are product caps).

Recommendation: Treat 4A + 4A₂ + 4B + 4C as a stacked story: copy explains behavior; JSZip reduces friction for ZIP-shaped flows; 4B/4C cover users who never zip. None of these replace the others.


MCP summary (after phases land)

Capability Hosted MCP Self-hosted MCP
URL import_url (new) import + source_type: url
PDF / DOCX import + source_type pdf or docx + base64 file import + path or local workflow

Adoption note

Ease of import does affect adoption. Order of impact:

  1. URL (paste) — removes “save as Markdown” friction for web content.
  2. PDF + DOCX — removes external converter for the two most common office formats (both shipped on this branch).
  3. Bulk — ZIP (and later multi-file) removes friction for migrations and “dump folder here.”

  • Branch: Stay on feat/import-url-documents-mcp for Phase 2 (PDF) and Phase 3 (DOCX). One PR at the end is a good default: reviewers see URL + PDF + DOCX together or you can open the PR after Phase 2 if you want PDF reviewed before DOCX.
  • Commits: Keep one commit per phase (or per logical slice). Phase 1 is already committed; add feat(import): pdf … (and later feat(import): docx …) on the same branch.
  • Push / merge to main: Not required between phases for local or CI testing. Push when you want backup, CI on the remote, or a draft PR for early feedback. Merge to main when you are ready to release (hosted bridge/gateway deploy coordination).

Testing, merge, and what’s next (after 4A₂ + 4B)

The “failed” importUrl test

  • test/import-url-importer.test.mjs does a real HTTPS request to https://example.com/ (with dryRun). If your environment blocks DNS or outbound HTTPS (e.g. some local sandboxes), the test can fail with getaddrinfo ENOTFOUND or similar. That is not a regression from the Hub bulk work.
  • CI (see .github/workflows/ci.yml npm test on ubuntu-latest) has normal network, so this test is expected to pass there.
  • New code tests: node --test test/hub-client-import-zip.test.mjs (mode rules + small JSZip build). There is no browser E2E in the repo for the Hub UI; use IMPORT-MANUAL-CHECKLIST.md § “Hub — Phase 4A₂, 4B, and 4C” and §4V (hosted) above.

Canister / on-chain

  • No ICP canister changes are required for 4A₂/4B/4C: the same POST /api/v1/import and bridge runImport path is used. Hosted behavior still depends on bridge + gateway as today.

Merge to main

  • Reasonable to merge when: CI is green on the PR, 4V (this doc §4V) is complete for PRs that change web/hub Import flows or gateway proxyImport / POST /api/v1/import (or a waiver is recorded in the PR), plus a short manual pass in the Import modal (self-hosted npm run hub and/or hosted with BRIDGE_URL, deploy preview or production as appropriate), and you’re ready to coordinate deploy (gateway/bridge/Hub static assets). No extra canister work for this feature alone.

Optional follow-up work (not blocking merge)

  • OpenAPI / HUB-API prose for product caps (100MB, 200 files per batch) if you want the HTTP doc to spell out client limits (the route contract is unchanged).
  • Deeper E2E (Playwright, etc.) if you add that stack later.

Next session prompt (optional polish after 4A₂ / 4B)

Phase 4A₂ and 4B are shipped on feat/import-url-documents-mcp (see § above). For a new chat, copy the block below if you want to continue with optional polish; adjust paths if your clone differs.

We are on branch feat/import-url-documents-mcp. Phases 1–3 (URL, PDF, DOCX) and Phase 4A (bulk import docs + Hub copy: ZIP vs single-file, folder-capable types) are shipped — see docs/IMPORT-URL-AND-DOCUMENTS-PHASES.md § Phase 4 and docs/IMPORT-SOURCES.md § “Hub browser: one upload, ZIP extraction, and bulk”.

Implement the strongest Hub bulk UX in one pass, in this order:

**A) Phase 4A₂ — Client-side ZIP (JSZip)**
- Only for source types where a ZIP of a folder is already valid server-side (e.g. markdown, ChatGPT/Claude-style exports — verify against runImport + zip extraction in hub/server.mjs and hub/bridge/server.mjs). Do NOT offer “zip my folder” for PDF/DOCX (single-file importers; see lib/importers/pdf.mjs and docx.mjs).
- Add JSZip (or equivalent) to the Hub bundle; drag folder or multi-select files → build one in-memory ZIP with sensible paths (preserve relative paths for markdown trees) → single existing multipart POST /api/v1/import with field `file`.
- Hard limits: max uncompressed total size, max file count, max zip bytes; clear error UX and cancel where feasible; document tradeoffs (memory, mobile) in IMPORT-SOURCES.md or IMPORT-URL-AND-DOCUMENTS-PHASES.md.
- Tests: zip builder unit tests with small fixtures, or a tight manual checklist in test docs if full E2E is heavy.

**B) Phase 4B — Native multi-file / webkitdirectory**
- Hub: enable folder picker and/or multiple file selection where product policy allows; sequential POST /api/v1/import per file OR one batch multipart (pick one approach, justify in PR); progress and partial-failure summary (“3 of 5 imported; failures: …”).
- Caps: max files, max bytes per file, align with gateway/bridge timeouts; surface limits in UI copy.
- PDF/DOCX: keep one file per import (multiple PDFs = N requests or explicit batch loop), unless you explicitly extend importer scope in this session.
- Docs: update IMPORT-SOURCES.md, IMPORT-URL-AND-DOCUMENTS-PHASES.md (mark 4A₂/4B shipped), PARITY-MATRIX-HOSTED.md if user-visible behavior changes; HUB-API.md + openapi.yaml only if HTTP contract or documented limits change.
- Hosted MCP: only add import_batch or similar if justified; otherwise document that agents still use repeated `import` calls.

**C) Optional polish (if time):** accessibility (live region for batch progress), duplicate-file warnings, telemetry hooks only if the repo already patterns them.

Do not push or open a PR unless I ask; commit on the feature branch in logical slices (e.g. JSZip first, then 4B) with tests passing.

Cursor plan (implementation checklist): .cursor/plans/document_and_url_import_4dda68c9.plan.md

File History 2 commits
sha256:65ccb454656ea5acdea0a10e559b78bcde1eb6ff753ecc2911bc99d1c3d7cadd feat(calendar): enforce agent context tiers in retrieval AP… Human minor 2 days ago
sha256:9103f98c89257ed2b01c237cea895dabb3e85ea337dccb1161c175e4422355b6 docs: accept Calendar Events v0 spec with Phase 0 security … Human 2 days ago