The Technology Behind Zeno

Zeno combines semantic query routing with a three-bucket memory architecture. Every message is embedded, classified by nearest-neighbour search against a learned prototype set, and routed to the specialist model best suited for it. Conversations quietly contribute to a shared knowledge layer — personal facts about you, community insights across members, and verified facts about the world.

Semantic Routing

Queries pass through a three-stage pipeline. Strong-signal patterns (code fragments, math expressions, explicit translation requests) are matched instantly. Everything else is embedded with bge-m3 and matched by cosine similarity against a growing set of prototype queries. Only when the semantic layer is uncertain does the request fall through to an LLM classifier.

flowchart TD
    A["User sends a message"] --> B{"Strong-signal regex match?\n(code, math, translate, etc.)"}
    B -->|"Yes"| J["Specialist Model"]
    B -->|"No"| C["Embed query via bge-m3\n(1024-dim, ~60ms)"]
    C --> D["Cosine top-2 against\nprototype embeddings"]
    D --> E{"top1 ≥ 0.70 AND\nmargin ≥ 0.05?"}
    E -->|"Yes"| F["Confident semantic hit\nbump prototype hit_count"]
    E -->|"No"| G["Grok 4.1 Fast\nLLM classification (~500ms)"]
    G --> H["Promote query as\nnew prototype (waitUntil)"]
    F --> J
    H --> J
    J --> K["Stream response to user"]

    style A fill:#1a1a2e,stroke:#d4a574,color:#fff
    style F fill:#1a1a2e,stroke:#22c55e,color:#fff
    style J fill:#1a1a2e,stroke:#22c55e,color:#fff
    style K fill:#1a1a2e,stroke:#22c55e,color:#fff
        

The prototype set is seeded with ~180 hand-authored example queries (15 per category × 12 categories) and grows organically. Every LLM-fallback classification is auto-promoted as a new prototype with dedup at cosine ≥ 0.92 and a 100-prototype cap per category. Hit-count-based eviction keeps the popular prototypes and prunes dead weight. Seeds are protected from eviction.

Specialist Models

Each query category maps to a specialist model chosen for that task type, sourced from multiple providers via OpenRouter. Two specialists — Visual Designer and Deep Researcher — are boost-only premium tiers users opt into per conversation.

flowchart TD
    R["Semantic Router\n+ LLM fallback"] --> CODER["Code Specialist\nMiniMax M2.7"]
    R --> REASONER["Deep Thinker\nDeepSeek v3.2"]
    R --> CREATIVE["Creative Writer\nQwen 3.5 35B"]
    R --> ANALYST["Research Analyst\nQwen 3.5 Flash"]
    R --> TEACHER["Tutor\nQwen 3.5 35B"]
    R --> QUICK["Quick Responder\nQwen 3.5 Flash"]
    R --> POLY["Language Expert\nQwen 3.5 Flash"]
    R --> SUMM["Summarizer\nQwen 3.5 Flash"]
    R --> CHAT["Conversationalist\nQwen 3.5 Flash"]
    R --> STRAT["Strategist\nMiniMax M2.7"]
    R -.->|"5 credits"| DESIGNER["Visual Designer\nGemini 3.1 Flash Image"]
    R -.->|"5 credits"| DEEP_RES["Deep Researcher\nGrok 4.20 Multi-Agent"]

    style R fill:#1a1a2e,stroke:#d4a574,color:#fff
    style DESIGNER fill:#1a1a2e,stroke:#a78bfa,color:#fff
    style DEEP_RES fill:#1a1a2e,stroke:#a78bfa,color:#fff
        

Document co-authoring — when you have a doc project open — overrides the specialist routing and uses Grok 4.1 Fast directly, because it follows the structured <doc> output format more reliably than other models.

Tools & Capabilities

Specialist models can invoke tools to extend their capabilities. The router selects a specialist; that specialist can then chain up to fifteen tool calls per turn to fulfil the request. Every chat request also injects the current UTC time, the user's IANA timezone, and IP-derived city/country/coordinates from Cloudflare's edge — so tools and answers are temporally and geographically grounded.

Three-Bucket Memory

Memory is split across three independent stores, each with different privacy and verification rules. Extraction is inline — the LLM emits hidden <memory> blocks in its response stream, which are stripped before display and dispatched to the correct bucket. No separate extraction pass means zero added LLM cost.

flowchart TD
    A["LLM response stream"] --> B["Parse <memory> blocks\n(8 declared types)"]
    B --> C{"Type"}
    C -->|"user_fact, user_preference,\nuser_relationship"| P["Personal Memory\n(per-user, private)"]
    C -->|"user_experience, user_opinion,\nconcept_link"| CN["Connective Candidate\n(PII-scrubbed, anonymous)"]
    C -->|"entity_fact, entity_update"| GC["Global Candidate\n(facts about the world)"]
    P --> PE["bge-m3 embedding\n→ personal_memory_embeddings"]
    CN --> CE["bge-m3 embedding\n→ connective_candidate_embeddings"]
    GC --> GE["bge-m3 embedding\n→ global_candidate_embeddings"]
    CN --> CL["Cron: cluster by\ncosine similarity ≥ 0.75"]
    CL --> CP{"≥ 10 distinct users\nin cluster?"}
    CP -->|"Yes"| CM["Promote to\nconnective_memories"]
    GC --> GV["Grok 4.20 batched\nverification (20/batch)"]
    GV --> GP{"Verdict"}
    GP -->|"verified"| GM["Promote to\nglobal_memories"]
    GP -->|"rejected"| GR["Drop"]
    GP -->|"uncertain"| GQ["Quarantine\n+ retry with backoff"]

    style A fill:#1a1a2e,stroke:#d4a574,color:#fff
    style P fill:#1a1a2e,stroke:#22c55e,color:#fff
    style CM fill:#1a1a2e,stroke:#a78bfa,color:#fff
    style GM fill:#1a1a2e,stroke:#a78bfa,color:#fff
        

Connective and global candidates are PII-scrubbed before storage (regex for emails, phone numbers, addresses, URLs, handles, credit card numbers). Candidates whose scrubbed content was altered above 20% are downgraded to personal-only and never enter the shared layer.

Connective Aggregation

Connective candidates — community experiences, opinions, and concept-link observations — are clustered nightly by cosine similarity against existing cluster centroids. When a cluster reaches ten distinct contributing users, it gets consolidated by an LLM into a single connective memory.

Search ranking uses a composite score: similarity × decay × confidence × volume_boost. Decay follows a 90-day half-life with a rescue clause — clusters with strong recent reinforcement stay alive longer. The volume boost (logarithmic in distinct contributors) prevents niche but well-clustered topics from dominating over broadly-shared insights.

Global Fact Verification

Global candidates — facts about people, places, and entities — are verified by Grok 4.20 in batches of twenty per cron tick. Before verification, candidates are deduplicated against existing global memories at cosine ≥ 0.85, so the verifier never wastes calls on already-known facts. Verdicts are routed to promote, reject, or quarantine paths.

flowchart LR
    Q["Pending candidates\n(retry_after passed)"] --> D["Dedup vs existing\nglobal_memories (cos ≥ 0.85)"]
    D --> B["Batch 20 candidates\ninto Grok 4.20 prompt"]
    B --> V{"Verdict per item"}
    V -->|"verified"| P["Promote to\nglobal_memories"]
    V -->|"rejected"| R["Drop"]
    V -->|"uncertain"| Q2["Quarantine + exponential\nbackoff (7→14→28→56d)"]

    style P fill:#1a1a2e,stroke:#22c55e,color:#fff
    style Q2 fill:#1a1a2e,stroke:#d4a574,color:#fff
        

Cap of five batches per cron tick (~$2/day at full capacity). Quarantined candidates retry on an exponential backoff schedule capped at 60 days. The verifier's reasoning is stored alongside each verdict for auditability.

Document Co-Authoring

Beyond chat, Zeno includes a collaborative document editor. When a doc project is open, your messages are routed through a special path that primes the LLM to emit a full updated document inside <doc>...</doc> tags alongside its conversational reply.

The editor applies streaming updates as a per-block diff against the previous document state. Unchanged paragraphs keep their DOM nodes — only the changed range is replaced. This preserves cursor position, scroll, and editing state during streaming, while still giving the LLM the simpler task of emitting whole documents rather than positional patches.

Cross-User Recommendations

The recommendation engine identifies topics multiple community members are exploring but that a given user hasn't encountered yet. It compares embedding vectors across the membership, finds semantic gaps, and clusters them into actionable recommendations.

Each user's personal memory embeddings (bge-m3, 1024-dimensional) are compared against a sample of other members' memories using cosine similarity. Topics scoring below 0.5 similarity represent genuine knowledge gaps. Gaps are clustered by category and ranked by a combined score of community size and novelty. Up to twelve recommendations are generated using Grok 4.20 multi-agent and cached for 8 hours, refreshing as new conversations add to the collective memory.

Recommendations are privacy-preserving by design. The system identifies trending topics across the community without exposing individual conversations. Members see what topics are popular, not who said what — and the connective layer they draw from has already been PII-scrubbed.