The Technology Behind Zeno
Zeno combines semantic query routing with a three-bucket memory architecture. Every message is embedded, classified by nearest-neighbour search against a learned prototype set, and routed to the specialist model best suited for it. Conversations quietly contribute to a shared knowledge layer — personal facts about you, community insights across members, and verified facts about the world.
Semantic Routing
Queries pass through a three-stage pipeline. Strong-signal patterns (code fragments, math expressions, explicit translation requests) are matched instantly. Everything else is embedded with bge-m3 and matched by cosine similarity against a growing set of prototype queries. Only when the semantic layer is uncertain does the request fall through to an LLM classifier.
flowchart TD
A["User sends a message"] --> B{"Strong-signal regex match?\n(code, math, translate, etc.)"}
B -->|"Yes"| J["Specialist Model"]
B -->|"No"| C["Embed query via bge-m3\n(1024-dim, ~60ms)"]
C --> D["Cosine top-2 against\nprototype embeddings"]
D --> E{"top1 ≥ 0.70 AND\nmargin ≥ 0.05?"}
E -->|"Yes"| F["Confident semantic hit\nbump prototype hit_count"]
E -->|"No"| G["Grok 4.1 Fast\nLLM classification (~500ms)"]
G --> H["Promote query as\nnew prototype (waitUntil)"]
F --> J
H --> J
J --> K["Stream response to user"]
style A fill:#1a1a2e,stroke:#d4a574,color:#fff
style F fill:#1a1a2e,stroke:#22c55e,color:#fff
style J fill:#1a1a2e,stroke:#22c55e,color:#fff
style K fill:#1a1a2e,stroke:#22c55e,color:#fff
The prototype set is seeded with ~180 hand-authored example queries (15 per category × 12 categories) and grows organically. Every LLM-fallback classification is auto-promoted as a new prototype with dedup at cosine ≥ 0.92 and a 100-prototype cap per category. Hit-count-based eviction keeps the popular prototypes and prunes dead weight. Seeds are protected from eviction.
Specialist Models
Each query category maps to a specialist model chosen for that task type, sourced from multiple providers via OpenRouter. Two specialists — Visual Designer and Deep Researcher — are boost-only premium tiers users opt into per conversation.
flowchart TD
R["Semantic Router\n+ LLM fallback"] --> CODER["Code Specialist\nMiniMax M2.7"]
R --> REASONER["Deep Thinker\nDeepSeek v3.2"]
R --> CREATIVE["Creative Writer\nQwen 3.5 35B"]
R --> ANALYST["Research Analyst\nQwen 3.5 Flash"]
R --> TEACHER["Tutor\nQwen 3.5 35B"]
R --> QUICK["Quick Responder\nQwen 3.5 Flash"]
R --> POLY["Language Expert\nQwen 3.5 Flash"]
R --> SUMM["Summarizer\nQwen 3.5 Flash"]
R --> CHAT["Conversationalist\nQwen 3.5 Flash"]
R --> STRAT["Strategist\nMiniMax M2.7"]
R -.->|"5 credits"| DESIGNER["Visual Designer\nGemini 3.1 Flash Image"]
R -.->|"5 credits"| DEEP_RES["Deep Researcher\nGrok 4.20 Multi-Agent"]
style R fill:#1a1a2e,stroke:#d4a574,color:#fff
style DESIGNER fill:#1a1a2e,stroke:#a78bfa,color:#fff
style DEEP_RES fill:#1a1a2e,stroke:#a78bfa,color:#fff
Document co-authoring — when you have a doc project open — overrides the specialist routing and uses Grok 4.1 Fast directly, because it follows the structured <doc> output format more reliably than other models.
Tools & Capabilities
Specialist models can invoke tools to extend their capabilities. The router selects a specialist; that specialist can then chain up to fifteen tool calls per turn to fulfil the request. Every chat request also injects the current UTC time, the user's IANA timezone, and IP-derived city/country/coordinates from Cloudflare's edge — so tools and answers are temporally and geographically grounded.
- •Web search & deep search — real-time information via DuckDuckGo, with deep search fetching and consolidating multiple pages in one call
- •Image search — finds real photos, diagrams and illustrations via the Brave Image Search API and embeds them inline as markdown images
- •Geo search — finds nearby restaurants, shops and landmarks via the OpenStreetMap Overpass API. Defaults to the user's IP-derived coordinates so "near me" works without device GPS
- •Image generation & editing — Flux 2 Klein for text-to-image, Flux 2 Pro for transformations, Flux 2 Flex for blending multiple references
- •Diagram generation — Flux Max renders 3D isometric diagrams from text descriptions for explanations
- •Website builder — generates self-contained HTML/CSS/JS sites; preview-served from R2 with conversation-aware iteration
- •Document creation — PDF and DOCX generation via a two-stage design + code pipeline
- •Code execution —
run_javascriptsandbox via QuickJS for math, date arithmetic, data transforms - •Memory search — explicit lookups against your three-bucket memory store when the model needs to recall something specific
- •Reminders & automations — schedule one-off reminders and recurring agentic workflows
- •Knowledge base search — when a knowledge module is active, RAG-style lookups against curated documents
Three-Bucket Memory
Memory is split across three independent stores, each with different privacy and verification rules. Extraction is inline — the LLM emits hidden <memory> blocks in its response stream, which are stripped before display and dispatched to the correct bucket. No separate extraction pass means zero added LLM cost.
flowchart TD
A["LLM response stream"] --> B["Parse <memory> blocks\n(8 declared types)"]
B --> C{"Type"}
C -->|"user_fact, user_preference,\nuser_relationship"| P["Personal Memory\n(per-user, private)"]
C -->|"user_experience, user_opinion,\nconcept_link"| CN["Connective Candidate\n(PII-scrubbed, anonymous)"]
C -->|"entity_fact, entity_update"| GC["Global Candidate\n(facts about the world)"]
P --> PE["bge-m3 embedding\n→ personal_memory_embeddings"]
CN --> CE["bge-m3 embedding\n→ connective_candidate_embeddings"]
GC --> GE["bge-m3 embedding\n→ global_candidate_embeddings"]
CN --> CL["Cron: cluster by\ncosine similarity ≥ 0.75"]
CL --> CP{"≥ 10 distinct users\nin cluster?"}
CP -->|"Yes"| CM["Promote to\nconnective_memories"]
GC --> GV["Grok 4.20 batched\nverification (20/batch)"]
GV --> GP{"Verdict"}
GP -->|"verified"| GM["Promote to\nglobal_memories"]
GP -->|"rejected"| GR["Drop"]
GP -->|"uncertain"| GQ["Quarantine\n+ retry with backoff"]
style A fill:#1a1a2e,stroke:#d4a574,color:#fff
style P fill:#1a1a2e,stroke:#22c55e,color:#fff
style CM fill:#1a1a2e,stroke:#a78bfa,color:#fff
style GM fill:#1a1a2e,stroke:#a78bfa,color:#fff
Connective and global candidates are PII-scrubbed before storage (regex for emails, phone numbers, addresses, URLs, handles, credit card numbers). Candidates whose scrubbed content was altered above 20% are downgraded to personal-only and never enter the shared layer.
Connective Aggregation
Connective candidates — community experiences, opinions, and concept-link observations — are clustered nightly by cosine similarity against existing cluster centroids. When a cluster reaches ten distinct contributing users, it gets consolidated by an LLM into a single connective memory.
Search ranking uses a composite score: similarity × decay × confidence × volume_boost. Decay follows a 90-day half-life with a rescue clause — clusters with strong recent reinforcement stay alive longer. The volume boost (logarithmic in distinct contributors) prevents niche but well-clustered topics from dominating over broadly-shared insights.
Global Fact Verification
Global candidates — facts about people, places, and entities — are verified by Grok 4.20 in batches of twenty per cron tick. Before verification, candidates are deduplicated against existing global memories at cosine ≥ 0.85, so the verifier never wastes calls on already-known facts. Verdicts are routed to promote, reject, or quarantine paths.
flowchart LR
Q["Pending candidates\n(retry_after passed)"] --> D["Dedup vs existing\nglobal_memories (cos ≥ 0.85)"]
D --> B["Batch 20 candidates\ninto Grok 4.20 prompt"]
B --> V{"Verdict per item"}
V -->|"verified"| P["Promote to\nglobal_memories"]
V -->|"rejected"| R["Drop"]
V -->|"uncertain"| Q2["Quarantine + exponential\nbackoff (7→14→28→56d)"]
style P fill:#1a1a2e,stroke:#22c55e,color:#fff
style Q2 fill:#1a1a2e,stroke:#d4a574,color:#fff
Cap of five batches per cron tick (~$2/day at full capacity). Quarantined candidates retry on an exponential backoff schedule capped at 60 days. The verifier's reasoning is stored alongside each verdict for auditability.
Document Co-Authoring
Beyond chat, Zeno includes a collaborative document editor. When a doc project is open, your messages are routed through a special path that primes the LLM to emit a full updated document inside <doc>...</doc> tags alongside its conversational reply.
The editor applies streaming updates as a per-block diff against the previous document state. Unchanged paragraphs keep their DOM nodes — only the changed range is replaced. This preserves cursor position, scroll, and editing state during streaming, while still giving the LLM the simpler task of emitting whole documents rather than positional patches.
Cross-User Recommendations
The recommendation engine identifies topics multiple community members are exploring but that a given user hasn't encountered yet. It compares embedding vectors across the membership, finds semantic gaps, and clusters them into actionable recommendations.
Each user's personal memory embeddings (bge-m3, 1024-dimensional) are compared against a sample of other members' memories using cosine similarity. Topics scoring below 0.5 similarity represent genuine knowledge gaps. Gaps are clustered by category and ranked by a combined score of community size and novelty. Up to twelve recommendations are generated using Grok 4.20 multi-agent and cached for 8 hours, refreshing as new conversations add to the collective memory.
Recommendations are privacy-preserving by design. The system identifies trending topics across the community without exposing individual conversations. Members see what topics are popular, not who said what — and the connective layer they draw from has already been PII-scrubbed.