Architecture

One daemon, four cooperating systems. Local-first by design — runs on your own machine or server, no cloud dependency required.

Clients
Web UI · CLI (cortex) · REST API · 14 Channel Platforms
CortexFlow-AI Gateway
FastAPI + WebSocket daemon · ws://127.0.0.1:7432
Channel Manager
Per-channel session isolation
Model Router
Task-aware fallback chains
Memory Pipeline
Redis · Qdrant · SQLite
Voice + Plugins
Whisper/TTS · Sandboxed SDK

Gateway

A FastAPI app exposing a WebSocket endpoint (/ws) for real-time chat and a REST API (/api/v1/...) for status, session, channel, memory, and metrics operations — see the REST API page. The same process also hosts the channel adapters, so a single cortex start brings the whole system up.

Model Router

Every request is classified into a task type, and each task type maps to an ordered fallback chain — if the first model fails or is unavailable, the router automatically tries the next:

Task typeFallback chain (in order)
complex_reasoningClaude Opus → GPT-4o → Gemini Pro → Ollama
code_generationDeepSeek Coder → Claude Sonnet → GPT-4o → Gemini Flash
code_reviewDeepSeek Coder → GPT-4o → Gemini Flash → Ollama
task_decompositionClaude Sonnet → GPT-4o → Gemini Pro → Ollama
summarization / intent_extraction / reflection / validationGemini Flash → GPT-4o mini → Ollama
cheap_inferenceOllama → GPT-4o mini → Gemini Flash
general (default)Gemini Flash → GPT-4o mini → Ollama

Setting local = "ollama/..." as the primary model in configuration enables full privacy mode — zero calls ever leave your machine.

Memory Pipeline

Three tiers, queried in order, merged into one ranked context for the LLM prompt:

  1. Short-term (Redis) — recent conversation turns, TTL-expired automatically
  2. Semantic (Qdrant) — vector similarity search over past conversations
  3. Long-term (SQLite) — durable storage with importance scoring, auto-pruning, tagging, and cross-session sharing

Reflection Engine

After the model router returns a response, the reflection engine scores it 0–100 on relevance, completeness, accuracy, and tone using a cheap model. Responses below the configured threshold are regenerated once with corrective guidance before being sent.

Plugin System

Plugins are discovered via Python entry points and run in the same process as typed Plugin subclasses contributing tools, channel adapters, or lifecycle hooks — see Plugins & SDK.

Observability

Structured JSON logs (or human-readable via rich in a TTY) and Prometheus metrics are exposed at GET /api/v1/metrics, with a JSON snapshot at GET /api/v1/metrics/snapshot for the web UI.