Architecture

One daemon, four cooperating systems. Local-first by design — runs on your own machine or server, no cloud dependency required.

Clients

Web UI · CLI (cortex) · REST API · 14 Channel Platforms

↓

CortexFlow-AI Gateway

FastAPI + WebSocket daemon · ws://127.0.0.1:7432

↓

Channel Manager

Per-channel session isolation

Model Router

Task-aware fallback chains

Memory Pipeline

Redis · Qdrant · SQLite

Voice + Plugins

Whisper/TTS · Sandboxed SDK

Gateway

A FastAPI app exposing a WebSocket endpoint (/ws) for real-time chat and a REST API (/api/v1/...) for status, session, channel, memory, and metrics operations — see the REST API page. The same process also hosts the channel adapters, so a single cortex start brings the whole system up.

Model Router

Every request is classified into a task type, and each task type maps to an ordered fallback chain — if the first model fails or is unavailable, the router automatically tries the next:

Task type	Fallback chain (in order)
`complex_reasoning`	Claude Opus → GPT-4o → Gemini Pro → Ollama
`code_generation`	DeepSeek Coder → Claude Sonnet → GPT-4o → Gemini Flash
`code_review`	DeepSeek Coder → GPT-4o → Gemini Flash → Ollama
`task_decomposition`	Claude Sonnet → GPT-4o → Gemini Pro → Ollama
`summarization` / `intent_extraction` / `reflection` / `validation`	Gemini Flash → GPT-4o mini → Ollama
`cheap_inference`	Ollama → GPT-4o mini → Gemini Flash
`general` (default)	Gemini Flash → GPT-4o mini → Ollama

Setting local = "ollama/..." as the primary model in configuration enables full privacy mode — zero calls ever leave your machine.

Memory Pipeline

Three tiers, queried in order, merged into one ranked context for the LLM prompt:

Short-term (Redis) — recent conversation turns, TTL-expired automatically
Semantic (Qdrant) — vector similarity search over past conversations
Long-term (SQLite) — durable storage with importance scoring, auto-pruning, tagging, and cross-session sharing

Reflection Engine

After the model router returns a response, the reflection engine scores it 0–100 on relevance, completeness, accuracy, and tone using a cheap model. Responses below the configured threshold are regenerated once with corrective guidance before being sent.

Plugin System

Plugins are discovered via Python entry points and run in the same process as typed Plugin subclasses contributing tools, channel adapters, or lifecycle hooks — see Plugins & SDK.

Observability

Structured JSON logs (or human-readable via rich in a TTY) and Prometheus metrics are exposed at GET /api/v1/metrics, with a JSON snapshot at GET /api/v1/metrics/snapshot for the web UI.