Mentor Zero — Architecture & Deployment

Current architecture and deployment details for Mentor Zero.

← Back to landing

Mentor Zero — Architecture & Deployment Guide (current)

This guide keeps the same layout: (1) Product summary (~1 page), (2) Technical architecture (~10 pages), and (3) Deployment & AWS environment (~2 pages). It reflects the current dev stack: Flask API + scheduler on EC2 (Docker Compose), Caddy TLS, Postgres in Docker, SES/Lambda inbound, SSM-backed secrets, ECR images, static pages served by Caddy/app, GitHub Actions for build/push + SSM-driven bootstrap. Recent updates: GitHub OIDC IAM role, build-and-push workflow, SSM S3 deploy of run_bootstrap.sh, prompt logging toggle, topic tagging via subject/header, admin config UI, verbose error logging, dedupe by event_log message_id.

---

1) Product Summary (~1 page)

  • What it is: Email-first AI mentor. One calm letter per topic per day, voiced like noted philosophers/scientists. Email is the deliberate medium; replying is the only interaction.
  • Personalization: Uses user context, streak/progress, notes, recent letter metadata to avoid repetition and tune tone.
  • Auth: Passwordless magic links (one-time tokens), sessions in DB, `mz_session` cookie. Logout revokes session. Admins from `app_admin`.
  • Delivery: SMTP (SES or Gmail). Scheduler sends daily. Footer carries login link (one-time token). Send-now per topic.
  • Scheduling: Per-user/topic send time + timezone; scheduler enqueues next job after each send; disabling a topic clears pending jobs.
  • Inbound replies: IMAP or SES→Lambda→webhook. Topic encoded via `[mz:<code>]` subject tag and `X-MZ-Topic` header. AI classifies done/note/unsubscribe; updates progress or disables topic; idempotent by Message-ID.
  • Admin: Prompt editor, users, footer editor, metrics dashboard, simulate reply, app-config toggle (prompt logging).
  • Observability: Structured logs; `event_log` for replies/sends; `/health` and `/metrics` (jobs + auth/webhook counters); verbose errors on webhook/login-link/scheduler; prompt logging optional.

---

2) Technical Architecture (~10 pages)

2.1 High-Level Flow

  • Enrollment: Landing page → `/api/login-link` sends magic link → `/login` consumes token, sets session cookie.
  • Settings: `/api/topics` saves enabled flag, context_note, send time, timezone; upserts schedule; enqueues next job; disabling clears pending jobs.
  • Send loop: Scheduler polls `email_job` (pending & due), marks sending, generates/sends, persists letter/metadata/progress, enqueues next run, updates `next_run_at_utc`.
  • Send-now: `/api/send-now` enqueues immediate job if user/topic active/enabled.
  • Replies: IMAP or SES/Lambda → webhook; AI marks done/note/unsubscribe; updates progress or disables topic; idempotent by Message-ID via `event_log`.
  • Admin: Prompt/templates, users, footer, metrics, simulate reply, config toggle for prompt logging; Admin link shown when `is_admin`.

2.2 Backend (Flask, `app.py`)

  • Serves SPA/static: `web/app.html`, `web/landing/index.html`, `web/architecture.html`, `web/user_guide.html`.
  • Auth: `/api/login-link`, `/api/signup`, `/login`, `/logout`, `/api/logout`, `/api/me`.
  • Settings/actions: `/api/topics` (GET/POST), `/api/complete`, `/api/send-now`.
  • Admin APIs/pages: prompts, users, footer, metrics, simulate reply, config toggle (`/api/admin/config`, `admin_config.html`).
  • Health/metrics: `/health`, `/metrics` (job/auth/webhook counters).
  • Webhooks: `/api/webhook/reply` (HMAC `X-MZ-Signature`), `/api/maildev-webhook` (local).
  • Middleware: request IDs; simple rate limits (login-link/webhook); host-aware cookies (Secure toggled by env).

2.3 Core Modules

  • `config.py`: env+SSM config, OpenAI client, SMTP config, logging, defaults, magic link secret, `DEFAULT_ADMIN_EMAILS`.
  • `db.py` (re-exports helpers): users/topics/templates/context; letters/metadata/prompts; schedules (`user_topic_schedule`), jobs (`email_job`); `event_log`; login tokens/sessions; app_config; footer; quotes; queue helpers.
  • `progress.py`: streak/done/missed; update completion/note.
  • `prompt.py`: build per-topic prompt; OpenAI JSON-mode; subject normalization + topic tag.
  • `mail.py`: SMTP send; IMAP ingestion; AI classify (done/note/unsubscribe); idempotent by Message-ID; sets `X-MZ-Topic` and `[mz:<code>]` subject tag; footer login links with DB token.
  • `scheduler.py`: compute next run, enqueue, poll pending, dispatch worker, persist letter/metadata/progress, enqueue next job; send-now uses same queue.
  • `webhook.py`: SES/Lambda handler; HMAC verify; AI unsubscribe; idempotent by Message-ID; logs payload/analysis.
  • `health.py`: queue metrics; `cli.py` for manual runs.

2.4 Data Model (key tables)

  • Users/admin/auth: `app_user`, `app_admin`, `login_token`, `app_session`
  • Topics/prompts: `topic`, `prompt_template`, `philosopher`
  • User-topic: `user_topic` (enabled/context), `user_topic_schedule` (timezone, send_time_local, next_run_at_utc)
  • Queue: `email_job` (run_at_utc, status pending/sending/sent/error, schedule_id, letter_id)
  • Letters: `letter`, `letter_metadata`, `letter_prompt` (optional prompt logging)
  • Progress: `user_progress` (completed, note, streak_at_time, letter_id)
  • Events: `event_log` (LETTER_SENT, REPLY_PROCESSED, UNSUBSCRIBE, LOGIN, etc.)
  • Quotes/footer/config: `bottom_quote`, `bottom_quote_history`, `email_footer`, `app_config`

2.5 Email Generation, Scheduling, Send-Now

  • `/api/topics` save: upsert schedule, compute `next_run_at_utc`, insert pending job; disabling clears pending jobs and pauses schedule.
  • Scheduler loop (~60s): fetch due pending jobs, mark sending, build letter, send SMTP, mark sent/error, enqueue next job, update schedule.
  • Send-now: enqueue immediate job if active/enabled; uses same pipeline.
  • Subjects normalized and tagged; footer renders `{{login_link}}` using DB one-time token (48h).
  • Prompt logging optional via `app_config.log_prompts` (admin toggle) → `letter_prompt`.

2.6 Prompting

  • OpenAI JSON-mode: subject/body/summary/themes/tone/advice_focus/variation_tags.
  • Inputs: prompt template (DB or fallback), recent metadata hints, progress context (streak/done/missed/notes/context), user/topic context.
  • Subject normalization + topic tag; resilient to malformed JSON responses.

2.7 Inbound Replies & Unsubscribe

  • IMAP or SES/Lambda → webhook (`/api/webhook/reply` HMAC). Topic from `X-MZ-Topic` header or `[mz:<code>]` tag. AI decides done/note/unsubscribe; unsubscribe disables topic; idempotent by Message-ID via `event_log`.
  • Maildev webhook for local; admin simulate uses same logic.

2.8 Frontend

  • `landing/index.html`: calm copy, magic-link form, topic selector, contact mailto, links to Architecture/User Guide docs.
  • `app.html`: topic toggles, context_note, send time/TZ, send-now buttons, admin link when `is_admin`, logout.
  • Admin pages: prompts, users, footer, metrics, reply simulator, config toggle.
  • Docs: `architecture.html`, `user_guide.html`. Served by Flask/Caddy.

2.9 Configuration & Secrets

  • Env/SSM keys: `DATABASE_URL[_SSM]`, `OPENAI_API_KEY[_SSM]`, `MAGIC_LINK_SECRET[_SSM]`, `WEBHOOK_SECRET[_SSM]`, SMTP (`SMTP_HOST/PORT/USE_TLS/REQUIRE_AUTH/USERNAME/PASSWORD` or `_SSM`), `SENDER_EMAIL`, `DEFAULT_ADMIN_EMAILS`, `APP_BASE_URL`, `AWS_REGION`.
  • `_get_config_value`: env → SSM → default; required keys raise.
  • User-data writes `.env` with SSM param names (no secrets); app fetches via SSM at runtime using instance role.

2.10 Observability & Logging

  • Structured logs; request IDs; verbose errors on webhook/login-link/scheduler; AI analysis logged in `event_log.metadata`.
  • `/health`, `/metrics` (job counts, auth/webhook counters, oldest pending, reply/unsubscribe counters).
  • Admin metrics UI; prompt logging toggle; event_log for sends/replies/unsubscribes/login.
  • Gaps: no remote log sink, no latency histograms, no alerts.

2.11 Reliability & Idempotency

  • Jobs: single-instance polling; no distributed locks; upsert on `(user_id, topic_id, run_at_utc)` to reset duplicates.
  • Replies: idempotent by Message-ID via `event_log`; HMAC on webhook; AI unsubscribe.
  • Gaps: send idempotency guard missing; no retries/backoff/ DLQ; no multi-instance coordination.

2.12 Security & Auth

  • Magic links (one-time tokens), sessions in DB, logout revokes; cookies HttpOnly, env-aware Secure, SameSite=Lax.
  • Rate limits on login-link/webhook; HMAC on webhook; SSM-backed secrets; TLS via Caddy/Let’s Encrypt.
  • PII in logs; recommend masking for prod; admin access via `app_admin`.

2.13 Extensibility

  • Topics: add rows to `topic` and `prompt_template`; prompt renderer is topic-code driven.
  • Prompts: editable via admin; per-topic model/temperature tunable in code.
  • Queue: can move to managed queue + retry/backoff.
  • Static/docs easily extended; admin config can grow (feature flags).

2.14 Testing

  • Unit: prompt parsing, streak calc, scheduler time calc, webhook signature.
  • Integration: progress flows, reply processing (IMAP/webhook), letter metadata persistence.
  • Gaps: E2E SMTP/IMAP, multi-instance scheduler, UI automation, load/alerting tests.

3) Deployment & AWS Environment (~2 pages)

3.1 Build & Publish

  • Image: Docker (gunicorn CMD), linux/amd64; Dockerfile installs app + alembic.
  • ECR: `${env_prefix}-app` (default repo). Build tags `latest` and git SHA.
  • GitHub Actions: `build-and-push.yml` (OIDC role) builds/pushes to ECR; `deploy-bootstrap.yml` triggers after successful build.

3.2 Infra Topology

  • Network: VPC (10.10.0.0/16), public subnet, IGW, SG allows 80/443/22.
  • EC2: Ubuntu, IAM role (SSM core, ECR read, SSM GetParameter, optional S3 deploy bucket). User-data installs Docker/compose/awscli, logs into ECR, writes compose/Caddy/.env, starts stack, waits for Postgres, seeds DB, runs alembic, ensures AWS_REGION in .env. Root volume 20 GB gp3.
  • Compose: `app` (gunicorn), `scheduler` (`python -m mentorzero.scheduler`), `db` (Postgres), `caddy` (TLS). Volumes: Postgres data, Caddy certs/config.
  • TLS: Caddy Let’s Encrypt for `www.<root>` and `api.dev.<root>`; Route53 A-records to EC2. (CloudFront removed.)
  • DNS: Route53 `www`, `api.dev` → EC2 IP.
  • Inbound email: SES receipt rule → S3 → Lambda → `/api/webhook/reply` with HMAC; secret in SSM; Lambda logs payload to CloudWatch; topic via header/subject tag.
  • Secrets: SSM params for DB URL, DB password, OpenAI key, magic link secret, SMTP creds, webhook secret; `.env` holds SSM param names; app fetches from SSM at runtime.
  • Static: served by Flask/Caddy (`web/landing`, `web/*.html`, docs).

3.3 CI/CD & Bootstrap

  • CI: `build-and-push` builds/pushes on main; uses OIDC role with ECR/SSM/EC2 perms.
  • Deploy: `deploy-bootstrap` runs on successful build; uploads `run_bootstrap.sh` to S3, SSM executes on EC2, pulls latest compose config from S3, runs bootstrap (Docker up, alembic, seeds).
  • Manual: `run_bootstrap.sh` can be run on host; `docker compose pull && docker compose up -d` to pick a new tag.
  • Terraform: sets DNS, SES, S3 inbound, Lambda, EC2/SG/IAM/SSM params, GitHub OIDC provider/role, deploy artifact bucket policy, app secrets placeholders (ignore_changes on values).

3.4 User-Data / Bootstrap Steps

  • Install docker/compose/awscli; login to ECR.
  • Write compose/Caddy/.env (SSM param names, config; no secrets).
  • `docker compose up -d`.
  • Wait for Postgres; extract SQL from image; run create/seed/backfill; run alembic with SSM-fetched DATABASE_URL.
  • Ensure AWS_REGION in .env for SSM client.

3.5 Runbook (dev)

  • `terraform apply -var "app_image=<ECR_URI:tag>" -var "deploy_artifact_bucket=<bucket>"`.
  • Populate SSM params once (OpenAI, magic link, SMTP, webhook, DB URL/password); `ignore_changes` prevents TF overwrite.
  • Push code → build-and-push → deploy-bootstrap runs; or SSH/SSM to `docker compose pull && docker compose up -d`.
  • Logs: `docker logs mentorzero-app-1`, `mentorzero-scheduler-1`, `mentorzero-caddy-1`, `mentorzero-db-1`.
  • Health: `GET /health`; metrics: `/metrics`; webhook: check HMAC.
  • DB seeds in `/opt/mentorzero/sql`; rerun with `docker exec -i mentorzero-db-1 psql -U postgres -d postgres < file.sql`.

3.6 Gaps / Next Steps

  • Add job retry/backoff + idempotency guard; multi-instance scheduler coordination or managed queue.
  • RDS with backups/encryption; ALB/WAF if scaling.
  • Remote log sink, latency metrics, alerts on webhook/job failures/backlog.
  • Secret rotation/reload; CI deploy approval gates; pin TLS/email sender policies.