tractatus/docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md
TheFlow 9d8fe404df chore: update dependencies and documentation
Update project dependencies, documentation, and supporting files:
- i18n improvements for multilingual support
- Admin dashboard enhancements
- Documentation updates for Koha/Stripe and deployment
- Server middleware and model updates
- Package dependency updates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 12:48:37 +13:00

21 KiB

Production Deployment Checklist

Project: Tractatus AI Safety Framework Website Environment: Production (vps-93a693da.vps.ovh.net) Domain: https://agenticgovernance.digital Created: 2025-10-09 Status: Active Procedure


Overview

This checklist ensures safe, consistent deployments to production. Always follow this procedure to prevent security incidents, service disruptions, and data loss.

Deployment Philosophy:

  • Deploy early, deploy often
  • Test thoroughly before deploying
  • Verify after deploying
  • Document incidents and learn

Incident Prevention: This checklist was created after a security incident where sensitive Claude Code governance files were accidentally deployed to production. Following this procedure prevents similar incidents.


Pre-Deployment Checklist

1. Code Quality Verification

  • All tests passing locally

    npm test
    
    • Expected: All tests pass, no failures
    • If any tests fail: Fix before deploying
  • Test coverage acceptable

    npm test -- --coverage
    
    • Check critical services maintain 80%+ coverage
    • Review new code has reasonable coverage
  • Linting passes (if linter configured)

    npm run lint
    # OR
    npx eslint src/
    

2. Security Verification

  • Run security audit

    npm audit
    
    • Review all vulnerabilities
    • Critical/High: Must fix or document why acceptable
    • Medium/Low: Review and plan fix if needed
    • If fixes available: npm audit fix then re-test
  • Check for sensitive files in git

    git ls-files | grep -E '(CLAUDE|SESSION|\.env|SECRET|HANDOFF|CLOSEDOWN|_Maintenance_Guide)'
    
    • Expected: No matches (all sensitive files excluded)
    • If matches found: Review .gitignore and remove from git history
  • Verify .rsyncignore completeness

    cat .rsyncignore
    
    • Confirm excludes:
      • CLAUDE*.md, SESSION*.md, maintenance guides
      • .env, .env.local, .env.production.local
      • node_modules/, .git/, .claude/
      • Test files, coverage reports
      • Development-only files
  • Check environment secrets not in code

    grep -r "sk-ant-" src/ || echo "No API keys found ✓"
    grep -r "mongodb://tractatus" src/ || echo "No hardcoded DB URLs ✓"
    
    • Expected: No hardcoded secrets in source code
    • All secrets in .env files (which are excluded)

3. Database Verification

  • Database migrations ready (if any)

    # Check if new migrations exist
    ls -la scripts/migrations/ | tail -5
    
    • If migrations exist: Plan migration execution
    • Document migration rollback procedure
  • Backup current database (for major changes)

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "mongodump --uri='mongodb://tractatus_user:PASSWORD@localhost:27017/tractatus_prod?authSource=tractatus_prod' --out=/tmp/backup-$(date +%Y%m%d-%H%M%S)"
    
    • Only needed for schema changes or major updates
    • Store backup location in deployment notes

4. Change Documentation

  • Review what's being deployed

    git log --oneline origin/main..HEAD
    
    • Confirm all commits are intentional
    • Verify no work-in-progress commits
  • Update CHANGELOG.md (if project uses one)

    • Document user-facing changes
    • Document breaking changes
    • Document security fixes
  • Commit all changes

    git status
    # If uncommitted changes exist, decide: commit or stash
    

Deployment Execution

Choose Deployment Method

Decision Matrix:

What Changed Script to Use Command
Public HTML/CSS/JS only deploy-frontend.sh ./scripts/deploy-frontend.sh
Koha donation system deploy-koha-to-production.sh ./scripts/deploy-koha-to-production.sh
Full project (backend, routes, services) deploy-full-project-SAFE.sh ./scripts/deploy-full-project-SAFE.sh
Emergency rollback Manual rsync See rollback section

Option 1: Frontend-Only Deployment

Use when only public-facing files changed (HTML, CSS, JS, images).

./scripts/deploy-frontend.sh

What it deploys:

  • public/ directory
  • Excludes: admin, backend code, config files

Safety level: Safest (public files only)

Option 2: Koha-Specific Deployment

Use when Koha donation system changed.

./scripts/deploy-koha-to-production.sh

What it deploys:

  • Koha controllers, services, routes
  • Koha frontend (public/koha.html)
  • Related middleware and models

Safety level: ⚠️ Moderate (includes backend code)

Option 3: Full Project Deployment (Most Common)

Use for backend changes, new features, or multi-component updates.

./scripts/deploy-full-project-SAFE.sh

Deployment steps:

  1. Script shows excluded patterns from .rsyncignore
  2. Review exclusions carefully - Verify sensitive files excluded
  3. Script shows dry-run summary
  4. Verify files to be deployed - Look for any unexpected files
  5. Confirm deployment (or Ctrl+C to abort)
  6. Script executes rsync with progress
  7. Deployment complete

What it deploys:

  • All source code (src/)
  • Public files (public/)
  • Configuration (package.json, etc.)
  • Documentation (docs/)
  • Scripts (scripts/)

What it excludes (via .rsyncignore):

  • Claude Code governance files (CLAUDE*.md, SESSION*.md)
  • Environment files (.env*)
  • Node modules (node_modules/)
  • Git repository (.git/)
  • Test files and coverage
  • Development-only files

Safety level: ⚠️ Use carefully (full codebase)

Deployment Verification During Execution

  • Watch for errors during deployment

    • Rsync errors (permission denied, connection failures)
    • File conflicts
    • Unexpected file deletions
  • Verify file count is reasonable

    • Frontend: ~50-100 files
    • Koha: ~20-30 files
    • Full: ~200-300 files (varies by project size)
    • If thousands of files: STOP - check .rsyncignore

Post-Deployment Verification

1. Immediate Checks (< 2 minutes)

  • Restart application (if backend changes)

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "sudo systemctl restart tractatus"
    
  • Check service status

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "sudo systemctl status tractatus"
    
    • Expected: active (running)
    • If failed: Check logs immediately
  • Health endpoint check

    curl https://agenticgovernance.digital/health
    
    • Expected: {"status":"ok","timestamp":"..."} (200 OK)
    • If 500 or error: Check logs, may need rollback
  • Homepage loads

    curl -I https://agenticgovernance.digital
    
    • Expected: HTTP/2 200
    • If 404/500: Critical issue, check logs

2. Functional Checks (2-5 minutes)

3. Log Monitoring (5-15 minutes)

  • Monitor production logs for errors

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "sudo journalctl -u tractatus -f"
    
    • Watch for:
      • ERROR, CRITICAL log levels
      • Unhandled exceptions
      • Database connection failures
      • 500 errors on requests
    • Monitor for at least 5 minutes
    • If errors appear: Investigate immediately
  • Check for new error patterns

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "sudo journalctl -u tractatus --since '5 minutes ago' | grep -i error"
    
    • Compare to known errors (acceptable warnings)
    • New errors may indicate deployment issues

4. Analytics Check (Optional, 15+ minutes)

  • Verify Plausible Analytics tracking

  • Check Google Search Console (if configured)

    • Verify no new crawl errors
    • Check for 404 increases

Database Migration Procedure (If Needed)

Only required when schema changes or data migrations needed.

Pre-Migration

  • Backup database (already done in pre-deployment)
  • Test migration on staging (if staging environment exists)
  • Review migration script
    cat scripts/migrations/YYYYMMDD-description.js
    

Execute Migration

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "cd /var/www/tractatus && node scripts/migrations/YYYYMMDD-description.js"

Post-Migration

  • Verify migration success

    # Check migration completed
    # Check data integrity
    
  • Test affected features

    • Any features using migrated data

Migration Rollback (If Needed)

  • Restore database from backup

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "mongorestore --uri='...' /tmp/backup-TIMESTAMP"
    
  • Rollback code (see rollback section)


Rollback Procedure

Use if deployment causes critical issues that can't be quickly fixed.

When to Rollback

  • Application won't start
  • Critical features completely broken
  • Security vulnerability introduced
  • Data loss or corruption occurring
  • 500 errors on every request

How to Rollback

  1. Identify last known good commit

    git log --oneline -10
    # Find commit before problematic changes
    
  2. Checkout last good commit

    git checkout <commit-hash>
    
  3. Redeploy using same script

    # Use same deployment script as original deployment
    ./scripts/deploy-full-project-SAFE.sh
    
  4. Restart application

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "sudo systemctl restart tractatus"
    
  5. Verify rollback successful

    • Check health endpoint
    • Check homepage loads
    • Check logs for errors
  6. Return to main branch

    git checkout main
    

Post-Rollback

  • Document incident

    • What went wrong?
    • What was the impact?
    • How was it detected?
    • How long was it broken?
    • What was rolled back?
  • Create incident report (template below)

  • Fix issue in development

    • Reproduce locally
    • Fix root cause
    • Add tests to prevent recurrence
    • Re-deploy when ready

Incident Documentation Template

Create file: docs/incidents/YYYY-MM-DD-description.md

# Incident Report: [Brief Description]

**Date**: YYYY-MM-DD HH:MM (NZST)
**Severity**: [Critical / High / Medium / Low]
**Duration**: [X minutes/hours]
**Detected By**: [User report / Monitoring / Developer]

## Summary
[1-2 sentence summary of what went wrong]

## Timeline
- HH:MM - Deployment initiated
- HH:MM - Issue detected
- HH:MM - Rollback initiated
- HH:MM - Service restored

## Root Cause
[What caused the issue?]

## Impact
- User-facing impact: [What did users experience?]
- Data impact: [Was any data lost/corrupted?]
- Security impact: [Were any security boundaries crossed?]

## Resolution
[How was it fixed?]

## Prevention
[What changes prevent this from happening again?]

## Action Items
- [ ] Fix root cause
- [ ] Add tests
- [ ] Update deployment checklist
- [ ] Update monitoring

Deployment Log Template

Keep a deployment log in: docs/deployments/YYYY-MM.md

# Deployments: [Month Year]

## YYYY-MM-DD HH:MM - [Description]

**Deployed By**: [Name]
**Deployment Type**: [Frontend / Koha / Full]
**Commits Deployed**:
- abc123 - Description
- def456 - Description

**Pre-Deployment Checks**:
- [x] Tests passing
- [x] Security audit clean
- [x] No sensitive files

**Verification**:
- [x] Health check passed
- [x] Homepage loads
- [x] No errors in logs

**Issues**: None
**Rollback Required**: No
**Notes**: [Any relevant notes]

Emergency Procedures

Service Won't Start

  1. Check logs immediately

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "sudo journalctl -u tractatus -n 100"
    
  2. Common issues:

    • MongoDB connection failed → Check MongoDB running: sudo systemctl status mongod
    • Port already in use → Check for zombie processes: sudo lsof -i :9000
    • Missing environment variables → Check .env file exists
    • Syntax error in code → Rollback immediately
  3. Quick fixes:

    # Restart MongoDB if stopped
    sudo systemctl start mongod
    
    # Kill zombie processes
    sudo pkill -f node.*tractatus
    
    # Restart application
    sudo systemctl restart tractatus
    

Database Connection Lost

  1. Verify MongoDB running

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "sudo systemctl status mongod"
    
  2. Check MongoDB logs

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "sudo journalctl -u mongod -n 50"
    
  3. Test connection manually

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "mongosh --host localhost --port 27017 --authenticationDatabase tractatus_prod -u tractatus_user"
    

High Error Rate

  1. Identify error pattern

    ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
      "sudo journalctl -u tractatus --since '10 minutes ago' | grep ERROR | sort | uniq -c | sort -rn | head -10"
    
  2. Check if all endpoints affected or specific routes

    # Check health endpoint
    curl https://agenticgovernance.digital/health
    
    # Check specific routes
    curl https://agenticgovernance.digital/api/documents
    
  3. Decision:

    • If isolated to one feature: Disable feature, investigate
    • If site-wide: Rollback immediately

CRITICAL: HTML Caching Rules

MANDATORY REQUIREMENT: HTML files MUST be delivered fresh to users without requiring cache refresh.

The Problem

Service worker caching HTML files caused deployment failures where users saw OLD content even after deploying NEW code. Users should NEVER need to clear cache manually.

The Solution (Enforced as of 2025-10-17)

Service Worker (public/service-worker.js):

  • HTML files: Network-ONLY strategy (never cache, always fetch fresh)
  • Exception: /index.html only for offline fallback
  • Bump CACHE_VERSION constant whenever service worker logic changes

Server (src/server.js):

  • HTML files: Cache-Control: no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0
  • This ensures browsers never cache HTML pages
  • CSS/JS: Long cache OK (use version parameters for cache-busting)

Version Manifest (public/version.json):

  • Update version number when deploying HTML changes
  • Service worker checks this for updates
  • Set forceUpdate: true for critical fixes

Deployment Rules for HTML Changes

When deploying HTML file changes:

  1. Verify service worker never caches HTML (except index.html)

    grep -A 10 "HTML files:" public/service-worker.js
    # Should show: Network-ONLY strategy, no caching
    
  2. Verify server sends no-cache headers

    grep -A 3 "HTML files:" src/server.js
    # Should show: no-store, no-cache, must-revalidate
    
  3. Bump version.json if critical content changed

    # Edit public/version.json
    # Increment version: 1.1.2 → 1.1.3
    # Update changelog
    # Set forceUpdate: true
    
  4. After deployment, verify headers in production

    curl -s -I https://agenticgovernance.digital/koha.html | grep -i cache-control
    # Expected: no-store, no-cache, must-revalidate
    
    curl -s https://agenticgovernance.digital/koha.html | grep "<title>"
    # Verify correct content showing
    
  5. Test in incognito window

Testing Cache Behavior

Before deployment:

# Local: Verify server sends correct headers
curl -s -I http://localhost:9000/koha.html | grep cache-control
# Expected: no-store, no-cache

# Verify service worker doesn't cache HTML
grep "endsWith('.html')" public/service-worker.js -A 10
# Should NOT cache responses, only fetch

After deployment:

# Production: Verify headers
curl -s -I https://agenticgovernance.digital/<file>.html | grep cache-control

# Production: Verify fresh content
curl -s https://agenticgovernance.digital/<file>.html | grep "<title>"

Incident Prevention

Lesson Learned (2025-10-17 Koha Deployment):

  • Deployed koha.html with reciprocal giving updates
  • Service worker cached old version
  • Users saw old content despite fresh deployment
  • Required THREE deployment attempts to fix
  • Root cause: Service worker was caching HTML with network-first strategy

Prevention:

  • Service worker now enforces network-ONLY for all HTML (except offline index.html)
  • Server enforces no-cache headers
  • This checklist documents the requirement architecturally

Deployment Best Practices

DO:

  • Deploy during low-traffic hours (NZ: 10am-2pm NZST = low US traffic)
  • Deploy small, focused changes (easier to debug)
  • Test thoroughly before deploying
  • Monitor logs after deployment
  • Document all deployments
  • Keep rollback procedure tested and ready
  • Communicate with team before major deployments
  • CRITICAL: Verify HTML cache headers before and after deployment
  • CRITICAL: Test in incognito window after HTML deployments

DON'T:

  • Deploy on Friday afternoon (limited time to fix issues)
  • Deploy multiple unrelated changes together
  • Skip testing "because it's a small change"
  • Deploy without checking logs after
  • Deploy when tired or rushed
  • Deploy without ability to rollback
  • Forget to restart services after backend changes
  • CRITICAL: Never cache HTML files in service worker (except offline fallback)
  • CRITICAL: Never ask users to clear their browser cache - fix it server-side

Deployment Timing Guidelines

Best Times (Low risk):

  • Monday-Thursday, 10am-2pm NZST
  • After morning coffee, before lunch
  • When you have 2+ hours to monitor

Acceptable Times (Medium risk):

  • Monday-Thursday, 2pm-5pm NZST
  • Early morning deployments (if you're alert)

Avoid Times (High risk):

  • Friday 3pm+ (weekend coverage issues)
  • Late evening (tired, less alert)
  • During known high-traffic events
  • When about to leave/travel

Automation Opportunities (Future)

Potential Improvements:

  • Automated testing in CI/CD (GitHub Actions)
  • Automated deployment on merge to main (after tests pass)
  • Automated health checks post-deployment
  • Automated rollback on health check failure
  • Slack notifications for deployments
  • Blue-green deployment for zero-downtime
  • Canary deployments for gradual rollout

Not Ready Yet Because:

  • Need stable test suite ( NOW READY - 380 tests passing)
  • Need monitoring in place ( Next task - Option D)
  • Need error alerting ( Next task - Option D)
  • Need staging environment (💡 Future consideration)

Checklist Quick Reference

Pre-Deploy:

  • Tests pass
  • Security audit clean
  • No sensitive files
  • .rsyncignore verified

Deploy:

  • Choose correct script
  • Review dry-run
  • Execute deployment
  • Note any errors

Verify:

  • Service running
  • Health check OK
  • Homepage loads
  • Monitor logs 5-15min

Document:

  • Log deployment
  • Note any issues
  • Update team

Contact & Support

Production Access:

  • SSH: ubuntu@vps-93a693da.vps.ovh.net
  • Key: ~/.ssh/tractatus_deploy
  • Sudo: Available for systemctl, journalctl

Service Management:

  • Service: tractatus.service (systemd)
  • Status: sudo systemctl status tractatus
  • Logs: sudo journalctl -u tractatus -f
  • Restart: sudo systemctl restart tractatus

Database:

  • Host: localhost:27017
  • Database: tractatus_prod
  • Auth: tractatus_prod database
  • User: tractatus_user

Domain:


Document Status: Active Procedure Last Updated: 2025-10-09 Next Review: After major deployment or incident Maintainer: Technical Lead (Claude Code + John Stroh)