# Production Deployment Checklist **Project**: Tractatus AI Safety Framework Website **Environment**: Production (vps-93a693da.vps.ovh.net) **Domain**: https://agenticgovernance.digital **Created**: 2025-10-09 **Status**: Active Procedure --- ## Overview This checklist ensures safe, consistent deployments to production. **Always follow this procedure** to prevent security incidents, service disruptions, and data loss. **Deployment Philosophy**: - Deploy early, deploy often - Test thoroughly before deploying - Verify after deploying - Document incidents and learn **Incident Prevention**: This checklist was created after a security incident where sensitive Claude Code governance files were accidentally deployed to production. Following this procedure prevents similar incidents. --- ## Pre-Deployment Checklist ### 1. Code Quality Verification - [ ] **All tests passing locally** ```bash npm test ``` - Expected: All tests pass, no failures - If any tests fail: Fix before deploying - [ ] **Test coverage acceptable** ```bash npm test -- --coverage ``` - Check critical services maintain 80%+ coverage - Review new code has reasonable coverage - [ ] **Linting passes** (if linter configured) ```bash npm run lint # OR npx eslint src/ ``` ### 2. Security Verification - [ ] **Run security audit** ```bash npm audit ``` - Review all vulnerabilities - Critical/High: Must fix or document why acceptable - Medium/Low: Review and plan fix if needed - If fixes available: `npm audit fix` then re-test - [ ] **Check for sensitive files in git** ```bash git ls-files | grep -E '(CLAUDE|SESSION|\.env|SECRET|HANDOFF|CLOSEDOWN|_Maintenance_Guide)' ``` - Expected: No matches (all sensitive files excluded) - If matches found: Review .gitignore and remove from git history - [ ] **Verify .rsyncignore completeness** ```bash cat .rsyncignore ``` - Confirm excludes: - `CLAUDE*.md`, `SESSION*.md`, maintenance guides - `.env`, `.env.local`, `.env.production.local` - `node_modules/`, `.git/`, `.claude/` - Test files, coverage reports - Development-only files - [ ] **Check environment secrets not in code** ```bash grep -r "sk-ant-" src/ || echo "No API keys found ✓" grep -r "mongodb://tractatus" src/ || echo "No hardcoded DB URLs ✓" ``` - Expected: No hardcoded secrets in source code - All secrets in .env files (which are excluded) ### 3. Database Verification - [ ] **Database migrations ready** (if any) ```bash # Check if new migrations exist ls -la scripts/migrations/ | tail -5 ``` - If migrations exist: Plan migration execution - Document migration rollback procedure - [ ] **Backup current database** (for major changes) ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "mongodump --uri='mongodb://tractatus_user:PASSWORD@localhost:27017/tractatus_prod?authSource=tractatus_prod' --out=/tmp/backup-$(date +%Y%m%d-%H%M%S)" ``` - Only needed for schema changes or major updates - Store backup location in deployment notes ### 4. Change Documentation - [ ] **Review what's being deployed** ```bash git log --oneline origin/main..HEAD ``` - Confirm all commits are intentional - Verify no work-in-progress commits - [ ] **Update CHANGELOG.md** (if project uses one) - Document user-facing changes - Document breaking changes - Document security fixes - [ ] **Commit all changes** ```bash git status # If uncommitted changes exist, decide: commit or stash ``` --- ## Deployment Execution ### Choose Deployment Method **Decision Matrix:** | What Changed | Script to Use | Command | |--------------|---------------|---------| | Public HTML/CSS/JS only | `deploy-frontend.sh` | `./scripts/deploy-frontend.sh` | | Koha donation system | `deploy-koha-to-production.sh` | `./scripts/deploy-koha-to-production.sh` | | Full project (backend, routes, services) | `deploy-full-project-SAFE.sh` | `./scripts/deploy-full-project-SAFE.sh` | | Emergency rollback | Manual rsync | See rollback section | ### Option 1: Frontend-Only Deployment Use when only public-facing files changed (HTML, CSS, JS, images). ```bash ./scripts/deploy-frontend.sh ``` **What it deploys:** - `public/` directory - Excludes: admin, backend code, config files **Safety level:** ✅ Safest (public files only) ### Option 2: Koha-Specific Deployment Use when Koha donation system changed. ```bash ./scripts/deploy-koha-to-production.sh ``` **What it deploys:** - Koha controllers, services, routes - Koha frontend (public/koha.html) - Related middleware and models **Safety level:** ⚠️ Moderate (includes backend code) ### Option 3: Full Project Deployment (Most Common) Use for backend changes, new features, or multi-component updates. ```bash ./scripts/deploy-full-project-SAFE.sh ``` **Deployment steps:** 1. Script shows excluded patterns from .rsyncignore 2. **Review exclusions carefully** - Verify sensitive files excluded 3. Script shows dry-run summary 4. **Verify files to be deployed** - Look for any unexpected files 5. Confirm deployment (or Ctrl+C to abort) 6. Script executes rsync with progress 7. Deployment complete **What it deploys:** - All source code (src/) - Public files (public/) - Configuration (package.json, etc.) - Documentation (docs/) - Scripts (scripts/) **What it excludes** (via .rsyncignore): - Claude Code governance files (CLAUDE*.md, SESSION*.md) - Environment files (.env*) - Node modules (node_modules/) - Git repository (.git/) - Test files and coverage - Development-only files **Safety level:** ⚠️ Use carefully (full codebase) ### Deployment Verification During Execution - [ ] **Watch for errors during deployment** - Rsync errors (permission denied, connection failures) - File conflicts - Unexpected file deletions - [ ] **Verify file count is reasonable** - Frontend: ~50-100 files - Koha: ~20-30 files - Full: ~200-300 files (varies by project size) - If thousands of files: STOP - check .rsyncignore --- ## Post-Deployment Verification ### 1. Immediate Checks (< 2 minutes) - [ ] **Restart application** (if backend changes) ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "sudo systemctl restart tractatus" ``` - [ ] **Check service status** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "sudo systemctl status tractatus" ``` - Expected: `active (running)` - If failed: Check logs immediately - [ ] **Health endpoint check** ```bash curl https://agenticgovernance.digital/health ``` - Expected: `{"status":"ok","timestamp":"..."}` (200 OK) - If 500 or error: Check logs, may need rollback - [ ] **Homepage loads** ```bash curl -I https://agenticgovernance.digital ``` - Expected: `HTTP/2 200` - If 404/500: Critical issue, check logs ### 2. Functional Checks (2-5 minutes) - [ ] **Test primary user flows:** - Visit homepage: https://agenticgovernance.digital - Navigate to Researcher path: https://agenticgovernance.digital/researcher.html - Navigate to Implementer path: https://agenticgovernance.digital/implementer.html - Navigate to Leader path: https://agenticgovernance.digital/leader.html - Visit documentation: https://agenticgovernance.digital/docs.html - Test interactive demo: https://agenticgovernance.digital/demos/27027-demo.html - [ ] **Test navigation:** - Click navbar dropdown menus - Mobile menu (resize browser or use DevTools) - Footer links work - [ ] **Test critical features** (based on what changed): - If Koha changed: Test donation flow (test mode) - If admin changed: Test admin login - If governance changed: Test governance API (with admin token) - If documents changed: Test document retrieval ### 3. Log Monitoring (5-15 minutes) - [ ] **Monitor production logs for errors** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "sudo journalctl -u tractatus -f" ``` - Watch for: - ERROR, CRITICAL log levels - Unhandled exceptions - Database connection failures - 500 errors on requests - Monitor for at least 5 minutes - If errors appear: Investigate immediately - [ ] **Check for new error patterns** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "sudo journalctl -u tractatus --since '5 minutes ago' | grep -i error" ``` - Compare to known errors (acceptable warnings) - New errors may indicate deployment issues ### 4. Analytics Check (Optional, 15+ minutes) - [ ] **Verify Plausible Analytics tracking** - Visit https://plausible.io/agenticgovernance.digital - Confirm events are being tracked - Check for unusual bounce rates or errors - [ ] **Check Google Search Console** (if configured) - Verify no new crawl errors - Check for 404 increases --- ## Database Migration Procedure (If Needed) Only required when schema changes or data migrations needed. ### Pre-Migration - [ ] **Backup database** (already done in pre-deployment) - [ ] **Test migration on staging** (if staging environment exists) - [ ] **Review migration script** ```bash cat scripts/migrations/YYYYMMDD-description.js ``` ### Execute Migration ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "cd /var/www/tractatus && node scripts/migrations/YYYYMMDD-description.js" ``` ### Post-Migration - [ ] **Verify migration success** ```bash # Check migration completed # Check data integrity ``` - [ ] **Test affected features** - Any features using migrated data ### Migration Rollback (If Needed) - [ ] **Restore database from backup** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "mongorestore --uri='...' /tmp/backup-TIMESTAMP" ``` - [ ] **Rollback code** (see rollback section) --- ## Rollback Procedure Use if deployment causes critical issues that can't be quickly fixed. ### When to Rollback - Application won't start - Critical features completely broken - Security vulnerability introduced - Data loss or corruption occurring - 500 errors on every request ### How to Rollback 1. **Identify last known good commit** ```bash git log --oneline -10 # Find commit before problematic changes ``` 2. **Checkout last good commit** ```bash git checkout ``` 3. **Redeploy using same script** ```bash # Use same deployment script as original deployment ./scripts/deploy-full-project-SAFE.sh ``` 4. **Restart application** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "sudo systemctl restart tractatus" ``` 5. **Verify rollback successful** - Check health endpoint - Check homepage loads - Check logs for errors 6. **Return to main branch** ```bash git checkout main ``` ### Post-Rollback - [ ] **Document incident** - What went wrong? - What was the impact? - How was it detected? - How long was it broken? - What was rolled back? - [ ] **Create incident report** (template below) - [ ] **Fix issue in development** - Reproduce locally - Fix root cause - Add tests to prevent recurrence - Re-deploy when ready --- ## Incident Documentation Template Create file: `docs/incidents/YYYY-MM-DD-description.md` ```markdown # Incident Report: [Brief Description] **Date**: YYYY-MM-DD HH:MM (NZST) **Severity**: [Critical / High / Medium / Low] **Duration**: [X minutes/hours] **Detected By**: [User report / Monitoring / Developer] ## Summary [1-2 sentence summary of what went wrong] ## Timeline - HH:MM - Deployment initiated - HH:MM - Issue detected - HH:MM - Rollback initiated - HH:MM - Service restored ## Root Cause [What caused the issue?] ## Impact - User-facing impact: [What did users experience?] - Data impact: [Was any data lost/corrupted?] - Security impact: [Were any security boundaries crossed?] ## Resolution [How was it fixed?] ## Prevention [What changes prevent this from happening again?] ## Action Items - [ ] Fix root cause - [ ] Add tests - [ ] Update deployment checklist - [ ] Update monitoring ``` --- ## Deployment Log Template Keep a deployment log in: `docs/deployments/YYYY-MM.md` ```markdown # Deployments: [Month Year] ## YYYY-MM-DD HH:MM - [Description] **Deployed By**: [Name] **Deployment Type**: [Frontend / Koha / Full] **Commits Deployed**: - abc123 - Description - def456 - Description **Pre-Deployment Checks**: - [x] Tests passing - [x] Security audit clean - [x] No sensitive files **Verification**: - [x] Health check passed - [x] Homepage loads - [x] No errors in logs **Issues**: None **Rollback Required**: No **Notes**: [Any relevant notes] ``` --- ## Emergency Procedures ### Service Won't Start 1. **Check logs immediately** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "sudo journalctl -u tractatus -n 100" ``` 2. **Common issues:** - MongoDB connection failed → Check MongoDB running: `sudo systemctl status mongod` - Port already in use → Check for zombie processes: `sudo lsof -i :9000` - Missing environment variables → Check .env file exists - Syntax error in code → Rollback immediately 3. **Quick fixes:** ```bash # Restart MongoDB if stopped sudo systemctl start mongod # Kill zombie processes sudo pkill -f node.*tractatus # Restart application sudo systemctl restart tractatus ``` ### Database Connection Lost 1. **Verify MongoDB running** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "sudo systemctl status mongod" ``` 2. **Check MongoDB logs** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "sudo journalctl -u mongod -n 50" ``` 3. **Test connection manually** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "mongosh --host localhost --port 27017 --authenticationDatabase tractatus_prod -u tractatus_user" ``` ### High Error Rate 1. **Identify error pattern** ```bash ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ "sudo journalctl -u tractatus --since '10 minutes ago' | grep ERROR | sort | uniq -c | sort -rn | head -10" ``` 2. **Check if all endpoints affected or specific routes** ```bash # Check health endpoint curl https://agenticgovernance.digital/health # Check specific routes curl https://agenticgovernance.digital/api/documents ``` 3. **Decision:** - If isolated to one feature: Disable feature, investigate - If site-wide: Rollback immediately --- ## CRITICAL: HTML Caching Rules **MANDATORY REQUIREMENT**: HTML files MUST be delivered fresh to users without requiring cache refresh. ### The Problem Service worker caching HTML files caused deployment failures where users saw OLD content even after deploying NEW code. Users should NEVER need to clear cache manually. ### The Solution (Enforced as of 2025-10-17) **Service Worker** (`public/service-worker.js`): - HTML files: Network-ONLY strategy (never cache, always fetch fresh) - Exception: `/index.html` only for offline fallback - Bump `CACHE_VERSION` constant whenever service worker logic changes **Server** (`src/server.js`): - HTML files: `Cache-Control: no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0` - This ensures browsers never cache HTML pages - CSS/JS: Long cache OK (use version parameters for cache-busting) **Version Manifest** (`public/version.json`): - Update version number when deploying HTML changes - Service worker checks this for updates - Set `forceUpdate: true` for critical fixes ### Deployment Rules for HTML Changes When deploying HTML file changes: 1. **Verify service worker never caches HTML** (except index.html) ```bash grep -A 10 "HTML files:" public/service-worker.js # Should show: Network-ONLY strategy, no caching ``` 2. **Verify server sends no-cache headers** ```bash grep -A 3 "HTML files:" src/server.js # Should show: no-store, no-cache, must-revalidate ``` 3. **Bump version.json if critical content changed** ```bash # Edit public/version.json # Increment version: 1.1.2 → 1.1.3 # Update changelog # Set forceUpdate: true ``` 4. **After deployment, verify headers in production** ```bash curl -s -I https://agenticgovernance.digital/koha.html | grep -i cache-control # Expected: no-store, no-cache, must-revalidate curl -s https://agenticgovernance.digital/koha.html | grep "" # Verify correct content showing ``` 5. **Test in incognito window** - Open https://agenticgovernance.digital in fresh incognito window - Verify new content loads immediately - No cache refresh should be needed ### Testing Cache Behavior **Before deployment:** ```bash # Local: Verify server sends correct headers curl -s -I http://localhost:9000/koha.html | grep cache-control # Expected: no-store, no-cache # Verify service worker doesn't cache HTML grep "endsWith('.html')" public/service-worker.js -A 10 # Should NOT cache responses, only fetch ``` **After deployment:** ```bash # Production: Verify headers curl -s -I https://agenticgovernance.digital/<file>.html | grep cache-control # Production: Verify fresh content curl -s https://agenticgovernance.digital/<file>.html | grep "<title>" ``` ### Incident Prevention **Lesson Learned** (2025-10-17 Koha Deployment): - Deployed koha.html with reciprocal giving updates - Service worker cached old version - Users saw old content despite fresh deployment - Required THREE deployment attempts to fix - Root cause: Service worker was caching HTML with network-first strategy **Prevention**: - Service worker now enforces network-ONLY for all HTML (except offline index.html) - Server enforces no-cache headers - This checklist documents the requirement architecturally --- ## Deployment Best Practices ### DO: - ✅ Deploy during low-traffic hours (NZ: 10am-2pm NZST = low US traffic) - ✅ Deploy small, focused changes (easier to debug) - ✅ Test thoroughly before deploying - ✅ Monitor logs after deployment - ✅ Document all deployments - ✅ Keep rollback procedure tested and ready - ✅ Communicate with team before major deployments - ✅ **CRITICAL: Verify HTML cache headers before and after deployment** - ✅ **CRITICAL: Test in incognito window after HTML deployments** ### DON'T: - ❌ Deploy on Friday afternoon (limited time to fix issues) - ❌ Deploy multiple unrelated changes together - ❌ Skip testing "because it's a small change" - ❌ Deploy without checking logs after - ❌ Deploy when tired or rushed - ❌ Deploy without ability to rollback - ❌ Forget to restart services after backend changes - ❌ **CRITICAL: Never cache HTML files in service worker (except offline fallback)** - ❌ **CRITICAL: Never ask users to clear their browser cache - fix it server-side** ### Deployment Timing Guidelines **Best Times** (Low risk): - Monday-Thursday, 10am-2pm NZST - After morning coffee, before lunch - When you have 2+ hours to monitor **Acceptable Times** (Medium risk): - Monday-Thursday, 2pm-5pm NZST - Early morning deployments (if you're alert) **Avoid Times** (High risk): - Friday 3pm+ (weekend coverage issues) - Late evening (tired, less alert) - During known high-traffic events - When about to leave/travel --- ## Automation Opportunities (Future) ### Potential Improvements: - [ ] Automated testing in CI/CD (GitHub Actions) - [ ] Automated deployment on merge to main (after tests pass) - [ ] Automated health checks post-deployment - [ ] Automated rollback on health check failure - [ ] Slack notifications for deployments - [ ] Blue-green deployment for zero-downtime - [ ] Canary deployments for gradual rollout ### Not Ready Yet Because: - Need stable test suite (✅ NOW READY - 380 tests passing) - Need monitoring in place (⏳ Next task - Option D) - Need error alerting (⏳ Next task - Option D) - Need staging environment (💡 Future consideration) --- ## Checklist Quick Reference **Pre-Deploy:** - [ ] Tests pass - [ ] Security audit clean - [ ] No sensitive files - [ ] .rsyncignore verified **Deploy:** - [ ] Choose correct script - [ ] Review dry-run - [ ] Execute deployment - [ ] Note any errors **Verify:** - [ ] Service running - [ ] Health check OK - [ ] Homepage loads - [ ] Monitor logs 5-15min **Document:** - [ ] Log deployment - [ ] Note any issues - [ ] Update team --- ## Contact & Support **Production Access:** - SSH: `ubuntu@vps-93a693da.vps.ovh.net` - Key: `~/.ssh/tractatus_deploy` - Sudo: Available for systemctl, journalctl **Service Management:** - Service: `tractatus.service` (systemd) - Status: `sudo systemctl status tractatus` - Logs: `sudo journalctl -u tractatus -f` - Restart: `sudo systemctl restart tractatus` **Database:** - Host: localhost:27017 - Database: `tractatus_prod` - Auth: tractatus_prod database - User: `tractatus_user` **Domain:** - Production: https://agenticgovernance.digital - Analytics: https://plausible.io/agenticgovernance.digital --- **Document Status**: Active Procedure **Last Updated**: 2025-10-09 **Next Review**: After major deployment or incident **Maintainer**: Technical Lead (Claude Code + John Stroh)