From 91925d899ce739e53149b15ea3d219d646f18917 Mon Sep 17 00:00:00 2001 From: TheFlow Date: Thu, 9 Oct 2025 22:19:00 +1300 Subject: [PATCH] docs: create comprehensive production deployment checklist MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add detailed deployment procedure to prevent security incidents and ensure consistent, safe deployments to production. Includes: - Pre-deployment verification (tests, security, sensitive file checks) - Three deployment methods (frontend, Koha, full project) - Post-deployment verification (health checks, log monitoring) - Database migration procedure - Emergency rollback procedure - Incident documentation template - Deployment log template - Emergency procedures (service failures, DB issues) - Best practices and timing guidelines Created after security incident where sensitive Claude Code files were accidentally deployed. This checklist prevents similar incidents through: - Mandatory .rsyncignore verification - Sensitive file checks before deployment - Dry-run review before execution - Post-deployment monitoring Status: Active procedure for all production deployments 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md | 676 ++++++++++++++++++++++++ 1 file changed, 676 insertions(+) create mode 100644 docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md diff --git a/docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md b/docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md new file mode 100644 index 00000000..536bdb46 --- /dev/null +++ b/docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md @@ -0,0 +1,676 @@ +# Production Deployment Checklist + +**Project**: Tractatus AI Safety Framework Website +**Environment**: Production (vps-93a693da.vps.ovh.net) +**Domain**: https://agenticgovernance.digital +**Created**: 2025-10-09 +**Status**: Active Procedure + +--- + +## Overview + +This checklist ensures safe, consistent deployments to production. **Always follow this procedure** to prevent security incidents, service disruptions, and data loss. + +**Deployment Philosophy**: +- Deploy early, deploy often +- Test thoroughly before deploying +- Verify after deploying +- Document incidents and learn + +**Incident Prevention**: This checklist was created after a security incident where sensitive Claude Code governance files were accidentally deployed to production. Following this procedure prevents similar incidents. + +--- + +## Pre-Deployment Checklist + +### 1. Code Quality Verification + +- [ ] **All tests passing locally** + ```bash + npm test + ``` + - Expected: All tests pass, no failures + - If any tests fail: Fix before deploying + +- [ ] **Test coverage acceptable** + ```bash + npm test -- --coverage + ``` + - Check critical services maintain 80%+ coverage + - Review new code has reasonable coverage + +- [ ] **Linting passes** (if linter configured) + ```bash + npm run lint + # OR + npx eslint src/ + ``` + +### 2. Security Verification + +- [ ] **Run security audit** + ```bash + npm audit + ``` + - Review all vulnerabilities + - Critical/High: Must fix or document why acceptable + - Medium/Low: Review and plan fix if needed + - If fixes available: `npm audit fix` then re-test + +- [ ] **Check for sensitive files in git** + ```bash + git ls-files | grep -E '(CLAUDE|SESSION|\.env|SECRET|HANDOFF|CLOSEDOWN|_Maintenance_Guide)' + ``` + - Expected: No matches (all sensitive files excluded) + - If matches found: Review .gitignore and remove from git history + +- [ ] **Verify .rsyncignore completeness** + ```bash + cat .rsyncignore + ``` + - Confirm excludes: + - `CLAUDE*.md`, `SESSION*.md`, maintenance guides + - `.env`, `.env.local`, `.env.production.local` + - `node_modules/`, `.git/`, `.claude/` + - Test files, coverage reports + - Development-only files + +- [ ] **Check environment secrets not in code** + ```bash + grep -r "sk-ant-" src/ || echo "No API keys found ✓" + grep -r "mongodb://tractatus" src/ || echo "No hardcoded DB URLs ✓" + ``` + - Expected: No hardcoded secrets in source code + - All secrets in .env files (which are excluded) + +### 3. Database Verification + +- [ ] **Database migrations ready** (if any) + ```bash + # Check if new migrations exist + ls -la scripts/migrations/ | tail -5 + ``` + - If migrations exist: Plan migration execution + - Document migration rollback procedure + +- [ ] **Backup current database** (for major changes) + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "mongodump --uri='mongodb://tractatus_user:PASSWORD@localhost:27017/tractatus_prod?authSource=tractatus_prod' --out=/tmp/backup-$(date +%Y%m%d-%H%M%S)" + ``` + - Only needed for schema changes or major updates + - Store backup location in deployment notes + +### 4. Change Documentation + +- [ ] **Review what's being deployed** + ```bash + git log --oneline origin/main..HEAD + ``` + - Confirm all commits are intentional + - Verify no work-in-progress commits + +- [ ] **Update CHANGELOG.md** (if project uses one) + - Document user-facing changes + - Document breaking changes + - Document security fixes + +- [ ] **Commit all changes** + ```bash + git status + # If uncommitted changes exist, decide: commit or stash + ``` + +--- + +## Deployment Execution + +### Choose Deployment Method + +**Decision Matrix:** + +| What Changed | Script to Use | Command | +|--------------|---------------|---------| +| Public HTML/CSS/JS only | `deploy-frontend.sh` | `./scripts/deploy-frontend.sh` | +| Koha donation system | `deploy-koha-to-production.sh` | `./scripts/deploy-koha-to-production.sh` | +| Full project (backend, routes, services) | `deploy-full-project-SAFE.sh` | `./scripts/deploy-full-project-SAFE.sh` | +| Emergency rollback | Manual rsync | See rollback section | + +### Option 1: Frontend-Only Deployment + +Use when only public-facing files changed (HTML, CSS, JS, images). + +```bash +./scripts/deploy-frontend.sh +``` + +**What it deploys:** +- `public/` directory +- Excludes: admin, backend code, config files + +**Safety level:** ✅ Safest (public files only) + +### Option 2: Koha-Specific Deployment + +Use when Koha donation system changed. + +```bash +./scripts/deploy-koha-to-production.sh +``` + +**What it deploys:** +- Koha controllers, services, routes +- Koha frontend (public/koha.html) +- Related middleware and models + +**Safety level:** ⚠️ Moderate (includes backend code) + +### Option 3: Full Project Deployment (Most Common) + +Use for backend changes, new features, or multi-component updates. + +```bash +./scripts/deploy-full-project-SAFE.sh +``` + +**Deployment steps:** +1. Script shows excluded patterns from .rsyncignore +2. **Review exclusions carefully** - Verify sensitive files excluded +3. Script shows dry-run summary +4. **Verify files to be deployed** - Look for any unexpected files +5. Confirm deployment (or Ctrl+C to abort) +6. Script executes rsync with progress +7. Deployment complete + +**What it deploys:** +- All source code (src/) +- Public files (public/) +- Configuration (package.json, etc.) +- Documentation (docs/) +- Scripts (scripts/) + +**What it excludes** (via .rsyncignore): +- Claude Code governance files (CLAUDE*.md, SESSION*.md) +- Environment files (.env*) +- Node modules (node_modules/) +- Git repository (.git/) +- Test files and coverage +- Development-only files + +**Safety level:** ⚠️ Use carefully (full codebase) + +### Deployment Verification During Execution + +- [ ] **Watch for errors during deployment** + - Rsync errors (permission denied, connection failures) + - File conflicts + - Unexpected file deletions + +- [ ] **Verify file count is reasonable** + - Frontend: ~50-100 files + - Koha: ~20-30 files + - Full: ~200-300 files (varies by project size) + - If thousands of files: STOP - check .rsyncignore + +--- + +## Post-Deployment Verification + +### 1. Immediate Checks (< 2 minutes) + +- [ ] **Restart application** (if backend changes) + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "sudo systemctl restart tractatus" + ``` + +- [ ] **Check service status** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "sudo systemctl status tractatus" + ``` + - Expected: `active (running)` + - If failed: Check logs immediately + +- [ ] **Health endpoint check** + ```bash + curl https://agenticgovernance.digital/health + ``` + - Expected: `{"status":"ok","timestamp":"..."}` (200 OK) + - If 500 or error: Check logs, may need rollback + +- [ ] **Homepage loads** + ```bash + curl -I https://agenticgovernance.digital + ``` + - Expected: `HTTP/2 200` + - If 404/500: Critical issue, check logs + +### 2. Functional Checks (2-5 minutes) + +- [ ] **Test primary user flows:** + - Visit homepage: https://agenticgovernance.digital + - Navigate to Researcher path: https://agenticgovernance.digital/researcher.html + - Navigate to Implementer path: https://agenticgovernance.digital/implementer.html + - Navigate to Leader path: https://agenticgovernance.digital/leader.html + - Visit documentation: https://agenticgovernance.digital/docs.html + - Test interactive demo: https://agenticgovernance.digital/demos/27027-demo.html + +- [ ] **Test navigation:** + - Click navbar dropdown menus + - Mobile menu (resize browser or use DevTools) + - Footer links work + +- [ ] **Test critical features** (based on what changed): + - If Koha changed: Test donation flow (test mode) + - If admin changed: Test admin login + - If governance changed: Test governance API (with admin token) + - If documents changed: Test document retrieval + +### 3. Log Monitoring (5-15 minutes) + +- [ ] **Monitor production logs for errors** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "sudo journalctl -u tractatus -f" + ``` + - Watch for: + - ERROR, CRITICAL log levels + - Unhandled exceptions + - Database connection failures + - 500 errors on requests + - Monitor for at least 5 minutes + - If errors appear: Investigate immediately + +- [ ] **Check for new error patterns** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "sudo journalctl -u tractatus --since '5 minutes ago' | grep -i error" + ``` + - Compare to known errors (acceptable warnings) + - New errors may indicate deployment issues + +### 4. Analytics Check (Optional, 15+ minutes) + +- [ ] **Verify Plausible Analytics tracking** + - Visit https://plausible.io/agenticgovernance.digital + - Confirm events are being tracked + - Check for unusual bounce rates or errors + +- [ ] **Check Google Search Console** (if configured) + - Verify no new crawl errors + - Check for 404 increases + +--- + +## Database Migration Procedure (If Needed) + +Only required when schema changes or data migrations needed. + +### Pre-Migration + +- [ ] **Backup database** (already done in pre-deployment) +- [ ] **Test migration on staging** (if staging environment exists) +- [ ] **Review migration script** + ```bash + cat scripts/migrations/YYYYMMDD-description.js + ``` + +### Execute Migration + +```bash +ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "cd /var/www/tractatus && node scripts/migrations/YYYYMMDD-description.js" +``` + +### Post-Migration + +- [ ] **Verify migration success** + ```bash + # Check migration completed + # Check data integrity + ``` + +- [ ] **Test affected features** + - Any features using migrated data + +### Migration Rollback (If Needed) + +- [ ] **Restore database from backup** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "mongorestore --uri='...' /tmp/backup-TIMESTAMP" + ``` + +- [ ] **Rollback code** (see rollback section) + +--- + +## Rollback Procedure + +Use if deployment causes critical issues that can't be quickly fixed. + +### When to Rollback + +- Application won't start +- Critical features completely broken +- Security vulnerability introduced +- Data loss or corruption occurring +- 500 errors on every request + +### How to Rollback + +1. **Identify last known good commit** + ```bash + git log --oneline -10 + # Find commit before problematic changes + ``` + +2. **Checkout last good commit** + ```bash + git checkout + ``` + +3. **Redeploy using same script** + ```bash + # Use same deployment script as original deployment + ./scripts/deploy-full-project-SAFE.sh + ``` + +4. **Restart application** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "sudo systemctl restart tractatus" + ``` + +5. **Verify rollback successful** + - Check health endpoint + - Check homepage loads + - Check logs for errors + +6. **Return to main branch** + ```bash + git checkout main + ``` + +### Post-Rollback + +- [ ] **Document incident** + - What went wrong? + - What was the impact? + - How was it detected? + - How long was it broken? + - What was rolled back? + +- [ ] **Create incident report** (template below) + +- [ ] **Fix issue in development** + - Reproduce locally + - Fix root cause + - Add tests to prevent recurrence + - Re-deploy when ready + +--- + +## Incident Documentation Template + +Create file: `docs/incidents/YYYY-MM-DD-description.md` + +```markdown +# Incident Report: [Brief Description] + +**Date**: YYYY-MM-DD HH:MM (NZST) +**Severity**: [Critical / High / Medium / Low] +**Duration**: [X minutes/hours] +**Detected By**: [User report / Monitoring / Developer] + +## Summary +[1-2 sentence summary of what went wrong] + +## Timeline +- HH:MM - Deployment initiated +- HH:MM - Issue detected +- HH:MM - Rollback initiated +- HH:MM - Service restored + +## Root Cause +[What caused the issue?] + +## Impact +- User-facing impact: [What did users experience?] +- Data impact: [Was any data lost/corrupted?] +- Security impact: [Were any security boundaries crossed?] + +## Resolution +[How was it fixed?] + +## Prevention +[What changes prevent this from happening again?] + +## Action Items +- [ ] Fix root cause +- [ ] Add tests +- [ ] Update deployment checklist +- [ ] Update monitoring +``` + +--- + +## Deployment Log Template + +Keep a deployment log in: `docs/deployments/YYYY-MM.md` + +```markdown +# Deployments: [Month Year] + +## YYYY-MM-DD HH:MM - [Description] + +**Deployed By**: [Name] +**Deployment Type**: [Frontend / Koha / Full] +**Commits Deployed**: +- abc123 - Description +- def456 - Description + +**Pre-Deployment Checks**: +- [x] Tests passing +- [x] Security audit clean +- [x] No sensitive files + +**Verification**: +- [x] Health check passed +- [x] Homepage loads +- [x] No errors in logs + +**Issues**: None +**Rollback Required**: No +**Notes**: [Any relevant notes] +``` + +--- + +## Emergency Procedures + +### Service Won't Start + +1. **Check logs immediately** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "sudo journalctl -u tractatus -n 100" + ``` + +2. **Common issues:** + - MongoDB connection failed → Check MongoDB running: `sudo systemctl status mongod` + - Port already in use → Check for zombie processes: `sudo lsof -i :9000` + - Missing environment variables → Check .env file exists + - Syntax error in code → Rollback immediately + +3. **Quick fixes:** + ```bash + # Restart MongoDB if stopped + sudo systemctl start mongod + + # Kill zombie processes + sudo pkill -f node.*tractatus + + # Restart application + sudo systemctl restart tractatus + ``` + +### Database Connection Lost + +1. **Verify MongoDB running** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "sudo systemctl status mongod" + ``` + +2. **Check MongoDB logs** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "sudo journalctl -u mongod -n 50" + ``` + +3. **Test connection manually** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "mongosh --host localhost --port 27017 --authenticationDatabase tractatus_prod -u tractatus_user" + ``` + +### High Error Rate + +1. **Identify error pattern** + ```bash + ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \ + "sudo journalctl -u tractatus --since '10 minutes ago' | grep ERROR | sort | uniq -c | sort -rn | head -10" + ``` + +2. **Check if all endpoints affected or specific routes** + ```bash + # Check health endpoint + curl https://agenticgovernance.digital/health + + # Check specific routes + curl https://agenticgovernance.digital/api/documents + ``` + +3. **Decision:** + - If isolated to one feature: Disable feature, investigate + - If site-wide: Rollback immediately + +--- + +## Deployment Best Practices + +### DO: +- ✅ Deploy during low-traffic hours (NZ: 10am-2pm NZST = low US traffic) +- ✅ Deploy small, focused changes (easier to debug) +- ✅ Test thoroughly before deploying +- ✅ Monitor logs after deployment +- ✅ Document all deployments +- ✅ Keep rollback procedure tested and ready +- ✅ Communicate with team before major deployments + +### DON'T: +- ❌ Deploy on Friday afternoon (limited time to fix issues) +- ❌ Deploy multiple unrelated changes together +- ❌ Skip testing "because it's a small change" +- ❌ Deploy without checking logs after +- ❌ Deploy when tired or rushed +- ❌ Deploy without ability to rollback +- ❌ Forget to restart services after backend changes + +### Deployment Timing Guidelines + +**Best Times** (Low risk): +- Monday-Thursday, 10am-2pm NZST +- After morning coffee, before lunch +- When you have 2+ hours to monitor + +**Acceptable Times** (Medium risk): +- Monday-Thursday, 2pm-5pm NZST +- Early morning deployments (if you're alert) + +**Avoid Times** (High risk): +- Friday 3pm+ (weekend coverage issues) +- Late evening (tired, less alert) +- During known high-traffic events +- When about to leave/travel + +--- + +## Automation Opportunities (Future) + +### Potential Improvements: +- [ ] Automated testing in CI/CD (GitHub Actions) +- [ ] Automated deployment on merge to main (after tests pass) +- [ ] Automated health checks post-deployment +- [ ] Automated rollback on health check failure +- [ ] Slack notifications for deployments +- [ ] Blue-green deployment for zero-downtime +- [ ] Canary deployments for gradual rollout + +### Not Ready Yet Because: +- Need stable test suite (✅ NOW READY - 380 tests passing) +- Need monitoring in place (⏳ Next task - Option D) +- Need error alerting (⏳ Next task - Option D) +- Need staging environment (💡 Future consideration) + +--- + +## Checklist Quick Reference + +**Pre-Deploy:** +- [ ] Tests pass +- [ ] Security audit clean +- [ ] No sensitive files +- [ ] .rsyncignore verified + +**Deploy:** +- [ ] Choose correct script +- [ ] Review dry-run +- [ ] Execute deployment +- [ ] Note any errors + +**Verify:** +- [ ] Service running +- [ ] Health check OK +- [ ] Homepage loads +- [ ] Monitor logs 5-15min + +**Document:** +- [ ] Log deployment +- [ ] Note any issues +- [ ] Update team + +--- + +## Contact & Support + +**Production Access:** +- SSH: `ubuntu@vps-93a693da.vps.ovh.net` +- Key: `~/.ssh/tractatus_deploy` +- Sudo: Available for systemctl, journalctl + +**Service Management:** +- Service: `tractatus.service` (systemd) +- Status: `sudo systemctl status tractatus` +- Logs: `sudo journalctl -u tractatus -f` +- Restart: `sudo systemctl restart tractatus` + +**Database:** +- Host: localhost:27017 +- Database: `tractatus_prod` +- Auth: tractatus_prod database +- User: `tractatus_user` + +**Domain:** +- Production: https://agenticgovernance.digital +- Analytics: https://plausible.io/agenticgovernance.digital + +--- + +**Document Status**: Active Procedure +**Last Updated**: 2025-10-09 +**Next Review**: After major deployment or incident +**Maintainer**: Technical Lead (Claude Code + John Stroh)