From 91925d899ce739e53149b15ea3d219d646f18917 Mon Sep 17 00:00:00 2001
From: TheFlow <theflow@sydigital.com>
Date: Thu, 9 Oct 2025 22:19:00 +1300
Subject: [PATCH] docs: create comprehensive production deployment checklist
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add detailed deployment procedure to prevent security incidents and
ensure consistent, safe deployments to production.

Includes:
- Pre-deployment verification (tests, security, sensitive file checks)
- Three deployment methods (frontend, Koha, full project)
- Post-deployment verification (health checks, log monitoring)
- Database migration procedure
- Emergency rollback procedure
- Incident documentation template
- Deployment log template
- Emergency procedures (service failures, DB issues)
- Best practices and timing guidelines

Created after security incident where sensitive Claude Code files were
accidentally deployed. This checklist prevents similar incidents through:
- Mandatory .rsyncignore verification
- Sensitive file checks before deployment
- Dry-run review before execution
- Post-deployment monitoring

Status: Active procedure for all production deployments

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
---
 docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md | 676 ++++++++++++++++++++++++
 1 file changed, 676 insertions(+)
 create mode 100644 docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md

diff --git a/docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md b/docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md
new file mode 100644
index 00000000..536bdb46
--- /dev/null
+++ b/docs/PRODUCTION_DEPLOYMENT_CHECKLIST.md
@@ -0,0 +1,676 @@
+# Production Deployment Checklist
+
+**Project**: Tractatus AI Safety Framework Website
+**Environment**: Production (vps-93a693da.vps.ovh.net)
+**Domain**: https://agenticgovernance.digital
+**Created**: 2025-10-09
+**Status**: Active Procedure
+
+---
+
+## Overview
+
+This checklist ensures safe, consistent deployments to production. **Always follow this procedure** to prevent security incidents, service disruptions, and data loss.
+
+**Deployment Philosophy**:
+- Deploy early, deploy often
+- Test thoroughly before deploying
+- Verify after deploying
+- Document incidents and learn
+
+**Incident Prevention**: This checklist was created after a security incident where sensitive Claude Code governance files were accidentally deployed to production. Following this procedure prevents similar incidents.
+
+---
+
+## Pre-Deployment Checklist
+
+### 1. Code Quality Verification
+
+- [ ] **All tests passing locally**
+  ```bash
+  npm test
+  ```
+  - Expected: All tests pass, no failures
+  - If any tests fail: Fix before deploying
+
+- [ ] **Test coverage acceptable**
+  ```bash
+  npm test -- --coverage
+  ```
+  - Check critical services maintain 80%+ coverage
+  - Review new code has reasonable coverage
+
+- [ ] **Linting passes** (if linter configured)
+  ```bash
+  npm run lint
+  # OR
+  npx eslint src/
+  ```
+
+### 2. Security Verification
+
+- [ ] **Run security audit**
+  ```bash
+  npm audit
+  ```
+  - Review all vulnerabilities
+  - Critical/High: Must fix or document why acceptable
+  - Medium/Low: Review and plan fix if needed
+  - If fixes available: `npm audit fix` then re-test
+
+- [ ] **Check for sensitive files in git**
+  ```bash
+  git ls-files | grep -E '(CLAUDE|SESSION|\.env|SECRET|HANDOFF|CLOSEDOWN|_Maintenance_Guide)'
+  ```
+  - Expected: No matches (all sensitive files excluded)
+  - If matches found: Review .gitignore and remove from git history
+
+- [ ] **Verify .rsyncignore completeness**
+  ```bash
+  cat .rsyncignore
+  ```
+  - Confirm excludes:
+    - `CLAUDE*.md`, `SESSION*.md`, maintenance guides
+    - `.env`, `.env.local`, `.env.production.local`
+    - `node_modules/`, `.git/`, `.claude/`
+    - Test files, coverage reports
+    - Development-only files
+
+- [ ] **Check environment secrets not in code**
+  ```bash
+  grep -r "sk-ant-" src/ || echo "No API keys found ✓"
+  grep -r "mongodb://tractatus" src/ || echo "No hardcoded DB URLs ✓"
+  ```
+  - Expected: No hardcoded secrets in source code
+  - All secrets in .env files (which are excluded)
+
+### 3. Database Verification
+
+- [ ] **Database migrations ready** (if any)
+  ```bash
+  # Check if new migrations exist
+  ls -la scripts/migrations/ | tail -5
+  ```
+  - If migrations exist: Plan migration execution
+  - Document migration rollback procedure
+
+- [ ] **Backup current database** (for major changes)
+  ```bash
+  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+    "mongodump --uri='mongodb://tractatus_user:PASSWORD@localhost:27017/tractatus_prod?authSource=tractatus_prod' --out=/tmp/backup-$(date +%Y%m%d-%H%M%S)"
+  ```
+  - Only needed for schema changes or major updates
+  - Store backup location in deployment notes
+
+### 4. Change Documentation
+
+- [ ] **Review what's being deployed**
+  ```bash
+  git log --oneline origin/main..HEAD
+  ```
+  - Confirm all commits are intentional
+  - Verify no work-in-progress commits
+
+- [ ] **Update CHANGELOG.md** (if project uses one)
+  - Document user-facing changes
+  - Document breaking changes
+  - Document security fixes
+
+- [ ] **Commit all changes**
+  ```bash
+  git status
+  # If uncommitted changes exist, decide: commit or stash
+  ```
+
+---
+
+## Deployment Execution
+
+### Choose Deployment Method
+
+**Decision Matrix:**
+
+| What Changed | Script to Use | Command |
+|--------------|---------------|---------|
+| Public HTML/CSS/JS only | `deploy-frontend.sh` | `./scripts/deploy-frontend.sh` |
+| Koha donation system | `deploy-koha-to-production.sh` | `./scripts/deploy-koha-to-production.sh` |
+| Full project (backend, routes, services) | `deploy-full-project-SAFE.sh` | `./scripts/deploy-full-project-SAFE.sh` |
+| Emergency rollback | Manual rsync | See rollback section |
+
+### Option 1: Frontend-Only Deployment
+
+Use when only public-facing files changed (HTML, CSS, JS, images).
+
+```bash
+./scripts/deploy-frontend.sh
+```
+
+**What it deploys:**
+- `public/` directory
+- Excludes: admin, backend code, config files
+
+**Safety level:** ✅ Safest (public files only)
+
+### Option 2: Koha-Specific Deployment
+
+Use when Koha donation system changed.
+
+```bash
+./scripts/deploy-koha-to-production.sh
+```
+
+**What it deploys:**
+- Koha controllers, services, routes
+- Koha frontend (public/koha.html)
+- Related middleware and models
+
+**Safety level:** ⚠️ Moderate (includes backend code)
+
+### Option 3: Full Project Deployment (Most Common)
+
+Use for backend changes, new features, or multi-component updates.
+
+```bash
+./scripts/deploy-full-project-SAFE.sh
+```
+
+**Deployment steps:**
+1. Script shows excluded patterns from .rsyncignore
+2. **Review exclusions carefully** - Verify sensitive files excluded
+3. Script shows dry-run summary
+4. **Verify files to be deployed** - Look for any unexpected files
+5. Confirm deployment (or Ctrl+C to abort)
+6. Script executes rsync with progress
+7. Deployment complete
+
+**What it deploys:**
+- All source code (src/)
+- Public files (public/)
+- Configuration (package.json, etc.)
+- Documentation (docs/)
+- Scripts (scripts/)
+
+**What it excludes** (via .rsyncignore):
+- Claude Code governance files (CLAUDE*.md, SESSION*.md)
+- Environment files (.env*)
+- Node modules (node_modules/)
+- Git repository (.git/)
+- Test files and coverage
+- Development-only files
+
+**Safety level:** ⚠️ Use carefully (full codebase)
+
+### Deployment Verification During Execution
+
+- [ ] **Watch for errors during deployment**
+  - Rsync errors (permission denied, connection failures)
+  - File conflicts
+  - Unexpected file deletions
+
+- [ ] **Verify file count is reasonable**
+  - Frontend: ~50-100 files
+  - Koha: ~20-30 files
+  - Full: ~200-300 files (varies by project size)
+  - If thousands of files: STOP - check .rsyncignore
+
+---
+
+## Post-Deployment Verification
+
+### 1. Immediate Checks (< 2 minutes)
+
+- [ ] **Restart application** (if backend changes)
+  ```bash
+  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+    "sudo systemctl restart tractatus"
+  ```
+
+- [ ] **Check service status**
+  ```bash
+  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+    "sudo systemctl status tractatus"
+  ```
+  - Expected: `active (running)`
+  - If failed: Check logs immediately
+
+- [ ] **Health endpoint check**
+  ```bash
+  curl https://agenticgovernance.digital/health
+  ```
+  - Expected: `{"status":"ok","timestamp":"..."}` (200 OK)
+  - If 500 or error: Check logs, may need rollback
+
+- [ ] **Homepage loads**
+  ```bash
+  curl -I https://agenticgovernance.digital
+  ```
+  - Expected: `HTTP/2 200`
+  - If 404/500: Critical issue, check logs
+
+### 2. Functional Checks (2-5 minutes)
+
+- [ ] **Test primary user flows:**
+  - Visit homepage: https://agenticgovernance.digital
+  - Navigate to Researcher path: https://agenticgovernance.digital/researcher.html
+  - Navigate to Implementer path: https://agenticgovernance.digital/implementer.html
+  - Navigate to Leader path: https://agenticgovernance.digital/leader.html
+  - Visit documentation: https://agenticgovernance.digital/docs.html
+  - Test interactive demo: https://agenticgovernance.digital/demos/27027-demo.html
+
+- [ ] **Test navigation:**
+  - Click navbar dropdown menus
+  - Mobile menu (resize browser or use DevTools)
+  - Footer links work
+
+- [ ] **Test critical features** (based on what changed):
+  - If Koha changed: Test donation flow (test mode)
+  - If admin changed: Test admin login
+  - If governance changed: Test governance API (with admin token)
+  - If documents changed: Test document retrieval
+
+### 3. Log Monitoring (5-15 minutes)
+
+- [ ] **Monitor production logs for errors**
+  ```bash
+  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+    "sudo journalctl -u tractatus -f"
+  ```
+  - Watch for:
+    - ERROR, CRITICAL log levels
+    - Unhandled exceptions
+    - Database connection failures
+    - 500 errors on requests
+  - Monitor for at least 5 minutes
+  - If errors appear: Investigate immediately
+
+- [ ] **Check for new error patterns**
+  ```bash
+  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+    "sudo journalctl -u tractatus --since '5 minutes ago' | grep -i error"
+  ```
+  - Compare to known errors (acceptable warnings)
+  - New errors may indicate deployment issues
+
+### 4. Analytics Check (Optional, 15+ minutes)
+
+- [ ] **Verify Plausible Analytics tracking**
+  - Visit https://plausible.io/agenticgovernance.digital
+  - Confirm events are being tracked
+  - Check for unusual bounce rates or errors
+
+- [ ] **Check Google Search Console** (if configured)
+  - Verify no new crawl errors
+  - Check for 404 increases
+
+---
+
+## Database Migration Procedure (If Needed)
+
+Only required when schema changes or data migrations needed.
+
+### Pre-Migration
+
+- [ ] **Backup database** (already done in pre-deployment)
+- [ ] **Test migration on staging** (if staging environment exists)
+- [ ] **Review migration script**
+  ```bash
+  cat scripts/migrations/YYYYMMDD-description.js
+  ```
+
+### Execute Migration
+
+```bash
+ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+  "cd /var/www/tractatus && node scripts/migrations/YYYYMMDD-description.js"
+```
+
+### Post-Migration
+
+- [ ] **Verify migration success**
+  ```bash
+  # Check migration completed
+  # Check data integrity
+  ```
+
+- [ ] **Test affected features**
+  - Any features using migrated data
+
+### Migration Rollback (If Needed)
+
+- [ ] **Restore database from backup**
+  ```bash
+  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+    "mongorestore --uri='...' /tmp/backup-TIMESTAMP"
+  ```
+
+- [ ] **Rollback code** (see rollback section)
+
+---
+
+## Rollback Procedure
+
+Use if deployment causes critical issues that can't be quickly fixed.
+
+### When to Rollback
+
+- Application won't start
+- Critical features completely broken
+- Security vulnerability introduced
+- Data loss or corruption occurring
+- 500 errors on every request
+
+### How to Rollback
+
+1. **Identify last known good commit**
+   ```bash
+   git log --oneline -10
+   # Find commit before problematic changes
+   ```
+
+2. **Checkout last good commit**
+   ```bash
+   git checkout <commit-hash>
+   ```
+
+3. **Redeploy using same script**
+   ```bash
+   # Use same deployment script as original deployment
+   ./scripts/deploy-full-project-SAFE.sh
+   ```
+
+4. **Restart application**
+   ```bash
+   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+     "sudo systemctl restart tractatus"
+   ```
+
+5. **Verify rollback successful**
+   - Check health endpoint
+   - Check homepage loads
+   - Check logs for errors
+
+6. **Return to main branch**
+   ```bash
+   git checkout main
+   ```
+
+### Post-Rollback
+
+- [ ] **Document incident**
+  - What went wrong?
+  - What was the impact?
+  - How was it detected?
+  - How long was it broken?
+  - What was rolled back?
+
+- [ ] **Create incident report** (template below)
+
+- [ ] **Fix issue in development**
+  - Reproduce locally
+  - Fix root cause
+  - Add tests to prevent recurrence
+  - Re-deploy when ready
+
+---
+
+## Incident Documentation Template
+
+Create file: `docs/incidents/YYYY-MM-DD-description.md`
+
+```markdown
+# Incident Report: [Brief Description]
+
+**Date**: YYYY-MM-DD HH:MM (NZST)
+**Severity**: [Critical / High / Medium / Low]
+**Duration**: [X minutes/hours]
+**Detected By**: [User report / Monitoring / Developer]
+
+## Summary
+[1-2 sentence summary of what went wrong]
+
+## Timeline
+- HH:MM - Deployment initiated
+- HH:MM - Issue detected
+- HH:MM - Rollback initiated
+- HH:MM - Service restored
+
+## Root Cause
+[What caused the issue?]
+
+## Impact
+- User-facing impact: [What did users experience?]
+- Data impact: [Was any data lost/corrupted?]
+- Security impact: [Were any security boundaries crossed?]
+
+## Resolution
+[How was it fixed?]
+
+## Prevention
+[What changes prevent this from happening again?]
+
+## Action Items
+- [ ] Fix root cause
+- [ ] Add tests
+- [ ] Update deployment checklist
+- [ ] Update monitoring
+```
+
+---
+
+## Deployment Log Template
+
+Keep a deployment log in: `docs/deployments/YYYY-MM.md`
+
+```markdown
+# Deployments: [Month Year]
+
+## YYYY-MM-DD HH:MM - [Description]
+
+**Deployed By**: [Name]
+**Deployment Type**: [Frontend / Koha / Full]
+**Commits Deployed**:
+- abc123 - Description
+- def456 - Description
+
+**Pre-Deployment Checks**:
+- [x] Tests passing
+- [x] Security audit clean
+- [x] No sensitive files
+
+**Verification**:
+- [x] Health check passed
+- [x] Homepage loads
+- [x] No errors in logs
+
+**Issues**: None
+**Rollback Required**: No
+**Notes**: [Any relevant notes]
+```
+
+---
+
+## Emergency Procedures
+
+### Service Won't Start
+
+1. **Check logs immediately**
+   ```bash
+   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+     "sudo journalctl -u tractatus -n 100"
+   ```
+
+2. **Common issues:**
+   - MongoDB connection failed → Check MongoDB running: `sudo systemctl status mongod`
+   - Port already in use → Check for zombie processes: `sudo lsof -i :9000`
+   - Missing environment variables → Check .env file exists
+   - Syntax error in code → Rollback immediately
+
+3. **Quick fixes:**
+   ```bash
+   # Restart MongoDB if stopped
+   sudo systemctl start mongod
+
+   # Kill zombie processes
+   sudo pkill -f node.*tractatus
+
+   # Restart application
+   sudo systemctl restart tractatus
+   ```
+
+### Database Connection Lost
+
+1. **Verify MongoDB running**
+   ```bash
+   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+     "sudo systemctl status mongod"
+   ```
+
+2. **Check MongoDB logs**
+   ```bash
+   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+     "sudo journalctl -u mongod -n 50"
+   ```
+
+3. **Test connection manually**
+   ```bash
+   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+     "mongosh --host localhost --port 27017 --authenticationDatabase tractatus_prod -u tractatus_user"
+   ```
+
+### High Error Rate
+
+1. **Identify error pattern**
+   ```bash
+   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
+     "sudo journalctl -u tractatus --since '10 minutes ago' | grep ERROR | sort | uniq -c | sort -rn | head -10"
+   ```
+
+2. **Check if all endpoints affected or specific routes**
+   ```bash
+   # Check health endpoint
+   curl https://agenticgovernance.digital/health
+
+   # Check specific routes
+   curl https://agenticgovernance.digital/api/documents
+   ```
+
+3. **Decision:**
+   - If isolated to one feature: Disable feature, investigate
+   - If site-wide: Rollback immediately
+
+---
+
+## Deployment Best Practices
+
+### DO:
+- ✅ Deploy during low-traffic hours (NZ: 10am-2pm NZST = low US traffic)
+- ✅ Deploy small, focused changes (easier to debug)
+- ✅ Test thoroughly before deploying
+- ✅ Monitor logs after deployment
+- ✅ Document all deployments
+- ✅ Keep rollback procedure tested and ready
+- ✅ Communicate with team before major deployments
+
+### DON'T:
+- ❌ Deploy on Friday afternoon (limited time to fix issues)
+- ❌ Deploy multiple unrelated changes together
+- ❌ Skip testing "because it's a small change"
+- ❌ Deploy without checking logs after
+- ❌ Deploy when tired or rushed
+- ❌ Deploy without ability to rollback
+- ❌ Forget to restart services after backend changes
+
+### Deployment Timing Guidelines
+
+**Best Times** (Low risk):
+- Monday-Thursday, 10am-2pm NZST
+- After morning coffee, before lunch
+- When you have 2+ hours to monitor
+
+**Acceptable Times** (Medium risk):
+- Monday-Thursday, 2pm-5pm NZST
+- Early morning deployments (if you're alert)
+
+**Avoid Times** (High risk):
+- Friday 3pm+ (weekend coverage issues)
+- Late evening (tired, less alert)
+- During known high-traffic events
+- When about to leave/travel
+
+---
+
+## Automation Opportunities (Future)
+
+### Potential Improvements:
+- [ ] Automated testing in CI/CD (GitHub Actions)
+- [ ] Automated deployment on merge to main (after tests pass)
+- [ ] Automated health checks post-deployment
+- [ ] Automated rollback on health check failure
+- [ ] Slack notifications for deployments
+- [ ] Blue-green deployment for zero-downtime
+- [ ] Canary deployments for gradual rollout
+
+### Not Ready Yet Because:
+- Need stable test suite (✅ NOW READY - 380 tests passing)
+- Need monitoring in place (⏳ Next task - Option D)
+- Need error alerting (⏳ Next task - Option D)
+- Need staging environment (💡 Future consideration)
+
+---
+
+## Checklist Quick Reference
+
+**Pre-Deploy:**
+- [ ] Tests pass
+- [ ] Security audit clean
+- [ ] No sensitive files
+- [ ] .rsyncignore verified
+
+**Deploy:**
+- [ ] Choose correct script
+- [ ] Review dry-run
+- [ ] Execute deployment
+- [ ] Note any errors
+
+**Verify:**
+- [ ] Service running
+- [ ] Health check OK
+- [ ] Homepage loads
+- [ ] Monitor logs 5-15min
+
+**Document:**
+- [ ] Log deployment
+- [ ] Note any issues
+- [ ] Update team
+
+---
+
+## Contact & Support
+
+**Production Access:**
+- SSH: `ubuntu@vps-93a693da.vps.ovh.net`
+- Key: `~/.ssh/tractatus_deploy`
+- Sudo: Available for systemctl, journalctl
+
+**Service Management:**
+- Service: `tractatus.service` (systemd)
+- Status: `sudo systemctl status tractatus`
+- Logs: `sudo journalctl -u tractatus -f`
+- Restart: `sudo systemctl restart tractatus`
+
+**Database:**
+- Host: localhost:27017
+- Database: `tractatus_prod`
+- Auth: tractatus_prod database
+- User: `tractatus_user`
+
+**Domain:**
+- Production: https://agenticgovernance.digital
+- Analytics: https://plausible.io/agenticgovernance.digital
+
+---
+
+**Document Status**: Active Procedure
+**Last Updated**: 2025-10-09
+**Next Review**: After major deployment or incident
+**Maintainer**: Technical Lead (Claude Code + John Stroh)