docs: create comprehensive production deployment checklist

Add detailed deployment procedure to prevent security incidents and
ensure consistent, safe deployments to production.

Includes:
- Pre-deployment verification (tests, security, sensitive file checks)
- Three deployment methods (frontend, Koha, full project)
- Post-deployment verification (health checks, log monitoring)
- Database migration procedure
- Emergency rollback procedure
- Incident documentation template
- Deployment log template
- Emergency procedures (service failures, DB issues)
- Best practices and timing guidelines

Created after security incident where sensitive Claude Code files were
accidentally deployed. This checklist prevents similar incidents through:
- Mandatory .rsyncignore verification
- Sensitive file checks before deployment
- Dry-run review before execution
- Post-deployment monitoring

Status: Active procedure for all production deployments

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
TheFlow 2025-10-09 22:19:00 +13:00
parent 20875e41fd
commit 91925d899c

View file

@ -0,0 +1,676 @@
# Production Deployment Checklist
**Project**: Tractatus AI Safety Framework Website
**Environment**: Production (vps-93a693da.vps.ovh.net)
**Domain**: https://agenticgovernance.digital
**Created**: 2025-10-09
**Status**: Active Procedure
---
## Overview
This checklist ensures safe, consistent deployments to production. **Always follow this procedure** to prevent security incidents, service disruptions, and data loss.
**Deployment Philosophy**:
- Deploy early, deploy often
- Test thoroughly before deploying
- Verify after deploying
- Document incidents and learn
**Incident Prevention**: This checklist was created after a security incident where sensitive Claude Code governance files were accidentally deployed to production. Following this procedure prevents similar incidents.
---
## Pre-Deployment Checklist
### 1. Code Quality Verification
- [ ] **All tests passing locally**
```bash
npm test
```
- Expected: All tests pass, no failures
- If any tests fail: Fix before deploying
- [ ] **Test coverage acceptable**
```bash
npm test -- --coverage
```
- Check critical services maintain 80%+ coverage
- Review new code has reasonable coverage
- [ ] **Linting passes** (if linter configured)
```bash
npm run lint
# OR
npx eslint src/
```
### 2. Security Verification
- [ ] **Run security audit**
```bash
npm audit
```
- Review all vulnerabilities
- Critical/High: Must fix or document why acceptable
- Medium/Low: Review and plan fix if needed
- If fixes available: `npm audit fix` then re-test
- [ ] **Check for sensitive files in git**
```bash
git ls-files | grep -E '(CLAUDE|SESSION|\.env|SECRET|HANDOFF|CLOSEDOWN|_Maintenance_Guide)'
```
- Expected: No matches (all sensitive files excluded)
- If matches found: Review .gitignore and remove from git history
- [ ] **Verify .rsyncignore completeness**
```bash
cat .rsyncignore
```
- Confirm excludes:
- `CLAUDE*.md`, `SESSION*.md`, maintenance guides
- `.env`, `.env.local`, `.env.production.local`
- `node_modules/`, `.git/`, `.claude/`
- Test files, coverage reports
- Development-only files
- [ ] **Check environment secrets not in code**
```bash
grep -r "sk-ant-" src/ || echo "No API keys found ✓"
grep -r "mongodb://tractatus" src/ || echo "No hardcoded DB URLs ✓"
```
- Expected: No hardcoded secrets in source code
- All secrets in .env files (which are excluded)
### 3. Database Verification
- [ ] **Database migrations ready** (if any)
```bash
# Check if new migrations exist
ls -la scripts/migrations/ | tail -5
```
- If migrations exist: Plan migration execution
- Document migration rollback procedure
- [ ] **Backup current database** (for major changes)
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"mongodump --uri='mongodb://tractatus_user:PASSWORD@localhost:27017/tractatus_prod?authSource=tractatus_prod' --out=/tmp/backup-$(date +%Y%m%d-%H%M%S)"
```
- Only needed for schema changes or major updates
- Store backup location in deployment notes
### 4. Change Documentation
- [ ] **Review what's being deployed**
```bash
git log --oneline origin/main..HEAD
```
- Confirm all commits are intentional
- Verify no work-in-progress commits
- [ ] **Update CHANGELOG.md** (if project uses one)
- Document user-facing changes
- Document breaking changes
- Document security fixes
- [ ] **Commit all changes**
```bash
git status
# If uncommitted changes exist, decide: commit or stash
```
---
## Deployment Execution
### Choose Deployment Method
**Decision Matrix:**
| What Changed | Script to Use | Command |
|--------------|---------------|---------|
| Public HTML/CSS/JS only | `deploy-frontend.sh` | `./scripts/deploy-frontend.sh` |
| Koha donation system | `deploy-koha-to-production.sh` | `./scripts/deploy-koha-to-production.sh` |
| Full project (backend, routes, services) | `deploy-full-project-SAFE.sh` | `./scripts/deploy-full-project-SAFE.sh` |
| Emergency rollback | Manual rsync | See rollback section |
### Option 1: Frontend-Only Deployment
Use when only public-facing files changed (HTML, CSS, JS, images).
```bash
./scripts/deploy-frontend.sh
```
**What it deploys:**
- `public/` directory
- Excludes: admin, backend code, config files
**Safety level:** ✅ Safest (public files only)
### Option 2: Koha-Specific Deployment
Use when Koha donation system changed.
```bash
./scripts/deploy-koha-to-production.sh
```
**What it deploys:**
- Koha controllers, services, routes
- Koha frontend (public/koha.html)
- Related middleware and models
**Safety level:** ⚠️ Moderate (includes backend code)
### Option 3: Full Project Deployment (Most Common)
Use for backend changes, new features, or multi-component updates.
```bash
./scripts/deploy-full-project-SAFE.sh
```
**Deployment steps:**
1. Script shows excluded patterns from .rsyncignore
2. **Review exclusions carefully** - Verify sensitive files excluded
3. Script shows dry-run summary
4. **Verify files to be deployed** - Look for any unexpected files
5. Confirm deployment (or Ctrl+C to abort)
6. Script executes rsync with progress
7. Deployment complete
**What it deploys:**
- All source code (src/)
- Public files (public/)
- Configuration (package.json, etc.)
- Documentation (docs/)
- Scripts (scripts/)
**What it excludes** (via .rsyncignore):
- Claude Code governance files (CLAUDE*.md, SESSION*.md)
- Environment files (.env*)
- Node modules (node_modules/)
- Git repository (.git/)
- Test files and coverage
- Development-only files
**Safety level:** ⚠️ Use carefully (full codebase)
### Deployment Verification During Execution
- [ ] **Watch for errors during deployment**
- Rsync errors (permission denied, connection failures)
- File conflicts
- Unexpected file deletions
- [ ] **Verify file count is reasonable**
- Frontend: ~50-100 files
- Koha: ~20-30 files
- Full: ~200-300 files (varies by project size)
- If thousands of files: STOP - check .rsyncignore
---
## Post-Deployment Verification
### 1. Immediate Checks (< 2 minutes)
- [ ] **Restart application** (if backend changes)
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"sudo systemctl restart tractatus"
```
- [ ] **Check service status**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"sudo systemctl status tractatus"
```
- Expected: `active (running)`
- If failed: Check logs immediately
- [ ] **Health endpoint check**
```bash
curl https://agenticgovernance.digital/health
```
- Expected: `{"status":"ok","timestamp":"..."}` (200 OK)
- If 500 or error: Check logs, may need rollback
- [ ] **Homepage loads**
```bash
curl -I https://agenticgovernance.digital
```
- Expected: `HTTP/2 200`
- If 404/500: Critical issue, check logs
### 2. Functional Checks (2-5 minutes)
- [ ] **Test primary user flows:**
- Visit homepage: https://agenticgovernance.digital
- Navigate to Researcher path: https://agenticgovernance.digital/researcher.html
- Navigate to Implementer path: https://agenticgovernance.digital/implementer.html
- Navigate to Leader path: https://agenticgovernance.digital/leader.html
- Visit documentation: https://agenticgovernance.digital/docs.html
- Test interactive demo: https://agenticgovernance.digital/demos/27027-demo.html
- [ ] **Test navigation:**
- Click navbar dropdown menus
- Mobile menu (resize browser or use DevTools)
- Footer links work
- [ ] **Test critical features** (based on what changed):
- If Koha changed: Test donation flow (test mode)
- If admin changed: Test admin login
- If governance changed: Test governance API (with admin token)
- If documents changed: Test document retrieval
### 3. Log Monitoring (5-15 minutes)
- [ ] **Monitor production logs for errors**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"sudo journalctl -u tractatus -f"
```
- Watch for:
- ERROR, CRITICAL log levels
- Unhandled exceptions
- Database connection failures
- 500 errors on requests
- Monitor for at least 5 minutes
- If errors appear: Investigate immediately
- [ ] **Check for new error patterns**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"sudo journalctl -u tractatus --since '5 minutes ago' | grep -i error"
```
- Compare to known errors (acceptable warnings)
- New errors may indicate deployment issues
### 4. Analytics Check (Optional, 15+ minutes)
- [ ] **Verify Plausible Analytics tracking**
- Visit https://plausible.io/agenticgovernance.digital
- Confirm events are being tracked
- Check for unusual bounce rates or errors
- [ ] **Check Google Search Console** (if configured)
- Verify no new crawl errors
- Check for 404 increases
---
## Database Migration Procedure (If Needed)
Only required when schema changes or data migrations needed.
### Pre-Migration
- [ ] **Backup database** (already done in pre-deployment)
- [ ] **Test migration on staging** (if staging environment exists)
- [ ] **Review migration script**
```bash
cat scripts/migrations/YYYYMMDD-description.js
```
### Execute Migration
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"cd /var/www/tractatus && node scripts/migrations/YYYYMMDD-description.js"
```
### Post-Migration
- [ ] **Verify migration success**
```bash
# Check migration completed
# Check data integrity
```
- [ ] **Test affected features**
- Any features using migrated data
### Migration Rollback (If Needed)
- [ ] **Restore database from backup**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"mongorestore --uri='...' /tmp/backup-TIMESTAMP"
```
- [ ] **Rollback code** (see rollback section)
---
## Rollback Procedure
Use if deployment causes critical issues that can't be quickly fixed.
### When to Rollback
- Application won't start
- Critical features completely broken
- Security vulnerability introduced
- Data loss or corruption occurring
- 500 errors on every request
### How to Rollback
1. **Identify last known good commit**
```bash
git log --oneline -10
# Find commit before problematic changes
```
2. **Checkout last good commit**
```bash
git checkout <commit-hash>
```
3. **Redeploy using same script**
```bash
# Use same deployment script as original deployment
./scripts/deploy-full-project-SAFE.sh
```
4. **Restart application**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"sudo systemctl restart tractatus"
```
5. **Verify rollback successful**
- Check health endpoint
- Check homepage loads
- Check logs for errors
6. **Return to main branch**
```bash
git checkout main
```
### Post-Rollback
- [ ] **Document incident**
- What went wrong?
- What was the impact?
- How was it detected?
- How long was it broken?
- What was rolled back?
- [ ] **Create incident report** (template below)
- [ ] **Fix issue in development**
- Reproduce locally
- Fix root cause
- Add tests to prevent recurrence
- Re-deploy when ready
---
## Incident Documentation Template
Create file: `docs/incidents/YYYY-MM-DD-description.md`
```markdown
# Incident Report: [Brief Description]
**Date**: YYYY-MM-DD HH:MM (NZST)
**Severity**: [Critical / High / Medium / Low]
**Duration**: [X minutes/hours]
**Detected By**: [User report / Monitoring / Developer]
## Summary
[1-2 sentence summary of what went wrong]
## Timeline
- HH:MM - Deployment initiated
- HH:MM - Issue detected
- HH:MM - Rollback initiated
- HH:MM - Service restored
## Root Cause
[What caused the issue?]
## Impact
- User-facing impact: [What did users experience?]
- Data impact: [Was any data lost/corrupted?]
- Security impact: [Were any security boundaries crossed?]
## Resolution
[How was it fixed?]
## Prevention
[What changes prevent this from happening again?]
## Action Items
- [ ] Fix root cause
- [ ] Add tests
- [ ] Update deployment checklist
- [ ] Update monitoring
```
---
## Deployment Log Template
Keep a deployment log in: `docs/deployments/YYYY-MM.md`
```markdown
# Deployments: [Month Year]
## YYYY-MM-DD HH:MM - [Description]
**Deployed By**: [Name]
**Deployment Type**: [Frontend / Koha / Full]
**Commits Deployed**:
- abc123 - Description
- def456 - Description
**Pre-Deployment Checks**:
- [x] Tests passing
- [x] Security audit clean
- [x] No sensitive files
**Verification**:
- [x] Health check passed
- [x] Homepage loads
- [x] No errors in logs
**Issues**: None
**Rollback Required**: No
**Notes**: [Any relevant notes]
```
---
## Emergency Procedures
### Service Won't Start
1. **Check logs immediately**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"sudo journalctl -u tractatus -n 100"
```
2. **Common issues:**
- MongoDB connection failed → Check MongoDB running: `sudo systemctl status mongod`
- Port already in use → Check for zombie processes: `sudo lsof -i :9000`
- Missing environment variables → Check .env file exists
- Syntax error in code → Rollback immediately
3. **Quick fixes:**
```bash
# Restart MongoDB if stopped
sudo systemctl start mongod
# Kill zombie processes
sudo pkill -f node.*tractatus
# Restart application
sudo systemctl restart tractatus
```
### Database Connection Lost
1. **Verify MongoDB running**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"sudo systemctl status mongod"
```
2. **Check MongoDB logs**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"sudo journalctl -u mongod -n 50"
```
3. **Test connection manually**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"mongosh --host localhost --port 27017 --authenticationDatabase tractatus_prod -u tractatus_user"
```
### High Error Rate
1. **Identify error pattern**
```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
"sudo journalctl -u tractatus --since '10 minutes ago' | grep ERROR | sort | uniq -c | sort -rn | head -10"
```
2. **Check if all endpoints affected or specific routes**
```bash
# Check health endpoint
curl https://agenticgovernance.digital/health
# Check specific routes
curl https://agenticgovernance.digital/api/documents
```
3. **Decision:**
- If isolated to one feature: Disable feature, investigate
- If site-wide: Rollback immediately
---
## Deployment Best Practices
### DO:
- ✅ Deploy during low-traffic hours (NZ: 10am-2pm NZST = low US traffic)
- ✅ Deploy small, focused changes (easier to debug)
- ✅ Test thoroughly before deploying
- ✅ Monitor logs after deployment
- ✅ Document all deployments
- ✅ Keep rollback procedure tested and ready
- ✅ Communicate with team before major deployments
### DON'T:
- ❌ Deploy on Friday afternoon (limited time to fix issues)
- ❌ Deploy multiple unrelated changes together
- ❌ Skip testing "because it's a small change"
- ❌ Deploy without checking logs after
- ❌ Deploy when tired or rushed
- ❌ Deploy without ability to rollback
- ❌ Forget to restart services after backend changes
### Deployment Timing Guidelines
**Best Times** (Low risk):
- Monday-Thursday, 10am-2pm NZST
- After morning coffee, before lunch
- When you have 2+ hours to monitor
**Acceptable Times** (Medium risk):
- Monday-Thursday, 2pm-5pm NZST
- Early morning deployments (if you're alert)
**Avoid Times** (High risk):
- Friday 3pm+ (weekend coverage issues)
- Late evening (tired, less alert)
- During known high-traffic events
- When about to leave/travel
---
## Automation Opportunities (Future)
### Potential Improvements:
- [ ] Automated testing in CI/CD (GitHub Actions)
- [ ] Automated deployment on merge to main (after tests pass)
- [ ] Automated health checks post-deployment
- [ ] Automated rollback on health check failure
- [ ] Slack notifications for deployments
- [ ] Blue-green deployment for zero-downtime
- [ ] Canary deployments for gradual rollout
### Not Ready Yet Because:
- Need stable test suite (✅ NOW READY - 380 tests passing)
- Need monitoring in place (⏳ Next task - Option D)
- Need error alerting (⏳ Next task - Option D)
- Need staging environment (💡 Future consideration)
---
## Checklist Quick Reference
**Pre-Deploy:**
- [ ] Tests pass
- [ ] Security audit clean
- [ ] No sensitive files
- [ ] .rsyncignore verified
**Deploy:**
- [ ] Choose correct script
- [ ] Review dry-run
- [ ] Execute deployment
- [ ] Note any errors
**Verify:**
- [ ] Service running
- [ ] Health check OK
- [ ] Homepage loads
- [ ] Monitor logs 5-15min
**Document:**
- [ ] Log deployment
- [ ] Note any issues
- [ ] Update team
---
## Contact & Support
**Production Access:**
- SSH: `ubuntu@vps-93a693da.vps.ovh.net`
- Key: `~/.ssh/tractatus_deploy`
- Sudo: Available for systemctl, journalctl
**Service Management:**
- Service: `tractatus.service` (systemd)
- Status: `sudo systemctl status tractatus`
- Logs: `sudo journalctl -u tractatus -f`
- Restart: `sudo systemctl restart tractatus`
**Database:**
- Host: localhost:27017
- Database: `tractatus_prod`
- Auth: tractatus_prod database
- User: `tractatus_user`
**Domain:**
- Production: https://agenticgovernance.digital
- Analytics: https://plausible.io/agenticgovernance.digital
---
**Document Status**: Active Procedure
**Last Updated**: 2025-10-09
**Next Review**: After major deployment or incident
**Maintainer**: Technical Lead (Claude Code + John Stroh)