TheFlow 91925d899c docs: create comprehensive production deployment checklist

Add detailed deployment procedure to prevent security incidents and
ensure consistent, safe deployments to production.

Includes:
- Pre-deployment verification (tests, security, sensitive file checks)
- Three deployment methods (frontend, Koha, full project)
- Post-deployment verification (health checks, log monitoring)
- Database migration procedure
- Emergency rollback procedure
- Incident documentation template
- Deployment log template
- Emergency procedures (service failures, DB issues)
- Best practices and timing guidelines

Created after security incident where sensitive Claude Code files were
accidentally deployed. This checklist prevents similar incidents through:
- Mandatory .rsyncignore verification
- Sensitive file checks before deployment
- Dry-run review before execution
- Post-deployment monitoring

Status: Active procedure for all production deployments

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-09 22:19:00 +13:00

17 KiB

Raw Blame History

Production Deployment Checklist

Project: Tractatus AI Safety Framework Website Environment: Production (vps-93a693da.vps.ovh.net) Domain: https://agenticgovernance.digital Created: 2025-10-09 Status: Active Procedure

Overview

This checklist ensures safe, consistent deployments to production. Always follow this procedure to prevent security incidents, service disruptions, and data loss.

Deployment Philosophy:

Deploy early, deploy often
Test thoroughly before deploying
Verify after deploying
Document incidents and learn

Incident Prevention: This checklist was created after a security incident where sensitive Claude Code governance files were accidentally deployed to production. Following this procedure prevents similar incidents.

Pre-Deployment Checklist

1. Code Quality Verification

All tests passing locally
```
npm test
```
- Expected: All tests pass, no failures
- If any tests fail: Fix before deploying
Test coverage acceptable
```
npm test -- --coverage
```
- Check critical services maintain 80%+ coverage
- Review new code has reasonable coverage
Linting passes (if linter configured)
```
npm run lint
# OR
npx eslint src/
```

2. Security Verification

Run security audit
```
npm audit
```
- Review all vulnerabilities
- Critical/High: Must fix or document why acceptable
- Medium/Low: Review and plan fix if needed
- If fixes available: npm audit fix then re-test
Check for sensitive files in git
```
git ls-files | grep -E '(CLAUDE|SESSION|\.env|SECRET|HANDOFF|CLOSEDOWN|_Maintenance_Guide)'
```
- Expected: No matches (all sensitive files excluded)
- If matches found: Review .gitignore and remove from git history
Verify .rsyncignore completeness
```
cat .rsyncignore
```
- Confirm excludes:
  - CLAUDE*.md, SESSION*.md, maintenance guides
  - .env, .env.local, .env.production.local
  - node_modules/, .git/, .claude/
  - Test files, coverage reports
  - Development-only files

Check environment secrets not in code

grep -r "sk-ant-" src/ || echo "No API keys found ✓"
grep -r "mongodb://tractatus" src/ || echo "No hardcoded DB URLs ✓"

Expected: No hardcoded secrets in source code
All secrets in .env files (which are excluded)

3. Database Verification

Database migrations ready (if any)
```
# Check if new migrations exist
ls -la scripts/migrations/ | tail -5
```
- If migrations exist: Plan migration execution
- Document migration rollback procedure

Backup current database (for major changes)

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "mongodump --uri='mongodb://tractatus_user:PASSWORD@localhost:27017/tractatus_prod?authSource=tractatus_prod' --out=/tmp/backup-$(date +%Y%m%d-%H%M%S)"

Only needed for schema changes or major updates
Store backup location in deployment notes

4. Change Documentation

Review what's being deployed
```
git log --oneline origin/main..HEAD
```
- Confirm all commits are intentional
- Verify no work-in-progress commits
Update CHANGELOG.md (if project uses one)
- Document user-facing changes
- Document breaking changes
- Document security fixes

Commit all changes

git status
# If uncommitted changes exist, decide: commit or stash

Deployment Execution

Choose Deployment Method

Decision Matrix:

What Changed	Script to Use	Command
Public HTML/CSS/JS only	`deploy-frontend.sh`	`./scripts/deploy-frontend.sh`
Koha donation system	`deploy-koha-to-production.sh`	`./scripts/deploy-koha-to-production.sh`
Full project (backend, routes, services)	`deploy-full-project-SAFE.sh`	`./scripts/deploy-full-project-SAFE.sh`
Emergency rollback	Manual rsync	See rollback section

Option 1: Frontend-Only Deployment

Use when only public-facing files changed (HTML, CSS, JS, images).

./scripts/deploy-frontend.sh

What it deploys:

public/ directory
Excludes: admin, backend code, config files

Safety level: ✅ Safest (public files only)

Option 2: Koha-Specific Deployment

Use when Koha donation system changed.

./scripts/deploy-koha-to-production.sh

What it deploys:

Koha controllers, services, routes
Koha frontend (public/koha.html)
Related middleware and models

Safety level: ⚠️ Moderate (includes backend code)

Option 3: Full Project Deployment (Most Common)

Use for backend changes, new features, or multi-component updates.

./scripts/deploy-full-project-SAFE.sh

Deployment steps:

Script shows excluded patterns from .rsyncignore
Review exclusions carefully - Verify sensitive files excluded
Script shows dry-run summary
Verify files to be deployed - Look for any unexpected files
Confirm deployment (or Ctrl+C to abort)
Script executes rsync with progress
Deployment complete

What it deploys:

All source code (src/)
Public files (public/)
Configuration (package.json, etc.)
Documentation (docs/)
Scripts (scripts/)

What it excludes (via .rsyncignore):

Claude Code governance files (CLAUDE*.md, SESSION*.md)
Environment files (.env*)
Node modules (node_modules/)
Git repository (.git/)
Test files and coverage
Development-only files

Safety level: ⚠️ Use carefully (full codebase)

Deployment Verification During Execution

Watch for errors during deployment
- Rsync errors (permission denied, connection failures)
- File conflicts
- Unexpected file deletions
Verify file count is reasonable
- Frontend: ~50-100 files
- Koha: ~20-30 files
- Full: ~200-300 files (varies by project size)
- If thousands of files: STOP - check .rsyncignore

Post-Deployment Verification

1. Immediate Checks (< 2 minutes)

Restart application (if backend changes)

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "sudo systemctl restart tractatus"

Check service status

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "sudo systemctl status tractatus"

Expected: active (running)
If failed: Check logs immediately

Health endpoint check
```
curl https://agenticgovernance.digital/health
```
- Expected: {"status":"ok","timestamp":"..."} (200 OK)
- If 500 or error: Check logs, may need rollback
Homepage loads
```
curl -I https://agenticgovernance.digital
```
- Expected: HTTP/2 200
- If 404/500: Critical issue, check logs

2. Functional Checks (2-5 minutes)

Test primary user flows:
- Visit homepage: https://agenticgovernance.digital
- Navigate to Researcher path: https://agenticgovernance.digital/researcher.html
- Navigate to Implementer path: https://agenticgovernance.digital/implementer.html
- Navigate to Leader path: https://agenticgovernance.digital/leader.html
- Visit documentation: https://agenticgovernance.digital/docs.html
- Test interactive demo: https://agenticgovernance.digital/demos/27027-demo.html
Test navigation:
- Click navbar dropdown menus
- Mobile menu (resize browser or use DevTools)
- Footer links work
Test critical features (based on what changed):
- If Koha changed: Test donation flow (test mode)
- If admin changed: Test admin login
- If governance changed: Test governance API (with admin token)
- If documents changed: Test document retrieval

3. Log Monitoring (5-15 minutes)

Monitor production logs for errors
```
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "sudo journalctl -u tractatus -f"
```
- Watch for:
  - ERROR, CRITICAL log levels
  - Unhandled exceptions
  - Database connection failures
  - 500 errors on requests
- Monitor for at least 5 minutes
- If errors appear: Investigate immediately

Check for new error patterns

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "sudo journalctl -u tractatus --since '5 minutes ago' | grep -i error"

Compare to known errors (acceptable warnings)
New errors may indicate deployment issues

4. Analytics Check (Optional, 15+ minutes)

Verify Plausible Analytics tracking
- Visit https://plausible.io/agenticgovernance.digital
- Confirm events are being tracked
- Check for unusual bounce rates or errors
Check Google Search Console (if configured)
- Verify no new crawl errors
- Check for 404 increases

Database Migration Procedure (If Needed)

Only required when schema changes or data migrations needed.

Pre-Migration

Backup database (already done in pre-deployment)
Test migration on staging (if staging environment exists)

Review migration script

cat scripts/migrations/YYYYMMDD-description.js

Execute Migration

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "cd /var/www/tractatus && node scripts/migrations/YYYYMMDD-description.js"

Post-Migration

Verify migration success

# Check migration completed
# Check data integrity

Test affected features
- Any features using migrated data

Migration Rollback (If Needed)

Restore database from backup

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "mongorestore --uri='...' /tmp/backup-TIMESTAMP"

Rollback code (see rollback section)

Rollback Procedure

Use if deployment causes critical issues that can't be quickly fixed.

When to Rollback

Application won't start
Critical features completely broken
Security vulnerability introduced
Data loss or corruption occurring
500 errors on every request

How to Rollback

Identify last known good commit

git log --oneline -10
# Find commit before problematic changes

Checkout last good commit
```
git checkout <commit-hash>
```

Redeploy using same script

# Use same deployment script as original deployment
./scripts/deploy-full-project-SAFE.sh

Restart application

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "sudo systemctl restart tractatus"

Verify rollback successful
- Check health endpoint
- Check homepage loads
- Check logs for errors
Return to main branch
```
git checkout main
```

Post-Rollback

Document incident
- What went wrong?
- What was the impact?
- How was it detected?
- How long was it broken?
- What was rolled back?
Create incident report (template below)
Fix issue in development
- Reproduce locally
- Fix root cause
- Add tests to prevent recurrence
- Re-deploy when ready

Incident Documentation Template

Create file: docs/incidents/YYYY-MM-DD-description.md

# Incident Report: [Brief Description]

**Date**: YYYY-MM-DD HH:MM (NZST)
**Severity**: [Critical / High / Medium / Low]
**Duration**: [X minutes/hours]
**Detected By**: [User report / Monitoring / Developer]

## Summary
[1-2 sentence summary of what went wrong]

## Timeline
- HH:MM - Deployment initiated
- HH:MM - Issue detected
- HH:MM - Rollback initiated
- HH:MM - Service restored

## Root Cause
[What caused the issue?]

## Impact
- User-facing impact: [What did users experience?]
- Data impact: [Was any data lost/corrupted?]
- Security impact: [Were any security boundaries crossed?]

## Resolution
[How was it fixed?]

## Prevention
[What changes prevent this from happening again?]

## Action Items
- [ ] Fix root cause
- [ ] Add tests
- [ ] Update deployment checklist
- [ ] Update monitoring

Deployment Log Template

Keep a deployment log in: docs/deployments/YYYY-MM.md

# Deployments: [Month Year]

## YYYY-MM-DD HH:MM - [Description]

**Deployed By**: [Name]
**Deployment Type**: [Frontend / Koha / Full]
**Commits Deployed**:
- abc123 - Description
- def456 - Description

**Pre-Deployment Checks**:
- [x] Tests passing
- [x] Security audit clean
- [x] No sensitive files

**Verification**:
- [x] Health check passed
- [x] Homepage loads
- [x] No errors in logs

**Issues**: None
**Rollback Required**: No
**Notes**: [Any relevant notes]

Emergency Procedures

Service Won't Start

Check logs immediately

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "sudo journalctl -u tractatus -n 100"

Common issues:
- MongoDB connection failed → Check MongoDB running: sudo systemctl status mongod
- Port already in use → Check for zombie processes: sudo lsof -i :9000
- Missing environment variables → Check .env file exists
- Syntax error in code → Rollback immediately

Quick fixes:

# Restart MongoDB if stopped
sudo systemctl start mongod

# Kill zombie processes
sudo pkill -f node.*tractatus

# Restart application
sudo systemctl restart tractatus

Database Connection Lost

Verify MongoDB running

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "sudo systemctl status mongod"

Check MongoDB logs

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "sudo journalctl -u mongod -n 50"

Test connection manually

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "mongosh --host localhost --port 27017 --authenticationDatabase tractatus_prod -u tractatus_user"

High Error Rate

Identify error pattern

ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "sudo journalctl -u tractatus --since '10 minutes ago' | grep ERROR | sort | uniq -c | sort -rn | head -10"

Check if all endpoints affected or specific routes

# Check health endpoint
curl https://agenticgovernance.digital/health

# Check specific routes
curl https://agenticgovernance.digital/api/documents

Decision:
- If isolated to one feature: Disable feature, investigate
- If site-wide: Rollback immediately

Deployment Best Practices

DO:

✅ Deploy during low-traffic hours (NZ: 10am-2pm NZST = low US traffic)
✅ Deploy small, focused changes (easier to debug)
✅ Test thoroughly before deploying
✅ Monitor logs after deployment
✅ Document all deployments
✅ Keep rollback procedure tested and ready
✅ Communicate with team before major deployments

DON'T:

❌ Deploy on Friday afternoon (limited time to fix issues)
❌ Deploy multiple unrelated changes together
❌ Skip testing "because it's a small change"
❌ Deploy without checking logs after
❌ Deploy when tired or rushed
❌ Deploy without ability to rollback
❌ Forget to restart services after backend changes

Deployment Timing Guidelines

Best Times (Low risk):

Monday-Thursday, 10am-2pm NZST
After morning coffee, before lunch
When you have 2+ hours to monitor

Acceptable Times (Medium risk):

Monday-Thursday, 2pm-5pm NZST
Early morning deployments (if you're alert)

Avoid Times (High risk):

Friday 3pm+ (weekend coverage issues)
Late evening (tired, less alert)
During known high-traffic events
When about to leave/travel

Automation Opportunities (Future)

Potential Improvements:

Automated testing in CI/CD (GitHub Actions)
Automated deployment on merge to main (after tests pass)
Automated health checks post-deployment
Automated rollback on health check failure
Slack notifications for deployments
Blue-green deployment for zero-downtime
Canary deployments for gradual rollout

Not Ready Yet Because:

Need stable test suite (✅ NOW READY - 380 tests passing)
Need monitoring in place (⏳ Next task - Option D)
Need error alerting (⏳ Next task - Option D)
Need staging environment (💡 Future consideration)

Checklist Quick Reference

Pre-Deploy:

Tests pass
Security audit clean
No sensitive files
.rsyncignore verified

Deploy:

Choose correct script
Review dry-run
Execute deployment
Note any errors

Verify:

Service running
Health check OK
Homepage loads
Monitor logs 5-15min

Document:

Log deployment
Note any issues
Update team

Contact & Support

Production Access:

SSH: ubuntu@vps-93a693da.vps.ovh.net
Key: ~/.ssh/tractatus_deploy
Sudo: Available for systemctl, journalctl

Service Management:

Service: tractatus.service (systemd)
Status: sudo systemctl status tractatus
Logs: sudo journalctl -u tractatus -f
Restart: sudo systemctl restart tractatus

Database:

Host: localhost:27017
Database: tractatus_prod
Auth: tractatus_prod database
User: tractatus_user

Domain:

Production: https://agenticgovernance.digital
Analytics: https://plausible.io/agenticgovernance.digital

Document Status: Active Procedure Last Updated: 2025-10-09 Next Review: After major deployment or incident Maintainer: Technical Lead (Claude Code + John Stroh)

17 KiB Raw Blame History

Production Deployment Checklist

Overview

Pre-Deployment Checklist

1. Code Quality Verification

2. Security Verification

3. Database Verification

4. Change Documentation

Deployment Execution

Choose Deployment Method

Option 1: Frontend-Only Deployment

Option 2: Koha-Specific Deployment

Option 3: Full Project Deployment (Most Common)

Deployment Verification During Execution

Post-Deployment Verification

1. Immediate Checks (< 2 minutes)

2. Functional Checks (2-5 minutes)

3. Log Monitoring (5-15 minutes)

4. Analytics Check (Optional, 15+ minutes)

Database Migration Procedure (If Needed)

Pre-Migration

Execute Migration

Post-Migration

Migration Rollback (If Needed)

Rollback Procedure

When to Rollback

How to Rollback

Post-Rollback

Incident Documentation Template

Deployment Log Template

Emergency Procedures

Service Won't Start

Database Connection Lost

High Error Rate

Deployment Best Practices

DO:

DON'T:

Deployment Timing Guidelines

Automation Opportunities (Future)

Potential Improvements:

Not Ready Yet Because:

Checklist Quick Reference

Contact & Support

17 KiB

Raw Blame History