# Production Deployment Checklist

**Project**: Tractatus AI Safety Framework Website
**Environment**: Production (vps-93a693da.vps.ovh.net)
**Domain**: https://agenticgovernance.digital
**Created**: 2025-10-09
**Status**: Active Procedure

---

## Overview

This checklist ensures safe, consistent deployments to production. **Always follow this procedure** to prevent security incidents, service disruptions, and data loss.

**Deployment Philosophy**:
- Deploy early, deploy often
- Test thoroughly before deploying
- Verify after deploying
- Document incidents and learn

**Incident Prevention**: This checklist was created after a security incident where sensitive Claude Code governance files were accidentally deployed to production. Following this procedure prevents similar incidents.

---

## Pre-Deployment Checklist

### 1. Code Quality Verification

- [ ] **All tests passing locally**
  ```bash
  npm test
  ```
  - Expected: All tests pass, no failures
  - If any tests fail: Fix before deploying

- [ ] **Test coverage acceptable**
  ```bash
  npm test -- --coverage
  ```
  - Check critical services maintain 80%+ coverage
  - Review new code has reasonable coverage

- [ ] **Linting passes** (if linter configured)
  ```bash
  npm run lint
  # OR
  npx eslint src/
  ```

### 2. Security Verification

- [ ] **Run security audit**
  ```bash
  npm audit
  ```
  - Review all vulnerabilities
  - Critical/High: Must fix or document why acceptable
  - Medium/Low: Review and plan fix if needed
  - If fixes available: `npm audit fix` then re-test

- [ ] **Check for sensitive files in git**
  ```bash
  git ls-files | grep -E '(CLAUDE|SESSION|\.env|SECRET|HANDOFF|CLOSEDOWN|_Maintenance_Guide)'
  ```
  - Expected: No matches (all sensitive files excluded)
  - If matches found: Review .gitignore and remove from git history

- [ ] **Verify .rsyncignore completeness**
  ```bash
  cat .rsyncignore
  ```
  - Confirm excludes:
    - `CLAUDE*.md`, `SESSION*.md`, maintenance guides
    - `.env`, `.env.local`, `.env.production.local`
    - `node_modules/`, `.git/`, `.claude/`
    - Test files, coverage reports
    - Development-only files

- [ ] **Check environment secrets not in code**
  ```bash
  grep -r "sk-ant-" src/ || echo "No API keys found ✓"
  grep -r "mongodb://tractatus" src/ || echo "No hardcoded DB URLs ✓"
  ```
  - Expected: No hardcoded secrets in source code
  - All secrets in .env files (which are excluded)

### 3. Database Verification

- [ ] **Database migrations ready** (if any)
  ```bash
  # Check if new migrations exist
  ls -la scripts/migrations/ | tail -5
  ```
  - If migrations exist: Plan migration execution
  - Document migration rollback procedure

- [ ] **Backup current database** (for major changes)
  ```bash
  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
    "mongodump --uri='mongodb://tractatus_user:PASSWORD@localhost:27017/tractatus_prod?authSource=tractatus_prod' --out=/tmp/backup-$(date +%Y%m%d-%H%M%S)"
  ```
  - Only needed for schema changes or major updates
  - Store backup location in deployment notes

### 4. Change Documentation

- [ ] **Review what's being deployed**
  ```bash
  git log --oneline origin/main..HEAD
  ```
  - Confirm all commits are intentional
  - Verify no work-in-progress commits

- [ ] **Update CHANGELOG.md** (if project uses one)
  - Document user-facing changes
  - Document breaking changes
  - Document security fixes

- [ ] **Commit all changes**
  ```bash
  git status
  # If uncommitted changes exist, decide: commit or stash
  ```

---

## Deployment Execution

### Choose Deployment Method

**Decision Matrix:**

| What Changed | Script to Use | Command |
|--------------|---------------|---------|
| Public HTML/CSS/JS only | `deploy-frontend.sh` | `./scripts/deploy-frontend.sh` |
| Koha donation system | `deploy-koha-to-production.sh` | `./scripts/deploy-koha-to-production.sh` |
| Full project (backend, routes, services) | `deploy-full-project-SAFE.sh` | `./scripts/deploy-full-project-SAFE.sh` |
| Emergency rollback | Manual rsync | See rollback section |

### Option 1: Frontend-Only Deployment

Use when only public-facing files changed (HTML, CSS, JS, images).

```bash
./scripts/deploy-frontend.sh
```

**What it deploys:**
- `public/` directory
- Excludes: admin, backend code, config files

**Safety level:** ✅ Safest (public files only)

### Option 2: Koha-Specific Deployment

Use when Koha donation system changed.

```bash
./scripts/deploy-koha-to-production.sh
```

**What it deploys:**
- Koha controllers, services, routes
- Koha frontend (public/koha.html)
- Related middleware and models

**Safety level:** ⚠️ Moderate (includes backend code)

### Option 3: Full Project Deployment (Most Common)

Use for backend changes, new features, or multi-component updates.

```bash
./scripts/deploy-full-project-SAFE.sh
```

**Deployment steps:**
1. Script shows excluded patterns from .rsyncignore
2. **Review exclusions carefully** - Verify sensitive files excluded
3. Script shows dry-run summary
4. **Verify files to be deployed** - Look for any unexpected files
5. Confirm deployment (or Ctrl+C to abort)
6. Script executes rsync with progress
7. Deployment complete

**What it deploys:**
- All source code (src/)
- Public files (public/)
- Configuration (package.json, etc.)
- Documentation (docs/)
- Scripts (scripts/)

**What it excludes** (via .rsyncignore):
- Claude Code governance files (CLAUDE*.md, SESSION*.md)
- Environment files (.env*)
- Node modules (node_modules/)
- Git repository (.git/)
- Test files and coverage
- Development-only files

**Safety level:** ⚠️ Use carefully (full codebase)

### Deployment Verification During Execution

- [ ] **Watch for errors during deployment**
  - Rsync errors (permission denied, connection failures)
  - File conflicts
  - Unexpected file deletions

- [ ] **Verify file count is reasonable**
  - Frontend: ~50-100 files
  - Koha: ~20-30 files
  - Full: ~200-300 files (varies by project size)
  - If thousands of files: STOP - check .rsyncignore

---

## Post-Deployment Verification

### 1. Immediate Checks (< 2 minutes)

- [ ] **Restart application** (if backend changes)
  ```bash
  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
    "sudo systemctl restart tractatus"
  ```

- [ ] **Check service status**
  ```bash
  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
    "sudo systemctl status tractatus"
  ```
  - Expected: `active (running)`
  - If failed: Check logs immediately

- [ ] **Health endpoint check**
  ```bash
  curl https://agenticgovernance.digital/health
  ```
  - Expected: `{"status":"ok","timestamp":"..."}` (200 OK)
  - If 500 or error: Check logs, may need rollback

- [ ] **Homepage loads**
  ```bash
  curl -I https://agenticgovernance.digital
  ```
  - Expected: `HTTP/2 200`
  - If 404/500: Critical issue, check logs

### 2. Functional Checks (2-5 minutes)

- [ ] **Test primary user flows:**
  - Visit homepage: https://agenticgovernance.digital
  - Navigate to Researcher path: https://agenticgovernance.digital/researcher.html
  - Navigate to Implementer path: https://agenticgovernance.digital/implementer.html
  - Navigate to Leader path: https://agenticgovernance.digital/leader.html
  - Visit documentation: https://agenticgovernance.digital/docs.html
  - Test interactive demo: https://agenticgovernance.digital/demos/27027-demo.html

- [ ] **Test navigation:**
  - Click navbar dropdown menus
  - Mobile menu (resize browser or use DevTools)
  - Footer links work

- [ ] **Test critical features** (based on what changed):
  - If Koha changed: Test donation flow (test mode)
  - If admin changed: Test admin login
  - If governance changed: Test governance API (with admin token)
  - If documents changed: Test document retrieval

### 3. Log Monitoring (5-15 minutes)

- [ ] **Monitor production logs for errors**
  ```bash
  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
    "sudo journalctl -u tractatus -f"
  ```
  - Watch for:
    - ERROR, CRITICAL log levels
    - Unhandled exceptions
    - Database connection failures
    - 500 errors on requests
  - Monitor for at least 5 minutes
  - If errors appear: Investigate immediately

- [ ] **Check for new error patterns**
  ```bash
  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
    "sudo journalctl -u tractatus --since '5 minutes ago' | grep -i error"
  ```
  - Compare to known errors (acceptable warnings)
  - New errors may indicate deployment issues

### 4. Analytics Check (Optional, 15+ minutes)

- [ ] **Verify Plausible Analytics tracking**
  - Visit https://plausible.io/agenticgovernance.digital
  - Confirm events are being tracked
  - Check for unusual bounce rates or errors

- [ ] **Check Google Search Console** (if configured)
  - Verify no new crawl errors
  - Check for 404 increases

---

## Database Migration Procedure (If Needed)

Only required when schema changes or data migrations needed.

### Pre-Migration

- [ ] **Backup database** (already done in pre-deployment)
- [ ] **Test migration on staging** (if staging environment exists)
- [ ] **Review migration script**
  ```bash
  cat scripts/migrations/YYYYMMDD-description.js
  ```

### Execute Migration

```bash
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
  "cd /var/www/tractatus && node scripts/migrations/YYYYMMDD-description.js"
```

### Post-Migration

- [ ] **Verify migration success**
  ```bash
  # Check migration completed
  # Check data integrity
  ```

- [ ] **Test affected features**
  - Any features using migrated data

### Migration Rollback (If Needed)

- [ ] **Restore database from backup**
  ```bash
  ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
    "mongorestore --uri='...' /tmp/backup-TIMESTAMP"
  ```

- [ ] **Rollback code** (see rollback section)

---

## Rollback Procedure

Use if deployment causes critical issues that can't be quickly fixed.

### When to Rollback

- Application won't start
- Critical features completely broken
- Security vulnerability introduced
- Data loss or corruption occurring
- 500 errors on every request

### How to Rollback

1. **Identify last known good commit**
   ```bash
   git log --oneline -10
   # Find commit before problematic changes
   ```

2. **Checkout last good commit**
   ```bash
   git checkout <commit-hash>
   ```

3. **Redeploy using same script**
   ```bash
   # Use same deployment script as original deployment
   ./scripts/deploy-full-project-SAFE.sh
   ```

4. **Restart application**
   ```bash
   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
     "sudo systemctl restart tractatus"
   ```

5. **Verify rollback successful**
   - Check health endpoint
   - Check homepage loads
   - Check logs for errors

6. **Return to main branch**
   ```bash
   git checkout main
   ```

### Post-Rollback

- [ ] **Document incident**
  - What went wrong?
  - What was the impact?
  - How was it detected?
  - How long was it broken?
  - What was rolled back?

- [ ] **Create incident report** (template below)

- [ ] **Fix issue in development**
  - Reproduce locally
  - Fix root cause
  - Add tests to prevent recurrence
  - Re-deploy when ready

---

## Incident Documentation Template

Create file: `docs/incidents/YYYY-MM-DD-description.md`

```markdown
# Incident Report: [Brief Description]

**Date**: YYYY-MM-DD HH:MM (NZST)
**Severity**: [Critical / High / Medium / Low]
**Duration**: [X minutes/hours]
**Detected By**: [User report / Monitoring / Developer]

## Summary
[1-2 sentence summary of what went wrong]

## Timeline
- HH:MM - Deployment initiated
- HH:MM - Issue detected
- HH:MM - Rollback initiated
- HH:MM - Service restored

## Root Cause
[What caused the issue?]

## Impact
- User-facing impact: [What did users experience?]
- Data impact: [Was any data lost/corrupted?]
- Security impact: [Were any security boundaries crossed?]

## Resolution
[How was it fixed?]

## Prevention
[What changes prevent this from happening again?]

## Action Items
- [ ] Fix root cause
- [ ] Add tests
- [ ] Update deployment checklist
- [ ] Update monitoring
```

---

## Deployment Log Template

Keep a deployment log in: `docs/deployments/YYYY-MM.md`

```markdown
# Deployments: [Month Year]

## YYYY-MM-DD HH:MM - [Description]

**Deployed By**: [Name]
**Deployment Type**: [Frontend / Koha / Full]
**Commits Deployed**:
- abc123 - Description
- def456 - Description

**Pre-Deployment Checks**:
- [x] Tests passing
- [x] Security audit clean
- [x] No sensitive files

**Verification**:
- [x] Health check passed
- [x] Homepage loads
- [x] No errors in logs

**Issues**: None
**Rollback Required**: No
**Notes**: [Any relevant notes]
```

---

## Emergency Procedures

### Service Won't Start

1. **Check logs immediately**
   ```bash
   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
     "sudo journalctl -u tractatus -n 100"
   ```

2. **Common issues:**
   - MongoDB connection failed → Check MongoDB running: `sudo systemctl status mongod`
   - Port already in use → Check for zombie processes: `sudo lsof -i :9000`
   - Missing environment variables → Check .env file exists
   - Syntax error in code → Rollback immediately

3. **Quick fixes:**
   ```bash
   # Restart MongoDB if stopped
   sudo systemctl start mongod

   # Kill zombie processes
   sudo pkill -f node.*tractatus

   # Restart application
   sudo systemctl restart tractatus
   ```

### Database Connection Lost

1. **Verify MongoDB running**
   ```bash
   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
     "sudo systemctl status mongod"
   ```

2. **Check MongoDB logs**
   ```bash
   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
     "sudo journalctl -u mongod -n 50"
   ```

3. **Test connection manually**
   ```bash
   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
     "mongosh --host localhost --port 27017 --authenticationDatabase tractatus_prod -u tractatus_user"
   ```

### High Error Rate

1. **Identify error pattern**
   ```bash
   ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net \
     "sudo journalctl -u tractatus --since '10 minutes ago' | grep ERROR | sort | uniq -c | sort -rn | head -10"
   ```

2. **Check if all endpoints affected or specific routes**
   ```bash
   # Check health endpoint
   curl https://agenticgovernance.digital/health

   # Check specific routes
   curl https://agenticgovernance.digital/api/documents
   ```

3. **Decision:**
   - If isolated to one feature: Disable feature, investigate
   - If site-wide: Rollback immediately

---

## CRITICAL: HTML Caching Rules

**MANDATORY REQUIREMENT**: HTML files MUST be delivered fresh to users without requiring cache refresh.

### The Problem
Service worker caching HTML files caused deployment failures where users saw OLD content even after deploying NEW code. Users should NEVER need to clear cache manually.

### The Solution (Enforced as of 2025-10-17)

**Service Worker** (`public/service-worker.js`):
- HTML files: Network-ONLY strategy (never cache, always fetch fresh)
- Exception: `/index.html` only for offline fallback
- Bump `CACHE_VERSION` constant whenever service worker logic changes

**Server** (`src/server.js`):
- HTML files: `Cache-Control: no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0`
- This ensures browsers never cache HTML pages
- CSS/JS: Long cache OK (use version parameters for cache-busting)

**Version Manifest** (`public/version.json`):
- Update version number when deploying HTML changes
- Service worker checks this for updates
- Set `forceUpdate: true` for critical fixes

### Deployment Rules for HTML Changes

When deploying HTML file changes:

1. **Verify service worker never caches HTML** (except index.html)
   ```bash
   grep -A 10 "HTML files:" public/service-worker.js
   # Should show: Network-ONLY strategy, no caching
   ```

2. **Verify server sends no-cache headers**
   ```bash
   grep -A 3 "HTML files:" src/server.js
   # Should show: no-store, no-cache, must-revalidate
   ```

3. **Bump version.json if critical content changed**
   ```bash
   # Edit public/version.json
   # Increment version: 1.1.2 → 1.1.3
   # Update changelog
   # Set forceUpdate: true
   ```

4. **After deployment, verify headers in production**
   ```bash
   curl -s -I https://agenticgovernance.digital/koha.html | grep -i cache-control
   # Expected: no-store, no-cache, must-revalidate

   curl -s https://agenticgovernance.digital/koha.html | grep "<title>"
   # Verify correct content showing
   ```

5. **Test in incognito window**
   - Open https://agenticgovernance.digital in fresh incognito window
   - Verify new content loads immediately
   - No cache refresh should be needed

### Testing Cache Behavior

**Before deployment:**
```bash
# Local: Verify server sends correct headers
curl -s -I http://localhost:9000/koha.html | grep cache-control
# Expected: no-store, no-cache

# Verify service worker doesn't cache HTML
grep "endsWith('.html')" public/service-worker.js -A 10
# Should NOT cache responses, only fetch
```

**After deployment:**
```bash
# Production: Verify headers
curl -s -I https://agenticgovernance.digital/<file>.html | grep cache-control

# Production: Verify fresh content
curl -s https://agenticgovernance.digital/<file>.html | grep "<title>"
```

### Incident Prevention

**Lesson Learned** (2025-10-17 Koha Deployment):
- Deployed koha.html with reciprocal giving updates
- Service worker cached old version
- Users saw old content despite fresh deployment
- Required THREE deployment attempts to fix
- Root cause: Service worker was caching HTML with network-first strategy

**Prevention**:
- Service worker now enforces network-ONLY for all HTML (except offline index.html)
- Server enforces no-cache headers
- This checklist documents the requirement architecturally

---

## Deployment Best Practices

### DO:
- ✅ Deploy during low-traffic hours (NZ: 10am-2pm NZST = low US traffic)
- ✅ Deploy small, focused changes (easier to debug)
- ✅ Test thoroughly before deploying
- ✅ Monitor logs after deployment
- ✅ Document all deployments
- ✅ Keep rollback procedure tested and ready
- ✅ Communicate with team before major deployments
- ✅ **CRITICAL: Verify HTML cache headers before and after deployment**
- ✅ **CRITICAL: Test in incognito window after HTML deployments**

### DON'T:
- ❌ Deploy on Friday afternoon (limited time to fix issues)
- ❌ Deploy multiple unrelated changes together
- ❌ Skip testing "because it's a small change"
- ❌ Deploy without checking logs after
- ❌ Deploy when tired or rushed
- ❌ Deploy without ability to rollback
- ❌ Forget to restart services after backend changes
- ❌ **CRITICAL: Never cache HTML files in service worker (except offline fallback)**
- ❌ **CRITICAL: Never ask users to clear their browser cache - fix it server-side**

### Deployment Timing Guidelines

**Best Times** (Low risk):
- Monday-Thursday, 10am-2pm NZST
- After morning coffee, before lunch
- When you have 2+ hours to monitor

**Acceptable Times** (Medium risk):
- Monday-Thursday, 2pm-5pm NZST
- Early morning deployments (if you're alert)

**Avoid Times** (High risk):
- Friday 3pm+ (weekend coverage issues)
- Late evening (tired, less alert)
- During known high-traffic events
- When about to leave/travel

---

## Automation Opportunities (Future)

### Potential Improvements:
- [ ] Automated testing in CI/CD (GitHub Actions)
- [ ] Automated deployment on merge to main (after tests pass)
- [ ] Automated health checks post-deployment
- [ ] Automated rollback on health check failure
- [ ] Slack notifications for deployments
- [ ] Blue-green deployment for zero-downtime
- [ ] Canary deployments for gradual rollout

### Not Ready Yet Because:
- Need stable test suite (✅ NOW READY - 380 tests passing)
- Need monitoring in place (⏳ Next task - Option D)
- Need error alerting (⏳ Next task - Option D)
- Need staging environment (💡 Future consideration)

---

## Checklist Quick Reference

**Pre-Deploy:**
- [ ] Tests pass
- [ ] Security audit clean
- [ ] No sensitive files
- [ ] .rsyncignore verified

**Deploy:**
- [ ] Choose correct script
- [ ] Review dry-run
- [ ] Execute deployment
- [ ] Note any errors

**Verify:**
- [ ] Service running
- [ ] Health check OK
- [ ] Homepage loads
- [ ] Monitor logs 5-15min

**Document:**
- [ ] Log deployment
- [ ] Note any issues
- [ ] Update team

---

## Contact & Support

**Production Access:**
- SSH: `ubuntu@vps-93a693da.vps.ovh.net`
- Key: `~/.ssh/tractatus_deploy`
- Sudo: Available for systemctl, journalctl

**Service Management:**
- Service: `tractatus.service` (systemd)
- Status: `sudo systemctl status tractatus`
- Logs: `sudo journalctl -u tractatus -f`
- Restart: `sudo systemctl restart tractatus`

**Database:**
- Host: localhost:27017
- Database: `tractatus_prod`
- Auth: tractatus_prod database
- User: `tractatus_user`

**Domain:**
- Production: https://agenticgovernance.digital
- Analytics: https://plausible.io/agenticgovernance.digital

---

**Document Status**: Active Procedure
**Last Updated**: 2025-10-09
**Next Review**: After major deployment or incident
**Maintainer**: Technical Lead (Claude Code + John Stroh)