tractatus/docs/INCIDENT_RECOVERY_2026-01-19.md
TheFlow b302960a61 docs: Complete VPS recovery documentation and attack reference
- Update INCIDENT_RECOVERY_2026-01-19.md with complete recovery status
- Create VPS_RECOVERY_REFERENCE.md with step-by-step recovery guide
- Update remediation plan to show executed status
- Update OVH rescue mode doc with resolution notes

Documents the successful complete reinstall approach after multiple
failed partial cleanup attempts. Includes attack indicators, banned
software list, and verification checklist for future incidents.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 12:06:32 +13:00

359 lines
11 KiB
Markdown

# Incident Recovery Report - 2026-01-19/20
## Executive Summary
**Status:** COMPLETE RECOVERY (Updated 2026-01-20)
- Website: UP (https://agenticgovernance.digital/ responds HTTP 200)
- SSH Access: WORKING (via fresh VPS reinstall)
- Malware: ELIMINATED (complete OS reinstall)
- Application: FULLY RESTORED
- Database: MIGRATED from local backup (134 documents)
- SSL: VALID (Let's Encrypt, expires April 2026)
- Root Cause: PM2 process manager running Exodus botnet malware
---
## Incident Timeline
| Date/Time | Event |
|-----------|-------|
| 2025-12-09 | First botnet attack (Exodus via Docker/Umami) - 83Kpps/45Mbps |
| 2025-12-09 | Recovery claimed complete, Docker removed |
| 2026-01-18 11:38 UTC | Server working, services running |
| 2026-01-18 13:57 CET | Second attack detected - 171Kpps/51Mbps UDP to 15.184.38.247:9007 |
| 2026-01-18 | OVH forces rescue mode |
| 2026-01-18 23:44 CET | Third attack detected - 44Kpps/50Mbps UDP to 171.225.223.4:80 |
| 2026-01-19 ~00:00 UTC | Recovery session begins |
| 2026-01-19 ~00:10 UTC | Malware identified: PM2 running botnet |
| 2026-01-19 ~00:12 UTC | PM2 and umami-deployment removed |
| 2026-01-19 00:12 UTC | Server rebooted to normal mode |
| 2026-01-19 00:12 UTC | Website confirmed UP |
| 2026-01-19 00:12 UTC | SSH access BROKEN |
---
## Attack Details
### Attack 1 (2025-12-09)
- **Type:** DNS flood
- **Rate:** 83Kpps / 45Mbps
- **Target:** 171.225.223.108:53
- **Source:** Docker container (Umami Analytics)
- **Malware:** Exodus Botnet (Mirai variant)
### Attack 2 (2026-01-18 13:57 CET)
- **Type:** UDP flood
- **Rate:** 171Kpps / 51Mbps
- **Target:** 15.184.38.247:9007
- **Source:** Unknown (likely PM2 managed process)
### Attack 3 (2026-01-18 23:44 CET)
- **Type:** UDP flood
- **Rate:** 44Kpps / 50Mbps
- **Target:** 171.225.223.4:80
- **Source:** Unknown (likely PM2 managed process)
---
## Root Cause Analysis
### December 2025 Recovery Failure
The December recovery was **incomplete**. Claims made:
- "Docker removed" - TRUE (Docker binaries removed)
- "All malware cleaned" - FALSE
What was **NOT** removed in December:
1. `/home/ubuntu/umami-deployment/` directory with cron jobs
2. PM2 process manager (`pm2-ubuntu.service`)
3. PostgreSQL service (part of Umami stack)
4. Ubuntu crontab with umami backup/monitoring scripts
### Persistence Mechanism
The botnet persisted via **PM2 process manager**:
- Service: `/etc/systemd/system/pm2-ubuntu.service`
- Enabled: `/etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service`
- Config: `/home/ubuntu/.pm2/dump.pm2`
- Logs: `/home/ubuntu/.pm2/pm2.log` (375 MB)
- Behavior: `pm2 resurrect` on boot restarts saved processes
PM2 should NEVER have existed on this server. Project spec states "Systemd only (no PM2)".
---
## Recovery Actions Taken (2026-01-19)
### Via OVH Rescue Mode
1. Mounted main disk: `mount /dev/sdb1 /mnt`
2. Removed PM2 completely:
```bash
rm -rf /mnt/home/ubuntu/.pm2
rm -f /mnt/etc/systemd/system/pm2-ubuntu.service
rm -f /mnt/etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service
```
3. Removed umami-deployment:
```bash
rm -rf /mnt/home/ubuntu/umami-deployment
rm -f /mnt/var/spool/cron/crontabs/ubuntu
```
4. Disabled PostgreSQL:
```bash
rm -f /mnt/etc/systemd/system/multi-user.target.wants/postgresql.service
```
5. Verified SSH keys present in `/mnt/home/ubuntu/.ssh/authorized_keys`
6. Rebooted to normal mode
---
## Current Status
### Working
- Website responds: https://agenticgovernance.digital/ (HTTP 200)
- nginx running
- tractatus service running (website works)
- mongod running (website works)
- Boot mode: LOCAL (not rescue)
### Broken
- SSH access: Connection closes immediately after authentication
- KVM console: Returns to login prompt after password entry
- No shell access to server
### Unknown
- Whether all malware is removed
- Whether another attack will occur
- Why SSH/shell access is broken
---
## SSH Keys (Should Be Present)
### Primary Key (theflow@the-flow)
```
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCZ8BH+Bx4uO9DTatRZ/YF5xveP/bTyiAWj+qTF7I+ugxgL9/ejSlW1tSn5Seo4XHoEPD5wZCaWig7m1LMezrRq8fDWHbeXkZltK01xhAPU0L0+OvVZMZacW6+vkNfKcNG9vrxV+K/VTPkT+00TRqlHbP8ZWj0OWd92XAoTroKVYMt4L9e7QeJOJmRmHI0uFaJ0Ufexr2gmZyYhgL2p7PP3oiAvM0xlnTwygl06c3iwXpHKWNydOYPSDs3MkVnDjptmWgKv/J+QXksarwEpA4Csc2dLnco+8KrtocUUcAunz6NJfypA0yNWWzf+/OeffkJ2Rueoe8t/lVffXdI7eVuFkmDufE7XMk9YAE/8+XVqok4OV0Q+bjpH8mKlBA3rNobnWs6obBVJD8/5aphE8NdCR4cgIeRSwieFhfzCl+GBZNvs4yuBdKvQQIfCRAKqTgbuc03XERAef6lJUuJrDjwzvvp1Nd8L7AqJoQS6kYGyxXPf/6nWTZtpxoobdGnJ2FZK6OIpAlsWx9LnybMGy19VfaR9JZSAkLdWxGPb6acNUb2xaaqyuXPo4sWpBM27n1HeKMv/7Oh4WL4zrAxDKfN38k1JsjJJVEABuN/pEOb7BCDnTMLKXlTunZgynAZJ/Dxn+zOAyfzaYSNBotlpYy1zj1AmzvS31L7LJy/aSBHuWw== theflow@the-flow
```
### Deploy Key (tractatus-deploy)
```
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPdJcKMabIVQRqKqNIpzxHNgxMZ8NOD+9gVCk6dY5uV0 tractatus-deploy
```
### Automated Deploy Key (added 2026-01-18)
```
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILPMcFAmLaRiLJLOD9EGJGm+EfdKu/Xb6p/+oBV/18HC tractatus-deploy-automated
```
### Key Backup URL
https://paste.rs/nELRM
---
## Outstanding Issues
### Critical
1. **No shell access** - Cannot manage server without rescue mode
2. **Malware verification incomplete** - Cannot confirm all malware removed
### High
1. **SSH broken** - Need to investigate via rescue mode:
- Check `/var/log/auth.log`
- Check `journalctl -u sshd`
- Check PAM configuration
- Check shell configuration
### Medium
1. **MongoDB log rotation** - Not configured, caused 45GB disk fill previously
2. **fail2ban** - May be blocking IPs aggressively
3. **No monitoring** - No alerts for future attacks
---
## Required Follow-up Actions
1. **Re-enter rescue mode** to fix SSH access
2. **Check auth logs** to determine why connections close
3. **Configure MongoDB log rotation** to prevent disk fill
4. **Verify no remaining malware** with full filesystem scan
5. **Document all credentials** in secure location
6. **Set up monitoring** for future attack detection
---
## Lessons Learned
### December Recovery Failures
1. Did not verify all services running on server
2. Did not check for PM2 (shouldn't exist per spec)
3. Did not remove umami-deployment directory
4. Did not remove ubuntu crontab
5. Falsely claimed complete recovery
### Process Failures
1. No verification checklist for recovery
2. No documentation of what should/shouldn't exist on server
3. No monitoring for attack recurrence
4. Repeated SSH access issues due to poor key management
---
## Server Specification (What SHOULD Exist)
### Services (Systemd)
- tractatus.service - Node.js application
- nginx.service - Web server
- mongod.service - Database
- fail2ban.service - Intrusion prevention
### Services (Should NOT Exist)
- pm2-ubuntu.service - REMOVED
- postgresql.service - REMOVED (was for Umami)
- docker.service - Should not exist
- Any umami/analytics services
### Directories
- `/var/www/tractatus/` - Application
- `/home/ubuntu/` - User home
- `/home/ubuntu/.ssh/` - SSH keys
### Directories (Should NOT Exist)
- `/home/ubuntu/umami-deployment/` - REMOVED
- `/home/ubuntu/.pm2/` - REMOVED
- `/var/lib/docker/` - Should not exist
---
## OVH Reference Information
- **Server:** vps-93a693da.vps.ovh.net
- **IP:** 91.134.240.3
- **Manager:** https://www.ovh.com/manager/
- **Attack Ref 1:** [ref=1.39fdba94] (Jan 18 13:57)
- **Attack Ref 2:** [ref=1.39fdba94] (Jan 18 23:44)
- **Rescue Ref:** [ref=1.2378332d]
---
## Claude Code Accountability
This incident represents multiple failures:
1. **December 2025:** Incomplete malware removal, false claims of complete recovery
2. **January 2026:** Failed to identify botnet attack as cause of issues
3. **January 2026:** 8+ hours of user time wasted on repeated recovery
4. **January 2026:** Failed to implement preventive measures after first incident
5. **January 2026:** SSH access remains broken after recovery attempt
---
---
## COMPLETE RECOVERY - 2026-01-20
### What Was Done
After multiple failed partial cleanup attempts, the decision was made to perform a **complete VPS reinstallation** as recommended in the remediation plan.
#### Phase 1: VPS Reinstallation via OVH Manager
- User initiated complete OS reinstall from OVH Manager
- Fresh Ubuntu installation with new credentials
- All malware completely eliminated by full disk wipe
#### Phase 2: System Setup
```bash
# Security tools
apt install -y fail2ban rkhunter chkrootkit
# Daily security monitoring script
/usr/local/bin/daily-security-check.sh
# MongoDB with log rotation
apt install -y mongodb-org
# Configured logrotate for /var/log/mongodb/
```
#### Phase 3: Application Deployment
1. Created `/var/www/tractatus/` directory
2. Created production `.env` file with NODE_ENV=production
3. Deployed application via rsync from local (CLEAN source)
4. Installed dependencies including `@anthropic-ai/sdk`
5. Created systemd service (`/etc/systemd/system/tractatus.service`)
6. Configured nginx with SSL reverse proxy
#### Phase 4: SSL Certificate
```bash
certbot --nginx -d agenticgovernance.digital
# Certificate valid until April 2026
```
#### Phase 5: Database Migration
```bash
# Local: Export database
mongodump --db tractatus_dev --out ~/tractatus-backup
# Transfer to VPS
rsync -avz ~/tractatus-backup/ ubuntu@vps:/tmp/tractatus-backup/
# VPS: Import to production
mongorestore --db tractatus /tmp/tractatus-backup/tractatus_dev/
# Result: 134 documents + 12 blog posts restored
```
#### Phase 6: Admin Setup
```bash
node scripts/fix-admin-user.js
node scripts/seed-projects.js
```
### Final System State (2026-01-20)
**Services Running:**
- `tractatus.service` - Node.js application (port 9000)
- `nginx.service` - Web server with SSL
- `mongod.service` - MongoDB database
- `fail2ban.service` - Intrusion prevention
**Services Explicitly BANNED:**
- PM2 - Never install (malware persistence vector)
- Docker - Never install (attack vector)
- PostgreSQL - Not needed (was for Umami)
**Security Measures:**
- SSH key authentication only (password disabled)
- UFW firewall enabled
- fail2ban active
- Daily security scan at 3 AM UTC (`/usr/local/bin/daily-security-check.sh`)
- rkhunter and chkrootkit installed
**Post-Recovery Improvements (same session):**
- Removed all Umami analytics references from codebase (29 HTML files)
- Deleted `/public/js/components/umami-tracker.js`
- Updated privacy policy to reflect "No Analytics"
- Added Research Papers section to landing page
- Created `/korero-counter-arguments.html` page
- Fixed Tailwind CSS to include emerald gradient classes
### Verification Completed
- [x] SSH access works with key authentication
- [x] Website responds correctly (HTTP 200)
- [x] SSL certificate valid
- [x] MongoDB running and accessible
- [x] All documents migrated (134 total)
- [x] Blog posts visible (12 posts)
- [x] Admin user functional
- [x] No PM2 installed
- [x] No Docker installed
- [x] Daily security scan configured
---
**Report Date:** 2026-01-19 (initial) / 2026-01-20 (complete recovery)
**Status:** COMPLETE RECOVERY - All systems operational
**Next Action:** Resume normal development (/community project)