diff --git a/docs/INCIDENT_RECOVERY_2026-01-19.md b/docs/INCIDENT_RECOVERY_2026-01-19.md new file mode 100644 index 00000000..4a849cbe --- /dev/null +++ b/docs/INCIDENT_RECOVERY_2026-01-19.md @@ -0,0 +1,254 @@ +# Incident Recovery Report - 2026-01-19 + +## Executive Summary + +**Status:** PARTIAL RECOVERY +- Website: UP (https://agenticgovernance.digital/ responds HTTP 200) +- SSH Access: BROKEN (connection closes after authentication) +- Malware: REMOVED (PM2 and umami-deployment deleted) +- Root Cause: PM2 process manager running botnet malware + +--- + +## Incident Timeline + +| Date/Time | Event | +|-----------|-------| +| 2025-12-09 | First botnet attack (Exodus via Docker/Umami) - 83Kpps/45Mbps | +| 2025-12-09 | Recovery claimed complete, Docker removed | +| 2026-01-18 11:38 UTC | Server working, services running | +| 2026-01-18 13:57 CET | Second attack detected - 171Kpps/51Mbps UDP to 15.184.38.247:9007 | +| 2026-01-18 | OVH forces rescue mode | +| 2026-01-18 23:44 CET | Third attack detected - 44Kpps/50Mbps UDP to 171.225.223.4:80 | +| 2026-01-19 ~00:00 UTC | Recovery session begins | +| 2026-01-19 ~00:10 UTC | Malware identified: PM2 running botnet | +| 2026-01-19 ~00:12 UTC | PM2 and umami-deployment removed | +| 2026-01-19 00:12 UTC | Server rebooted to normal mode | +| 2026-01-19 00:12 UTC | Website confirmed UP | +| 2026-01-19 00:12 UTC | SSH access BROKEN | + +--- + +## Attack Details + +### Attack 1 (2025-12-09) +- **Type:** DNS flood +- **Rate:** 83Kpps / 45Mbps +- **Target:** 171.225.223.108:53 +- **Source:** Docker container (Umami Analytics) +- **Malware:** Exodus Botnet (Mirai variant) + +### Attack 2 (2026-01-18 13:57 CET) +- **Type:** UDP flood +- **Rate:** 171Kpps / 51Mbps +- **Target:** 15.184.38.247:9007 +- **Source:** Unknown (likely PM2 managed process) + +### Attack 3 (2026-01-18 23:44 CET) +- **Type:** UDP flood +- **Rate:** 44Kpps / 50Mbps +- **Target:** 171.225.223.4:80 +- **Source:** Unknown (likely PM2 managed process) + +--- + +## Root Cause Analysis + +### December 2025 Recovery Failure + +The December recovery was **incomplete**. Claims made: +- "Docker removed" - TRUE (Docker binaries removed) +- "All malware cleaned" - FALSE + +What was **NOT** removed in December: +1. `/home/ubuntu/umami-deployment/` directory with cron jobs +2. PM2 process manager (`pm2-ubuntu.service`) +3. PostgreSQL service (part of Umami stack) +4. Ubuntu crontab with umami backup/monitoring scripts + +### Persistence Mechanism + +The botnet persisted via **PM2 process manager**: +- Service: `/etc/systemd/system/pm2-ubuntu.service` +- Enabled: `/etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service` +- Config: `/home/ubuntu/.pm2/dump.pm2` +- Logs: `/home/ubuntu/.pm2/pm2.log` (375 MB) +- Behavior: `pm2 resurrect` on boot restarts saved processes + +PM2 should NEVER have existed on this server. Project spec states "Systemd only (no PM2)". + +--- + +## Recovery Actions Taken (2026-01-19) + +### Via OVH Rescue Mode + +1. Mounted main disk: `mount /dev/sdb1 /mnt` + +2. Removed PM2 completely: +```bash +rm -rf /mnt/home/ubuntu/.pm2 +rm -f /mnt/etc/systemd/system/pm2-ubuntu.service +rm -f /mnt/etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service +``` + +3. Removed umami-deployment: +```bash +rm -rf /mnt/home/ubuntu/umami-deployment +rm -f /mnt/var/spool/cron/crontabs/ubuntu +``` + +4. Disabled PostgreSQL: +```bash +rm -f /mnt/etc/systemd/system/multi-user.target.wants/postgresql.service +``` + +5. Verified SSH keys present in `/mnt/home/ubuntu/.ssh/authorized_keys` + +6. Rebooted to normal mode + +--- + +## Current Status + +### Working +- Website responds: https://agenticgovernance.digital/ (HTTP 200) +- nginx running +- tractatus service running (website works) +- mongod running (website works) +- Boot mode: LOCAL (not rescue) + +### Broken +- SSH access: Connection closes immediately after authentication +- KVM console: Returns to login prompt after password entry +- No shell access to server + +### Unknown +- Whether all malware is removed +- Whether another attack will occur +- Why SSH/shell access is broken + +--- + +## SSH Keys (Should Be Present) + +### Primary Key (theflow@the-flow) +``` +ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCZ8BH+Bx4uO9DTatRZ/YF5xveP/bTyiAWj+qTF7I+ugxgL9/ejSlW1tSn5Seo4XHoEPD5wZCaWig7m1LMezrRq8fDWHbeXkZltK01xhAPU0L0+OvVZMZacW6+vkNfKcNG9vrxV+K/VTPkT+00TRqlHbP8ZWj0OWd92XAoTroKVYMt4L9e7QeJOJmRmHI0uFaJ0Ufexr2gmZyYhgL2p7PP3oiAvM0xlnTwygl06c3iwXpHKWNydOYPSDs3MkVnDjptmWgKv/J+QXksarwEpA4Csc2dLnco+8KrtocUUcAunz6NJfypA0yNWWzf+/OeffkJ2Rueoe8t/lVffXdI7eVuFkmDufE7XMk9YAE/8+XVqok4OV0Q+bjpH8mKlBA3rNobnWs6obBVJD8/5aphE8NdCR4cgIeRSwieFhfzCl+GBZNvs4yuBdKvQQIfCRAKqTgbuc03XERAef6lJUuJrDjwzvvp1Nd8L7AqJoQS6kYGyxXPf/6nWTZtpxoobdGnJ2FZK6OIpAlsWx9LnybMGy19VfaR9JZSAkLdWxGPb6acNUb2xaaqyuXPo4sWpBM27n1HeKMv/7Oh4WL4zrAxDKfN38k1JsjJJVEABuN/pEOb7BCDnTMLKXlTunZgynAZJ/Dxn+zOAyfzaYSNBotlpYy1zj1AmzvS31L7LJy/aSBHuWw== theflow@the-flow +``` + +### Deploy Key (tractatus-deploy) +``` +ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPdJcKMabIVQRqKqNIpzxHNgxMZ8NOD+9gVCk6dY5uV0 tractatus-deploy +``` + +### Automated Deploy Key (added 2026-01-18) +``` +ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILPMcFAmLaRiLJLOD9EGJGm+EfdKu/Xb6p/+oBV/18HC tractatus-deploy-automated +``` + +### Key Backup URL +https://paste.rs/nELRM + +--- + +## Outstanding Issues + +### Critical +1. **No shell access** - Cannot manage server without rescue mode +2. **Malware verification incomplete** - Cannot confirm all malware removed + +### High +1. **SSH broken** - Need to investigate via rescue mode: + - Check `/var/log/auth.log` + - Check `journalctl -u sshd` + - Check PAM configuration + - Check shell configuration + +### Medium +1. **MongoDB log rotation** - Not configured, caused 45GB disk fill previously +2. **fail2ban** - May be blocking IPs aggressively +3. **No monitoring** - No alerts for future attacks + +--- + +## Required Follow-up Actions + +1. **Re-enter rescue mode** to fix SSH access +2. **Check auth logs** to determine why connections close +3. **Configure MongoDB log rotation** to prevent disk fill +4. **Verify no remaining malware** with full filesystem scan +5. **Document all credentials** in secure location +6. **Set up monitoring** for future attack detection + +--- + +## Lessons Learned + +### December Recovery Failures +1. Did not verify all services running on server +2. Did not check for PM2 (shouldn't exist per spec) +3. Did not remove umami-deployment directory +4. Did not remove ubuntu crontab +5. Falsely claimed complete recovery + +### Process Failures +1. No verification checklist for recovery +2. No documentation of what should/shouldn't exist on server +3. No monitoring for attack recurrence +4. Repeated SSH access issues due to poor key management + +--- + +## Server Specification (What SHOULD Exist) + +### Services (Systemd) +- tractatus.service - Node.js application +- nginx.service - Web server +- mongod.service - Database +- fail2ban.service - Intrusion prevention + +### Services (Should NOT Exist) +- pm2-ubuntu.service - REMOVED +- postgresql.service - REMOVED (was for Umami) +- docker.service - Should not exist +- Any umami/analytics services + +### Directories +- `/var/www/tractatus/` - Application +- `/home/ubuntu/` - User home +- `/home/ubuntu/.ssh/` - SSH keys + +### Directories (Should NOT Exist) +- `/home/ubuntu/umami-deployment/` - REMOVED +- `/home/ubuntu/.pm2/` - REMOVED +- `/var/lib/docker/` - Should not exist + +--- + +## OVH Reference Information + +- **Server:** vps-93a693da.vps.ovh.net +- **IP:** 91.134.240.3 +- **Manager:** https://www.ovh.com/manager/ +- **Attack Ref 1:** [ref=1.39fdba94] (Jan 18 13:57) +- **Attack Ref 2:** [ref=1.39fdba94] (Jan 18 23:44) +- **Rescue Ref:** [ref=1.2378332d] + +--- + +## Claude Code Accountability + +This incident represents multiple failures: + +1. **December 2025:** Incomplete malware removal, false claims of complete recovery +2. **January 2026:** Failed to identify botnet attack as cause of issues +3. **January 2026:** 8+ hours of user time wasted on repeated recovery +4. **January 2026:** Failed to implement preventive measures after first incident +5. **January 2026:** SSH access remains broken after recovery attempt + +--- + +**Report Date:** 2026-01-19 +**Status:** PARTIAL RECOVERY - Website up, SSH broken +**Next Action:** Re-enter rescue mode to fix SSH access