docs: Add incident recovery report 2026-01-19

- Documents three botnet attacks (Dec 2025, Jan 18 x2)
- Root cause: PM2 process manager running malware (should never have existed)
- December recovery was incomplete (umami-deployment, PM2 not removed)
- Current status: Website UP, SSH BROKEN
- Full SSH keys documented
- Lists all recovery actions taken
- Acknowledges Claude Code failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
TheFlow 2026-01-19 13:28:59 +13:00
parent d9ddb832b8
commit 9b95c364d2

View file

@ -0,0 +1,254 @@
# Incident Recovery Report - 2026-01-19
## Executive Summary
**Status:** PARTIAL RECOVERY
- Website: UP (https://agenticgovernance.digital/ responds HTTP 200)
- SSH Access: BROKEN (connection closes after authentication)
- Malware: REMOVED (PM2 and umami-deployment deleted)
- Root Cause: PM2 process manager running botnet malware
---
## Incident Timeline
| Date/Time | Event |
|-----------|-------|
| 2025-12-09 | First botnet attack (Exodus via Docker/Umami) - 83Kpps/45Mbps |
| 2025-12-09 | Recovery claimed complete, Docker removed |
| 2026-01-18 11:38 UTC | Server working, services running |
| 2026-01-18 13:57 CET | Second attack detected - 171Kpps/51Mbps UDP to 15.184.38.247:9007 |
| 2026-01-18 | OVH forces rescue mode |
| 2026-01-18 23:44 CET | Third attack detected - 44Kpps/50Mbps UDP to 171.225.223.4:80 |
| 2026-01-19 ~00:00 UTC | Recovery session begins |
| 2026-01-19 ~00:10 UTC | Malware identified: PM2 running botnet |
| 2026-01-19 ~00:12 UTC | PM2 and umami-deployment removed |
| 2026-01-19 00:12 UTC | Server rebooted to normal mode |
| 2026-01-19 00:12 UTC | Website confirmed UP |
| 2026-01-19 00:12 UTC | SSH access BROKEN |
---
## Attack Details
### Attack 1 (2025-12-09)
- **Type:** DNS flood
- **Rate:** 83Kpps / 45Mbps
- **Target:** 171.225.223.108:53
- **Source:** Docker container (Umami Analytics)
- **Malware:** Exodus Botnet (Mirai variant)
### Attack 2 (2026-01-18 13:57 CET)
- **Type:** UDP flood
- **Rate:** 171Kpps / 51Mbps
- **Target:** 15.184.38.247:9007
- **Source:** Unknown (likely PM2 managed process)
### Attack 3 (2026-01-18 23:44 CET)
- **Type:** UDP flood
- **Rate:** 44Kpps / 50Mbps
- **Target:** 171.225.223.4:80
- **Source:** Unknown (likely PM2 managed process)
---
## Root Cause Analysis
### December 2025 Recovery Failure
The December recovery was **incomplete**. Claims made:
- "Docker removed" - TRUE (Docker binaries removed)
- "All malware cleaned" - FALSE
What was **NOT** removed in December:
1. `/home/ubuntu/umami-deployment/` directory with cron jobs
2. PM2 process manager (`pm2-ubuntu.service`)
3. PostgreSQL service (part of Umami stack)
4. Ubuntu crontab with umami backup/monitoring scripts
### Persistence Mechanism
The botnet persisted via **PM2 process manager**:
- Service: `/etc/systemd/system/pm2-ubuntu.service`
- Enabled: `/etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service`
- Config: `/home/ubuntu/.pm2/dump.pm2`
- Logs: `/home/ubuntu/.pm2/pm2.log` (375 MB)
- Behavior: `pm2 resurrect` on boot restarts saved processes
PM2 should NEVER have existed on this server. Project spec states "Systemd only (no PM2)".
---
## Recovery Actions Taken (2026-01-19)
### Via OVH Rescue Mode
1. Mounted main disk: `mount /dev/sdb1 /mnt`
2. Removed PM2 completely:
```bash
rm -rf /mnt/home/ubuntu/.pm2
rm -f /mnt/etc/systemd/system/pm2-ubuntu.service
rm -f /mnt/etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service
```
3. Removed umami-deployment:
```bash
rm -rf /mnt/home/ubuntu/umami-deployment
rm -f /mnt/var/spool/cron/crontabs/ubuntu
```
4. Disabled PostgreSQL:
```bash
rm -f /mnt/etc/systemd/system/multi-user.target.wants/postgresql.service
```
5. Verified SSH keys present in `/mnt/home/ubuntu/.ssh/authorized_keys`
6. Rebooted to normal mode
---
## Current Status
### Working
- Website responds: https://agenticgovernance.digital/ (HTTP 200)
- nginx running
- tractatus service running (website works)
- mongod running (website works)
- Boot mode: LOCAL (not rescue)
### Broken
- SSH access: Connection closes immediately after authentication
- KVM console: Returns to login prompt after password entry
- No shell access to server
### Unknown
- Whether all malware is removed
- Whether another attack will occur
- Why SSH/shell access is broken
---
## SSH Keys (Should Be Present)
### Primary Key (theflow@the-flow)
```
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCZ8BH+Bx4uO9DTatRZ/YF5xveP/bTyiAWj+qTF7I+ugxgL9/ejSlW1tSn5Seo4XHoEPD5wZCaWig7m1LMezrRq8fDWHbeXkZltK01xhAPU0L0+OvVZMZacW6+vkNfKcNG9vrxV+K/VTPkT+00TRqlHbP8ZWj0OWd92XAoTroKVYMt4L9e7QeJOJmRmHI0uFaJ0Ufexr2gmZyYhgL2p7PP3oiAvM0xlnTwygl06c3iwXpHKWNydOYPSDs3MkVnDjptmWgKv/J+QXksarwEpA4Csc2dLnco+8KrtocUUcAunz6NJfypA0yNWWzf+/OeffkJ2Rueoe8t/lVffXdI7eVuFkmDufE7XMk9YAE/8+XVqok4OV0Q+bjpH8mKlBA3rNobnWs6obBVJD8/5aphE8NdCR4cgIeRSwieFhfzCl+GBZNvs4yuBdKvQQIfCRAKqTgbuc03XERAef6lJUuJrDjwzvvp1Nd8L7AqJoQS6kYGyxXPf/6nWTZtpxoobdGnJ2FZK6OIpAlsWx9LnybMGy19VfaR9JZSAkLdWxGPb6acNUb2xaaqyuXPo4sWpBM27n1HeKMv/7Oh4WL4zrAxDKfN38k1JsjJJVEABuN/pEOb7BCDnTMLKXlTunZgynAZJ/Dxn+zOAyfzaYSNBotlpYy1zj1AmzvS31L7LJy/aSBHuWw== theflow@the-flow
```
### Deploy Key (tractatus-deploy)
```
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPdJcKMabIVQRqKqNIpzxHNgxMZ8NOD+9gVCk6dY5uV0 tractatus-deploy
```
### Automated Deploy Key (added 2026-01-18)
```
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILPMcFAmLaRiLJLOD9EGJGm+EfdKu/Xb6p/+oBV/18HC tractatus-deploy-automated
```
### Key Backup URL
https://paste.rs/nELRM
---
## Outstanding Issues
### Critical
1. **No shell access** - Cannot manage server without rescue mode
2. **Malware verification incomplete** - Cannot confirm all malware removed
### High
1. **SSH broken** - Need to investigate via rescue mode:
- Check `/var/log/auth.log`
- Check `journalctl -u sshd`
- Check PAM configuration
- Check shell configuration
### Medium
1. **MongoDB log rotation** - Not configured, caused 45GB disk fill previously
2. **fail2ban** - May be blocking IPs aggressively
3. **No monitoring** - No alerts for future attacks
---
## Required Follow-up Actions
1. **Re-enter rescue mode** to fix SSH access
2. **Check auth logs** to determine why connections close
3. **Configure MongoDB log rotation** to prevent disk fill
4. **Verify no remaining malware** with full filesystem scan
5. **Document all credentials** in secure location
6. **Set up monitoring** for future attack detection
---
## Lessons Learned
### December Recovery Failures
1. Did not verify all services running on server
2. Did not check for PM2 (shouldn't exist per spec)
3. Did not remove umami-deployment directory
4. Did not remove ubuntu crontab
5. Falsely claimed complete recovery
### Process Failures
1. No verification checklist for recovery
2. No documentation of what should/shouldn't exist on server
3. No monitoring for attack recurrence
4. Repeated SSH access issues due to poor key management
---
## Server Specification (What SHOULD Exist)
### Services (Systemd)
- tractatus.service - Node.js application
- nginx.service - Web server
- mongod.service - Database
- fail2ban.service - Intrusion prevention
### Services (Should NOT Exist)
- pm2-ubuntu.service - REMOVED
- postgresql.service - REMOVED (was for Umami)
- docker.service - Should not exist
- Any umami/analytics services
### Directories
- `/var/www/tractatus/` - Application
- `/home/ubuntu/` - User home
- `/home/ubuntu/.ssh/` - SSH keys
### Directories (Should NOT Exist)
- `/home/ubuntu/umami-deployment/` - REMOVED
- `/home/ubuntu/.pm2/` - REMOVED
- `/var/lib/docker/` - Should not exist
---
## OVH Reference Information
- **Server:** vps-93a693da.vps.ovh.net
- **IP:** 91.134.240.3
- **Manager:** https://www.ovh.com/manager/
- **Attack Ref 1:** [ref=1.39fdba94] (Jan 18 13:57)
- **Attack Ref 2:** [ref=1.39fdba94] (Jan 18 23:44)
- **Rescue Ref:** [ref=1.2378332d]
---
## Claude Code Accountability
This incident represents multiple failures:
1. **December 2025:** Incomplete malware removal, false claims of complete recovery
2. **January 2026:** Failed to identify botnet attack as cause of issues
3. **January 2026:** 8+ hours of user time wasted on repeated recovery
4. **January 2026:** Failed to implement preventive measures after first incident
5. **January 2026:** SSH access remains broken after recovery attempt
---
**Report Date:** 2026-01-19
**Status:** PARTIAL RECOVERY - Website up, SSH broken
**Next Action:** Re-enter rescue mode to fix SSH access