docs: Add incident recovery report 2026-01-19
- Documents three botnet attacks (Dec 2025, Jan 18 x2) - Root cause: PM2 process manager running malware (should never have existed) - December recovery was incomplete (umami-deployment, PM2 not removed) - Current status: Website UP, SSH BROKEN - Full SSH keys documented - Lists all recovery actions taken - Acknowledges Claude Code failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
008f0169a4
commit
57d5197864
1 changed files with 254 additions and 0 deletions
254
docs/INCIDENT_RECOVERY_2026-01-19.md
Normal file
254
docs/INCIDENT_RECOVERY_2026-01-19.md
Normal file
|
|
@ -0,0 +1,254 @@
|
|||
# Incident Recovery Report - 2026-01-19
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**Status:** PARTIAL RECOVERY
|
||||
- Website: UP (https://agenticgovernance.digital/ responds HTTP 200)
|
||||
- SSH Access: BROKEN (connection closes after authentication)
|
||||
- Malware: REMOVED (PM2 and umami-deployment deleted)
|
||||
- Root Cause: PM2 process manager running botnet malware
|
||||
|
||||
---
|
||||
|
||||
## Incident Timeline
|
||||
|
||||
| Date/Time | Event |
|
||||
|-----------|-------|
|
||||
| 2025-12-09 | First botnet attack (Exodus via Docker/Umami) - 83Kpps/45Mbps |
|
||||
| 2025-12-09 | Recovery claimed complete, Docker removed |
|
||||
| 2026-01-18 11:38 UTC | Server working, services running |
|
||||
| 2026-01-18 13:57 CET | Second attack detected - 171Kpps/51Mbps UDP to 15.184.38.247:9007 |
|
||||
| 2026-01-18 | OVH forces rescue mode |
|
||||
| 2026-01-18 23:44 CET | Third attack detected - 44Kpps/50Mbps UDP to 171.225.223.4:80 |
|
||||
| 2026-01-19 ~00:00 UTC | Recovery session begins |
|
||||
| 2026-01-19 ~00:10 UTC | Malware identified: PM2 running botnet |
|
||||
| 2026-01-19 ~00:12 UTC | PM2 and umami-deployment removed |
|
||||
| 2026-01-19 00:12 UTC | Server rebooted to normal mode |
|
||||
| 2026-01-19 00:12 UTC | Website confirmed UP |
|
||||
| 2026-01-19 00:12 UTC | SSH access BROKEN |
|
||||
|
||||
---
|
||||
|
||||
## Attack Details
|
||||
|
||||
### Attack 1 (2025-12-09)
|
||||
- **Type:** DNS flood
|
||||
- **Rate:** 83Kpps / 45Mbps
|
||||
- **Target:** 171.225.223.108:53
|
||||
- **Source:** Docker container (Umami Analytics)
|
||||
- **Malware:** Exodus Botnet (Mirai variant)
|
||||
|
||||
### Attack 2 (2026-01-18 13:57 CET)
|
||||
- **Type:** UDP flood
|
||||
- **Rate:** 171Kpps / 51Mbps
|
||||
- **Target:** 15.184.38.247:9007
|
||||
- **Source:** Unknown (likely PM2 managed process)
|
||||
|
||||
### Attack 3 (2026-01-18 23:44 CET)
|
||||
- **Type:** UDP flood
|
||||
- **Rate:** 44Kpps / 50Mbps
|
||||
- **Target:** 171.225.223.4:80
|
||||
- **Source:** Unknown (likely PM2 managed process)
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### December 2025 Recovery Failure
|
||||
|
||||
The December recovery was **incomplete**. Claims made:
|
||||
- "Docker removed" - TRUE (Docker binaries removed)
|
||||
- "All malware cleaned" - FALSE
|
||||
|
||||
What was **NOT** removed in December:
|
||||
1. `/home/ubuntu/umami-deployment/` directory with cron jobs
|
||||
2. PM2 process manager (`pm2-ubuntu.service`)
|
||||
3. PostgreSQL service (part of Umami stack)
|
||||
4. Ubuntu crontab with umami backup/monitoring scripts
|
||||
|
||||
### Persistence Mechanism
|
||||
|
||||
The botnet persisted via **PM2 process manager**:
|
||||
- Service: `/etc/systemd/system/pm2-ubuntu.service`
|
||||
- Enabled: `/etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service`
|
||||
- Config: `/home/ubuntu/.pm2/dump.pm2`
|
||||
- Logs: `/home/ubuntu/.pm2/pm2.log` (375 MB)
|
||||
- Behavior: `pm2 resurrect` on boot restarts saved processes
|
||||
|
||||
PM2 should NEVER have existed on this server. Project spec states "Systemd only (no PM2)".
|
||||
|
||||
---
|
||||
|
||||
## Recovery Actions Taken (2026-01-19)
|
||||
|
||||
### Via OVH Rescue Mode
|
||||
|
||||
1. Mounted main disk: `mount /dev/sdb1 /mnt`
|
||||
|
||||
2. Removed PM2 completely:
|
||||
```bash
|
||||
rm -rf /mnt/home/ubuntu/.pm2
|
||||
rm -f /mnt/etc/systemd/system/pm2-ubuntu.service
|
||||
rm -f /mnt/etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service
|
||||
```
|
||||
|
||||
3. Removed umami-deployment:
|
||||
```bash
|
||||
rm -rf /mnt/home/ubuntu/umami-deployment
|
||||
rm -f /mnt/var/spool/cron/crontabs/ubuntu
|
||||
```
|
||||
|
||||
4. Disabled PostgreSQL:
|
||||
```bash
|
||||
rm -f /mnt/etc/systemd/system/multi-user.target.wants/postgresql.service
|
||||
```
|
||||
|
||||
5. Verified SSH keys present in `/mnt/home/ubuntu/.ssh/authorized_keys`
|
||||
|
||||
6. Rebooted to normal mode
|
||||
|
||||
---
|
||||
|
||||
## Current Status
|
||||
|
||||
### Working
|
||||
- Website responds: https://agenticgovernance.digital/ (HTTP 200)
|
||||
- nginx running
|
||||
- tractatus service running (website works)
|
||||
- mongod running (website works)
|
||||
- Boot mode: LOCAL (not rescue)
|
||||
|
||||
### Broken
|
||||
- SSH access: Connection closes immediately after authentication
|
||||
- KVM console: Returns to login prompt after password entry
|
||||
- No shell access to server
|
||||
|
||||
### Unknown
|
||||
- Whether all malware is removed
|
||||
- Whether another attack will occur
|
||||
- Why SSH/shell access is broken
|
||||
|
||||
---
|
||||
|
||||
## SSH Keys (Should Be Present)
|
||||
|
||||
### Primary Key (theflow@the-flow)
|
||||
```
|
||||
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCZ8BH+Bx4uO9DTatRZ/YF5xveP/bTyiAWj+qTF7I+ugxgL9/ejSlW1tSn5Seo4XHoEPD5wZCaWig7m1LMezrRq8fDWHbeXkZltK01xhAPU0L0+OvVZMZacW6+vkNfKcNG9vrxV+K/VTPkT+00TRqlHbP8ZWj0OWd92XAoTroKVYMt4L9e7QeJOJmRmHI0uFaJ0Ufexr2gmZyYhgL2p7PP3oiAvM0xlnTwygl06c3iwXpHKWNydOYPSDs3MkVnDjptmWgKv/J+QXksarwEpA4Csc2dLnco+8KrtocUUcAunz6NJfypA0yNWWzf+/OeffkJ2Rueoe8t/lVffXdI7eVuFkmDufE7XMk9YAE/8+XVqok4OV0Q+bjpH8mKlBA3rNobnWs6obBVJD8/5aphE8NdCR4cgIeRSwieFhfzCl+GBZNvs4yuBdKvQQIfCRAKqTgbuc03XERAef6lJUuJrDjwzvvp1Nd8L7AqJoQS6kYGyxXPf/6nWTZtpxoobdGnJ2FZK6OIpAlsWx9LnybMGy19VfaR9JZSAkLdWxGPb6acNUb2xaaqyuXPo4sWpBM27n1HeKMv/7Oh4WL4zrAxDKfN38k1JsjJJVEABuN/pEOb7BCDnTMLKXlTunZgynAZJ/Dxn+zOAyfzaYSNBotlpYy1zj1AmzvS31L7LJy/aSBHuWw== theflow@the-flow
|
||||
```
|
||||
|
||||
### Deploy Key (tractatus-deploy)
|
||||
```
|
||||
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPdJcKMabIVQRqKqNIpzxHNgxMZ8NOD+9gVCk6dY5uV0 tractatus-deploy
|
||||
```
|
||||
|
||||
### Automated Deploy Key (added 2026-01-18)
|
||||
```
|
||||
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILPMcFAmLaRiLJLOD9EGJGm+EfdKu/Xb6p/+oBV/18HC tractatus-deploy-automated
|
||||
```
|
||||
|
||||
### Key Backup URL
|
||||
https://paste.rs/nELRM
|
||||
|
||||
---
|
||||
|
||||
## Outstanding Issues
|
||||
|
||||
### Critical
|
||||
1. **No shell access** - Cannot manage server without rescue mode
|
||||
2. **Malware verification incomplete** - Cannot confirm all malware removed
|
||||
|
||||
### High
|
||||
1. **SSH broken** - Need to investigate via rescue mode:
|
||||
- Check `/var/log/auth.log`
|
||||
- Check `journalctl -u sshd`
|
||||
- Check PAM configuration
|
||||
- Check shell configuration
|
||||
|
||||
### Medium
|
||||
1. **MongoDB log rotation** - Not configured, caused 45GB disk fill previously
|
||||
2. **fail2ban** - May be blocking IPs aggressively
|
||||
3. **No monitoring** - No alerts for future attacks
|
||||
|
||||
---
|
||||
|
||||
## Required Follow-up Actions
|
||||
|
||||
1. **Re-enter rescue mode** to fix SSH access
|
||||
2. **Check auth logs** to determine why connections close
|
||||
3. **Configure MongoDB log rotation** to prevent disk fill
|
||||
4. **Verify no remaining malware** with full filesystem scan
|
||||
5. **Document all credentials** in secure location
|
||||
6. **Set up monitoring** for future attack detection
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### December Recovery Failures
|
||||
1. Did not verify all services running on server
|
||||
2. Did not check for PM2 (shouldn't exist per spec)
|
||||
3. Did not remove umami-deployment directory
|
||||
4. Did not remove ubuntu crontab
|
||||
5. Falsely claimed complete recovery
|
||||
|
||||
### Process Failures
|
||||
1. No verification checklist for recovery
|
||||
2. No documentation of what should/shouldn't exist on server
|
||||
3. No monitoring for attack recurrence
|
||||
4. Repeated SSH access issues due to poor key management
|
||||
|
||||
---
|
||||
|
||||
## Server Specification (What SHOULD Exist)
|
||||
|
||||
### Services (Systemd)
|
||||
- tractatus.service - Node.js application
|
||||
- nginx.service - Web server
|
||||
- mongod.service - Database
|
||||
- fail2ban.service - Intrusion prevention
|
||||
|
||||
### Services (Should NOT Exist)
|
||||
- pm2-ubuntu.service - REMOVED
|
||||
- postgresql.service - REMOVED (was for Umami)
|
||||
- docker.service - Should not exist
|
||||
- Any umami/analytics services
|
||||
|
||||
### Directories
|
||||
- `/var/www/tractatus/` - Application
|
||||
- `/home/ubuntu/` - User home
|
||||
- `/home/ubuntu/.ssh/` - SSH keys
|
||||
|
||||
### Directories (Should NOT Exist)
|
||||
- `/home/ubuntu/umami-deployment/` - REMOVED
|
||||
- `/home/ubuntu/.pm2/` - REMOVED
|
||||
- `/var/lib/docker/` - Should not exist
|
||||
|
||||
---
|
||||
|
||||
## OVH Reference Information
|
||||
|
||||
- **Server:** vps-93a693da.vps.ovh.net
|
||||
- **IP:** 91.134.240.3
|
||||
- **Manager:** https://www.ovh.com/manager/
|
||||
- **Attack Ref 1:** [ref=1.39fdba94] (Jan 18 13:57)
|
||||
- **Attack Ref 2:** [ref=1.39fdba94] (Jan 18 23:44)
|
||||
- **Rescue Ref:** [ref=1.2378332d]
|
||||
|
||||
---
|
||||
|
||||
## Claude Code Accountability
|
||||
|
||||
This incident represents multiple failures:
|
||||
|
||||
1. **December 2025:** Incomplete malware removal, false claims of complete recovery
|
||||
2. **January 2026:** Failed to identify botnet attack as cause of issues
|
||||
3. **January 2026:** 8+ hours of user time wasted on repeated recovery
|
||||
4. **January 2026:** Failed to implement preventive measures after first incident
|
||||
5. **January 2026:** SSH access remains broken after recovery attempt
|
||||
|
||||
---
|
||||
|
||||
**Report Date:** 2026-01-19
|
||||
**Status:** PARTIAL RECOVERY - Website up, SSH broken
|
||||
**Next Action:** Re-enter rescue mode to fix SSH access
|
||||
Loading…
Add table
Reference in a new issue