tractatus/docs/INCIDENT_RECOVERY_2026-01-19.md
TheFlow b302960a61 docs: Complete VPS recovery documentation and attack reference
- Update INCIDENT_RECOVERY_2026-01-19.md with complete recovery status
- Create VPS_RECOVERY_REFERENCE.md with step-by-step recovery guide
- Update remediation plan to show executed status
- Update OVH rescue mode doc with resolution notes

Documents the successful complete reinstall approach after multiple
failed partial cleanup attempts. Includes attack indicators, banned
software list, and verification checklist for future incidents.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-20 12:06:32 +13:00

11 KiB

Incident Recovery Report - 2026-01-19/20

Executive Summary

Status: COMPLETE RECOVERY (Updated 2026-01-20)

  • Website: UP (https://agenticgovernance.digital/ responds HTTP 200)
  • SSH Access: WORKING (via fresh VPS reinstall)
  • Malware: ELIMINATED (complete OS reinstall)
  • Application: FULLY RESTORED
  • Database: MIGRATED from local backup (134 documents)
  • SSL: VALID (Let's Encrypt, expires April 2026)
  • Root Cause: PM2 process manager running Exodus botnet malware

Incident Timeline

Date/Time Event
2025-12-09 First botnet attack (Exodus via Docker/Umami) - 83Kpps/45Mbps
2025-12-09 Recovery claimed complete, Docker removed
2026-01-18 11:38 UTC Server working, services running
2026-01-18 13:57 CET Second attack detected - 171Kpps/51Mbps UDP to 15.184.38.247:9007
2026-01-18 OVH forces rescue mode
2026-01-18 23:44 CET Third attack detected - 44Kpps/50Mbps UDP to 171.225.223.4:80
2026-01-19 ~00:00 UTC Recovery session begins
2026-01-19 ~00:10 UTC Malware identified: PM2 running botnet
2026-01-19 ~00:12 UTC PM2 and umami-deployment removed
2026-01-19 00:12 UTC Server rebooted to normal mode
2026-01-19 00:12 UTC Website confirmed UP
2026-01-19 00:12 UTC SSH access BROKEN

Attack Details

Attack 1 (2025-12-09)

  • Type: DNS flood
  • Rate: 83Kpps / 45Mbps
  • Target: 171.225.223.108:53
  • Source: Docker container (Umami Analytics)
  • Malware: Exodus Botnet (Mirai variant)

Attack 2 (2026-01-18 13:57 CET)

  • Type: UDP flood
  • Rate: 171Kpps / 51Mbps
  • Target: 15.184.38.247:9007
  • Source: Unknown (likely PM2 managed process)

Attack 3 (2026-01-18 23:44 CET)

  • Type: UDP flood
  • Rate: 44Kpps / 50Mbps
  • Target: 171.225.223.4:80
  • Source: Unknown (likely PM2 managed process)

Root Cause Analysis

December 2025 Recovery Failure

The December recovery was incomplete. Claims made:

  • "Docker removed" - TRUE (Docker binaries removed)
  • "All malware cleaned" - FALSE

What was NOT removed in December:

  1. /home/ubuntu/umami-deployment/ directory with cron jobs
  2. PM2 process manager (pm2-ubuntu.service)
  3. PostgreSQL service (part of Umami stack)
  4. Ubuntu crontab with umami backup/monitoring scripts

Persistence Mechanism

The botnet persisted via PM2 process manager:

  • Service: /etc/systemd/system/pm2-ubuntu.service
  • Enabled: /etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service
  • Config: /home/ubuntu/.pm2/dump.pm2
  • Logs: /home/ubuntu/.pm2/pm2.log (375 MB)
  • Behavior: pm2 resurrect on boot restarts saved processes

PM2 should NEVER have existed on this server. Project spec states "Systemd only (no PM2)".


Recovery Actions Taken (2026-01-19)

Via OVH Rescue Mode

  1. Mounted main disk: mount /dev/sdb1 /mnt

  2. Removed PM2 completely:

rm -rf /mnt/home/ubuntu/.pm2
rm -f /mnt/etc/systemd/system/pm2-ubuntu.service
rm -f /mnt/etc/systemd/system/multi-user.target.wants/pm2-ubuntu.service
  1. Removed umami-deployment:
rm -rf /mnt/home/ubuntu/umami-deployment
rm -f /mnt/var/spool/cron/crontabs/ubuntu
  1. Disabled PostgreSQL:
rm -f /mnt/etc/systemd/system/multi-user.target.wants/postgresql.service
  1. Verified SSH keys present in /mnt/home/ubuntu/.ssh/authorized_keys

  2. Rebooted to normal mode


Current Status

Working

  • Website responds: https://agenticgovernance.digital/ (HTTP 200)
  • nginx running
  • tractatus service running (website works)
  • mongod running (website works)
  • Boot mode: LOCAL (not rescue)

Broken

  • SSH access: Connection closes immediately after authentication
  • KVM console: Returns to login prompt after password entry
  • No shell access to server

Unknown

  • Whether all malware is removed
  • Whether another attack will occur
  • Why SSH/shell access is broken

SSH Keys (Should Be Present)

Primary Key (theflow@the-flow)

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCZ8BH+Bx4uO9DTatRZ/YF5xveP/bTyiAWj+qTF7I+ugxgL9/ejSlW1tSn5Seo4XHoEPD5wZCaWig7m1LMezrRq8fDWHbeXkZltK01xhAPU0L0+OvVZMZacW6+vkNfKcNG9vrxV+K/VTPkT+00TRqlHbP8ZWj0OWd92XAoTroKVYMt4L9e7QeJOJmRmHI0uFaJ0Ufexr2gmZyYhgL2p7PP3oiAvM0xlnTwygl06c3iwXpHKWNydOYPSDs3MkVnDjptmWgKv/J+QXksarwEpA4Csc2dLnco+8KrtocUUcAunz6NJfypA0yNWWzf+/OeffkJ2Rueoe8t/lVffXdI7eVuFkmDufE7XMk9YAE/8+XVqok4OV0Q+bjpH8mKlBA3rNobnWs6obBVJD8/5aphE8NdCR4cgIeRSwieFhfzCl+GBZNvs4yuBdKvQQIfCRAKqTgbuc03XERAef6lJUuJrDjwzvvp1Nd8L7AqJoQS6kYGyxXPf/6nWTZtpxoobdGnJ2FZK6OIpAlsWx9LnybMGy19VfaR9JZSAkLdWxGPb6acNUb2xaaqyuXPo4sWpBM27n1HeKMv/7Oh4WL4zrAxDKfN38k1JsjJJVEABuN/pEOb7BCDnTMLKXlTunZgynAZJ/Dxn+zOAyfzaYSNBotlpYy1zj1AmzvS31L7LJy/aSBHuWw== theflow@the-flow

Deploy Key (tractatus-deploy)

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIPdJcKMabIVQRqKqNIpzxHNgxMZ8NOD+9gVCk6dY5uV0 tractatus-deploy

Automated Deploy Key (added 2026-01-18)

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILPMcFAmLaRiLJLOD9EGJGm+EfdKu/Xb6p/+oBV/18HC tractatus-deploy-automated

Key Backup URL

https://paste.rs/nELRM


Outstanding Issues

Critical

  1. No shell access - Cannot manage server without rescue mode
  2. Malware verification incomplete - Cannot confirm all malware removed

High

  1. SSH broken - Need to investigate via rescue mode:
    • Check /var/log/auth.log
    • Check journalctl -u sshd
    • Check PAM configuration
    • Check shell configuration

Medium

  1. MongoDB log rotation - Not configured, caused 45GB disk fill previously
  2. fail2ban - May be blocking IPs aggressively
  3. No monitoring - No alerts for future attacks

Required Follow-up Actions

  1. Re-enter rescue mode to fix SSH access
  2. Check auth logs to determine why connections close
  3. Configure MongoDB log rotation to prevent disk fill
  4. Verify no remaining malware with full filesystem scan
  5. Document all credentials in secure location
  6. Set up monitoring for future attack detection

Lessons Learned

December Recovery Failures

  1. Did not verify all services running on server
  2. Did not check for PM2 (shouldn't exist per spec)
  3. Did not remove umami-deployment directory
  4. Did not remove ubuntu crontab
  5. Falsely claimed complete recovery

Process Failures

  1. No verification checklist for recovery
  2. No documentation of what should/shouldn't exist on server
  3. No monitoring for attack recurrence
  4. Repeated SSH access issues due to poor key management

Server Specification (What SHOULD Exist)

Services (Systemd)

  • tractatus.service - Node.js application
  • nginx.service - Web server
  • mongod.service - Database
  • fail2ban.service - Intrusion prevention

Services (Should NOT Exist)

  • pm2-ubuntu.service - REMOVED
  • postgresql.service - REMOVED (was for Umami)
  • docker.service - Should not exist
  • Any umami/analytics services

Directories

  • /var/www/tractatus/ - Application
  • /home/ubuntu/ - User home
  • /home/ubuntu/.ssh/ - SSH keys

Directories (Should NOT Exist)

  • /home/ubuntu/umami-deployment/ - REMOVED
  • /home/ubuntu/.pm2/ - REMOVED
  • /var/lib/docker/ - Should not exist

OVH Reference Information

  • Server: vps-93a693da.vps.ovh.net
  • IP: 91.134.240.3
  • Manager: https://www.ovh.com/manager/
  • Attack Ref 1: [ref=1.39fdba94] (Jan 18 13:57)
  • Attack Ref 2: [ref=1.39fdba94] (Jan 18 23:44)
  • Rescue Ref: [ref=1.2378332d]

Claude Code Accountability

This incident represents multiple failures:

  1. December 2025: Incomplete malware removal, false claims of complete recovery
  2. January 2026: Failed to identify botnet attack as cause of issues
  3. January 2026: 8+ hours of user time wasted on repeated recovery
  4. January 2026: Failed to implement preventive measures after first incident
  5. January 2026: SSH access remains broken after recovery attempt


COMPLETE RECOVERY - 2026-01-20

What Was Done

After multiple failed partial cleanup attempts, the decision was made to perform a complete VPS reinstallation as recommended in the remediation plan.

Phase 1: VPS Reinstallation via OVH Manager

  • User initiated complete OS reinstall from OVH Manager
  • Fresh Ubuntu installation with new credentials
  • All malware completely eliminated by full disk wipe

Phase 2: System Setup

# Security tools
apt install -y fail2ban rkhunter chkrootkit

# Daily security monitoring script
/usr/local/bin/daily-security-check.sh

# MongoDB with log rotation
apt install -y mongodb-org
# Configured logrotate for /var/log/mongodb/

Phase 3: Application Deployment

  1. Created /var/www/tractatus/ directory
  2. Created production .env file with NODE_ENV=production
  3. Deployed application via rsync from local (CLEAN source)
  4. Installed dependencies including @anthropic-ai/sdk
  5. Created systemd service (/etc/systemd/system/tractatus.service)
  6. Configured nginx with SSL reverse proxy

Phase 4: SSL Certificate

certbot --nginx -d agenticgovernance.digital
# Certificate valid until April 2026

Phase 5: Database Migration

# Local: Export database
mongodump --db tractatus_dev --out ~/tractatus-backup

# Transfer to VPS
rsync -avz ~/tractatus-backup/ ubuntu@vps:/tmp/tractatus-backup/

# VPS: Import to production
mongorestore --db tractatus /tmp/tractatus-backup/tractatus_dev/
# Result: 134 documents + 12 blog posts restored

Phase 6: Admin Setup

node scripts/fix-admin-user.js
node scripts/seed-projects.js

Final System State (2026-01-20)

Services Running:

  • tractatus.service - Node.js application (port 9000)
  • nginx.service - Web server with SSL
  • mongod.service - MongoDB database
  • fail2ban.service - Intrusion prevention

Services Explicitly BANNED:

  • PM2 - Never install (malware persistence vector)
  • Docker - Never install (attack vector)
  • PostgreSQL - Not needed (was for Umami)

Security Measures:

  • SSH key authentication only (password disabled)
  • UFW firewall enabled
  • fail2ban active
  • Daily security scan at 3 AM UTC (/usr/local/bin/daily-security-check.sh)
  • rkhunter and chkrootkit installed

Post-Recovery Improvements (same session):

  • Removed all Umami analytics references from codebase (29 HTML files)
  • Deleted /public/js/components/umami-tracker.js
  • Updated privacy policy to reflect "No Analytics"
  • Added Research Papers section to landing page
  • Created /korero-counter-arguments.html page
  • Fixed Tailwind CSS to include emerald gradient classes

Verification Completed

  • SSH access works with key authentication
  • Website responds correctly (HTTP 200)
  • SSL certificate valid
  • MongoDB running and accessible
  • All documents migrated (134 total)
  • Blog posts visible (12 posts)
  • Admin user functional
  • No PM2 installed
  • No Docker installed
  • Daily security scan configured

Report Date: 2026-01-19 (initial) / 2026-01-20 (complete recovery) Status: COMPLETE RECOVERY - All systems operational Next Action: Resume normal development (/community project)