# Production Monitoring Setup **Project**: Tractatus AI Safety Framework Website **Environment**: Production (vps-93a693da.vps.ovh.net) **Created**: 2025-10-09 **Status**: Ready for Deployment --- ## Overview Comprehensive monitoring system for Tractatus production environment, providing: - **Health monitoring** - Application uptime, service status, database connectivity - **Log monitoring** - Error detection, security events, anomaly detection - **Disk monitoring** - Disk space usage alerts - **SSL monitoring** - Certificate expiry warnings - **Email alerts** - Automated notifications for critical issues **Philosophy**: Privacy-first, self-hosted monitoring aligned with Tractatus values. --- ## Monitoring Components ### 1. Health Check Monitor (`health-check.sh`) **What it monitors:** - Application health endpoint (https://agenticgovernance.digital/health) - Systemd service status (tractatus.service) - MongoDB database connectivity - Disk space usage **Alert Triggers:** - Service not running - Health endpoint returns non-200 - Database connection failed - Disk space > 90% **Frequency**: Every 5 minutes ### 2. Log Monitor (`log-monitor.sh`) **What it monitors:** - ERROR and CRITICAL log entries - Security events (authentication failures, unauthorized access) - Database errors - HTTP 500 errors - Unhandled exceptions **Alert Triggers:** - 10+ errors in 5-minute window - 3+ critical errors in 5-minute window - Any security events **Frequency**: Every 5 minutes **Follow Mode**: Can run continuously for real-time monitoring ### 3. Disk Space Monitor (`disk-monitor.sh`) **What it monitors:** - Root filesystem (/) - Var directory (/var) - Log directory (/var/log) - Tractatus application (/var/www/tractatus) - Temp directory (/tmp) **Alert Triggers:** - Warning: 80%+ usage - Critical: 90%+ usage **Frequency**: Every 15 minutes ### 4. SSL Certificate Monitor (`ssl-monitor.sh`) **What it monitors:** - SSL certificate expiry for agenticgovernance.digital **Alert Triggers:** - Warning: Expires in 30 days or less - Critical: Expires in 7 days or less - Critical: Already expired **Frequency**: Daily ### 5. Master Monitor (`monitor-all.sh`) Orchestrates all monitoring checks in a single run. --- ## Installation ### Prerequisites ```bash # Ensure required commands are available sudo apt-get update sudo apt-get install -y curl jq openssl mailutils # Install MongoDB shell (if not installed) wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add - echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list sudo apt-get update sudo apt-get install -y mongodb-mongosh ``` ### Deploy Monitoring Scripts ```bash # From local machine, deploy monitoring scripts to production rsync -avz -e "ssh -i ~/.ssh/tractatus_deploy" \ scripts/monitoring/ \ ubuntu@vps-93a693da.vps.ovh.net:/var/www/tractatus/scripts/monitoring/ ``` ### Set Up Log Directory ```bash # On production server ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net # Create log directory sudo mkdir -p /var/log/tractatus sudo chown ubuntu:ubuntu /var/log/tractatus sudo chmod 755 /var/log/tractatus ``` ### Make Scripts Executable ```bash # On production server cd /var/www/tractatus/scripts/monitoring chmod +x *.sh ``` ### Configure Email Alerts **Option 1: Using Postfix (Recommended for production)** ```bash # Install Postfix sudo apt-get install -y postfix # Configure Postfix (select "Internet Site") sudo dpkg-reconfigure postfix # Set ALERT_EMAIL environment variable echo 'export ALERT_EMAIL="your-email@example.com"' | sudo tee -a /etc/environment source /etc/environment ``` **Option 2: Using External SMTP (ProtonMail, Gmail, etc.)** ```bash # Install sendemail sudo apt-get install -y sendemail libio-socket-ssl-perl libnet-ssleay-perl # Configure in monitoring scripts (or use system mail) ``` **Option 3: No Email (Testing)** ```bash # Leave ALERT_EMAIL unset - monitoring will log but not send emails # Useful for initial testing ``` ### Test Monitoring Scripts ```bash # Test health check cd /var/www/tractatus/scripts/monitoring ./health-check.sh --test # Test log monitor ./log-monitor.sh --since "10 minutes ago" --test # Test disk monitor ./disk-monitor.sh --test # Test SSL monitor ./ssl-monitor.sh --test # Test master monitor ./monitor-all.sh --test ``` Expected output: Each script should run without errors and show `[INFO]` messages. --- ## Cron Configuration ### Create Monitoring Cron Jobs ```bash # On production server crontab -e ``` Add the following cron jobs: ```cron # Tractatus Production Monitoring # Logs: /var/log/tractatus/monitoring.log # Master monitoring (every 5 minutes) # Runs: health check, log monitor, disk monitor */5 * * * * /var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl >> /var/log/tractatus/cron-monitor.log 2>&1 # SSL certificate check (daily at 3am) 0 3 * * * /var/www/tractatus/scripts/monitoring/ssl-monitor.sh >> /var/log/tractatus/cron-ssl.log 2>&1 # Disk monitor (every 15 minutes - separate from master for frequency control) */15 * * * * /var/www/tractatus/scripts/monitoring/disk-monitor.sh >> /var/log/tractatus/cron-disk.log 2>&1 ``` ### Verify Cron Jobs ```bash # List active cron jobs crontab -l # Check cron logs sudo journalctl -u cron -f # Wait 5 minutes, then check monitoring logs tail -f /var/log/tractatus/cron-monitor.log ``` ### Alternative: Systemd Timers (Optional) More modern alternative to cron, provides better logging and failure handling. **Create timer file**: `/etc/systemd/system/tractatus-monitoring.timer` ```ini [Unit] Description=Tractatus Monitoring Timer Requires=tractatus-monitoring.service [Timer] OnBootSec=5min OnUnitActiveSec=5min AccuracySec=1s [Install] WantedBy=timers.target ``` **Create service file**: `/etc/systemd/system/tractatus-monitoring.service` ```ini [Unit] Description=Tractatus Production Monitoring After=network.target tractatus.service [Service] Type=oneshot User=ubuntu WorkingDirectory=/var/www/tractatus ExecStart=/var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl StandardOutput=journal StandardError=journal Environment="ALERT_EMAIL=your-email@example.com" [Install] WantedBy=multi-user.target ``` **Enable and start:** ```bash sudo systemctl daemon-reload sudo systemctl enable tractatus-monitoring.timer sudo systemctl start tractatus-monitoring.timer # Check status sudo systemctl status tractatus-monitoring.timer sudo systemctl list-timers ``` --- ## Alert Configuration ### Alert Thresholds **Health Check:** - Consecutive failures: 3 (alerts on 3rd failure) - Check interval: 5 minutes - Time to alert: 15 minutes of downtime **Log Monitor:** - Error threshold: 10 errors in 5 minutes - Critical threshold: 3 critical errors in 5 minutes - Security events: Immediate alert **Disk Space:** - Warning: 80% usage - Critical: 90% usage **SSL Certificate:** - Warning: 30 days until expiry - Critical: 7 days until expiry ### Customize Alerts Edit thresholds in scripts: ```bash # Health check thresholds vi /var/www/tractatus/scripts/monitoring/health-check.sh # Change: MAX_FAILURES=3 # Log monitor thresholds vi /var/www/tractatus/scripts/monitoring/log-monitor.sh # Change: ERROR_THRESHOLD=10 # Change: CRITICAL_THRESHOLD=3 # Disk monitor thresholds vi /var/www/tractatus/scripts/monitoring/disk-monitor.sh # Change: WARN_THRESHOLD=80 # Change: CRITICAL_THRESHOLD=90 # SSL monitor thresholds vi /var/www/tractatus/scripts/monitoring/ssl-monitor.sh # Change: WARN_DAYS=30 # Change: CRITICAL_DAYS=7 ``` --- ## Manual Monitoring Commands ### Check Current Status ```bash # Run all monitors manually cd /var/www/tractatus/scripts/monitoring ./monitor-all.sh # Run individual monitors ./health-check.sh ./log-monitor.sh --since "1 hour" ./disk-monitor.sh ./ssl-monitor.sh ``` ### View Monitoring Logs ```bash # View all monitoring logs tail -f /var/log/tractatus/monitoring.log # View specific monitor logs tail -f /var/log/tractatus/health-check.log tail -f /var/log/tractatus/log-monitor.log tail -f /var/log/tractatus/disk-monitor.log tail -f /var/log/tractatus/ssl-monitor.log # View cron execution logs tail -f /var/log/tractatus/cron-monitor.log ``` ### Test Alert Delivery ```bash # Send test alert cd /var/www/tractatus/scripts/monitoring # This should trigger an alert (if service is running) # It will show "would send alert" in test mode ./health-check.sh --test # Force alert by temporarily stopping service sudo systemctl stop tractatus ./health-check.sh # Should alert after 3 failures (15 minutes) sudo systemctl start tractatus ``` --- ## Troubleshooting ### No Alerts Received **Check email configuration:** ```bash # Verify ALERT_EMAIL is set echo $ALERT_EMAIL # Test mail command echo "Test email" | mail -s "Test Subject" $ALERT_EMAIL # Check mail logs sudo tail -f /var/log/mail.log ``` **Check cron execution:** ```bash # Verify cron jobs are running crontab -l # Check cron logs sudo journalctl -u cron -n 50 # Check script logs tail -100 /var/log/tractatus/cron-monitor.log ``` ### Scripts Not Executing **Check permissions:** ```bash ls -la /var/www/tractatus/scripts/monitoring/ # Should show: -rwxr-xr-x (executable) # Fix if needed chmod +x /var/www/tractatus/scripts/monitoring/*.sh ``` **Check cron PATH:** ```bash # Add to crontab PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin # Or use full paths in cron commands ``` ### High Alert Frequency **Increase thresholds:** Edit threshold values in scripts (see Alert Configuration section). **Increase consecutive failure count:** ```bash vi /var/www/tractatus/scripts/monitoring/health-check.sh # Increase MAX_FAILURES from 3 to 5 or higher ``` ### False Positives **Review alert conditions:** ```bash # Check recent logs to understand why alerts triggered tail -100 /var/log/tractatus/monitoring.log # Run manual check with verbose output ./health-check.sh # Check if service is actually unhealthy sudo systemctl status tractatus curl https://agenticgovernance.digital/health ``` --- ## Monitoring Dashboard (Optional - Future Enhancement) ### Option 1: Grafana + Prometheus Self-hosted metrics dashboard (requires setup). ### Option 2: Simple Web Dashboard Create minimal status page showing last check results. ### Option 3: UptimeRobot Free Tier External monitoring service (privacy tradeoff). **Not implemented yet** - current solution uses email alerts only. --- ## Best Practices ### DO: - ✅ Test monitoring scripts before deploying - ✅ Check alert emails regularly - ✅ Review monitoring logs weekly - ✅ Adjust thresholds based on actual patterns - ✅ Document any monitoring configuration changes - ✅ Keep monitoring scripts updated ### DON'T: - ❌ Ignore alert emails - ❌ Set thresholds too low (alert fatigue) - ❌ Deploy monitoring without testing - ❌ Disable monitoring without planning - ❌ Let log files grow unbounded - ❌ Ignore repeated warnings ### Monitoring Hygiene ```bash # Rotate monitoring logs weekly sudo logrotate /etc/logrotate.d/tractatus-monitoring # Clean up old state files find /var/tmp -name "tractatus-*-state" -mtime +7 -delete # Review alert frequency monthly grep "\[ALERT\]" /var/log/tractatus/monitoring.log | wc -l ``` --- ## Incident Response ### When Alert Received 1. **Acknowledge alert** - Note time received 2. **Check current status** - Run manual health check 3. **Review logs** - Check what triggered alert 4. **Investigate root cause** - See deployment checklist emergency procedures 5. **Take action** - Fix issue or escalate 6. **Document** - Create incident report ### Critical Alert Response Time - **Health check failure**: Respond within 15 minutes - **Log errors**: Respond within 30 minutes - **Disk space critical**: Respond within 1 hour - **SSL expiry (7 days)**: Respond within 24 hours --- ## Maintenance ### Weekly Tasks - [ ] Review monitoring logs for patterns - [ ] Check alert email inbox - [ ] Verify cron jobs still running - [ ] Review disk space trends ### Monthly Tasks - [ ] Review and adjust alert thresholds - [ ] Clean up old monitoring logs - [ ] Test manual failover procedures - [ ] Update monitoring documentation ### Quarterly Tasks - [ ] Full monitoring system audit - [ ] Test all alert scenarios - [ ] Review incident response times - [ ] Consider monitoring enhancements --- ## Monitoring Metrics ### Success Metrics - **Uptime**: Target 99.9% (< 45 minutes downtime/month) - **Alert Response Time**: < 30 minutes for critical - **False Positive Rate**: < 5% of alerts - **Detection Time**: < 5 minutes for critical issues ### Tracking ```bash # Calculate uptime from logs grep "Health endpoint OK" /var/log/tractatus/monitoring.log | wc -l # Count alerts sent grep "Alert email sent" /var/log/tractatus/monitoring.log | wc -l # Review response times (manual from incident reports) ``` --- ## Security Considerations ### Log Access Control ```bash # Ensure logs are readable only by ubuntu user and root sudo chown ubuntu:ubuntu /var/log/tractatus/*.log sudo chmod 640 /var/log/tractatus/*.log ``` ### Alert Email Security - Use encrypted email if possible (ProtonMail) - Don't include sensitive data in alert body - Alerts show symptoms, not credentials ### Monitoring Script Security - Scripts run as ubuntu user (not root) - No credentials embedded in scripts - Use environment variables for sensitive config --- ## Future Enhancements ### Planned Improvements - [ ] **Metrics collection**: Store monitoring metrics in database for trend analysis - [ ] **Status page**: Public status page showing service availability - [ ] **Mobile alerts**: SMS or push notifications for critical alerts - [ ] **Distributed monitoring**: Multiple monitoring locations for redundancy - [ ] **Automated remediation**: Auto-restart service on failure - [ ] **Performance monitoring**: Response time tracking, query performance - [ ] **User impact monitoring**: Track error rates from user perspective ### Integration Opportunities - [ ] **Plausible Analytics**: Monitor traffic patterns, correlate with errors - [ ] **GitHub Actions**: Run monitoring checks in CI/CD - [ ] **Slack integration**: Send alerts to Slack channel - [ ] **Database backup monitoring**: Alert on backup failures --- ## Support & Documentation **Monitoring Scripts**: `/var/www/tractatus/scripts/monitoring/` **Monitoring Logs**: `/var/log/tractatus/` **Cron Configuration**: `crontab -l` (ubuntu user) **Alert Email**: Set via `ALERT_EMAIL` environment variable **Related Documents:** - [Production Deployment Checklist](PRODUCTION_DEPLOYMENT_CHECKLIST.md) - [Phase 4 Preparation Checklist](../PHASE-4-PREPARATION-CHECKLIST.md) --- **Document Status**: Ready for Production **Last Updated**: 2025-10-09 **Next Review**: After 1 month of monitoring data **Maintainer**: Technical Lead (Claude Code + John Stroh)