Create self-hosted, privacy-first monitoring infrastructure for production environment with automated health checks, log analysis, and alerting. Monitoring Components: - health-check.sh: Application health, service status, DB connectivity, disk space - log-monitor.sh: Error detection, security events, anomaly detection - disk-monitor.sh: Disk space usage monitoring (5 paths) - ssl-monitor.sh: SSL certificate expiry monitoring - monitor-all.sh: Master orchestration script Features: - Email alerting system (configurable thresholds) - Consecutive failure tracking (prevents false positives) - Test mode for safe deployment testing - Comprehensive logging to /var/log/tractatus/ - Cron-ready for automated execution - Exit codes for monitoring tool integration Alert Triggers: - Health: 3 consecutive failures (15min downtime) - Logs: 10 errors OR 3 critical errors in 5min - Disk: 80% warning, 90% critical - SSL: 30 days warning, 7 days critical Setup Documentation: - Complete installation instructions - Cron configuration examples - Systemd timer alternative - Troubleshooting guide - Alert customization guide - Incident response procedures Privacy-First Design: - Self-hosted (no external monitoring services) - Minimal data exposure in alerts - Local log storage only - No telemetry to third parties Aligns with Tractatus values: transparency, privacy, operational excellence Addresses Phase 4 Prep Checklist Task #6: Production Monitoring & Alerting Next: Deploy to production, configure email alerts, set up cron jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
15 KiB
Production Monitoring Setup
Project: Tractatus AI Safety Framework Website Environment: Production (vps-93a693da.vps.ovh.net) Created: 2025-10-09 Status: Ready for Deployment
Overview
Comprehensive monitoring system for Tractatus production environment, providing:
- Health monitoring - Application uptime, service status, database connectivity
- Log monitoring - Error detection, security events, anomaly detection
- Disk monitoring - Disk space usage alerts
- SSL monitoring - Certificate expiry warnings
- Email alerts - Automated notifications for critical issues
Philosophy: Privacy-first, self-hosted monitoring aligned with Tractatus values.
Monitoring Components
1. Health Check Monitor (health-check.sh)
What it monitors:
- Application health endpoint (https://agenticgovernance.digital/health)
- Systemd service status (tractatus.service)
- MongoDB database connectivity
- Disk space usage
Alert Triggers:
- Service not running
- Health endpoint returns non-200
- Database connection failed
- Disk space > 90%
Frequency: Every 5 minutes
2. Log Monitor (log-monitor.sh)
What it monitors:
- ERROR and CRITICAL log entries
- Security events (authentication failures, unauthorized access)
- Database errors
- HTTP 500 errors
- Unhandled exceptions
Alert Triggers:
- 10+ errors in 5-minute window
- 3+ critical errors in 5-minute window
- Any security events
Frequency: Every 5 minutes
Follow Mode: Can run continuously for real-time monitoring
3. Disk Space Monitor (disk-monitor.sh)
What it monitors:
- Root filesystem (/)
- Var directory (/var)
- Log directory (/var/log)
- Tractatus application (/var/www/tractatus)
- Temp directory (/tmp)
Alert Triggers:
- Warning: 80%+ usage
- Critical: 90%+ usage
Frequency: Every 15 minutes
4. SSL Certificate Monitor (ssl-monitor.sh)
What it monitors:
- SSL certificate expiry for agenticgovernance.digital
Alert Triggers:
- Warning: Expires in 30 days or less
- Critical: Expires in 7 days or less
- Critical: Already expired
Frequency: Daily
5. Master Monitor (monitor-all.sh)
Orchestrates all monitoring checks in a single run.
Installation
Prerequisites
# Ensure required commands are available
sudo apt-get update
sudo apt-get install -y curl jq openssl mailutils
# Install MongoDB shell (if not installed)
wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add -
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
sudo apt-get update
sudo apt-get install -y mongodb-mongosh
Deploy Monitoring Scripts
# From local machine, deploy monitoring scripts to production
rsync -avz -e "ssh -i ~/.ssh/tractatus_deploy" \
scripts/monitoring/ \
ubuntu@vps-93a693da.vps.ovh.net:/var/www/tractatus/scripts/monitoring/
Set Up Log Directory
# On production server
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net
# Create log directory
sudo mkdir -p /var/log/tractatus
sudo chown ubuntu:ubuntu /var/log/tractatus
sudo chmod 755 /var/log/tractatus
Make Scripts Executable
# On production server
cd /var/www/tractatus/scripts/monitoring
chmod +x *.sh
Configure Email Alerts
Option 1: Using Postfix (Recommended for production)
# Install Postfix
sudo apt-get install -y postfix
# Configure Postfix (select "Internet Site")
sudo dpkg-reconfigure postfix
# Set ALERT_EMAIL environment variable
echo 'export ALERT_EMAIL="your-email@example.com"' | sudo tee -a /etc/environment
source /etc/environment
Option 2: Using External SMTP (ProtonMail, Gmail, etc.)
# Install sendemail
sudo apt-get install -y sendemail libio-socket-ssl-perl libnet-ssleay-perl
# Configure in monitoring scripts (or use system mail)
Option 3: No Email (Testing)
# Leave ALERT_EMAIL unset - monitoring will log but not send emails
# Useful for initial testing
Test Monitoring Scripts
# Test health check
cd /var/www/tractatus/scripts/monitoring
./health-check.sh --test
# Test log monitor
./log-monitor.sh --since "10 minutes ago" --test
# Test disk monitor
./disk-monitor.sh --test
# Test SSL monitor
./ssl-monitor.sh --test
# Test master monitor
./monitor-all.sh --test
Expected output: Each script should run without errors and show [INFO] messages.
Cron Configuration
Create Monitoring Cron Jobs
# On production server
crontab -e
Add the following cron jobs:
# Tractatus Production Monitoring
# Logs: /var/log/tractatus/monitoring.log
# Master monitoring (every 5 minutes)
# Runs: health check, log monitor, disk monitor
*/5 * * * * /var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl >> /var/log/tractatus/cron-monitor.log 2>&1
# SSL certificate check (daily at 3am)
0 3 * * * /var/www/tractatus/scripts/monitoring/ssl-monitor.sh >> /var/log/tractatus/cron-ssl.log 2>&1
# Disk monitor (every 15 minutes - separate from master for frequency control)
*/15 * * * * /var/www/tractatus/scripts/monitoring/disk-monitor.sh >> /var/log/tractatus/cron-disk.log 2>&1
Verify Cron Jobs
# List active cron jobs
crontab -l
# Check cron logs
sudo journalctl -u cron -f
# Wait 5 minutes, then check monitoring logs
tail -f /var/log/tractatus/cron-monitor.log
Alternative: Systemd Timers (Optional)
More modern alternative to cron, provides better logging and failure handling.
Create timer file: /etc/systemd/system/tractatus-monitoring.timer
[Unit]
Description=Tractatus Monitoring Timer
Requires=tractatus-monitoring.service
[Timer]
OnBootSec=5min
OnUnitActiveSec=5min
AccuracySec=1s
[Install]
WantedBy=timers.target
Create service file: /etc/systemd/system/tractatus-monitoring.service
[Unit]
Description=Tractatus Production Monitoring
After=network.target tractatus.service
[Service]
Type=oneshot
User=ubuntu
WorkingDirectory=/var/www/tractatus
ExecStart=/var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl
StandardOutput=journal
StandardError=journal
Environment="ALERT_EMAIL=your-email@example.com"
[Install]
WantedBy=multi-user.target
Enable and start:
sudo systemctl daemon-reload
sudo systemctl enable tractatus-monitoring.timer
sudo systemctl start tractatus-monitoring.timer
# Check status
sudo systemctl status tractatus-monitoring.timer
sudo systemctl list-timers
Alert Configuration
Alert Thresholds
Health Check:
- Consecutive failures: 3 (alerts on 3rd failure)
- Check interval: 5 minutes
- Time to alert: 15 minutes of downtime
Log Monitor:
- Error threshold: 10 errors in 5 minutes
- Critical threshold: 3 critical errors in 5 minutes
- Security events: Immediate alert
Disk Space:
- Warning: 80% usage
- Critical: 90% usage
SSL Certificate:
- Warning: 30 days until expiry
- Critical: 7 days until expiry
Customize Alerts
Edit thresholds in scripts:
# Health check thresholds
vi /var/www/tractatus/scripts/monitoring/health-check.sh
# Change: MAX_FAILURES=3
# Log monitor thresholds
vi /var/www/tractatus/scripts/monitoring/log-monitor.sh
# Change: ERROR_THRESHOLD=10
# Change: CRITICAL_THRESHOLD=3
# Disk monitor thresholds
vi /var/www/tractatus/scripts/monitoring/disk-monitor.sh
# Change: WARN_THRESHOLD=80
# Change: CRITICAL_THRESHOLD=90
# SSL monitor thresholds
vi /var/www/tractatus/scripts/monitoring/ssl-monitor.sh
# Change: WARN_DAYS=30
# Change: CRITICAL_DAYS=7
Manual Monitoring Commands
Check Current Status
# Run all monitors manually
cd /var/www/tractatus/scripts/monitoring
./monitor-all.sh
# Run individual monitors
./health-check.sh
./log-monitor.sh --since "1 hour"
./disk-monitor.sh
./ssl-monitor.sh
View Monitoring Logs
# View all monitoring logs
tail -f /var/log/tractatus/monitoring.log
# View specific monitor logs
tail -f /var/log/tractatus/health-check.log
tail -f /var/log/tractatus/log-monitor.log
tail -f /var/log/tractatus/disk-monitor.log
tail -f /var/log/tractatus/ssl-monitor.log
# View cron execution logs
tail -f /var/log/tractatus/cron-monitor.log
Test Alert Delivery
# Send test alert
cd /var/www/tractatus/scripts/monitoring
# This should trigger an alert (if service is running)
# It will show "would send alert" in test mode
./health-check.sh --test
# Force alert by temporarily stopping service
sudo systemctl stop tractatus
./health-check.sh # Should alert after 3 failures (15 minutes)
sudo systemctl start tractatus
Troubleshooting
No Alerts Received
Check email configuration:
# Verify ALERT_EMAIL is set
echo $ALERT_EMAIL
# Test mail command
echo "Test email" | mail -s "Test Subject" $ALERT_EMAIL
# Check mail logs
sudo tail -f /var/log/mail.log
Check cron execution:
# Verify cron jobs are running
crontab -l
# Check cron logs
sudo journalctl -u cron -n 50
# Check script logs
tail -100 /var/log/tractatus/cron-monitor.log
Scripts Not Executing
Check permissions:
ls -la /var/www/tractatus/scripts/monitoring/
# Should show: -rwxr-xr-x (executable)
# Fix if needed
chmod +x /var/www/tractatus/scripts/monitoring/*.sh
Check cron PATH:
# Add to crontab
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# Or use full paths in cron commands
High Alert Frequency
Increase thresholds:
Edit threshold values in scripts (see Alert Configuration section).
Increase consecutive failure count:
vi /var/www/tractatus/scripts/monitoring/health-check.sh
# Increase MAX_FAILURES from 3 to 5 or higher
False Positives
Review alert conditions:
# Check recent logs to understand why alerts triggered
tail -100 /var/log/tractatus/monitoring.log
# Run manual check with verbose output
./health-check.sh
# Check if service is actually unhealthy
sudo systemctl status tractatus
curl https://agenticgovernance.digital/health
Monitoring Dashboard (Optional - Future Enhancement)
Option 1: Grafana + Prometheus
Self-hosted metrics dashboard (requires setup).
Option 2: Simple Web Dashboard
Create minimal status page showing last check results.
Option 3: UptimeRobot Free Tier
External monitoring service (privacy tradeoff).
Not implemented yet - current solution uses email alerts only.
Best Practices
DO:
- ✅ Test monitoring scripts before deploying
- ✅ Check alert emails regularly
- ✅ Review monitoring logs weekly
- ✅ Adjust thresholds based on actual patterns
- ✅ Document any monitoring configuration changes
- ✅ Keep monitoring scripts updated
DON'T:
- ❌ Ignore alert emails
- ❌ Set thresholds too low (alert fatigue)
- ❌ Deploy monitoring without testing
- ❌ Disable monitoring without planning
- ❌ Let log files grow unbounded
- ❌ Ignore repeated warnings
Monitoring Hygiene
# Rotate monitoring logs weekly
sudo logrotate /etc/logrotate.d/tractatus-monitoring
# Clean up old state files
find /var/tmp -name "tractatus-*-state" -mtime +7 -delete
# Review alert frequency monthly
grep "\[ALERT\]" /var/log/tractatus/monitoring.log | wc -l
Incident Response
When Alert Received
- Acknowledge alert - Note time received
- Check current status - Run manual health check
- Review logs - Check what triggered alert
- Investigate root cause - See deployment checklist emergency procedures
- Take action - Fix issue or escalate
- Document - Create incident report
Critical Alert Response Time
- Health check failure: Respond within 15 minutes
- Log errors: Respond within 30 minutes
- Disk space critical: Respond within 1 hour
- SSL expiry (7 days): Respond within 24 hours
Maintenance
Weekly Tasks
- Review monitoring logs for patterns
- Check alert email inbox
- Verify cron jobs still running
- Review disk space trends
Monthly Tasks
- Review and adjust alert thresholds
- Clean up old monitoring logs
- Test manual failover procedures
- Update monitoring documentation
Quarterly Tasks
- Full monitoring system audit
- Test all alert scenarios
- Review incident response times
- Consider monitoring enhancements
Monitoring Metrics
Success Metrics
- Uptime: Target 99.9% (< 45 minutes downtime/month)
- Alert Response Time: < 30 minutes for critical
- False Positive Rate: < 5% of alerts
- Detection Time: < 5 minutes for critical issues
Tracking
# Calculate uptime from logs
grep "Health endpoint OK" /var/log/tractatus/monitoring.log | wc -l
# Count alerts sent
grep "Alert email sent" /var/log/tractatus/monitoring.log | wc -l
# Review response times (manual from incident reports)
Security Considerations
Log Access Control
# Ensure logs are readable only by ubuntu user and root
sudo chown ubuntu:ubuntu /var/log/tractatus/*.log
sudo chmod 640 /var/log/tractatus/*.log
Alert Email Security
- Use encrypted email if possible (ProtonMail)
- Don't include sensitive data in alert body
- Alerts show symptoms, not credentials
Monitoring Script Security
- Scripts run as ubuntu user (not root)
- No credentials embedded in scripts
- Use environment variables for sensitive config
Future Enhancements
Planned Improvements
- Metrics collection: Store monitoring metrics in database for trend analysis
- Status page: Public status page showing service availability
- Mobile alerts: SMS or push notifications for critical alerts
- Distributed monitoring: Multiple monitoring locations for redundancy
- Automated remediation: Auto-restart service on failure
- Performance monitoring: Response time tracking, query performance
- User impact monitoring: Track error rates from user perspective
Integration Opportunities
- Plausible Analytics: Monitor traffic patterns, correlate with errors
- GitHub Actions: Run monitoring checks in CI/CD
- Slack integration: Send alerts to Slack channel
- Database backup monitoring: Alert on backup failures
Support & Documentation
Monitoring Scripts: /var/www/tractatus/scripts/monitoring/
Monitoring Logs: /var/log/tractatus/
Cron Configuration: crontab -l (ubuntu user)
Alert Email: Set via ALERT_EMAIL environment variable
Related Documents:
Document Status: Ready for Production Last Updated: 2025-10-09 Next Review: After 1 month of monitoring data Maintainer: Technical Lead (Claude Code + John Stroh)