Create self-hosted, privacy-first monitoring infrastructure for production environment with automated health checks, log analysis, and alerting. Monitoring Components: - health-check.sh: Application health, service status, DB connectivity, disk space - log-monitor.sh: Error detection, security events, anomaly detection - disk-monitor.sh: Disk space usage monitoring (5 paths) - ssl-monitor.sh: SSL certificate expiry monitoring - monitor-all.sh: Master orchestration script Features: - Email alerting system (configurable thresholds) - Consecutive failure tracking (prevents false positives) - Test mode for safe deployment testing - Comprehensive logging to /var/log/tractatus/ - Cron-ready for automated execution - Exit codes for monitoring tool integration Alert Triggers: - Health: 3 consecutive failures (15min downtime) - Logs: 10 errors OR 3 critical errors in 5min - Disk: 80% warning, 90% critical - SSL: 30 days warning, 7 days critical Setup Documentation: - Complete installation instructions - Cron configuration examples - Systemd timer alternative - Troubleshooting guide - Alert customization guide - Incident response procedures Privacy-First Design: - Self-hosted (no external monitoring services) - Minimal data exposure in alerts - Local log storage only - No telemetry to third parties Aligns with Tractatus values: transparency, privacy, operational excellence Addresses Phase 4 Prep Checklist Task #6: Production Monitoring & Alerting Next: Deploy to production, configure email alerts, set up cron jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
648 lines
15 KiB
Markdown
648 lines
15 KiB
Markdown
# Production Monitoring Setup
|
|
|
|
**Project**: Tractatus AI Safety Framework Website
|
|
**Environment**: Production (vps-93a693da.vps.ovh.net)
|
|
**Created**: 2025-10-09
|
|
**Status**: Ready for Deployment
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Comprehensive monitoring system for Tractatus production environment, providing:
|
|
|
|
- **Health monitoring** - Application uptime, service status, database connectivity
|
|
- **Log monitoring** - Error detection, security events, anomaly detection
|
|
- **Disk monitoring** - Disk space usage alerts
|
|
- **SSL monitoring** - Certificate expiry warnings
|
|
- **Email alerts** - Automated notifications for critical issues
|
|
|
|
**Philosophy**: Privacy-first, self-hosted monitoring aligned with Tractatus values.
|
|
|
|
---
|
|
|
|
## Monitoring Components
|
|
|
|
### 1. Health Check Monitor (`health-check.sh`)
|
|
|
|
**What it monitors:**
|
|
- Application health endpoint (https://agenticgovernance.digital/health)
|
|
- Systemd service status (tractatus.service)
|
|
- MongoDB database connectivity
|
|
- Disk space usage
|
|
|
|
**Alert Triggers:**
|
|
- Service not running
|
|
- Health endpoint returns non-200
|
|
- Database connection failed
|
|
- Disk space > 90%
|
|
|
|
**Frequency**: Every 5 minutes
|
|
|
|
### 2. Log Monitor (`log-monitor.sh`)
|
|
|
|
**What it monitors:**
|
|
- ERROR and CRITICAL log entries
|
|
- Security events (authentication failures, unauthorized access)
|
|
- Database errors
|
|
- HTTP 500 errors
|
|
- Unhandled exceptions
|
|
|
|
**Alert Triggers:**
|
|
- 10+ errors in 5-minute window
|
|
- 3+ critical errors in 5-minute window
|
|
- Any security events
|
|
|
|
**Frequency**: Every 5 minutes
|
|
|
|
**Follow Mode**: Can run continuously for real-time monitoring
|
|
|
|
### 3. Disk Space Monitor (`disk-monitor.sh`)
|
|
|
|
**What it monitors:**
|
|
- Root filesystem (/)
|
|
- Var directory (/var)
|
|
- Log directory (/var/log)
|
|
- Tractatus application (/var/www/tractatus)
|
|
- Temp directory (/tmp)
|
|
|
|
**Alert Triggers:**
|
|
- Warning: 80%+ usage
|
|
- Critical: 90%+ usage
|
|
|
|
**Frequency**: Every 15 minutes
|
|
|
|
### 4. SSL Certificate Monitor (`ssl-monitor.sh`)
|
|
|
|
**What it monitors:**
|
|
- SSL certificate expiry for agenticgovernance.digital
|
|
|
|
**Alert Triggers:**
|
|
- Warning: Expires in 30 days or less
|
|
- Critical: Expires in 7 days or less
|
|
- Critical: Already expired
|
|
|
|
**Frequency**: Daily
|
|
|
|
### 5. Master Monitor (`monitor-all.sh`)
|
|
|
|
Orchestrates all monitoring checks in a single run.
|
|
|
|
---
|
|
|
|
## Installation
|
|
|
|
### Prerequisites
|
|
|
|
```bash
|
|
# Ensure required commands are available
|
|
sudo apt-get update
|
|
sudo apt-get install -y curl jq openssl mailutils
|
|
|
|
# Install MongoDB shell (if not installed)
|
|
wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add -
|
|
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
|
|
sudo apt-get update
|
|
sudo apt-get install -y mongodb-mongosh
|
|
```
|
|
|
|
### Deploy Monitoring Scripts
|
|
|
|
```bash
|
|
# From local machine, deploy monitoring scripts to production
|
|
rsync -avz -e "ssh -i ~/.ssh/tractatus_deploy" \
|
|
scripts/monitoring/ \
|
|
ubuntu@vps-93a693da.vps.ovh.net:/var/www/tractatus/scripts/monitoring/
|
|
```
|
|
|
|
### Set Up Log Directory
|
|
|
|
```bash
|
|
# On production server
|
|
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net
|
|
|
|
# Create log directory
|
|
sudo mkdir -p /var/log/tractatus
|
|
sudo chown ubuntu:ubuntu /var/log/tractatus
|
|
sudo chmod 755 /var/log/tractatus
|
|
```
|
|
|
|
### Make Scripts Executable
|
|
|
|
```bash
|
|
# On production server
|
|
cd /var/www/tractatus/scripts/monitoring
|
|
chmod +x *.sh
|
|
```
|
|
|
|
### Configure Email Alerts
|
|
|
|
**Option 1: Using Postfix (Recommended for production)**
|
|
|
|
```bash
|
|
# Install Postfix
|
|
sudo apt-get install -y postfix
|
|
|
|
# Configure Postfix (select "Internet Site")
|
|
sudo dpkg-reconfigure postfix
|
|
|
|
# Set ALERT_EMAIL environment variable
|
|
echo 'export ALERT_EMAIL="your-email@example.com"' | sudo tee -a /etc/environment
|
|
source /etc/environment
|
|
```
|
|
|
|
**Option 2: Using External SMTP (ProtonMail, Gmail, etc.)**
|
|
|
|
```bash
|
|
# Install sendemail
|
|
sudo apt-get install -y sendemail libio-socket-ssl-perl libnet-ssleay-perl
|
|
|
|
# Configure in monitoring scripts (or use system mail)
|
|
```
|
|
|
|
**Option 3: No Email (Testing)**
|
|
|
|
```bash
|
|
# Leave ALERT_EMAIL unset - monitoring will log but not send emails
|
|
# Useful for initial testing
|
|
```
|
|
|
|
### Test Monitoring Scripts
|
|
|
|
```bash
|
|
# Test health check
|
|
cd /var/www/tractatus/scripts/monitoring
|
|
./health-check.sh --test
|
|
|
|
# Test log monitor
|
|
./log-monitor.sh --since "10 minutes ago" --test
|
|
|
|
# Test disk monitor
|
|
./disk-monitor.sh --test
|
|
|
|
# Test SSL monitor
|
|
./ssl-monitor.sh --test
|
|
|
|
# Test master monitor
|
|
./monitor-all.sh --test
|
|
```
|
|
|
|
Expected output: Each script should run without errors and show `[INFO]` messages.
|
|
|
|
---
|
|
|
|
## Cron Configuration
|
|
|
|
### Create Monitoring Cron Jobs
|
|
|
|
```bash
|
|
# On production server
|
|
crontab -e
|
|
```
|
|
|
|
Add the following cron jobs:
|
|
|
|
```cron
|
|
# Tractatus Production Monitoring
|
|
# Logs: /var/log/tractatus/monitoring.log
|
|
|
|
# Master monitoring (every 5 minutes)
|
|
# Runs: health check, log monitor, disk monitor
|
|
*/5 * * * * /var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl >> /var/log/tractatus/cron-monitor.log 2>&1
|
|
|
|
# SSL certificate check (daily at 3am)
|
|
0 3 * * * /var/www/tractatus/scripts/monitoring/ssl-monitor.sh >> /var/log/tractatus/cron-ssl.log 2>&1
|
|
|
|
# Disk monitor (every 15 minutes - separate from master for frequency control)
|
|
*/15 * * * * /var/www/tractatus/scripts/monitoring/disk-monitor.sh >> /var/log/tractatus/cron-disk.log 2>&1
|
|
```
|
|
|
|
### Verify Cron Jobs
|
|
|
|
```bash
|
|
# List active cron jobs
|
|
crontab -l
|
|
|
|
# Check cron logs
|
|
sudo journalctl -u cron -f
|
|
|
|
# Wait 5 minutes, then check monitoring logs
|
|
tail -f /var/log/tractatus/cron-monitor.log
|
|
```
|
|
|
|
### Alternative: Systemd Timers (Optional)
|
|
|
|
More modern alternative to cron, provides better logging and failure handling.
|
|
|
|
**Create timer file**: `/etc/systemd/system/tractatus-monitoring.timer`
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=Tractatus Monitoring Timer
|
|
Requires=tractatus-monitoring.service
|
|
|
|
[Timer]
|
|
OnBootSec=5min
|
|
OnUnitActiveSec=5min
|
|
AccuracySec=1s
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|
|
```
|
|
|
|
**Create service file**: `/etc/systemd/system/tractatus-monitoring.service`
|
|
|
|
```ini
|
|
[Unit]
|
|
Description=Tractatus Production Monitoring
|
|
After=network.target tractatus.service
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
User=ubuntu
|
|
WorkingDirectory=/var/www/tractatus
|
|
ExecStart=/var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl
|
|
StandardOutput=journal
|
|
StandardError=journal
|
|
Environment="ALERT_EMAIL=your-email@example.com"
|
|
|
|
[Install]
|
|
WantedBy=multi-user.target
|
|
```
|
|
|
|
**Enable and start:**
|
|
|
|
```bash
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable tractatus-monitoring.timer
|
|
sudo systemctl start tractatus-monitoring.timer
|
|
|
|
# Check status
|
|
sudo systemctl status tractatus-monitoring.timer
|
|
sudo systemctl list-timers
|
|
```
|
|
|
|
---
|
|
|
|
## Alert Configuration
|
|
|
|
### Alert Thresholds
|
|
|
|
**Health Check:**
|
|
- Consecutive failures: 3 (alerts on 3rd failure)
|
|
- Check interval: 5 minutes
|
|
- Time to alert: 15 minutes of downtime
|
|
|
|
**Log Monitor:**
|
|
- Error threshold: 10 errors in 5 minutes
|
|
- Critical threshold: 3 critical errors in 5 minutes
|
|
- Security events: Immediate alert
|
|
|
|
**Disk Space:**
|
|
- Warning: 80% usage
|
|
- Critical: 90% usage
|
|
|
|
**SSL Certificate:**
|
|
- Warning: 30 days until expiry
|
|
- Critical: 7 days until expiry
|
|
|
|
### Customize Alerts
|
|
|
|
Edit thresholds in scripts:
|
|
|
|
```bash
|
|
# Health check thresholds
|
|
vi /var/www/tractatus/scripts/monitoring/health-check.sh
|
|
# Change: MAX_FAILURES=3
|
|
|
|
# Log monitor thresholds
|
|
vi /var/www/tractatus/scripts/monitoring/log-monitor.sh
|
|
# Change: ERROR_THRESHOLD=10
|
|
# Change: CRITICAL_THRESHOLD=3
|
|
|
|
# Disk monitor thresholds
|
|
vi /var/www/tractatus/scripts/monitoring/disk-monitor.sh
|
|
# Change: WARN_THRESHOLD=80
|
|
# Change: CRITICAL_THRESHOLD=90
|
|
|
|
# SSL monitor thresholds
|
|
vi /var/www/tractatus/scripts/monitoring/ssl-monitor.sh
|
|
# Change: WARN_DAYS=30
|
|
# Change: CRITICAL_DAYS=7
|
|
```
|
|
|
|
---
|
|
|
|
## Manual Monitoring Commands
|
|
|
|
### Check Current Status
|
|
|
|
```bash
|
|
# Run all monitors manually
|
|
cd /var/www/tractatus/scripts/monitoring
|
|
./monitor-all.sh
|
|
|
|
# Run individual monitors
|
|
./health-check.sh
|
|
./log-monitor.sh --since "1 hour"
|
|
./disk-monitor.sh
|
|
./ssl-monitor.sh
|
|
```
|
|
|
|
### View Monitoring Logs
|
|
|
|
```bash
|
|
# View all monitoring logs
|
|
tail -f /var/log/tractatus/monitoring.log
|
|
|
|
# View specific monitor logs
|
|
tail -f /var/log/tractatus/health-check.log
|
|
tail -f /var/log/tractatus/log-monitor.log
|
|
tail -f /var/log/tractatus/disk-monitor.log
|
|
tail -f /var/log/tractatus/ssl-monitor.log
|
|
|
|
# View cron execution logs
|
|
tail -f /var/log/tractatus/cron-monitor.log
|
|
```
|
|
|
|
### Test Alert Delivery
|
|
|
|
```bash
|
|
# Send test alert
|
|
cd /var/www/tractatus/scripts/monitoring
|
|
|
|
# This should trigger an alert (if service is running)
|
|
# It will show "would send alert" in test mode
|
|
./health-check.sh --test
|
|
|
|
# Force alert by temporarily stopping service
|
|
sudo systemctl stop tractatus
|
|
./health-check.sh # Should alert after 3 failures (15 minutes)
|
|
sudo systemctl start tractatus
|
|
```
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### No Alerts Received
|
|
|
|
**Check email configuration:**
|
|
|
|
```bash
|
|
# Verify ALERT_EMAIL is set
|
|
echo $ALERT_EMAIL
|
|
|
|
# Test mail command
|
|
echo "Test email" | mail -s "Test Subject" $ALERT_EMAIL
|
|
|
|
# Check mail logs
|
|
sudo tail -f /var/log/mail.log
|
|
```
|
|
|
|
**Check cron execution:**
|
|
|
|
```bash
|
|
# Verify cron jobs are running
|
|
crontab -l
|
|
|
|
# Check cron logs
|
|
sudo journalctl -u cron -n 50
|
|
|
|
# Check script logs
|
|
tail -100 /var/log/tractatus/cron-monitor.log
|
|
```
|
|
|
|
### Scripts Not Executing
|
|
|
|
**Check permissions:**
|
|
|
|
```bash
|
|
ls -la /var/www/tractatus/scripts/monitoring/
|
|
# Should show: -rwxr-xr-x (executable)
|
|
|
|
# Fix if needed
|
|
chmod +x /var/www/tractatus/scripts/monitoring/*.sh
|
|
```
|
|
|
|
**Check cron PATH:**
|
|
|
|
```bash
|
|
# Add to crontab
|
|
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
|
|
|
|
# Or use full paths in cron commands
|
|
```
|
|
|
|
### High Alert Frequency
|
|
|
|
**Increase thresholds:**
|
|
|
|
Edit threshold values in scripts (see Alert Configuration section).
|
|
|
|
**Increase consecutive failure count:**
|
|
|
|
```bash
|
|
vi /var/www/tractatus/scripts/monitoring/health-check.sh
|
|
# Increase MAX_FAILURES from 3 to 5 or higher
|
|
```
|
|
|
|
### False Positives
|
|
|
|
**Review alert conditions:**
|
|
|
|
```bash
|
|
# Check recent logs to understand why alerts triggered
|
|
tail -100 /var/log/tractatus/monitoring.log
|
|
|
|
# Run manual check with verbose output
|
|
./health-check.sh
|
|
|
|
# Check if service is actually unhealthy
|
|
sudo systemctl status tractatus
|
|
curl https://agenticgovernance.digital/health
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Dashboard (Optional - Future Enhancement)
|
|
|
|
### Option 1: Grafana + Prometheus
|
|
|
|
Self-hosted metrics dashboard (requires setup).
|
|
|
|
### Option 2: Simple Web Dashboard
|
|
|
|
Create minimal status page showing last check results.
|
|
|
|
### Option 3: UptimeRobot Free Tier
|
|
|
|
External monitoring service (privacy tradeoff).
|
|
|
|
**Not implemented yet** - current solution uses email alerts only.
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
### DO:
|
|
- ✅ Test monitoring scripts before deploying
|
|
- ✅ Check alert emails regularly
|
|
- ✅ Review monitoring logs weekly
|
|
- ✅ Adjust thresholds based on actual patterns
|
|
- ✅ Document any monitoring configuration changes
|
|
- ✅ Keep monitoring scripts updated
|
|
|
|
### DON'T:
|
|
- ❌ Ignore alert emails
|
|
- ❌ Set thresholds too low (alert fatigue)
|
|
- ❌ Deploy monitoring without testing
|
|
- ❌ Disable monitoring without planning
|
|
- ❌ Let log files grow unbounded
|
|
- ❌ Ignore repeated warnings
|
|
|
|
### Monitoring Hygiene
|
|
|
|
```bash
|
|
# Rotate monitoring logs weekly
|
|
sudo logrotate /etc/logrotate.d/tractatus-monitoring
|
|
|
|
# Clean up old state files
|
|
find /var/tmp -name "tractatus-*-state" -mtime +7 -delete
|
|
|
|
# Review alert frequency monthly
|
|
grep "\[ALERT\]" /var/log/tractatus/monitoring.log | wc -l
|
|
```
|
|
|
|
---
|
|
|
|
## Incident Response
|
|
|
|
### When Alert Received
|
|
|
|
1. **Acknowledge alert** - Note time received
|
|
2. **Check current status** - Run manual health check
|
|
3. **Review logs** - Check what triggered alert
|
|
4. **Investigate root cause** - See deployment checklist emergency procedures
|
|
5. **Take action** - Fix issue or escalate
|
|
6. **Document** - Create incident report
|
|
|
|
### Critical Alert Response Time
|
|
|
|
- **Health check failure**: Respond within 15 minutes
|
|
- **Log errors**: Respond within 30 minutes
|
|
- **Disk space critical**: Respond within 1 hour
|
|
- **SSL expiry (7 days)**: Respond within 24 hours
|
|
|
|
---
|
|
|
|
## Maintenance
|
|
|
|
### Weekly Tasks
|
|
|
|
- [ ] Review monitoring logs for patterns
|
|
- [ ] Check alert email inbox
|
|
- [ ] Verify cron jobs still running
|
|
- [ ] Review disk space trends
|
|
|
|
### Monthly Tasks
|
|
|
|
- [ ] Review and adjust alert thresholds
|
|
- [ ] Clean up old monitoring logs
|
|
- [ ] Test manual failover procedures
|
|
- [ ] Update monitoring documentation
|
|
|
|
### Quarterly Tasks
|
|
|
|
- [ ] Full monitoring system audit
|
|
- [ ] Test all alert scenarios
|
|
- [ ] Review incident response times
|
|
- [ ] Consider monitoring enhancements
|
|
|
|
---
|
|
|
|
## Monitoring Metrics
|
|
|
|
### Success Metrics
|
|
|
|
- **Uptime**: Target 99.9% (< 45 minutes downtime/month)
|
|
- **Alert Response Time**: < 30 minutes for critical
|
|
- **False Positive Rate**: < 5% of alerts
|
|
- **Detection Time**: < 5 minutes for critical issues
|
|
|
|
### Tracking
|
|
|
|
```bash
|
|
# Calculate uptime from logs
|
|
grep "Health endpoint OK" /var/log/tractatus/monitoring.log | wc -l
|
|
|
|
# Count alerts sent
|
|
grep "Alert email sent" /var/log/tractatus/monitoring.log | wc -l
|
|
|
|
# Review response times (manual from incident reports)
|
|
```
|
|
|
|
---
|
|
|
|
## Security Considerations
|
|
|
|
### Log Access Control
|
|
|
|
```bash
|
|
# Ensure logs are readable only by ubuntu user and root
|
|
sudo chown ubuntu:ubuntu /var/log/tractatus/*.log
|
|
sudo chmod 640 /var/log/tractatus/*.log
|
|
```
|
|
|
|
### Alert Email Security
|
|
|
|
- Use encrypted email if possible (ProtonMail)
|
|
- Don't include sensitive data in alert body
|
|
- Alerts show symptoms, not credentials
|
|
|
|
### Monitoring Script Security
|
|
|
|
- Scripts run as ubuntu user (not root)
|
|
- No credentials embedded in scripts
|
|
- Use environment variables for sensitive config
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Improvements
|
|
|
|
- [ ] **Metrics collection**: Store monitoring metrics in database for trend analysis
|
|
- [ ] **Status page**: Public status page showing service availability
|
|
- [ ] **Mobile alerts**: SMS or push notifications for critical alerts
|
|
- [ ] **Distributed monitoring**: Multiple monitoring locations for redundancy
|
|
- [ ] **Automated remediation**: Auto-restart service on failure
|
|
- [ ] **Performance monitoring**: Response time tracking, query performance
|
|
- [ ] **User impact monitoring**: Track error rates from user perspective
|
|
|
|
### Integration Opportunities
|
|
|
|
- [ ] **Plausible Analytics**: Monitor traffic patterns, correlate with errors
|
|
- [ ] **GitHub Actions**: Run monitoring checks in CI/CD
|
|
- [ ] **Slack integration**: Send alerts to Slack channel
|
|
- [ ] **Database backup monitoring**: Alert on backup failures
|
|
|
|
---
|
|
|
|
## Support & Documentation
|
|
|
|
**Monitoring Scripts**: `/var/www/tractatus/scripts/monitoring/`
|
|
**Monitoring Logs**: `/var/log/tractatus/`
|
|
**Cron Configuration**: `crontab -l` (ubuntu user)
|
|
**Alert Email**: Set via `ALERT_EMAIL` environment variable
|
|
|
|
**Related Documents:**
|
|
- [Production Deployment Checklist](PRODUCTION_DEPLOYMENT_CHECKLIST.md)
|
|
- [Phase 4 Preparation Checklist](../PHASE-4-PREPARATION-CHECKLIST.md)
|
|
|
|
---
|
|
|
|
**Document Status**: Ready for Production
|
|
**Last Updated**: 2025-10-09
|
|
**Next Review**: After 1 month of monitoring data
|
|
**Maintainer**: Technical Lead (Claude Code + John Stroh)
|