tractatus/docs/PRODUCTION_MONITORING_SETUP.md
TheFlow 2298d36bed fix(submissions): restructure Economist package and fix article display
- Create Economist SubmissionTracking package correctly:
  * mainArticle = full blog post content
  * coverLetter = 216-word SIR— letter
  * Links to blog post via blogPostId
- Archive 'Letter to The Economist' from blog posts (it's the cover letter)
- Fix date display on article cards (use published_at)
- Target publication already displaying via blue badge

Database changes:
- Make blogPostId optional in SubmissionTracking model
- Economist package ID: 68fa85ae49d4900e7f2ecd83
- Le Monde package ID: 68fa2abd2e6acd5691932150

Next: Enhanced modal with tabs, validation, export

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-24 08:47:42 +13:00

15 KiB

Production Monitoring Setup

Project: Tractatus AI Safety Framework Website Environment: Production (vps-93a693da.vps.ovh.net) Created: 2025-10-09 Status: Ready for Deployment


Overview

Comprehensive monitoring system for Tractatus production environment, providing:

  • Health monitoring - Application uptime, service status, database connectivity
  • Log monitoring - Error detection, security events, anomaly detection
  • Disk monitoring - Disk space usage alerts
  • SSL monitoring - Certificate expiry warnings
  • Email alerts - Automated notifications for critical issues

Philosophy: Privacy-first, self-hosted monitoring aligned with Tractatus values.


Monitoring Components

1. Health Check Monitor (health-check.sh)

What it monitors:

Alert Triggers:

  • Service not running
  • Health endpoint returns non-200
  • Database connection failed
  • Disk space > 90%

Frequency: Every 5 minutes

2. Log Monitor (log-monitor.sh)

What it monitors:

  • ERROR and CRITICAL log entries
  • Security events (authentication failures, unauthorized access)
  • Database errors
  • HTTP 500 errors
  • Unhandled exceptions

Alert Triggers:

  • 10+ errors in 5-minute window
  • 3+ critical errors in 5-minute window
  • Any security events

Frequency: Every 5 minutes

Follow Mode: Can run continuously for real-time monitoring

3. Disk Space Monitor (disk-monitor.sh)

What it monitors:

  • Root filesystem (/)
  • Var directory (/var)
  • Log directory (/var/log)
  • Tractatus application (/var/www/tractatus)
  • Temp directory (/tmp)

Alert Triggers:

  • Warning: 80%+ usage
  • Critical: 90%+ usage

Frequency: Every 15 minutes

4. SSL Certificate Monitor (ssl-monitor.sh)

What it monitors:

  • SSL certificate expiry for agenticgovernance.digital

Alert Triggers:

  • Warning: Expires in 30 days or less
  • Critical: Expires in 7 days or less
  • Critical: Already expired

Frequency: Daily

5. Master Monitor (monitor-all.sh)

Orchestrates all monitoring checks in a single run.


Installation

Prerequisites

# Ensure required commands are available
sudo apt-get update
sudo apt-get install -y curl jq openssl mailutils

# Install MongoDB shell (if not installed)
wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add -
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
sudo apt-get update
sudo apt-get install -y mongodb-mongosh

Deploy Monitoring Scripts

# From local machine, deploy monitoring scripts to production
rsync -avz -e "ssh -i ~/.ssh/tractatus_deploy" \
  scripts/monitoring/ \
  ubuntu@vps-93a693da.vps.ovh.net:/var/www/tractatus/scripts/monitoring/

Set Up Log Directory

# On production server
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net

# Create log directory
sudo mkdir -p /var/log/tractatus
sudo chown ubuntu:ubuntu /var/log/tractatus
sudo chmod 755 /var/log/tractatus

Make Scripts Executable

# On production server
cd /var/www/tractatus/scripts/monitoring
chmod +x *.sh

Configure Email Alerts

Option 1: Using Postfix (Recommended for production)

# Install Postfix
sudo apt-get install -y postfix

# Configure Postfix (select "Internet Site")
sudo dpkg-reconfigure postfix

# Set ALERT_EMAIL environment variable
echo 'export ALERT_EMAIL="your-email@example.com"' | sudo tee -a /etc/environment
source /etc/environment

Option 2: Using External SMTP (ProtonMail, Gmail, etc.)

# Install sendemail
sudo apt-get install -y sendemail libio-socket-ssl-perl libnet-ssleay-perl

# Configure in monitoring scripts (or use system mail)

Option 3: No Email (Testing)

# Leave ALERT_EMAIL unset - monitoring will log but not send emails
# Useful for initial testing

Test Monitoring Scripts

# Test health check
cd /var/www/tractatus/scripts/monitoring
./health-check.sh --test

# Test log monitor
./log-monitor.sh --since "10 minutes ago" --test

# Test disk monitor
./disk-monitor.sh --test

# Test SSL monitor
./ssl-monitor.sh --test

# Test master monitor
./monitor-all.sh --test

Expected output: Each script should run without errors and show [INFO] messages.


Cron Configuration

Create Monitoring Cron Jobs

# On production server
crontab -e

Add the following cron jobs:

# Tractatus Production Monitoring
# Logs: /var/log/tractatus/monitoring.log

# Master monitoring (every 5 minutes)
# Runs: health check, log monitor, disk monitor
*/5 * * * * /var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl >> /var/log/tractatus/cron-monitor.log 2>&1

# SSL certificate check (daily at 3am)
0 3 * * * /var/www/tractatus/scripts/monitoring/ssl-monitor.sh >> /var/log/tractatus/cron-ssl.log 2>&1

# Disk monitor (every 15 minutes - separate from master for frequency control)
*/15 * * * * /var/www/tractatus/scripts/monitoring/disk-monitor.sh >> /var/log/tractatus/cron-disk.log 2>&1

Verify Cron Jobs

# List active cron jobs
crontab -l

# Check cron logs
sudo journalctl -u cron -f

# Wait 5 minutes, then check monitoring logs
tail -f /var/log/tractatus/cron-monitor.log

Alternative: Systemd Timers (Optional)

More modern alternative to cron, provides better logging and failure handling.

Create timer file: /etc/systemd/system/tractatus-monitoring.timer

[Unit]
Description=Tractatus Monitoring Timer
Requires=tractatus-monitoring.service

[Timer]
OnBootSec=5min
OnUnitActiveSec=5min
AccuracySec=1s

[Install]
WantedBy=timers.target

Create service file: /etc/systemd/system/tractatus-monitoring.service

[Unit]
Description=Tractatus Production Monitoring
After=network.target tractatus.service

[Service]
Type=oneshot
User=ubuntu
WorkingDirectory=/var/www/tractatus
ExecStart=/var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl
StandardOutput=journal
StandardError=journal
Environment="ALERT_EMAIL=your-email@example.com"

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable tractatus-monitoring.timer
sudo systemctl start tractatus-monitoring.timer

# Check status
sudo systemctl status tractatus-monitoring.timer
sudo systemctl list-timers

Alert Configuration

Alert Thresholds

Health Check:

  • Consecutive failures: 3 (alerts on 3rd failure)
  • Check interval: 5 minutes
  • Time to alert: 15 minutes of downtime

Log Monitor:

  • Error threshold: 10 errors in 5 minutes
  • Critical threshold: 3 critical errors in 5 minutes
  • Security events: Immediate alert

Disk Space:

  • Warning: 80% usage
  • Critical: 90% usage

SSL Certificate:

  • Warning: 30 days until expiry
  • Critical: 7 days until expiry

Customize Alerts

Edit thresholds in scripts:

# Health check thresholds
vi /var/www/tractatus/scripts/monitoring/health-check.sh
# Change: MAX_FAILURES=3

# Log monitor thresholds
vi /var/www/tractatus/scripts/monitoring/log-monitor.sh
# Change: ERROR_THRESHOLD=10
# Change: CRITICAL_THRESHOLD=3

# Disk monitor thresholds
vi /var/www/tractatus/scripts/monitoring/disk-monitor.sh
# Change: WARN_THRESHOLD=80
# Change: CRITICAL_THRESHOLD=90

# SSL monitor thresholds
vi /var/www/tractatus/scripts/monitoring/ssl-monitor.sh
# Change: WARN_DAYS=30
# Change: CRITICAL_DAYS=7

Manual Monitoring Commands

Check Current Status

# Run all monitors manually
cd /var/www/tractatus/scripts/monitoring
./monitor-all.sh

# Run individual monitors
./health-check.sh
./log-monitor.sh --since "1 hour"
./disk-monitor.sh
./ssl-monitor.sh

View Monitoring Logs

# View all monitoring logs
tail -f /var/log/tractatus/monitoring.log

# View specific monitor logs
tail -f /var/log/tractatus/health-check.log
tail -f /var/log/tractatus/log-monitor.log
tail -f /var/log/tractatus/disk-monitor.log
tail -f /var/log/tractatus/ssl-monitor.log

# View cron execution logs
tail -f /var/log/tractatus/cron-monitor.log

Test Alert Delivery

# Send test alert
cd /var/www/tractatus/scripts/monitoring

# This should trigger an alert (if service is running)
# It will show "would send alert" in test mode
./health-check.sh --test

# Force alert by temporarily stopping service
sudo systemctl stop tractatus
./health-check.sh  # Should alert after 3 failures (15 minutes)
sudo systemctl start tractatus

Troubleshooting

No Alerts Received

Check email configuration:

# Verify ALERT_EMAIL is set
echo $ALERT_EMAIL

# Test mail command
echo "Test email" | mail -s "Test Subject" $ALERT_EMAIL

# Check mail logs
sudo tail -f /var/log/mail.log

Check cron execution:

# Verify cron jobs are running
crontab -l

# Check cron logs
sudo journalctl -u cron -n 50

# Check script logs
tail -100 /var/log/tractatus/cron-monitor.log

Scripts Not Executing

Check permissions:

ls -la /var/www/tractatus/scripts/monitoring/
# Should show: -rwxr-xr-x (executable)

# Fix if needed
chmod +x /var/www/tractatus/scripts/monitoring/*.sh

Check cron PATH:

# Add to crontab
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# Or use full paths in cron commands

High Alert Frequency

Increase thresholds:

Edit threshold values in scripts (see Alert Configuration section).

Increase consecutive failure count:

vi /var/www/tractatus/scripts/monitoring/health-check.sh
# Increase MAX_FAILURES from 3 to 5 or higher

False Positives

Review alert conditions:

# Check recent logs to understand why alerts triggered
tail -100 /var/log/tractatus/monitoring.log

# Run manual check with verbose output
./health-check.sh

# Check if service is actually unhealthy
sudo systemctl status tractatus
curl https://agenticgovernance.digital/health

Monitoring Dashboard (Optional - Future Enhancement)

Option 1: Grafana + Prometheus

Self-hosted metrics dashboard (requires setup).

Option 2: Simple Web Dashboard

Create minimal status page showing last check results.

Option 3: UptimeRobot Free Tier

External monitoring service (privacy tradeoff).

Not implemented yet - current solution uses email alerts only.


Best Practices

DO:

  • Test monitoring scripts before deploying
  • Check alert emails regularly
  • Review monitoring logs weekly
  • Adjust thresholds based on actual patterns
  • Document any monitoring configuration changes
  • Keep monitoring scripts updated

DON'T:

  • Ignore alert emails
  • Set thresholds too low (alert fatigue)
  • Deploy monitoring without testing
  • Disable monitoring without planning
  • Let log files grow unbounded
  • Ignore repeated warnings

Monitoring Hygiene

# Rotate monitoring logs weekly
sudo logrotate /etc/logrotate.d/tractatus-monitoring

# Clean up old state files
find /var/tmp -name "tractatus-*-state" -mtime +7 -delete

# Review alert frequency monthly
grep "\[ALERT\]" /var/log/tractatus/monitoring.log | wc -l

Incident Response

When Alert Received

  1. Acknowledge alert - Note time received
  2. Check current status - Run manual health check
  3. Review logs - Check what triggered alert
  4. Investigate root cause - See deployment checklist emergency procedures
  5. Take action - Fix issue or escalate
  6. Document - Create incident report

Critical Alert Response Time

  • Health check failure: Respond within 15 minutes
  • Log errors: Respond within 30 minutes
  • Disk space critical: Respond within 1 hour
  • SSL expiry (7 days): Respond within 24 hours

Maintenance

Weekly Tasks

  • Review monitoring logs for patterns
  • Check alert email inbox
  • Verify cron jobs still running
  • Review disk space trends

Monthly Tasks

  • Review and adjust alert thresholds
  • Clean up old monitoring logs
  • Test manual failover procedures
  • Update monitoring documentation

Quarterly Tasks

  • Full monitoring system audit
  • Test all alert scenarios
  • Review incident response times
  • Consider monitoring enhancements

Monitoring Metrics

Success Metrics

  • Uptime: Target 99.9% (< 45 minutes downtime/month)
  • Alert Response Time: < 30 minutes for critical
  • False Positive Rate: < 5% of alerts
  • Detection Time: < 5 minutes for critical issues

Tracking

# Calculate uptime from logs
grep "Health endpoint OK" /var/log/tractatus/monitoring.log | wc -l

# Count alerts sent
grep "Alert email sent" /var/log/tractatus/monitoring.log | wc -l

# Review response times (manual from incident reports)

Security Considerations

Log Access Control

# Ensure logs are readable only by ubuntu user and root
sudo chown ubuntu:ubuntu /var/log/tractatus/*.log
sudo chmod 640 /var/log/tractatus/*.log

Alert Email Security

  • Use encrypted email if possible (ProtonMail)
  • Don't include sensitive data in alert body
  • Alerts show symptoms, not credentials

Monitoring Script Security

  • Scripts run as ubuntu user (not root)
  • No credentials embedded in scripts
  • Use environment variables for sensitive config

Future Enhancements

Planned Improvements

  • Metrics collection: Store monitoring metrics in database for trend analysis
  • Status page: Public status page showing service availability
  • Mobile alerts: SMS or push notifications for critical alerts
  • Distributed monitoring: Multiple monitoring locations for redundancy
  • Automated remediation: Auto-restart service on failure
  • Performance monitoring: Response time tracking, query performance
  • User impact monitoring: Track error rates from user perspective

Integration Opportunities

  • Plausible Analytics: Monitor traffic patterns, correlate with errors
  • GitHub Actions: Run monitoring checks in CI/CD
  • Slack integration: Send alerts to Slack channel
  • Database backup monitoring: Alert on backup failures

Support & Documentation

Monitoring Scripts: /var/www/tractatus/scripts/monitoring/ Monitoring Logs: /var/log/tractatus/ Cron Configuration: crontab -l (ubuntu user) Alert Email: Set via ALERT_EMAIL environment variable

Related Documents:


Document Status: Ready for Production Last Updated: 2025-10-09 Next Review: After 1 month of monitoring data Maintainer: Technical Lead (Claude Code + John Stroh)