ops: implement comprehensive production monitoring system

Create self-hosted, privacy-first monitoring infrastructure for production environment with automated health checks, log analysis, and alerting. Monitoring Components: - health-check.sh: Application health, service status, DB connectivity, disk space - log-monitor.sh: Error detection, security events, anomaly detection - disk-monitor.sh: Disk space usage monitoring (5 paths) - ssl-monitor.sh: SSL certificate expiry monitoring - monitor-all.sh: Master orchestration script Features: - Email alerting system (configurable thresholds) - Consecutive failure tracking (prevents false positives) - Test mode for safe deployment testing - Comprehensive logging to /var/log/tractatus/ - Cron-ready for automated execution - Exit codes for monitoring tool integration Alert Triggers: - Health: 3 consecutive failures (15min downtime) - Logs: 10 errors OR 3 critical errors in 5min - Disk: 80% warning, 90% critical - SSL: 30 days warning, 7 days critical Setup Documentation: - Complete installation instructions - Cron configuration examples - Systemd timer alternative - Troubleshooting guide - Alert customization guide - Incident response procedures Privacy-First Design: - Self-hosted (no external monitoring services) - Minimal data exposure in alerts - Local log storage only - No telemetry to third parties Aligns with Tractatus values: transparency, privacy, operational excellence Addresses Phase 4 Prep Checklist Task #6: Production Monitoring & Alerting Next: Deploy to production, configure email alerts, set up cron jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-09 22:23:40 +13:00 · 2025-10-09 22:23:40 +13:00 · c755c49ec1
commit c755c49ec1
parent 1221941828
6 changed files with 1940 additions and 0 deletions
--- a/docs/PRODUCTION_MONITORING_SETUP.md
+++ b/docs/PRODUCTION_MONITORING_SETUP.md
@ -0,0 +1,648 @@
+# Production Monitoring Setup
+
+**Project**: Tractatus AI Safety Framework Website
+**Environment**: Production (vps-93a693da.vps.ovh.net)
+**Created**: 2025-10-09
+**Status**: Ready for Deployment
+
+---
+
+## Overview
+
+Comprehensive monitoring system for Tractatus production environment, providing:
+
+- **Health monitoring** - Application uptime, service status, database connectivity
+- **Log monitoring** - Error detection, security events, anomaly detection
+- **Disk monitoring** - Disk space usage alerts
+- **SSL monitoring** - Certificate expiry warnings
+- **Email alerts** - Automated notifications for critical issues
+
+**Philosophy**: Privacy-first, self-hosted monitoring aligned with Tractatus values.
+
+---
+
+## Monitoring Components
+
+### 1. Health Check Monitor (`health-check.sh`)
+
+**What it monitors:**
+- Application health endpoint (https://agenticgovernance.digital/health)
+- Systemd service status (tractatus.service)
+- MongoDB database connectivity
+- Disk space usage
+
+**Alert Triggers:**
+- Service not running
+- Health endpoint returns non-200
+- Database connection failed
+- Disk space > 90%
+
+**Frequency**: Every 5 minutes
+
+### 2. Log Monitor (`log-monitor.sh`)
+
+**What it monitors:**
+- ERROR and CRITICAL log entries
+- Security events (authentication failures, unauthorized access)
+- Database errors
+- HTTP 500 errors
+- Unhandled exceptions
+
+**Alert Triggers:**
+- 10+ errors in 5-minute window
+- 3+ critical errors in 5-minute window
+- Any security events
+
+**Frequency**: Every 5 minutes
+
+**Follow Mode**: Can run continuously for real-time monitoring
+
+### 3. Disk Space Monitor (`disk-monitor.sh`)
+
+**What it monitors:**
+- Root filesystem (/)
+- Var directory (/var)
+- Log directory (/var/log)
+- Tractatus application (/var/www/tractatus)
+- Temp directory (/tmp)
+
+**Alert Triggers:**
+- Warning: 80%+ usage
+- Critical: 90%+ usage
+
+**Frequency**: Every 15 minutes
+
+### 4. SSL Certificate Monitor (`ssl-monitor.sh`)
+
+**What it monitors:**
+- SSL certificate expiry for agenticgovernance.digital
+
+**Alert Triggers:**
+- Warning: Expires in 30 days or less
+- Critical: Expires in 7 days or less
+- Critical: Already expired
+
+**Frequency**: Daily
+
+### 5. Master Monitor (`monitor-all.sh`)
+
+Orchestrates all monitoring checks in a single run.
+
+---
+
+## Installation
+
+### Prerequisites
+
+```bash
+# Ensure required commands are available
+sudo apt-get update
+sudo apt-get install -y curl jq openssl mailutils
+
+# Install MongoDB shell (if not installed)
+wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add -
+echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
+sudo apt-get update
+sudo apt-get install -y mongodb-mongosh
+```
+
+### Deploy Monitoring Scripts
+
+```bash
+# From local machine, deploy monitoring scripts to production
+rsync -avz -e "ssh -i ~/.ssh/tractatus_deploy" \
+  scripts/monitoring/ \
+  ubuntu@vps-93a693da.vps.ovh.net:/var/www/tractatus/scripts/monitoring/
+```
+
+### Set Up Log Directory
+
+```bash
+# On production server
+ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net
+
+# Create log directory
+sudo mkdir -p /var/log/tractatus
+sudo chown ubuntu:ubuntu /var/log/tractatus
+sudo chmod 755 /var/log/tractatus
+```
+
+### Make Scripts Executable
+
+```bash
+# On production server
+cd /var/www/tractatus/scripts/monitoring
+chmod +x *.sh
+```
+
+### Configure Email Alerts
+
+**Option 1: Using Postfix (Recommended for production)**
+
+```bash
+# Install Postfix
+sudo apt-get install -y postfix
+
+# Configure Postfix (select "Internet Site")
+sudo dpkg-reconfigure postfix
+
+# Set ALERT_EMAIL environment variable
+echo 'export ALERT_EMAIL="your-email@example.com"' | sudo tee -a /etc/environment
+source /etc/environment
+```
+
+**Option 2: Using External SMTP (ProtonMail, Gmail, etc.)**
+
+```bash
+# Install sendemail
+sudo apt-get install -y sendemail libio-socket-ssl-perl libnet-ssleay-perl
+
+# Configure in monitoring scripts (or use system mail)
+```
+
+**Option 3: No Email (Testing)**
+
+```bash
+# Leave ALERT_EMAIL unset - monitoring will log but not send emails
+# Useful for initial testing
+```
+
+### Test Monitoring Scripts
+
+```bash
+# Test health check
+cd /var/www/tractatus/scripts/monitoring
+./health-check.sh --test
+
+# Test log monitor
+./log-monitor.sh --since "10 minutes ago" --test
+
+# Test disk monitor
+./disk-monitor.sh --test
+
+# Test SSL monitor
+./ssl-monitor.sh --test
+
+# Test master monitor
+./monitor-all.sh --test
+```
+
+Expected output: Each script should run without errors and show `[INFO]` messages.
+
+---
+
+## Cron Configuration
+
+### Create Monitoring Cron Jobs
+
+```bash
+# On production server
+crontab -e
+```
+
+Add the following cron jobs:
+
+```cron
+# Tractatus Production Monitoring
+# Logs: /var/log/tractatus/monitoring.log
+
+# Master monitoring (every 5 minutes)
+# Runs: health check, log monitor, disk monitor
+*/5 * * * * /var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl >> /var/log/tractatus/cron-monitor.log 2>&1
+
+# SSL certificate check (daily at 3am)
+0 3 * * * /var/www/tractatus/scripts/monitoring/ssl-monitor.sh >> /var/log/tractatus/cron-ssl.log 2>&1
+
+# Disk monitor (every 15 minutes - separate from master for frequency control)
+*/15 * * * * /var/www/tractatus/scripts/monitoring/disk-monitor.sh >> /var/log/tractatus/cron-disk.log 2>&1
+```
+
+### Verify Cron Jobs
+
+```bash
+# List active cron jobs
+crontab -l
+
+# Check cron logs
+sudo journalctl -u cron -f
+
+# Wait 5 minutes, then check monitoring logs
+tail -f /var/log/tractatus/cron-monitor.log
+```
+
+### Alternative: Systemd Timers (Optional)
+
+More modern alternative to cron, provides better logging and failure handling.
+
+**Create timer file**: `/etc/systemd/system/tractatus-monitoring.timer`
+
+```ini
+[Unit]
+Description=Tractatus Monitoring Timer
+Requires=tractatus-monitoring.service
+
+[Timer]
+OnBootSec=5min
+OnUnitActiveSec=5min
+AccuracySec=1s
+
+[Install]
+WantedBy=timers.target
+```
+
+**Create service file**: `/etc/systemd/system/tractatus-monitoring.service`
+
+```ini
+[Unit]
+Description=Tractatus Production Monitoring
+After=network.target tractatus.service
+
+[Service]
+Type=oneshot
+User=ubuntu
+WorkingDirectory=/var/www/tractatus
+ExecStart=/var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl
+StandardOutput=journal
+StandardError=journal
+Environment="ALERT_EMAIL=your-email@example.com"
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable and start:**
+
+```bash
+sudo systemctl daemon-reload
+sudo systemctl enable tractatus-monitoring.timer
+sudo systemctl start tractatus-monitoring.timer
+
+# Check status
+sudo systemctl status tractatus-monitoring.timer
+sudo systemctl list-timers
+```
+
+---
+
+## Alert Configuration
+
+### Alert Thresholds
+
+**Health Check:**
+- Consecutive failures: 3 (alerts on 3rd failure)
+- Check interval: 5 minutes
+- Time to alert: 15 minutes of downtime
+
+**Log Monitor:**
+- Error threshold: 10 errors in 5 minutes
+- Critical threshold: 3 critical errors in 5 minutes
+- Security events: Immediate alert
+
+**Disk Space:**
+- Warning: 80% usage
+- Critical: 90% usage
+
+**SSL Certificate:**
+- Warning: 30 days until expiry
+- Critical: 7 days until expiry
+
+### Customize Alerts
+
+Edit thresholds in scripts:
+
+```bash
+# Health check thresholds
+vi /var/www/tractatus/scripts/monitoring/health-check.sh
+# Change: MAX_FAILURES=3
+
+# Log monitor thresholds
+vi /var/www/tractatus/scripts/monitoring/log-monitor.sh
+# Change: ERROR_THRESHOLD=10
+# Change: CRITICAL_THRESHOLD=3
+
+# Disk monitor thresholds
+vi /var/www/tractatus/scripts/monitoring/disk-monitor.sh
+# Change: WARN_THRESHOLD=80
+# Change: CRITICAL_THRESHOLD=90
+
+# SSL monitor thresholds
+vi /var/www/tractatus/scripts/monitoring/ssl-monitor.sh
+# Change: WARN_DAYS=30
+# Change: CRITICAL_DAYS=7
+```
+
+---
+
+## Manual Monitoring Commands
+
+### Check Current Status
+
+```bash
+# Run all monitors manually
+cd /var/www/tractatus/scripts/monitoring
+./monitor-all.sh
+
+# Run individual monitors
+./health-check.sh
+./log-monitor.sh --since "1 hour"
+./disk-monitor.sh
+./ssl-monitor.sh
+```
+
+### View Monitoring Logs
+
+```bash
+# View all monitoring logs
+tail -f /var/log/tractatus/monitoring.log
+
+# View specific monitor logs
+tail -f /var/log/tractatus/health-check.log
+tail -f /var/log/tractatus/log-monitor.log
+tail -f /var/log/tractatus/disk-monitor.log
+tail -f /var/log/tractatus/ssl-monitor.log
+
+# View cron execution logs
+tail -f /var/log/tractatus/cron-monitor.log
+```
+
+### Test Alert Delivery
+
+```bash
+# Send test alert
+cd /var/www/tractatus/scripts/monitoring
+
+# This should trigger an alert (if service is running)
+# It will show "would send alert" in test mode
+./health-check.sh --test
+
+# Force alert by temporarily stopping service
+sudo systemctl stop tractatus
+./health-check.sh  # Should alert after 3 failures (15 minutes)
+sudo systemctl start tractatus
+```
+
+---
+
+## Troubleshooting
+
+### No Alerts Received
+
+**Check email configuration:**
+
+```bash
+# Verify ALERT_EMAIL is set
+echo $ALERT_EMAIL
+
+# Test mail command
+echo "Test email" | mail -s "Test Subject" $ALERT_EMAIL
+
+# Check mail logs
+sudo tail -f /var/log/mail.log
+```
+
+**Check cron execution:**
+
+```bash
+# Verify cron jobs are running
+crontab -l
+
+# Check cron logs
+sudo journalctl -u cron -n 50
+
+# Check script logs
+tail -100 /var/log/tractatus/cron-monitor.log
+```
+
+### Scripts Not Executing
+
+**Check permissions:**
+
+```bash
+ls -la /var/www/tractatus/scripts/monitoring/
+# Should show: -rwxr-xr-x (executable)
+
+# Fix if needed
+chmod +x /var/www/tractatus/scripts/monitoring/*.sh
+```
+
+**Check cron PATH:**
+
+```bash
+# Add to crontab
+PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
+
+# Or use full paths in cron commands
+```
+
+### High Alert Frequency
+
+**Increase thresholds:**
+
+Edit threshold values in scripts (see Alert Configuration section).
+
+**Increase consecutive failure count:**
+
+```bash
+vi /var/www/tractatus/scripts/monitoring/health-check.sh
+# Increase MAX_FAILURES from 3 to 5 or higher
+```
+
+### False Positives
+
+**Review alert conditions:**
+
+```bash
+# Check recent logs to understand why alerts triggered
+tail -100 /var/log/tractatus/monitoring.log
+
+# Run manual check with verbose output
+./health-check.sh
+
+# Check if service is actually unhealthy
+sudo systemctl status tractatus
+curl https://agenticgovernance.digital/health
+```
+
+---
+
+## Monitoring Dashboard (Optional - Future Enhancement)
+
+### Option 1: Grafana + Prometheus
+
+Self-hosted metrics dashboard (requires setup).
+
+### Option 2: Simple Web Dashboard
+
+Create minimal status page showing last check results.
+
+### Option 3: UptimeRobot Free Tier
+
+External monitoring service (privacy tradeoff).
+
+**Not implemented yet** - current solution uses email alerts only.
+
+---
+
+## Best Practices
+
+### DO:
+- ✅ Test monitoring scripts before deploying
+- ✅ Check alert emails regularly
+- ✅ Review monitoring logs weekly
+- ✅ Adjust thresholds based on actual patterns
+- ✅ Document any monitoring configuration changes
+- ✅ Keep monitoring scripts updated
+
+### DON'T:
+- ❌ Ignore alert emails
+- ❌ Set thresholds too low (alert fatigue)
+- ❌ Deploy monitoring without testing
+- ❌ Disable monitoring without planning
+- ❌ Let log files grow unbounded
+- ❌ Ignore repeated warnings
+
+### Monitoring Hygiene
+
+```bash
+# Rotate monitoring logs weekly
+sudo logrotate /etc/logrotate.d/tractatus-monitoring
+
+# Clean up old state files
+find /var/tmp -name "tractatus-*-state" -mtime +7 -delete
+
+# Review alert frequency monthly
+grep "\[ALERT\]" /var/log/tractatus/monitoring.log | wc -l
+```
+
+---
+
+## Incident Response
+
+### When Alert Received
+
+1. **Acknowledge alert** - Note time received
+2. **Check current status** - Run manual health check
+3. **Review logs** - Check what triggered alert
+4. **Investigate root cause** - See deployment checklist emergency procedures
+5. **Take action** - Fix issue or escalate
+6. **Document** - Create incident report
+
+### Critical Alert Response Time
+
+- **Health check failure**: Respond within 15 minutes
+- **Log errors**: Respond within 30 minutes
+- **Disk space critical**: Respond within 1 hour
+- **SSL expiry (7 days)**: Respond within 24 hours
+
+---
+
+## Maintenance
+
+### Weekly Tasks
+
+- [ ] Review monitoring logs for patterns
+- [ ] Check alert email inbox
+- [ ] Verify cron jobs still running
+- [ ] Review disk space trends
+
+### Monthly Tasks
+
+- [ ] Review and adjust alert thresholds
+- [ ] Clean up old monitoring logs
+- [ ] Test manual failover procedures
+- [ ] Update monitoring documentation
+
+### Quarterly Tasks
+
+- [ ] Full monitoring system audit
+- [ ] Test all alert scenarios
+- [ ] Review incident response times
+- [ ] Consider monitoring enhancements
+
+---
+
+## Monitoring Metrics
+
+### Success Metrics
+
+- **Uptime**: Target 99.9% (< 45 minutes downtime/month)
+- **Alert Response Time**: < 30 minutes for critical
+- **False Positive Rate**: < 5% of alerts
+- **Detection Time**: < 5 minutes for critical issues
+
+### Tracking
+
+```bash
+# Calculate uptime from logs
+grep "Health endpoint OK" /var/log/tractatus/monitoring.log | wc -l
+
+# Count alerts sent
+grep "Alert email sent" /var/log/tractatus/monitoring.log | wc -l
+
+# Review response times (manual from incident reports)
+```
+
+---
+
+## Security Considerations
+
+### Log Access Control
+
+```bash
+# Ensure logs are readable only by ubuntu user and root
+sudo chown ubuntu:ubuntu /var/log/tractatus/*.log
+sudo chmod 640 /var/log/tractatus/*.log
+```
+
+### Alert Email Security
+
+- Use encrypted email if possible (ProtonMail)
+- Don't include sensitive data in alert body
+- Alerts show symptoms, not credentials
+
+### Monitoring Script Security
+
+- Scripts run as ubuntu user (not root)
+- No credentials embedded in scripts
+- Use environment variables for sensitive config
+
+---
+
+## Future Enhancements
+
+### Planned Improvements
+
+- [ ] **Metrics collection**: Store monitoring metrics in database for trend analysis
+- [ ] **Status page**: Public status page showing service availability
+- [ ] **Mobile alerts**: SMS or push notifications for critical alerts
+- [ ] **Distributed monitoring**: Multiple monitoring locations for redundancy
+- [ ] **Automated remediation**: Auto-restart service on failure
+- [ ] **Performance monitoring**: Response time tracking, query performance
+- [ ] **User impact monitoring**: Track error rates from user perspective
+
+### Integration Opportunities
+
+- [ ] **Plausible Analytics**: Monitor traffic patterns, correlate with errors
+- [ ] **GitHub Actions**: Run monitoring checks in CI/CD
+- [ ] **Slack integration**: Send alerts to Slack channel
+- [ ] **Database backup monitoring**: Alert on backup failures
+
+---
+
+## Support & Documentation
+
+**Monitoring Scripts**: `/var/www/tractatus/scripts/monitoring/`
+**Monitoring Logs**: `/var/log/tractatus/`
+**Cron Configuration**: `crontab -l` (ubuntu user)
+**Alert Email**: Set via `ALERT_EMAIL` environment variable
+
+**Related Documents:**
+- [Production Deployment Checklist](PRODUCTION_DEPLOYMENT_CHECKLIST.md)
+- [Phase 4 Preparation Checklist](../PHASE-4-PREPARATION-CHECKLIST.md)
+
+---
+
+**Document Status**: Ready for Production
+**Last Updated**: 2025-10-09
+**Next Review**: After 1 month of monitoring data
+**Maintainer**: Technical Lead (Claude Code + John Stroh)
--- a/scripts/monitoring/disk-monitor.sh
+++ b/scripts/monitoring/disk-monitor.sh
@ -0,0 +1,257 @@
+#!/bin/bash
+#
+# Disk Space Monitoring Script
+# Monitors disk space usage and alerts when thresholds exceeded
+#
+# Usage:
+#   ./disk-monitor.sh          # Check all monitored paths
+#   ./disk-monitor.sh --test   # Test mode (no alerts)
+#
+# Exit codes:
+#   0 = OK
+#   1 = Warning threshold exceeded
+#   2 = Critical threshold exceeded
+
+set -euo pipefail
+
+# Configuration
+ALERT_EMAIL="${ALERT_EMAIL:-}"
+LOG_FILE="/var/log/tractatus/disk-monitor.log"
+WARN_THRESHOLD=80      # Warn at 80% usage
+CRITICAL_THRESHOLD=90  # Critical at 90% usage
+
+# Paths to monitor
+declare -A MONITORED_PATHS=(
+  ["/"]="Root filesystem"
+  ["/var"]="Var directory"
+  ["/var/log"]="Log directory"
+  ["/var/www/tractatus"]="Tractatus application"
+  ["/tmp"]="Temp directory"
+)
+
+# Parse arguments
+TEST_MODE=false
+
+while [[ $# -gt 0 ]]; do
+  case $1 in
+    --test)
+      TEST_MODE=true
+      shift
+      ;;
+    *)
+      echo "Unknown option: $1"
+      exit 3
+      ;;
+  esac
+done
+
+# Logging function
+log() {
+  local level="$1"
+  shift
+  local message="$*"
+  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
+
+  echo "[$timestamp] [$level] $message"
+
+  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
+    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
+  fi
+}
+
+# Send alert email
+send_alert() {
+  local subject="$1"
+  local body="$2"
+
+  if [[ "$TEST_MODE" == "true" ]]; then
+    log "INFO" "TEST MODE: Would send alert: $subject"
+    return 0
+  fi
+
+  if [[ -z "$ALERT_EMAIL" ]]; then
+    log "WARN" "No alert email configured (ALERT_EMAIL not set)"
+    return 0
+  fi
+
+  if command -v mail &> /dev/null; then
+    echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
+    log "INFO" "Alert email sent to $ALERT_EMAIL"
+  elif command -v sendmail &> /dev/null; then
+    {
+      echo "Subject: $subject"
+      echo "From: tractatus-monitoring@agenticgovernance.digital"
+      echo "To: $ALERT_EMAIL"
+      echo ""
+      echo "$body"
+    } | sendmail "$ALERT_EMAIL"
+    log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
+  else
+    log "WARN" "No email command available"
+  fi
+}
+
+# Get disk usage for path
+get_disk_usage() {
+  local path="$1"
+
+  # Check if path exists
+  if [[ ! -e "$path" ]]; then
+    echo "N/A"
+    return 1
+  fi
+
+  # Get usage percentage (remove % sign)
+  df -h "$path" 2>/dev/null | awk 'NR==2 {print $5}' | sed 's/%//' || echo "N/A"
+}
+
+# Get human-readable disk usage details
+get_disk_details() {
+  local path="$1"
+
+  if [[ ! -e "$path" ]]; then
+    echo "Path does not exist"
+    return 1
+  fi
+
+  df -h "$path" 2>/dev/null | awk 'NR==2 {printf "Size: %s | Used: %s | Avail: %s | Use%%: %s | Mounted: %s\n", $2, $3, $4, $5, $6}'
+}
+
+# Find largest directories in path
+find_largest_dirs() {
+  local path="$1"
+  local limit="${2:-10}"
+
+  if [[ ! -e "$path" ]]; then
+    return 1
+  fi
+
+  du -h "$path"/* 2>/dev/null | sort -rh | head -n "$limit" || echo "Unable to scan directory"
+}
+
+# Check single path
+check_path() {
+  local path="$1"
+  local description="$2"
+
+  local usage=$(get_disk_usage "$path")
+
+  if [[ "$usage" == "N/A" ]]; then
+    log "WARN" "$description ($path): Unable to check"
+    return 0
+  fi
+
+  if [[ "$usage" -ge "$CRITICAL_THRESHOLD" ]]; then
+    log "CRITICAL" "$description ($path): ${usage}% used (>= $CRITICAL_THRESHOLD%)"
+    return 2
+  elif [[ "$usage" -ge "$WARN_THRESHOLD" ]]; then
+    log "WARN" "$description ($path): ${usage}% used (>= $WARN_THRESHOLD%)"
+    return 1
+  else
+    log "INFO" "$description ($path): ${usage}% used"
+    return 0
+  fi
+}
+
+# Main monitoring function
+main() {
+  log "INFO" "Starting disk space monitoring"
+
+  local max_severity=0
+  local issues=()
+  local critical_paths=()
+  local warning_paths=()
+
+  # Check all monitored paths
+  for path in "${!MONITORED_PATHS[@]}"; do
+    local description="${MONITORED_PATHS[$path]}"
+    local exit_code=0
+
+    check_path "$path" "$description" || exit_code=$?
+
+    if [[ "$exit_code" -eq 2 ]]; then
+      max_severity=2
+      critical_paths+=("$path (${description})")
+    elif [[ "$exit_code" -eq 1 ]]; then
+      [[ "$max_severity" -lt 1 ]] && max_severity=1
+      warning_paths+=("$path (${description})")
+    fi
+  done
+
+  # Send alerts if thresholds exceeded
+  if [[ "$max_severity" -eq 2 ]]; then
+    local subject="[CRITICAL] Tractatus Disk Space Critical"
+    local body="CRITICAL: Disk space usage has exceeded ${CRITICAL_THRESHOLD}% on one or more paths.
+
+Critical Paths (>= ${CRITICAL_THRESHOLD}%):
+$(printf -- "- %s\n" "${critical_paths[@]}")
+"
+
+    # Add warning paths if any
+    if [[ "${#warning_paths[@]}" -gt 0 ]]; then
+      body+="
+Warning Paths (>= ${WARN_THRESHOLD}%):
+$(printf -- "- %s\n" "${warning_paths[@]}")
+"
+    fi
+
+    body+="
+Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
+Host: $(hostname)
+
+Disk Usage Details:
+$(df -h)
+
+Largest directories in /var/www/tractatus:
+$(find_largest_dirs /var/www/tractatus 10)
+
+Largest log files:
+$(du -h /var/log/tractatus/*.log 2>/dev/null | sort -rh | head -10 || echo "No log files found")
+
+Action Required:
+1. Clean up old log files
+2. Remove unnecessary files
+3. Check for runaway processes creating large files
+4. Consider expanding disk space
+
+Clean up commands:
+# Rotate old logs
+sudo journalctl --vacuum-time=7d
+
+# Clean up npm cache
+npm cache clean --force
+
+# Find large files
+find /var/www/tractatus -type f -size +100M -exec ls -lh {} \;
+"
+
+    send_alert "$subject" "$body"
+    log "CRITICAL" "Disk space alert sent"
+
+  elif [[ "$max_severity" -eq 1 ]]; then
+    local subject="[WARN] Tractatus Disk Space Warning"
+    local body="WARNING: Disk space usage has exceeded ${WARN_THRESHOLD}% on one or more paths.
+
+Warning Paths (>= ${WARN_THRESHOLD}%):
+$(printf -- "- %s\n" "${warning_paths[@]}")
+
+Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
+Host: $(hostname)
+
+Disk Usage:
+$(df -h)
+
+Please review disk usage and clean up if necessary.
+"
+
+    send_alert "$subject" "$body"
+    log "WARN" "Disk space warning sent"
+  else
+    log "INFO" "All monitored paths within acceptable limits"
+  fi
+
+  exit $max_severity
+}
+
+# Run main function
+main
--- a/scripts/monitoring/health-check.sh
+++ b/scripts/monitoring/health-check.sh
@ -0,0 +1,269 @@
+#!/bin/bash
+#
+# Health Check Monitoring Script
+# Monitors Tractatus application health endpoint and service status
+#
+# Usage:
+#   ./health-check.sh                 # Run check, alert if issues
+#   ./health-check.sh --quiet         # Suppress output unless error
+#   ./health-check.sh --test          # Test mode (no alerts)
+#
+# Exit codes:
+#   0 = Healthy
+#   1 = Health endpoint failed
+#   2 = Service not running
+#   3 = Configuration error
+
+set -euo pipefail
+
+# Configuration
+HEALTH_URL="${HEALTH_URL:-https://agenticgovernance.digital/health}"
+SERVICE_NAME="${SERVICE_NAME:-tractatus}"
+ALERT_EMAIL="${ALERT_EMAIL:-}"
+LOG_FILE="/var/log/tractatus/health-check.log"
+STATE_FILE="/var/tmp/tractatus-health-state"
+MAX_FAILURES=3  # Alert after 3 consecutive failures
+
+# Parse arguments
+QUIET=false
+TEST_MODE=false
+
+while [[ $# -gt 0 ]]; do
+  case $1 in
+    --quiet) QUIET=true; shift ;;
+    --test) TEST_MODE=true; shift ;;
+    *) echo "Unknown option: $1"; exit 3 ;;
+  esac
+done
+
+# Logging function
+log() {
+  local level="$1"
+  shift
+  local message="$*"
+  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
+
+  if [[ "$QUIET" != "true" ]] || [[ "$level" == "ERROR" ]] || [[ "$level" == "CRITICAL" ]]; then
+    echo "[$timestamp] [$level] $message"
+  fi
+
+  # Log to file if directory exists
+  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
+    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
+  fi
+}
+
+# Get current failure count
+get_failure_count() {
+  if [[ -f "$STATE_FILE" ]]; then
+    cat "$STATE_FILE"
+  else
+    echo "0"
+  fi
+}
+
+# Increment failure count
+increment_failure_count() {
+  local count=$(get_failure_count)
+  echo $((count + 1)) > "$STATE_FILE"
+}
+
+# Reset failure count
+reset_failure_count() {
+  echo "0" > "$STATE_FILE"
+}
+
+# Send alert email
+send_alert() {
+  local subject="$1"
+  local body="$2"
+
+  if [[ "$TEST_MODE" == "true" ]]; then
+    log "INFO" "TEST MODE: Would send alert: $subject"
+    return 0
+  fi
+
+  if [[ -z "$ALERT_EMAIL" ]]; then
+    log "WARN" "No alert email configured (ALERT_EMAIL not set)"
+    return 0
+  fi
+
+  # Try to send email using mail command (if available)
+  if command -v mail &> /dev/null; then
+    echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
+    log "INFO" "Alert email sent to $ALERT_EMAIL"
+  elif command -v sendmail &> /dev/null; then
+    {
+      echo "Subject: $subject"
+      echo "From: tractatus-monitoring@agenticgovernance.digital"
+      echo "To: $ALERT_EMAIL"
+      echo ""
+      echo "$body"
+    } | sendmail "$ALERT_EMAIL"
+    log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
+  else
+    log "WARN" "No email command available (install mailutils or sendmail)"
+  fi
+}
+
+# Check health endpoint
+check_health_endpoint() {
+  log "INFO" "Checking health endpoint: $HEALTH_URL"
+
+  # Make HTTP request with timeout
+  local response
+  local http_code
+
+  response=$(curl -s -w "\n%{http_code}" --max-time 10 "$HEALTH_URL" 2>&1) || {
+    log "ERROR" "Health endpoint request failed: $response"
+    return 1
+  }
+
+  # Extract HTTP code (last line)
+  http_code=$(echo "$response" | tail -n 1)
+
+  # Extract response body (everything except last line)
+  local body=$(echo "$response" | sed '$d')
+
+  # Check HTTP status
+  if [[ "$http_code" != "200" ]]; then
+    log "ERROR" "Health endpoint returned HTTP $http_code"
+    return 1
+  fi
+
+  # Check response contains expected JSON
+  if ! echo "$body" | jq -e '.status == "ok"' &> /dev/null; then
+    log "ERROR" "Health endpoint response invalid: $body"
+    return 1
+  fi
+
+  log "INFO" "Health endpoint OK (HTTP $http_code)"
+  return 0
+}
+
+# Check systemd service status
+check_service_status() {
+  log "INFO" "Checking service status: $SERVICE_NAME"
+
+  if ! systemctl is-active --quiet "$SERVICE_NAME"; then
+    log "ERROR" "Service $SERVICE_NAME is not active"
+    return 2
+  fi
+
+  # Check if service is enabled
+  if ! systemctl is-enabled --quiet "$SERVICE_NAME"; then
+    log "WARN" "Service $SERVICE_NAME is not enabled (won't start on boot)"
+  fi
+
+  log "INFO" "Service $SERVICE_NAME is active"
+  return 0
+}
+
+# Check database connectivity (quick MongoDB ping)
+check_database() {
+  log "INFO" "Checking database connectivity"
+
+  # Try to connect to MongoDB (timeout 5 seconds)
+  if ! timeout 5 mongosh --quiet --eval "db.adminCommand('ping')" localhost:27017/tractatus_prod &> /dev/null; then
+    log "ERROR" "Database connection failed"
+    return 1
+  fi
+
+  log "INFO" "Database connectivity OK"
+  return 0
+}
+
+# Check disk space
+check_disk_space() {
+  log "INFO" "Checking disk space"
+
+  # Get root filesystem usage percentage
+  local usage=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
+
+  if [[ "$usage" -gt 90 ]]; then
+    log "CRITICAL" "Disk space critical: ${usage}% used"
+    return 1
+  elif [[ "$usage" -gt 80 ]]; then
+    log "WARN" "Disk space high: ${usage}% used"
+  else
+    log "INFO" "Disk space OK: ${usage}% used"
+  fi
+
+  return 0
+}
+
+# Main health check
+main() {
+  log "INFO" "Starting health check"
+
+  local all_healthy=true
+  local issues=()
+
+  # Run all checks
+  if ! check_service_status; then
+    all_healthy=false
+    issues+=("Service not running")
+  fi
+
+  if ! check_health_endpoint; then
+    all_healthy=false
+    issues+=("Health endpoint failed")
+  fi
+
+  if ! check_database; then
+    all_healthy=false
+    issues+=("Database connectivity failed")
+  fi
+
+  if ! check_disk_space; then
+    all_healthy=false
+    issues+=("Disk space issue")
+  fi
+
+  # Handle results
+  if [[ "$all_healthy" == "true" ]]; then
+    log "INFO" "All health checks passed ✓"
+    reset_failure_count
+    exit 0
+  else
+    log "ERROR" "Health check failed: ${issues[*]}"
+    increment_failure_count
+
+    local failure_count=$(get_failure_count)
+    log "WARN" "Consecutive failures: $failure_count/$MAX_FAILURES"
+
+    # Alert if threshold reached
+    if [[ "$failure_count" -ge "$MAX_FAILURES" ]]; then
+      local subject="[ALERT] Tractatus Health Check Failed ($failure_count failures)"
+      local body="Tractatus health check has failed $failure_count times consecutively.
+
+Issues detected:
+$(printf -- "- %s\n" "${issues[@]}")
+
+Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
+Host: $(hostname)
+Service: $SERVICE_NAME
+Health URL: $HEALTH_URL
+
+Please investigate immediately.
+
+View logs:
+sudo journalctl -u $SERVICE_NAME -n 100
+
+Check service status:
+sudo systemctl status $SERVICE_NAME
+
+Restart service:
+sudo systemctl restart $SERVICE_NAME
+"
+
+      send_alert "$subject" "$body"
+      log "CRITICAL" "Alert sent after $failure_count consecutive failures"
+    fi
+
+    exit 1
+  fi
+}
+
+# Run main function
+main
--- a/scripts/monitoring/log-monitor.sh
+++ b/scripts/monitoring/log-monitor.sh
@ -0,0 +1,269 @@
+#!/bin/bash
+#
+# Log Monitoring Script
+# Monitors Tractatus service logs for errors, security events, and anomalies
+#
+# Usage:
+#   ./log-monitor.sh                  # Monitor logs since last check
+#   ./log-monitor.sh --since "1 hour" # Monitor specific time window
+#   ./log-monitor.sh --follow         # Continuous monitoring
+#   ./log-monitor.sh --test           # Test mode (no alerts)
+#
+# Exit codes:
+#   0 = No issues found
+#   1 = Errors detected
+#   2 = Critical errors detected
+#   3 = Configuration error
+
+set -euo pipefail
+
+# Configuration
+SERVICE_NAME="${SERVICE_NAME:-tractatus}"
+ALERT_EMAIL="${ALERT_EMAIL:-}"
+LOG_FILE="/var/log/tractatus/log-monitor.log"
+STATE_FILE="/var/tmp/tractatus-log-monitor-state"
+ERROR_THRESHOLD=10     # Alert after 10 errors in window
+CRITICAL_THRESHOLD=3   # Alert immediately after 3 critical errors
+
+# Parse arguments
+SINCE="5 minutes ago"
+FOLLOW=false
+TEST_MODE=false
+
+while [[ $# -gt 0 ]]; do
+  case $1 in
+    --since)
+      SINCE="$2"
+      shift 2
+      ;;
+    --follow)
+      FOLLOW=true
+      shift
+      ;;
+    --test)
+      TEST_MODE=true
+      shift
+      ;;
+    *)
+      echo "Unknown option: $1"
+      exit 3
+      ;;
+  esac
+done
+
+# Logging function
+log() {
+  local level="$1"
+  shift
+  local message="$*"
+  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
+
+  echo "[$timestamp] [$level] $message"
+
+  # Log to file if directory exists
+  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
+    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
+  fi
+}
+
+# Send alert email
+send_alert() {
+  local subject="$1"
+  local body="$2"
+
+  if [[ "$TEST_MODE" == "true" ]]; then
+    log "INFO" "TEST MODE: Would send alert: $subject"
+    return 0
+  fi
+
+  if [[ -z "$ALERT_EMAIL" ]]; then
+    log "WARN" "No alert email configured (ALERT_EMAIL not set)"
+    return 0
+  fi
+
+  if command -v mail &> /dev/null; then
+    echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
+    log "INFO" "Alert email sent to $ALERT_EMAIL"
+  elif command -v sendmail &> /dev/null; then
+    {
+      echo "Subject: $subject"
+      echo "From: tractatus-monitoring@agenticgovernance.digital"
+      echo "To: $ALERT_EMAIL"
+      echo ""
+      echo "$body"
+    } | sendmail "$ALERT_EMAIL"
+    log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
+  else
+    log "WARN" "No email command available"
+  fi
+}
+
+# Extract errors from logs
+extract_errors() {
+  local since="$1"
+
+  # Get logs since specified time
+  sudo journalctl -u "$SERVICE_NAME" --since "$since" --no-pager 2>/dev/null || {
+    log "ERROR" "Failed to read journal for $SERVICE_NAME"
+    return 1
+  }
+}
+
+# Analyze log patterns
+analyze_logs() {
+  local logs="$1"
+
+  # Count different severity levels
+  local error_count=$(echo "$logs" | grep -ci "\[ERROR\]" || echo "0")
+  local critical_count=$(echo "$logs" | grep -ci "\[CRITICAL\]" || echo "0")
+  local warn_count=$(echo "$logs" | grep -ci "\[WARN\]" || echo "0")
+
+  # Security-related patterns
+  local security_count=$(echo "$logs" | grep -ciE "(SECURITY|unauthorized|forbidden|authentication failed)" || echo "0")
+
+  # Database errors
+  local db_error_count=$(echo "$logs" | grep -ciE "(mongodb|database|connection.*failed)" || echo "0")
+
+  # HTTP errors
+  local http_error_count=$(echo "$logs" | grep -ciE "HTTP.*50[0-9]|Internal Server Error" || echo "0")
+
+  # Unhandled exceptions
+  local exception_count=$(echo "$logs" | grep -ciE "(Unhandled.*exception|TypeError|ReferenceError)" || echo "0")
+
+  log "INFO" "Log analysis: CRITICAL=$critical_count ERROR=$error_count WARN=$warn_count SECURITY=$security_count DB_ERROR=$db_error_count HTTP_ERROR=$http_error_count EXCEPTION=$exception_count"
+
+  # Determine severity
+  if [[ "$critical_count" -ge "$CRITICAL_THRESHOLD" ]]; then
+    log "CRITICAL" "Critical error threshold exceeded: $critical_count critical errors"
+    return 2
+  fi
+
+  if [[ "$error_count" -ge "$ERROR_THRESHOLD" ]]; then
+    log "ERROR" "Error threshold exceeded: $error_count errors"
+    return 1
+  fi
+
+  if [[ "$security_count" -gt 0 ]]; then
+    log "WARN" "Security events detected: $security_count events"
+  fi
+
+  if [[ "$db_error_count" -gt 5 ]]; then
+    log "WARN" "Database errors detected: $db_error_count errors"
+  fi
+
+  if [[ "$exception_count" -gt 0 ]]; then
+    log "WARN" "Unhandled exceptions detected: $exception_count exceptions"
+  fi
+
+  return 0
+}
+
+# Extract top error messages
+get_top_errors() {
+  local logs="$1"
+  local limit="${2:-10}"
+
+  echo "$logs" | grep -iE "\[ERROR\]|\[CRITICAL\]" | \
+    sed 's/^.*\] //' | \
+    sort | uniq -c | sort -rn | head -n "$limit"
+}
+
+# Main monitoring function
+main() {
+  log "INFO" "Starting log monitoring (since: $SINCE)"
+
+  # Extract logs
+  local logs
+  logs=$(extract_errors "$SINCE") || {
+    log "ERROR" "Failed to extract logs"
+    exit 3
+  }
+
+  # Count total log entries
+  local log_count=$(echo "$logs" | wc -l)
+  log "INFO" "Analyzing $log_count log entries"
+
+  if [[ "$log_count" -eq 0 ]]; then
+    log "INFO" "No logs found in time window"
+    exit 0
+  fi
+
+  # Analyze logs
+  local exit_code=0
+  analyze_logs "$logs" || exit_code=$?
+
+  # If errors detected, send alert
+  if [[ "$exit_code" -ne 0 ]]; then
+    local severity="ERROR"
+    [[ "$exit_code" -eq 2 ]] && severity="CRITICAL"
+
+    local subject="[ALERT] Tractatus Log Monitoring - $severity Detected"
+
+    # Extract top 10 error messages
+    local top_errors=$(get_top_errors "$logs" 10)
+
+    local body="Log monitoring detected $severity level issues in Tractatus service.
+
+Time Window: $SINCE
+Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
+Host: $(hostname)
+Service: $SERVICE_NAME
+
+Top Error Messages:
+$top_errors
+
+Recent Critical/Error Logs:
+$(echo "$logs" | grep -iE "\[ERROR\]|\[CRITICAL\]" | tail -n 20)
+
+Full logs:
+sudo journalctl -u $SERVICE_NAME --since \"$SINCE\"
+
+Check service status:
+sudo systemctl status $SERVICE_NAME
+"
+
+    send_alert "$subject" "$body"
+  else
+    log "INFO" "No significant issues detected"
+  fi
+
+  exit $exit_code
+}
+
+# Follow mode (continuous monitoring)
+follow_logs() {
+  log "INFO" "Starting continuous log monitoring"
+
+  sudo journalctl -u "$SERVICE_NAME" -f --no-pager | while read -r line; do
+    # Check for error patterns
+    if echo "$line" | grep -qiE "\[ERROR\]|\[CRITICAL\]"; then
+      log "ERROR" "$line"
+
+      # Extract error message
+      local error_msg=$(echo "$line" | sed 's/^.*\] //')
+
+      # Check for critical patterns
+      if echo "$line" | grep -qiE "\[CRITICAL\]|Unhandled.*exception|Database.*failed|Service.*crashed"; then
+        local subject="[CRITICAL] Tractatus Error Detected"
+        local body="Critical error detected in Tractatus logs:
+
+$line
+
+Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
+Host: $(hostname)
+
+Recent logs:
+$(sudo journalctl -u $SERVICE_NAME -n 10 --no-pager)
+"
+        send_alert "$subject" "$body"
+      fi
+    fi
+  done
+}
+
+# Run appropriate mode
+if [[ "$FOLLOW" == "true" ]]; then
+  follow_logs
+else
+  main
+fi
--- a/scripts/monitoring/monitor-all.sh
+++ b/scripts/monitoring/monitor-all.sh
@ -0,0 +1,178 @@
+#!/bin/bash
+#
+# Master Monitoring Script
+# Orchestrates all monitoring checks for Tractatus production environment
+#
+# Usage:
+#   ./monitor-all.sh              # Run all monitors
+#   ./monitor-all.sh --test       # Test mode (no alerts)
+#   ./monitor-all.sh --skip-ssl   # Skip SSL check
+#
+# Exit codes:
+#   0 = All checks passed
+#   1 = Some warnings
+#   2 = Some critical issues
+#   3 = Configuration error
+
+set -euo pipefail
+
+# Configuration
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+LOG_FILE="/var/log/tractatus/monitoring.log"
+ALERT_EMAIL="${ALERT_EMAIL:-}"
+
+# Parse arguments
+TEST_MODE=false
+SKIP_SSL=false
+
+while [[ $# -gt 0 ]]; do
+  case $1 in
+    --test)
+      TEST_MODE=true
+      shift
+      ;;
+    --skip-ssl)
+      SKIP_SSL=true
+      shift
+      ;;
+    *)
+      echo "Unknown option: $1"
+      exit 3
+      ;;
+  esac
+done
+
+# Export configuration for child scripts
+export ALERT_EMAIL
+[[ "$TEST_MODE" == "true" ]] && TEST_FLAG="--test" || TEST_FLAG=""
+
+# Logging function
+log() {
+  local level="$1"
+  shift
+  local message="$*"
+  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
+
+  echo "[$timestamp] [$level] $message"
+
+  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
+    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
+  fi
+}
+
+# Run monitoring check
+run_check() {
+  local name="$1"
+  local script="$2"
+  shift 2
+  local args="$@"
+
+  log "INFO" "Running $name..."
+
+  local exit_code=0
+  "$SCRIPT_DIR/$script" $args $TEST_FLAG || exit_code=$?
+
+  case $exit_code in
+    0)
+      log "INFO" "$name: OK ✓"
+      ;;
+    1)
+      log "WARN" "$name: Warning"
+      ;;
+    2)
+      log "CRITICAL" "$name: Critical"
+      ;;
+    *)
+      log "ERROR" "$name: Error (exit code: $exit_code)"
+      ;;
+  esac
+
+  return $exit_code
+}
+
+# Main monitoring function
+main() {
+  log "INFO" "=== Starting Tractatus Monitoring Suite ==="
+  log "INFO" "Timestamp: $(date '+%Y-%m-%d %H:%M:%S %Z')"
+  log "INFO" "Host: $(hostname)"
+  [[ "$TEST_MODE" == "true" ]] && log "INFO" "TEST MODE: Alerts suppressed"
+
+  local max_severity=0
+  local checks_run=0
+  local checks_passed=0
+  local checks_warned=0
+  local checks_critical=0
+  local checks_failed=0
+
+  # Health Check
+  if run_check "Health Check" "health-check.sh"; then
+    ((checks_passed++))
+  else
+    local exit_code=$?
+    [[ $exit_code -eq 1 ]] && ((checks_warned++))
+    [[ $exit_code -eq 2 ]] && ((checks_critical++))
+    [[ $exit_code -ge 3 ]] && ((checks_failed++))
+    [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
+  fi
+  ((checks_run++))
+
+  # Log Monitor
+  if run_check "Log Monitor" "log-monitor.sh" --since "5 minutes ago"; then
+    ((checks_passed++))
+  else
+    local exit_code=$?
+    [[ $exit_code -eq 1 ]] && ((checks_warned++))
+    [[ $exit_code -eq 2 ]] && ((checks_critical++))
+    [[ $exit_code -ge 3 ]] && ((checks_failed++))
+    [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
+  fi
+  ((checks_run++))
+
+  # Disk Monitor
+  if run_check "Disk Monitor" "disk-monitor.sh"; then
+    ((checks_passed++))
+  else
+    local exit_code=$?
+    [[ $exit_code -eq 1 ]] && ((checks_warned++))
+    [[ $exit_code -eq 2 ]] && ((checks_critical++))
+    [[ $exit_code -ge 3 ]] && ((checks_failed++))
+    [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
+  fi
+  ((checks_run++))
+
+  # SSL Monitor (optional)
+  if [[ "$SKIP_SSL" != "true" ]]; then
+    if run_check "SSL Monitor" "ssl-monitor.sh"; then
+      ((checks_passed++))
+    else
+      local exit_code=$?
+      [[ $exit_code -eq 1 ]] && ((checks_warned++))
+      [[ $exit_code -eq 2 ]] && ((checks_critical++))
+      [[ $exit_code -ge 3 ]] && ((checks_failed++))
+      [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
+    fi
+    ((checks_run++))
+  fi
+
+  # Summary
+  log "INFO" "=== Monitoring Summary ==="
+  log "INFO" "Checks run: $checks_run"
+  log "INFO" "Passed: $checks_passed | Warned: $checks_warned | Critical: $checks_critical | Failed: $checks_failed"
+
+  if [[ $max_severity -eq 0 ]]; then
+    log "INFO" "All monitoring checks passed ✓"
+  elif [[ $max_severity -eq 1 ]]; then
+    log "WARN" "Some checks returned warnings"
+  elif [[ $max_severity -eq 2 ]]; then
+    log "CRITICAL" "Some checks returned critical alerts"
+  else
+    log "ERROR" "Some checks failed"
+  fi
+
+  log "INFO" "=== Monitoring Complete ==="
+
+  exit $max_severity
+}
+
+# Run main function
+main
--- a/scripts/monitoring/ssl-monitor.sh
+++ b/scripts/monitoring/ssl-monitor.sh
@ -0,0 +1,319 @@
+#!/bin/bash
+#
+# SSL Certificate Monitoring Script
+# Monitors SSL certificate expiry and alerts before expiration
+#
+# Usage:
+#   ./ssl-monitor.sh                       # Check all domains
+#   ./ssl-monitor.sh --domain example.com  # Check specific domain
+#   ./ssl-monitor.sh --test                # Test mode (no alerts)
+#
+# Exit codes:
+#   0 = OK
+#   1 = Warning (expires soon)
+#   2 = Critical (expires very soon)
+#   3 = Expired or error
+
+set -euo pipefail
+
+# Configuration
+ALERT_EMAIL="${ALERT_EMAIL:-}"
+LOG_FILE="/var/log/tractatus/ssl-monitor.log"
+WARN_DAYS=30       # Warn 30 days before expiry
+CRITICAL_DAYS=7    # Critical alert 7 days before expiry
+
+# Default domains to monitor
+DOMAINS=(
+  "agenticgovernance.digital"
+)
+
+# Parse arguments
+TEST_MODE=false
+SPECIFIC_DOMAIN=""
+
+while [[ $# -gt 0 ]]; do
+  case $1 in
+    --domain)
+      SPECIFIC_DOMAIN="$2"
+      shift 2
+      ;;
+    --test)
+      TEST_MODE=true
+      shift
+      ;;
+    *)
+      echo "Unknown option: $1"
+      exit 3
+      ;;
+  esac
+done
+
+# Override domains if specific domain provided
+if [[ -n "$SPECIFIC_DOMAIN" ]]; then
+  DOMAINS=("$SPECIFIC_DOMAIN")
+fi
+
+# Logging function
+log() {
+  local level="$1"
+  shift
+  local message="$*"
+  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
+
+  echo "[$timestamp] [$level] $message"
+
+  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
+    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
+  fi
+}
+
+# Send alert email
+send_alert() {
+  local subject="$1"
+  local body="$2"
+
+  if [[ "$TEST_MODE" == "true" ]]; then
+    log "INFO" "TEST MODE: Would send alert: $subject"
+    return 0
+  fi
+
+  if [[ -z "$ALERT_EMAIL" ]]; then
+    log "WARN" "No alert email configured (ALERT_EMAIL not set)"
+    return 0
+  fi
+
+  if command -v mail &> /dev/null; then
+    echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
+    log "INFO" "Alert email sent to $ALERT_EMAIL"
+  elif command -v sendmail &> /dev/null; then
+    {
+      echo "Subject: $subject"
+      echo "From: tractatus-monitoring@agenticgovernance.digital"
+      echo "To: $ALERT_EMAIL"
+      echo ""
+      echo "$body"
+    } | sendmail "$ALERT_EMAIL"
+    log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
+  else
+    log "WARN" "No email command available"
+  fi
+}
+
+# Get SSL certificate expiry date
+get_cert_expiry() {
+  local domain="$1"
+
+  # Use openssl to get certificate
+  local expiry_date
+  expiry_date=$(echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null | \
+    openssl x509 -noout -enddate 2>/dev/null | \
+    cut -d= -f2) || {
+    log "ERROR" "Failed to retrieve certificate for $domain"
+    return 1
+  }
+
+  echo "$expiry_date"
+}
+
+# Get days until expiry
+get_days_until_expiry() {
+  local expiry_date="$1"
+
+  # Convert expiry date to seconds since epoch
+  local expiry_epoch
+  expiry_epoch=$(date -d "$expiry_date" +%s 2>/dev/null) || {
+    log "ERROR" "Failed to parse expiry date: $expiry_date"
+    return 1
+  }
+
+  # Get current time in seconds since epoch
+  local now_epoch=$(date +%s)
+
+  # Calculate days until expiry
+  local seconds_until_expiry=$((expiry_epoch - now_epoch))
+  local days_until_expiry=$((seconds_until_expiry / 86400))
+
+  echo "$days_until_expiry"
+}
+
+# Get certificate details
+get_cert_details() {
+  local domain="$1"
+
+  echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null | \
+    openssl x509 -noout -subject -issuer -dates 2>/dev/null || {
+    echo "Failed to retrieve certificate details"
+    return 1
+  }
+}
+
+# Check single domain
+check_domain() {
+  local domain="$1"
+
+  log "INFO" "Checking SSL certificate for $domain"
+
+  # Get expiry date
+  local expiry_date
+  expiry_date=$(get_cert_expiry "$domain") || {
+    log "ERROR" "Failed to check certificate for $domain"
+    return 3
+  }
+
+  # Calculate days until expiry
+  local days_until_expiry
+  days_until_expiry=$(get_days_until_expiry "$expiry_date") || {
+    log "ERROR" "Failed to calculate expiry for $domain"
+    return 3
+  }
+
+  # Check if expired
+  if [[ "$days_until_expiry" -lt 0 ]]; then
+    log "CRITICAL" "$domain: Certificate EXPIRED ${days_until_expiry#-} days ago!"
+    return 3
+  fi
+
+  # Check thresholds
+  if [[ "$days_until_expiry" -le "$CRITICAL_DAYS" ]]; then
+    log "CRITICAL" "$domain: Certificate expires in $days_until_expiry days (expires: $expiry_date)"
+    return 2
+  elif [[ "$days_until_expiry" -le "$WARN_DAYS" ]]; then
+    log "WARN" "$domain: Certificate expires in $days_until_expiry days (expires: $expiry_date)"
+    return 1
+  else
+    log "INFO" "$domain: Certificate valid for $days_until_expiry days (expires: $expiry_date)"
+    return 0
+  fi
+}
+
+# Main monitoring function
+main() {
+  log "INFO" "Starting SSL certificate monitoring"
+
+  local max_severity=0
+  local expired_domains=()
+  local critical_domains=()
+  local warning_domains=()
+
+  # Check all domains
+  for domain in "${DOMAINS[@]}"; do
+    local exit_code=0
+    local expiry_date=$(get_cert_expiry "$domain" 2>/dev/null || echo "Unknown")
+    local days_until_expiry=$(get_days_until_expiry "$expiry_date" 2>/dev/null || echo "Unknown")
+
+    check_domain "$domain" || exit_code=$?
+
+    if [[ "$exit_code" -eq 3 ]]; then
+      max_severity=3
+      expired_domains+=("$domain (EXPIRED or ERROR)")
+    elif [[ "$exit_code" -eq 2 ]]; then
+      [[ "$max_severity" -lt 2 ]] && max_severity=2
+      critical_domains+=("$domain (expires in $days_until_expiry days)")
+    elif [[ "$exit_code" -eq 1 ]]; then
+      [[ "$max_severity" -lt 1 ]] && max_severity=1
+      warning_domains+=("$domain (expires in $days_until_expiry days)")
+    fi
+  done
+
+  # Send alerts based on severity
+  if [[ "$max_severity" -eq 3 ]]; then
+    local subject="[CRITICAL] SSL Certificate Expired or Error"
+    local body="CRITICAL: SSL certificate has expired or error occurred.
+
+Expired/Error Domains:
+$(printf -- "- %s\n" "${expired_domains[@]}")
+"
+
+    # Add other alerts if any
+    if [[ "${#critical_domains[@]}" -gt 0 ]]; then
+      body+="
+Critical Domains (<= $CRITICAL_DAYS days):
+$(printf -- "- %s\n" "${critical_domains[@]}")
+"
+    fi
+
+    if [[ "${#warning_domains[@]}" -gt 0 ]]; then
+      body+="
+Warning Domains (<= $WARN_DAYS days):
+$(printf -- "- %s\n" "${warning_domains[@]}")
+"
+    fi
+
+    body+="
+Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
+Host: $(hostname)
+
+Action Required:
+1. Renew SSL certificate immediately
+2. Check Let's Encrypt auto-renewal:
+   sudo certbot renew --dry-run
+
+Certificate details:
+$(get_cert_details "${DOMAINS[0]}")
+
+Renewal commands:
+# Test renewal
+sudo certbot renew --dry-run
+
+# Force renewal
+sudo certbot renew --force-renewal
+
+# Check certificate status
+sudo certbot certificates
+"
+
+    send_alert "$subject" "$body"
+    log "CRITICAL" "SSL certificate alert sent"
+
+  elif [[ "$max_severity" -eq 2 ]]; then
+    local subject="[CRITICAL] SSL Certificate Expires Soon"
+    local body="CRITICAL: SSL certificate expires in $CRITICAL_DAYS days or less.
+
+Critical Domains (<= $CRITICAL_DAYS days):
+$(printf -- "- %s\n" "${critical_domains[@]}")
+"
+
+    if [[ "${#warning_domains[@]}" -gt 0 ]]; then
+      body+="
+Warning Domains (<= $WARN_DAYS days):
+$(printf -- "- %s\n" "${warning_domains[@]}")
+"
+    fi
+
+    body+="
+Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
+Host: $(hostname)
+
+Please renew certificates soon.
+
+Check renewal:
+sudo certbot renew --dry-run
+"
+
+    send_alert "$subject" "$body"
+    log "CRITICAL" "SSL expiry alert sent"
+
+  elif [[ "$max_severity" -eq 1 ]]; then
+    local subject="[WARN] SSL Certificate Expires Soon"
+    local body="WARNING: SSL certificate expires in $WARN_DAYS days or less.
+
+Warning Domains (<= $WARN_DAYS days):
+$(printf -- "- %s\n" "${warning_domains[@]}")
+
+Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
+Host: $(hostname)
+
+Please plan certificate renewal.
+"
+
+    send_alert "$subject" "$body"
+    log "WARN" "SSL expiry warning sent"
+  else
+    log "INFO" "All SSL certificates valid"
+  fi
+
+  exit $max_severity
+}
+
+# Run main function
+main