ops: implement comprehensive production monitoring system

Create self-hosted, privacy-first monitoring infrastructure for production environment with automated health checks, log analysis, and alerting. Monitoring Components: - health-check.sh: Application health, service status, DB connectivity, disk space - log-monitor.sh: Error detection, security events, anomaly detection - disk-monitor.sh: Disk space usage monitoring (5 paths) - ssl-monitor.sh: SSL certificate expiry monitoring - monitor-all.sh: Master orchestration script Features: - Email alerting system (configurable thresholds) - Consecutive failure tracking (prevents false positives) - Test mode for safe deployment testing - Comprehensive logging to /var/log/tractatus/ - Cron-ready for automated execution - Exit codes for monitoring tool integration Alert Triggers: - Health: 3 consecutive failures (15min downtime) - Logs: 10 errors OR 3 critical errors in 5min - Disk: 80% warning, 90% critical - SSL: 30 days warning, 7 days critical Setup Documentation: - Complete installation instructions - Cron configuration examples - Systemd timer alternative - Troubleshooting guide - Alert customization guide - Incident response procedures Privacy-First Design: - Self-hosted (no external monitoring services) - Minimal data exposure in alerts - Local log storage only - No telemetry to third parties Aligns with Tractatus values: transparency, privacy, operational excellence Addresses Phase 4 Prep Checklist Task #6: Production Monitoring & Alerting Next: Deploy to production, configure email alerts, set up cron jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-09 22:23:40 +13:00 · 2025-10-09 22:23:40 +13:00 · c755c49ec1
commit c755c49ec1
parent 1221941828
6 changed files with 1940 additions and 0 deletions
--- a/docs/PRODUCTION_MONITORING_SETUP.md
+++ b/docs/PRODUCTION_MONITORING_SETUP.md
@ -0,0 +1,648 @@
 # Production Monitoring Setup
 **Project**: Tractatus AI Safety Framework Website
 **Environment**: Production (vps-93a693da.vps.ovh.net)
 **Created**: 2025-10-09
 **Status**: Ready for Deployment
 ---
 ## Overview
 Comprehensive monitoring system for Tractatus production environment, providing:
 - **Health monitoring** - Application uptime, service status, database connectivity
 - **Log monitoring** - Error detection, security events, anomaly detection
 - **Disk monitoring** - Disk space usage alerts
 - **SSL monitoring** - Certificate expiry warnings
 - **Email alerts** - Automated notifications for critical issues
 **Philosophy**: Privacy-first, self-hosted monitoring aligned with Tractatus values.
 ---
 ## Monitoring Components
 ### 1. Health Check Monitor (`health-check.sh`)
 **What it monitors:**
 - Application health endpoint (https://agenticgovernance.digital/health)
 - Systemd service status (tractatus.service)
 - MongoDB database connectivity
 - Disk space usage
 **Alert Triggers:**
 - Service not running
 - Health endpoint returns non-200
 - Database connection failed
 - Disk space > 90%
 **Frequency**: Every 5 minutes
 ### 2. Log Monitor (`log-monitor.sh`)
 **What it monitors:**
 - ERROR and CRITICAL log entries
 - Security events (authentication failures, unauthorized access)
 - Database errors
 - HTTP 500 errors
 - Unhandled exceptions
 **Alert Triggers:**
 - 10+ errors in 5-minute window
 - 3+ critical errors in 5-minute window
 - Any security events
 **Frequency**: Every 5 minutes
 **Follow Mode**: Can run continuously for real-time monitoring
 ### 3. Disk Space Monitor (`disk-monitor.sh`)
 **What it monitors:**
 - Root filesystem (/)
 - Var directory (/var)
 - Log directory (/var/log)
 - Tractatus application (/var/www/tractatus)
 - Temp directory (/tmp)
 **Alert Triggers:**
 - Warning: 80%+ usage
 - Critical: 90%+ usage
 **Frequency**: Every 15 minutes
 ### 4. SSL Certificate Monitor (`ssl-monitor.sh`)
 **What it monitors:**
 - SSL certificate expiry for agenticgovernance.digital
 **Alert Triggers:**
 - Warning: Expires in 30 days or less
 - Critical: Expires in 7 days or less
 - Critical: Already expired
 **Frequency**: Daily
 ### 5. Master Monitor (`monitor-all.sh`)
 Orchestrates all monitoring checks in a single run.
 ---
 ## Installation
 ### Prerequisites
 ```bash
 # Ensure required commands are available
 sudo apt-get update
 sudo apt-get install -y curl jq openssl mailutils
 # Install MongoDB shell (if not installed)
 wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add -
 echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
 sudo apt-get update
 sudo apt-get install -y mongodb-mongosh
 ```
 ### Deploy Monitoring Scripts
 ```bash
 # From local machine, deploy monitoring scripts to production
 rsync -avz -e "ssh -i ~/.ssh/tractatus_deploy" \
  scripts/monitoring/ \
  ubuntu@vps-93a693da.vps.ovh.net:/var/www/tractatus/scripts/monitoring/
 ```
 ### Set Up Log Directory
 ```bash
 # On production server
 ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net
 # Create log directory
 sudo mkdir -p /var/log/tractatus
 sudo chown ubuntu:ubuntu /var/log/tractatus
 sudo chmod 755 /var/log/tractatus
 ```
 ### Make Scripts Executable
 ```bash
 # On production server
 cd /var/www/tractatus/scripts/monitoring
 chmod +x *.sh
 ```
 ### Configure Email Alerts
 **Option 1: Using Postfix (Recommended for production)**
 ```bash
 # Install Postfix
 sudo apt-get install -y postfix
 # Configure Postfix (select "Internet Site")
 sudo dpkg-reconfigure postfix
 # Set ALERT_EMAIL environment variable
 echo 'export ALERT_EMAIL="your-email@example.com"' | sudo tee -a /etc/environment
 source /etc/environment
 ```
 **Option 2: Using External SMTP (ProtonMail, Gmail, etc.)**
 ```bash
 # Install sendemail
 sudo apt-get install -y sendemail libio-socket-ssl-perl libnet-ssleay-perl
 # Configure in monitoring scripts (or use system mail)
 ```
 **Option 3: No Email (Testing)**
 ```bash
 # Leave ALERT_EMAIL unset - monitoring will log but not send emails
 # Useful for initial testing
 ```
 ### Test Monitoring Scripts
 ```bash
 # Test health check
 cd /var/www/tractatus/scripts/monitoring
 ./health-check.sh --test
 # Test log monitor
 ./log-monitor.sh --since "10 minutes ago" --test
 # Test disk monitor
 ./disk-monitor.sh --test
 # Test SSL monitor
 ./ssl-monitor.sh --test
 # Test master monitor
 ./monitor-all.sh --test
 ```
 Expected output: Each script should run without errors and show `[INFO]` messages.
 ---
 ## Cron Configuration
 ### Create Monitoring Cron Jobs
 ```bash
 # On production server
 crontab -e
 ```
 Add the following cron jobs:
 ```cron
 # Tractatus Production Monitoring
 # Logs: /var/log/tractatus/monitoring.log
 # Master monitoring (every 5 minutes)
 # Runs: health check, log monitor, disk monitor
 */5 * * * * /var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl >> /var/log/tractatus/cron-monitor.log 2>&1
 # SSL certificate check (daily at 3am)
 0 3 * * * /var/www/tractatus/scripts/monitoring/ssl-monitor.sh >> /var/log/tractatus/cron-ssl.log 2>&1
 # Disk monitor (every 15 minutes - separate from master for frequency control)
 */15 * * * * /var/www/tractatus/scripts/monitoring/disk-monitor.sh >> /var/log/tractatus/cron-disk.log 2>&1
 ```
 ### Verify Cron Jobs
 ```bash
 # List active cron jobs
 crontab -l
 # Check cron logs
 sudo journalctl -u cron -f
 # Wait 5 minutes, then check monitoring logs
 tail -f /var/log/tractatus/cron-monitor.log
 ```
 ### Alternative: Systemd Timers (Optional)
 More modern alternative to cron, provides better logging and failure handling.
 **Create timer file**: `/etc/systemd/system/tractatus-monitoring.timer`
 ```ini
 [Unit]
 Description=Tractatus Monitoring Timer
 Requires=tractatus-monitoring.service
 [Timer]
 OnBootSec=5min
 OnUnitActiveSec=5min
 AccuracySec=1s
 [Install]
 WantedBy=timers.target
 ```
 **Create service file**: `/etc/systemd/system/tractatus-monitoring.service`
 ```ini
 [Unit]
 Description=Tractatus Production Monitoring
 After=network.target tractatus.service
 [Service]
 Type=oneshot
 User=ubuntu
 WorkingDirectory=/var/www/tractatus
 ExecStart=/var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl
 StandardOutput=journal
 StandardError=journal
 Environment="ALERT_EMAIL=your-email@example.com"
 [Install]
 WantedBy=multi-user.target
 ```
 **Enable and start:**
 ```bash
 sudo systemctl daemon-reload
 sudo systemctl enable tractatus-monitoring.timer
 sudo systemctl start tractatus-monitoring.timer
 # Check status
 sudo systemctl status tractatus-monitoring.timer
 sudo systemctl list-timers
 ```
 ---
 ## Alert Configuration
 ### Alert Thresholds
 **Health Check:**
 - Consecutive failures: 3 (alerts on 3rd failure)
 - Check interval: 5 minutes
 - Time to alert: 15 minutes of downtime
 **Log Monitor:**
 - Error threshold: 10 errors in 5 minutes
 - Critical threshold: 3 critical errors in 5 minutes
 - Security events: Immediate alert
 **Disk Space:**
 - Warning: 80% usage
 - Critical: 90% usage
 **SSL Certificate:**
 - Warning: 30 days until expiry
 - Critical: 7 days until expiry
 ### Customize Alerts
 Edit thresholds in scripts:
 ```bash
 # Health check thresholds
 vi /var/www/tractatus/scripts/monitoring/health-check.sh
 # Change: MAX_FAILURES=3
 # Log monitor thresholds
 vi /var/www/tractatus/scripts/monitoring/log-monitor.sh
 # Change: ERROR_THRESHOLD=10
 # Change: CRITICAL_THRESHOLD=3
 # Disk monitor thresholds
 vi /var/www/tractatus/scripts/monitoring/disk-monitor.sh
 # Change: WARN_THRESHOLD=80
 # Change: CRITICAL_THRESHOLD=90
 # SSL monitor thresholds
 vi /var/www/tractatus/scripts/monitoring/ssl-monitor.sh
 # Change: WARN_DAYS=30
 # Change: CRITICAL_DAYS=7
 ```
 ---
 ## Manual Monitoring Commands
 ### Check Current Status
 ```bash
 # Run all monitors manually
 cd /var/www/tractatus/scripts/monitoring
 ./monitor-all.sh
 # Run individual monitors
 ./health-check.sh
 ./log-monitor.sh --since "1 hour"
 ./disk-monitor.sh
 ./ssl-monitor.sh
 ```
 ### View Monitoring Logs
 ```bash
 # View all monitoring logs
 tail -f /var/log/tractatus/monitoring.log
 # View specific monitor logs
 tail -f /var/log/tractatus/health-check.log
 tail -f /var/log/tractatus/log-monitor.log
 tail -f /var/log/tractatus/disk-monitor.log
 tail -f /var/log/tractatus/ssl-monitor.log
 # View cron execution logs
 tail -f /var/log/tractatus/cron-monitor.log
 ```
 ### Test Alert Delivery
 ```bash
 # Send test alert
 cd /var/www/tractatus/scripts/monitoring
 # This should trigger an alert (if service is running)
 # It will show "would send alert" in test mode
 ./health-check.sh --test
 # Force alert by temporarily stopping service
 sudo systemctl stop tractatus
 ./health-check.sh  # Should alert after 3 failures (15 minutes)
 sudo systemctl start tractatus
 ```
 ---
 ## Troubleshooting
 ### No Alerts Received
 **Check email configuration:**
 ```bash
 # Verify ALERT_EMAIL is set
 echo $ALERT_EMAIL
 # Test mail command
 echo "Test email" | mail -s "Test Subject" $ALERT_EMAIL
 # Check mail logs
 sudo tail -f /var/log/mail.log
 ```
 **Check cron execution:**
 ```bash
 # Verify cron jobs are running
 crontab -l
 # Check cron logs
 sudo journalctl -u cron -n 50
 # Check script logs
 tail -100 /var/log/tractatus/cron-monitor.log
 ```
 ### Scripts Not Executing
 **Check permissions:**
 ```bash
 ls -la /var/www/tractatus/scripts/monitoring/
 # Should show: -rwxr-xr-x (executable)
 # Fix if needed
 chmod +x /var/www/tractatus/scripts/monitoring/*.sh
 ```
 **Check cron PATH:**
 ```bash
 # Add to crontab
 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 # Or use full paths in cron commands
 ```
 ### High Alert Frequency
 **Increase thresholds:**
 Edit threshold values in scripts (see Alert Configuration section).
 **Increase consecutive failure count:**
 ```bash
 vi /var/www/tractatus/scripts/monitoring/health-check.sh
 # Increase MAX_FAILURES from 3 to 5 or higher
 ```
 ### False Positives
 **Review alert conditions:**
 ```bash
 # Check recent logs to understand why alerts triggered
 tail -100 /var/log/tractatus/monitoring.log
 # Run manual check with verbose output
 ./health-check.sh
 # Check if service is actually unhealthy
 sudo systemctl status tractatus
 curl https://agenticgovernance.digital/health
 ```
 ---
 ## Monitoring Dashboard (Optional - Future Enhancement)
 ### Option 1: Grafana + Prometheus
 Self-hosted metrics dashboard (requires setup).
 ### Option 2: Simple Web Dashboard
 Create minimal status page showing last check results.
 ### Option 3: UptimeRobot Free Tier
 External monitoring service (privacy tradeoff).
 **Not implemented yet** - current solution uses email alerts only.
 ---
 ## Best Practices
 ### DO:
 - ✅ Test monitoring scripts before deploying
 - ✅ Check alert emails regularly
 - ✅ Review monitoring logs weekly
 - ✅ Adjust thresholds based on actual patterns
 - ✅ Document any monitoring configuration changes
 - ✅ Keep monitoring scripts updated
 ### DON'T:
 - ❌ Ignore alert emails
 - ❌ Set thresholds too low (alert fatigue)
 - ❌ Deploy monitoring without testing
 - ❌ Disable monitoring without planning
 - ❌ Let log files grow unbounded
 - ❌ Ignore repeated warnings
 ### Monitoring Hygiene
 ```bash
 # Rotate monitoring logs weekly
 sudo logrotate /etc/logrotate.d/tractatus-monitoring
 # Clean up old state files
 find /var/tmp -name "tractatus-*-state" -mtime +7 -delete
 # Review alert frequency monthly
 grep "\[ALERT\]" /var/log/tractatus/monitoring.log | wc -l
 ```
 ---
 ## Incident Response
 ### When Alert Received
 1. **Acknowledge alert** - Note time received
 2. **Check current status** - Run manual health check
 3. **Review logs** - Check what triggered alert
 4. **Investigate root cause** - See deployment checklist emergency procedures
 5. **Take action** - Fix issue or escalate
 6. **Document** - Create incident report
 ### Critical Alert Response Time
 - **Health check failure**: Respond within 15 minutes
 - **Log errors**: Respond within 30 minutes
 - **Disk space critical**: Respond within 1 hour
 - **SSL expiry (7 days)**: Respond within 24 hours
 ---
 ## Maintenance
 ### Weekly Tasks
 - [ ] Review monitoring logs for patterns
 - [ ] Check alert email inbox
 - [ ] Verify cron jobs still running
 - [ ] Review disk space trends
 ### Monthly Tasks
 - [ ] Review and adjust alert thresholds
 - [ ] Clean up old monitoring logs
 - [ ] Test manual failover procedures
 - [ ] Update monitoring documentation
 ### Quarterly Tasks
 - [ ] Full monitoring system audit
 - [ ] Test all alert scenarios
 - [ ] Review incident response times
 - [ ] Consider monitoring enhancements
 ---
 ## Monitoring Metrics
 ### Success Metrics
 - **Uptime**: Target 99.9% (< 45 minutes downtime/month)
 - **Alert Response Time**: < 30 minutes for critical
 - **False Positive Rate**: < 5% of alerts
 - **Detection Time**: < 5 minutes for critical issues
 ### Tracking
 ```bash
 # Calculate uptime from logs
 grep "Health endpoint OK" /var/log/tractatus/monitoring.log | wc -l
 # Count alerts sent
 grep "Alert email sent" /var/log/tractatus/monitoring.log | wc -l
 # Review response times (manual from incident reports)
 ```
 ---
 ## Security Considerations
 ### Log Access Control
 ```bash
 # Ensure logs are readable only by ubuntu user and root
 sudo chown ubuntu:ubuntu /var/log/tractatus/*.log
 sudo chmod 640 /var/log/tractatus/*.log
 ```
 ### Alert Email Security
 - Use encrypted email if possible (ProtonMail)
 - Don't include sensitive data in alert body
 - Alerts show symptoms, not credentials
 ### Monitoring Script Security
 - Scripts run as ubuntu user (not root)
 - No credentials embedded in scripts
 - Use environment variables for sensitive config
 ---
 ## Future Enhancements
 ### Planned Improvements
 - [ ] **Metrics collection**: Store monitoring metrics in database for trend analysis
 - [ ] **Status page**: Public status page showing service availability
 - [ ] **Mobile alerts**: SMS or push notifications for critical alerts
 - [ ] **Distributed monitoring**: Multiple monitoring locations for redundancy
 - [ ] **Automated remediation**: Auto-restart service on failure
 - [ ] **Performance monitoring**: Response time tracking, query performance
 - [ ] **User impact monitoring**: Track error rates from user perspective
 ### Integration Opportunities
 - [ ] **Plausible Analytics**: Monitor traffic patterns, correlate with errors
 - [ ] **GitHub Actions**: Run monitoring checks in CI/CD
 - [ ] **Slack integration**: Send alerts to Slack channel
 - [ ] **Database backup monitoring**: Alert on backup failures
 ---
 ## Support & Documentation
 **Monitoring Scripts**: `/var/www/tractatus/scripts/monitoring/`
 **Monitoring Logs**: `/var/log/tractatus/`
 **Cron Configuration**: `crontab -l` (ubuntu user)
 **Alert Email**: Set via `ALERT_EMAIL` environment variable
 **Related Documents:**
 - [Production Deployment Checklist](PRODUCTION_DEPLOYMENT_CHECKLIST.md)
 - [Phase 4 Preparation Checklist](../PHASE-4-PREPARATION-CHECKLIST.md)
 ---
 **Document Status**: Ready for Production
 **Last Updated**: 2025-10-09
 **Next Review**: After 1 month of monitoring data
 **Maintainer**: Technical Lead (Claude Code + John Stroh)
--- a/scripts/monitoring/disk-monitor.sh
+++ b/scripts/monitoring/disk-monitor.sh
@ -0,0 +1,257 @@
 #!/bin/bash
 #
 # Disk Space Monitoring Script
 # Monitors disk space usage and alerts when thresholds exceeded
 #
 # Usage:
 #   ./disk-monitor.sh          # Check all monitored paths
 #   ./disk-monitor.sh --test   # Test mode (no alerts)
 #
 # Exit codes:
 #   0 = OK
 #   1 = Warning threshold exceeded
 #   2 = Critical threshold exceeded
 set -euo pipefail
 # Configuration
 ALERT_EMAIL="${ALERT_EMAIL:-}"
 LOG_FILE="/var/log/tractatus/disk-monitor.log"
 WARN_THRESHOLD=80      # Warn at 80% usage
 CRITICAL_THRESHOLD=90  # Critical at 90% usage
 # Paths to monitor
 declare -A MONITORED_PATHS=(
  ["/"]="Root filesystem"
  ["/var"]="Var directory"
  ["/var/log"]="Log directory"
  ["/var/www/tractatus"]="Tractatus application"
  ["/tmp"]="Temp directory"
 )
 # Parse arguments
 TEST_MODE=false
 while [[ $# -gt 0 ]]; do
  case $1 in
    --test)
      TEST_MODE=true
      shift
      ;;
    *)
      echo "Unknown option: $1"
      exit 3
      ;;
  esac
 done
 # Logging function
 log() {
  local level="$1"
  shift
  local message="$*"
  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
  echo "[$timestamp] [$level] $message"
  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
  fi
 }
 # Send alert email
 send_alert() {
  local subject="$1"
  local body="$2"
  if [[ "$TEST_MODE" == "true" ]]; then
    log "INFO" "TEST MODE: Would send alert: $subject"
    return 0
  fi
  if [[ -z "$ALERT_EMAIL" ]]; then
    log "WARN" "No alert email configured (ALERT_EMAIL not set)"
    return 0
  fi
  if command -v mail &> /dev/null; then
    echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
    log "INFO" "Alert email sent to $ALERT_EMAIL"
  elif command -v sendmail &> /dev/null; then
    {
      echo "Subject: $subject"
      echo "From: tractatus-monitoring@agenticgovernance.digital"
      echo "To: $ALERT_EMAIL"
      echo ""
      echo "$body"
    } | sendmail "$ALERT_EMAIL"
    log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
  else
    log "WARN" "No email command available"
  fi
 }
 # Get disk usage for path
 get_disk_usage() {
  local path="$1"
  # Check if path exists
  if [[ ! -e "$path" ]]; then
    echo "N/A"
    return 1
  fi
  # Get usage percentage (remove % sign)
  df -h "$path" 2>/dev/null | awk 'NR==2 {print $5}' | sed 's/%//' || echo "N/A"
 }
 # Get human-readable disk usage details
 get_disk_details() {
  local path="$1"
  if [[ ! -e "$path" ]]; then
    echo "Path does not exist"
    return 1
  fi
  df -h "$path" 2>/dev/null | awk 'NR==2 {printf "Size: %s | Used: %s | Avail: %s | Use%%: %s | Mounted: %s\n", $2, $3, $4, $5, $6}'
 }
 # Find largest directories in path
 find_largest_dirs() {
  local path="$1"
  local limit="${2:-10}"
  if [[ ! -e "$path" ]]; then
    return 1
  fi
  du -h "$path"/* 2>/dev/null | sort -rh | head -n "$limit" || echo "Unable to scan directory"
 }
 # Check single path
 check_path() {
  local path="$1"
  local description="$2"
  local usage=$(get_disk_usage "$path")
  if [[ "$usage" == "N/A" ]]; then
    log "WARN" "$description ($path): Unable to check"
    return 0
  fi
  if [[ "$usage" -ge "$CRITICAL_THRESHOLD" ]]; then
    log "CRITICAL" "$description ($path): ${usage}% used (>= $CRITICAL_THRESHOLD%)"
    return 2
  elif [[ "$usage" -ge "$WARN_THRESHOLD" ]]; then
    log "WARN" "$description ($path): ${usage}% used (>= $WARN_THRESHOLD%)"
    return 1
  else
    log "INFO" "$description ($path): ${usage}% used"
    return 0
  fi
 }
 # Main monitoring function
 main() {
  log "INFO" "Starting disk space monitoring"
  local max_severity=0
  local issues=()
  local critical_paths=()
  local warning_paths=()
  # Check all monitored paths
  for path in "${!MONITORED_PATHS[@]}"; do
    local description="${MONITORED_PATHS[$path]}"
    local exit_code=0
    check_path "$path" "$description" || exit_code=$?
    if [[ "$exit_code" -eq 2 ]]; then
      max_severity=2
      critical_paths+=("$path (${description})")
    elif [[ "$exit_code" -eq 1 ]]; then
      [[ "$max_severity" -lt 1 ]] && max_severity=1
      warning_paths+=("$path (${description})")
    fi
  done
  # Send alerts if thresholds exceeded
  if [[ "$max_severity" -eq 2 ]]; then
    local subject="[CRITICAL] Tractatus Disk Space Critical"
    local body="CRITICAL: Disk space usage has exceeded ${CRITICAL_THRESHOLD}% on one or more paths.
 Critical Paths (>= ${CRITICAL_THRESHOLD}%):
 $(printf -- "- %s\n" "${critical_paths[@]}")
 "
    # Add warning paths if any
    if [[ "${#warning_paths[@]}" -gt 0 ]]; then
      body+="
 Warning Paths (>= ${WARN_THRESHOLD}%):
 $(printf -- "- %s\n" "${warning_paths[@]}")
 "
    fi
    body+="
 Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
 Host: $(hostname)
 Disk Usage Details:
 $(df -h)
 Largest directories in /var/www/tractatus:
 $(find_largest_dirs /var/www/tractatus 10)
 Largest log files:
 $(du -h /var/log/tractatus/*.log 2>/dev/null | sort -rh | head -10 || echo "No log files found")
 Action Required:
 1. Clean up old log files
 2. Remove unnecessary files
 3. Check for runaway processes creating large files
 4. Consider expanding disk space
 Clean up commands:
 # Rotate old logs
 sudo journalctl --vacuum-time=7d
 # Clean up npm cache
 npm cache clean --force
 # Find large files
 find /var/www/tractatus -type f -size +100M -exec ls -lh {} \;
 "
    send_alert "$subject" "$body"
    log "CRITICAL" "Disk space alert sent"
  elif [[ "$max_severity" -eq 1 ]]; then
    local subject="[WARN] Tractatus Disk Space Warning"
    local body="WARNING: Disk space usage has exceeded ${WARN_THRESHOLD}% on one or more paths.
 Warning Paths (>= ${WARN_THRESHOLD}%):
 $(printf -- "- %s\n" "${warning_paths[@]}")
 Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
 Host: $(hostname)
 Disk Usage:
 $(df -h)
 Please review disk usage and clean up if necessary.
 "
    send_alert "$subject" "$body"
    log "WARN" "Disk space warning sent"
  else
    log "INFO" "All monitored paths within acceptable limits"
  fi
  exit $max_severity
 }
 # Run main function
 main
--- a/scripts/monitoring/health-check.sh
+++ b/scripts/monitoring/health-check.sh
@ -0,0 +1,269 @@
 #!/bin/bash
 #
 # Health Check Monitoring Script
 # Monitors Tractatus application health endpoint and service status
 #
 # Usage:
 #   ./health-check.sh                 # Run check, alert if issues
 #   ./health-check.sh --quiet         # Suppress output unless error
 #   ./health-check.sh --test          # Test mode (no alerts)
 #
 # Exit codes:
 #   0 = Healthy
 #   1 = Health endpoint failed
 #   2 = Service not running
 #   3 = Configuration error
 set -euo pipefail
 # Configuration
 HEALTH_URL="${HEALTH_URL:-https://agenticgovernance.digital/health}"
 SERVICE_NAME="${SERVICE_NAME:-tractatus}"
 ALERT_EMAIL="${ALERT_EMAIL:-}"
 LOG_FILE="/var/log/tractatus/health-check.log"
 STATE_FILE="/var/tmp/tractatus-health-state"
 MAX_FAILURES=3  # Alert after 3 consecutive failures
 # Parse arguments
 QUIET=false
 TEST_MODE=false
 while [[ $# -gt 0 ]]; do
  case $1 in
    --quiet) QUIET=true; shift ;;
    --test) TEST_MODE=true; shift ;;
    *) echo "Unknown option: $1"; exit 3 ;;
  esac
 done
 # Logging function
 log() {
  local level="$1"
  shift
  local message="$*"
  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
  if [[ "$QUIET" != "true" ]] || [[ "$level" == "ERROR" ]] || [[ "$level" == "CRITICAL" ]]; then
    echo "[$timestamp] [$level] $message"
  fi
  # Log to file if directory exists
  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
  fi
 }
 # Get current failure count
 get_failure_count() {
  if [[ -f "$STATE_FILE" ]]; then
    cat "$STATE_FILE"
  else
    echo "0"
  fi
 }
 # Increment failure count
 increment_failure_count() {
  local count=$(get_failure_count)
  echo $((count + 1)) > "$STATE_FILE"
 }
 # Reset failure count
 reset_failure_count() {
  echo "0" > "$STATE_FILE"
 }
 # Send alert email
 send_alert() {
  local subject="$1"
  local body="$2"
  if [[ "$TEST_MODE" == "true" ]]; then
    log "INFO" "TEST MODE: Would send alert: $subject"
    return 0
  fi
  if [[ -z "$ALERT_EMAIL" ]]; then
    log "WARN" "No alert email configured (ALERT_EMAIL not set)"
    return 0
  fi
  # Try to send email using mail command (if available)
  if command -v mail &> /dev/null; then
    echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
    log "INFO" "Alert email sent to $ALERT_EMAIL"
  elif command -v sendmail &> /dev/null; then
    {
      echo "Subject: $subject"
      echo "From: tractatus-monitoring@agenticgovernance.digital"
      echo "To: $ALERT_EMAIL"
      echo ""
      echo "$body"
    } | sendmail "$ALERT_EMAIL"
    log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
  else
    log "WARN" "No email command available (install mailutils or sendmail)"
  fi
 }
 # Check health endpoint
 check_health_endpoint() {
  log "INFO" "Checking health endpoint: $HEALTH_URL"
  # Make HTTP request with timeout
  local response
  local http_code
  response=$(curl -s -w "\n%{http_code}" --max-time 10 "$HEALTH_URL" 2>&1) || {
    log "ERROR" "Health endpoint request failed: $response"
    return 1
  }
  # Extract HTTP code (last line)
  http_code=$(echo "$response" | tail -n 1)
  # Extract response body (everything except last line)
  local body=$(echo "$response" | sed '$d')
  # Check HTTP status
  if [[ "$http_code" != "200" ]]; then
    log "ERROR" "Health endpoint returned HTTP $http_code"
    return 1
  fi
  # Check response contains expected JSON
  if ! echo "$body" | jq -e '.status == "ok"' &> /dev/null; then
    log "ERROR" "Health endpoint response invalid: $body"
    return 1
  fi
  log "INFO" "Health endpoint OK (HTTP $http_code)"
  return 0
 }
 # Check systemd service status
 check_service_status() {
  log "INFO" "Checking service status: $SERVICE_NAME"
  if ! systemctl is-active --quiet "$SERVICE_NAME"; then
    log "ERROR" "Service $SERVICE_NAME is not active"
    return 2
  fi
  # Check if service is enabled
  if ! systemctl is-enabled --quiet "$SERVICE_NAME"; then
    log "WARN" "Service $SERVICE_NAME is not enabled (won't start on boot)"
  fi
  log "INFO" "Service $SERVICE_NAME is active"
  return 0
 }
 # Check database connectivity (quick MongoDB ping)
 check_database() {
  log "INFO" "Checking database connectivity"
  # Try to connect to MongoDB (timeout 5 seconds)
  if ! timeout 5 mongosh --quiet --eval "db.adminCommand('ping')" localhost:27017/tractatus_prod &> /dev/null; then
    log "ERROR" "Database connection failed"
    return 1
  fi
  log "INFO" "Database connectivity OK"
  return 0
 }
 # Check disk space
 check_disk_space() {
  log "INFO" "Checking disk space"
  # Get root filesystem usage percentage
  local usage=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
  if [[ "$usage" -gt 90 ]]; then
    log "CRITICAL" "Disk space critical: ${usage}% used"
    return 1
  elif [[ "$usage" -gt 80 ]]; then
    log "WARN" "Disk space high: ${usage}% used"
  else
    log "INFO" "Disk space OK: ${usage}% used"
  fi
  return 0
 }
 # Main health check
 main() {
  log "INFO" "Starting health check"
  local all_healthy=true
  local issues=()
  # Run all checks
  if ! check_service_status; then
    all_healthy=false
    issues+=("Service not running")
  fi
  if ! check_health_endpoint; then
    all_healthy=false
    issues+=("Health endpoint failed")
  fi
  if ! check_database; then
    all_healthy=false
    issues+=("Database connectivity failed")
  fi
  if ! check_disk_space; then
    all_healthy=false
    issues+=("Disk space issue")
  fi
  # Handle results
  if [[ "$all_healthy" == "true" ]]; then
    log "INFO" "All health checks passed ✓"
    reset_failure_count
    exit 0
  else
    log "ERROR" "Health check failed: ${issues[*]}"
    increment_failure_count
    local failure_count=$(get_failure_count)
    log "WARN" "Consecutive failures: $failure_count/$MAX_FAILURES"
    # Alert if threshold reached
    if [[ "$failure_count" -ge "$MAX_FAILURES" ]]; then
      local subject="[ALERT] Tractatus Health Check Failed ($failure_count failures)"
      local body="Tractatus health check has failed $failure_count times consecutively.
 Issues detected:
 $(printf -- "- %s\n" "${issues[@]}")
 Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
 Host: $(hostname)
 Service: $SERVICE_NAME
 Health URL: $HEALTH_URL
 Please investigate immediately.
 View logs:
 sudo journalctl -u $SERVICE_NAME -n 100
 Check service status:
 sudo systemctl status $SERVICE_NAME
 Restart service:
 sudo systemctl restart $SERVICE_NAME
 "
      send_alert "$subject" "$body"
      log "CRITICAL" "Alert sent after $failure_count consecutive failures"
    fi
    exit 1
  fi
 }
 # Run main function
 main
--- a/scripts/monitoring/log-monitor.sh
+++ b/scripts/monitoring/log-monitor.sh
@ -0,0 +1,269 @@
 #!/bin/bash
 #
 # Log Monitoring Script
 # Monitors Tractatus service logs for errors, security events, and anomalies
 #
 # Usage:
 #   ./log-monitor.sh                  # Monitor logs since last check
 #   ./log-monitor.sh --since "1 hour" # Monitor specific time window
 #   ./log-monitor.sh --follow         # Continuous monitoring
 #   ./log-monitor.sh --test           # Test mode (no alerts)
 #
 # Exit codes:
 #   0 = No issues found
 #   1 = Errors detected
 #   2 = Critical errors detected
 #   3 = Configuration error
 set -euo pipefail
 # Configuration
 SERVICE_NAME="${SERVICE_NAME:-tractatus}"
 ALERT_EMAIL="${ALERT_EMAIL:-}"
 LOG_FILE="/var/log/tractatus/log-monitor.log"
 STATE_FILE="/var/tmp/tractatus-log-monitor-state"
 ERROR_THRESHOLD=10     # Alert after 10 errors in window
 CRITICAL_THRESHOLD=3   # Alert immediately after 3 critical errors
 # Parse arguments
 SINCE="5 minutes ago"
 FOLLOW=false
 TEST_MODE=false
 while [[ $# -gt 0 ]]; do
  case $1 in
    --since)
      SINCE="$2"
      shift 2
      ;;
    --follow)
      FOLLOW=true
      shift
      ;;
    --test)
      TEST_MODE=true
      shift
      ;;
    *)
      echo "Unknown option: $1"
      exit 3
      ;;
  esac
 done
 # Logging function
 log() {
  local level="$1"
  shift
  local message="$*"
  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
  echo "[$timestamp] [$level] $message"
  # Log to file if directory exists
  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
  fi
 }
 # Send alert email
 send_alert() {
  local subject="$1"
  local body="$2"
  if [[ "$TEST_MODE" == "true" ]]; then
    log "INFO" "TEST MODE: Would send alert: $subject"
    return 0
  fi
  if [[ -z "$ALERT_EMAIL" ]]; then
    log "WARN" "No alert email configured (ALERT_EMAIL not set)"
    return 0
  fi
  if command -v mail &> /dev/null; then
    echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
    log "INFO" "Alert email sent to $ALERT_EMAIL"
  elif command -v sendmail &> /dev/null; then
    {
      echo "Subject: $subject"
      echo "From: tractatus-monitoring@agenticgovernance.digital"
      echo "To: $ALERT_EMAIL"
      echo ""
      echo "$body"
    } | sendmail "$ALERT_EMAIL"
    log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
  else
    log "WARN" "No email command available"
  fi
 }
 # Extract errors from logs
 extract_errors() {
  local since="$1"
  # Get logs since specified time
  sudo journalctl -u "$SERVICE_NAME" --since "$since" --no-pager 2>/dev/null || {
    log "ERROR" "Failed to read journal for $SERVICE_NAME"
    return 1
  }
 }
 # Analyze log patterns
 analyze_logs() {
  local logs="$1"
  # Count different severity levels
  local error_count=$(echo "$logs" | grep -ci "\[ERROR\]" || echo "0")
  local critical_count=$(echo "$logs" | grep -ci "\[CRITICAL\]" || echo "0")
  local warn_count=$(echo "$logs" | grep -ci "\[WARN\]" || echo "0")
  # Security-related patterns
  local security_count=$(echo "$logs" | grep -ciE "(SECURITY|unauthorized|forbidden|authentication failed)" || echo "0")
  # Database errors
  local db_error_count=$(echo "$logs" | grep -ciE "(mongodb|database|connection.*failed)" || echo "0")
  # HTTP errors
  local http_error_count=$(echo "$logs" | grep -ciE "HTTP.*50[0-9]|Internal Server Error" || echo "0")
  # Unhandled exceptions
  local exception_count=$(echo "$logs" | grep -ciE "(Unhandled.*exception|TypeError|ReferenceError)" || echo "0")
  log "INFO" "Log analysis: CRITICAL=$critical_count ERROR=$error_count WARN=$warn_count SECURITY=$security_count DB_ERROR=$db_error_count HTTP_ERROR=$http_error_count EXCEPTION=$exception_count"
  # Determine severity
  if [[ "$critical_count" -ge "$CRITICAL_THRESHOLD" ]]; then
    log "CRITICAL" "Critical error threshold exceeded: $critical_count critical errors"
    return 2
  fi
  if [[ "$error_count" -ge "$ERROR_THRESHOLD" ]]; then
    log "ERROR" "Error threshold exceeded: $error_count errors"
    return 1
  fi
  if [[ "$security_count" -gt 0 ]]; then
    log "WARN" "Security events detected: $security_count events"
  fi
  if [[ "$db_error_count" -gt 5 ]]; then
    log "WARN" "Database errors detected: $db_error_count errors"
  fi
  if [[ "$exception_count" -gt 0 ]]; then
    log "WARN" "Unhandled exceptions detected: $exception_count exceptions"
  fi
  return 0
 }
 # Extract top error messages
 get_top_errors() {
  local logs="$1"
  local limit="${2:-10}"
  echo "$logs" | grep -iE "\[ERROR\]|\[CRITICAL\]" | \
    sed 's/^.*\] //' | \
    sort | uniq -c | sort -rn | head -n "$limit"
 }
 # Main monitoring function
 main() {
  log "INFO" "Starting log monitoring (since: $SINCE)"
  # Extract logs
  local logs
  logs=$(extract_errors "$SINCE") || {
    log "ERROR" "Failed to extract logs"
    exit 3
  }
  # Count total log entries
  local log_count=$(echo "$logs" | wc -l)
  log "INFO" "Analyzing $log_count log entries"
  if [[ "$log_count" -eq 0 ]]; then
    log "INFO" "No logs found in time window"
    exit 0
  fi
  # Analyze logs
  local exit_code=0
  analyze_logs "$logs" || exit_code=$?
  # If errors detected, send alert
  if [[ "$exit_code" -ne 0 ]]; then
    local severity="ERROR"
    [[ "$exit_code" -eq 2 ]] && severity="CRITICAL"
    local subject="[ALERT] Tractatus Log Monitoring - $severity Detected"
    # Extract top 10 error messages
    local top_errors=$(get_top_errors "$logs" 10)
    local body="Log monitoring detected $severity level issues in Tractatus service.
 Time Window: $SINCE
 Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
 Host: $(hostname)
 Service: $SERVICE_NAME
 Top Error Messages:
 $top_errors
 Recent Critical/Error Logs:
 $(echo "$logs" | grep -iE "\[ERROR\]|\[CRITICAL\]" | tail -n 20)
 Full logs:
 sudo journalctl -u $SERVICE_NAME --since \"$SINCE\"
 Check service status:
 sudo systemctl status $SERVICE_NAME
 "
    send_alert "$subject" "$body"
  else
    log "INFO" "No significant issues detected"
  fi
  exit $exit_code
 }
 # Follow mode (continuous monitoring)
 follow_logs() {
  log "INFO" "Starting continuous log monitoring"
  sudo journalctl -u "$SERVICE_NAME" -f --no-pager | while read -r line; do
    # Check for error patterns
    if echo "$line" | grep -qiE "\[ERROR\]|\[CRITICAL\]"; then
      log "ERROR" "$line"
      # Extract error message
      local error_msg=$(echo "$line" | sed 's/^.*\] //')
      # Check for critical patterns
      if echo "$line" | grep -qiE "\[CRITICAL\]|Unhandled.*exception|Database.*failed|Service.*crashed"; then
        local subject="[CRITICAL] Tractatus Error Detected"
        local body="Critical error detected in Tractatus logs:
 $line
 Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
 Host: $(hostname)
 Recent logs:
 $(sudo journalctl -u $SERVICE_NAME -n 10 --no-pager)
 "
        send_alert "$subject" "$body"
      fi
    fi
  done
 }
 # Run appropriate mode
 if [[ "$FOLLOW" == "true" ]]; then
  follow_logs
 else
  main
 fi
--- a/scripts/monitoring/monitor-all.sh
+++ b/scripts/monitoring/monitor-all.sh
@ -0,0 +1,178 @@
 #!/bin/bash
 #
 # Master Monitoring Script
 # Orchestrates all monitoring checks for Tractatus production environment
 #
 # Usage:
 #   ./monitor-all.sh              # Run all monitors
 #   ./monitor-all.sh --test       # Test mode (no alerts)
 #   ./monitor-all.sh --skip-ssl   # Skip SSL check
 #
 # Exit codes:
 #   0 = All checks passed
 #   1 = Some warnings
 #   2 = Some critical issues
 #   3 = Configuration error
 set -euo pipefail
 # Configuration
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 LOG_FILE="/var/log/tractatus/monitoring.log"
 ALERT_EMAIL="${ALERT_EMAIL:-}"
 # Parse arguments
 TEST_MODE=false
 SKIP_SSL=false
 while [[ $# -gt 0 ]]; do
  case $1 in
    --test)
      TEST_MODE=true
      shift
      ;;
    --skip-ssl)
      SKIP_SSL=true
      shift
      ;;
    *)
      echo "Unknown option: $1"
      exit 3
      ;;
  esac
 done
 # Export configuration for child scripts
 export ALERT_EMAIL
 [[ "$TEST_MODE" == "true" ]] && TEST_FLAG="--test" || TEST_FLAG=""
 # Logging function
 log() {
  local level="$1"
  shift
  local message="$*"
  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
  echo "[$timestamp] [$level] $message"
  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
  fi
 }
 # Run monitoring check
 run_check() {
  local name="$1"
  local script="$2"
  shift 2
  local args="$@"
  log "INFO" "Running $name..."
  local exit_code=0
  "$SCRIPT_DIR/$script" $args $TEST_FLAG || exit_code=$?
  case $exit_code in
    0)
      log "INFO" "$name: OK ✓"
      ;;
    1)
      log "WARN" "$name: Warning"
      ;;
    2)
      log "CRITICAL" "$name: Critical"
      ;;
    *)
      log "ERROR" "$name: Error (exit code: $exit_code)"
      ;;
  esac
  return $exit_code
 }
 # Main monitoring function
 main() {
  log "INFO" "=== Starting Tractatus Monitoring Suite ==="
  log "INFO" "Timestamp: $(date '+%Y-%m-%d %H:%M:%S %Z')"
  log "INFO" "Host: $(hostname)"
  [[ "$TEST_MODE" == "true" ]] && log "INFO" "TEST MODE: Alerts suppressed"
  local max_severity=0
  local checks_run=0
  local checks_passed=0
  local checks_warned=0
  local checks_critical=0
  local checks_failed=0
  # Health Check
  if run_check "Health Check" "health-check.sh"; then
    ((checks_passed++))
  else
    local exit_code=$?
    [[ $exit_code -eq 1 ]] && ((checks_warned++))
    [[ $exit_code -eq 2 ]] && ((checks_critical++))
    [[ $exit_code -ge 3 ]] && ((checks_failed++))
    [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
  fi
  ((checks_run++))
  # Log Monitor
  if run_check "Log Monitor" "log-monitor.sh" --since "5 minutes ago"; then
    ((checks_passed++))
  else
    local exit_code=$?
    [[ $exit_code -eq 1 ]] && ((checks_warned++))
    [[ $exit_code -eq 2 ]] && ((checks_critical++))
    [[ $exit_code -ge 3 ]] && ((checks_failed++))
    [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
  fi
  ((checks_run++))
  # Disk Monitor
  if run_check "Disk Monitor" "disk-monitor.sh"; then
    ((checks_passed++))
  else
    local exit_code=$?
    [[ $exit_code -eq 1 ]] && ((checks_warned++))
    [[ $exit_code -eq 2 ]] && ((checks_critical++))
    [[ $exit_code -ge 3 ]] && ((checks_failed++))
    [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
  fi
  ((checks_run++))
  # SSL Monitor (optional)
  if [[ "$SKIP_SSL" != "true" ]]; then
    if run_check "SSL Monitor" "ssl-monitor.sh"; then
      ((checks_passed++))
    else
      local exit_code=$?
      [[ $exit_code -eq 1 ]] && ((checks_warned++))
      [[ $exit_code -eq 2 ]] && ((checks_critical++))
      [[ $exit_code -ge 3 ]] && ((checks_failed++))
      [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
    fi
    ((checks_run++))
  fi
  # Summary
  log "INFO" "=== Monitoring Summary ==="
  log "INFO" "Checks run: $checks_run"
  log "INFO" "Passed: $checks_passed | Warned: $checks_warned | Critical: $checks_critical | Failed: $checks_failed"
  if [[ $max_severity -eq 0 ]]; then
    log "INFO" "All monitoring checks passed ✓"
  elif [[ $max_severity -eq 1 ]]; then
    log "WARN" "Some checks returned warnings"
  elif [[ $max_severity -eq 2 ]]; then
    log "CRITICAL" "Some checks returned critical alerts"
  else
    log "ERROR" "Some checks failed"
  fi
  log "INFO" "=== Monitoring Complete ==="
  exit $max_severity
 }
 # Run main function
 main
--- a/scripts/monitoring/ssl-monitor.sh
+++ b/scripts/monitoring/ssl-monitor.sh
@ -0,0 +1,319 @@
 #!/bin/bash
 #
 # SSL Certificate Monitoring Script
 # Monitors SSL certificate expiry and alerts before expiration
 #
 # Usage:
 #   ./ssl-monitor.sh                       # Check all domains
 #   ./ssl-monitor.sh --domain example.com  # Check specific domain
 #   ./ssl-monitor.sh --test                # Test mode (no alerts)
 #
 # Exit codes:
 #   0 = OK
 #   1 = Warning (expires soon)
 #   2 = Critical (expires very soon)
 #   3 = Expired or error
 set -euo pipefail
 # Configuration
 ALERT_EMAIL="${ALERT_EMAIL:-}"
 LOG_FILE="/var/log/tractatus/ssl-monitor.log"
 WARN_DAYS=30       # Warn 30 days before expiry
 CRITICAL_DAYS=7    # Critical alert 7 days before expiry
 # Default domains to monitor
 DOMAINS=(
  "agenticgovernance.digital"
 )
 # Parse arguments
 TEST_MODE=false
 SPECIFIC_DOMAIN=""
 while [[ $# -gt 0 ]]; do
  case $1 in
    --domain)
      SPECIFIC_DOMAIN="$2"
      shift 2
      ;;
    --test)
      TEST_MODE=true
      shift
      ;;
    *)
      echo "Unknown option: $1"
      exit 3
      ;;
  esac
 done
 # Override domains if specific domain provided
 if [[ -n "$SPECIFIC_DOMAIN" ]]; then
  DOMAINS=("$SPECIFIC_DOMAIN")
 fi
 # Logging function
 log() {
  local level="$1"
  shift
  local message="$*"
  local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
  echo "[$timestamp] [$level] $message"
  if [[ -d "$(dirname "$LOG_FILE")" ]]; then
    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
  fi
 }
 # Send alert email
 send_alert() {
  local subject="$1"
  local body="$2"
  if [[ "$TEST_MODE" == "true" ]]; then
    log "INFO" "TEST MODE: Would send alert: $subject"
    return 0
  fi
  if [[ -z "$ALERT_EMAIL" ]]; then
    log "WARN" "No alert email configured (ALERT_EMAIL not set)"
    return 0
  fi
  if command -v mail &> /dev/null; then
    echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
    log "INFO" "Alert email sent to $ALERT_EMAIL"
  elif command -v sendmail &> /dev/null; then
    {
      echo "Subject: $subject"
      echo "From: tractatus-monitoring@agenticgovernance.digital"
      echo "To: $ALERT_EMAIL"
      echo ""
      echo "$body"
    } | sendmail "$ALERT_EMAIL"
    log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
  else
    log "WARN" "No email command available"
  fi
 }
 # Get SSL certificate expiry date
 get_cert_expiry() {
  local domain="$1"
  # Use openssl to get certificate
  local expiry_date
  expiry_date=$(echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null | \
    openssl x509 -noout -enddate 2>/dev/null | \
    cut -d= -f2) || {
    log "ERROR" "Failed to retrieve certificate for $domain"
    return 1
  }
  echo "$expiry_date"
 }
 # Get days until expiry
 get_days_until_expiry() {
  local expiry_date="$1"
  # Convert expiry date to seconds since epoch
  local expiry_epoch
  expiry_epoch=$(date -d "$expiry_date" +%s 2>/dev/null) || {
    log "ERROR" "Failed to parse expiry date: $expiry_date"
    return 1
  }
  # Get current time in seconds since epoch
  local now_epoch=$(date +%s)
  # Calculate days until expiry
  local seconds_until_expiry=$((expiry_epoch - now_epoch))
  local days_until_expiry=$((seconds_until_expiry / 86400))
  echo "$days_until_expiry"
 }
 # Get certificate details
 get_cert_details() {
  local domain="$1"
  echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null | \
    openssl x509 -noout -subject -issuer -dates 2>/dev/null || {
    echo "Failed to retrieve certificate details"
    return 1
  }
 }
 # Check single domain
 check_domain() {
  local domain="$1"
  log "INFO" "Checking SSL certificate for $domain"
  # Get expiry date
  local expiry_date
  expiry_date=$(get_cert_expiry "$domain") || {
    log "ERROR" "Failed to check certificate for $domain"
    return 3
  }
  # Calculate days until expiry
  local days_until_expiry
  days_until_expiry=$(get_days_until_expiry "$expiry_date") || {
    log "ERROR" "Failed to calculate expiry for $domain"
    return 3
  }
  # Check if expired
  if [[ "$days_until_expiry" -lt 0 ]]; then
    log "CRITICAL" "$domain: Certificate EXPIRED ${days_until_expiry#-} days ago!"
    return 3
  fi
  # Check thresholds
  if [[ "$days_until_expiry" -le "$CRITICAL_DAYS" ]]; then
    log "CRITICAL" "$domain: Certificate expires in $days_until_expiry days (expires: $expiry_date)"
    return 2
  elif [[ "$days_until_expiry" -le "$WARN_DAYS" ]]; then
    log "WARN" "$domain: Certificate expires in $days_until_expiry days (expires: $expiry_date)"
    return 1
  else
    log "INFO" "$domain: Certificate valid for $days_until_expiry days (expires: $expiry_date)"
    return 0
  fi
 }
 # Main monitoring function
 main() {
  log "INFO" "Starting SSL certificate monitoring"
  local max_severity=0
  local expired_domains=()
  local critical_domains=()
  local warning_domains=()
  # Check all domains
  for domain in "${DOMAINS[@]}"; do
    local exit_code=0
    local expiry_date=$(get_cert_expiry "$domain" 2>/dev/null || echo "Unknown")
    local days_until_expiry=$(get_days_until_expiry "$expiry_date" 2>/dev/null || echo "Unknown")
    check_domain "$domain" || exit_code=$?
    if [[ "$exit_code" -eq 3 ]]; then
      max_severity=3
      expired_domains+=("$domain (EXPIRED or ERROR)")
    elif [[ "$exit_code" -eq 2 ]]; then
      [[ "$max_severity" -lt 2 ]] && max_severity=2
      critical_domains+=("$domain (expires in $days_until_expiry days)")
    elif [[ "$exit_code" -eq 1 ]]; then
      [[ "$max_severity" -lt 1 ]] && max_severity=1
      warning_domains+=("$domain (expires in $days_until_expiry days)")
    fi
  done
  # Send alerts based on severity
  if [[ "$max_severity" -eq 3 ]]; then
    local subject="[CRITICAL] SSL Certificate Expired or Error"
    local body="CRITICAL: SSL certificate has expired or error occurred.
 Expired/Error Domains:
 $(printf -- "- %s\n" "${expired_domains[@]}")
 "
    # Add other alerts if any
    if [[ "${#critical_domains[@]}" -gt 0 ]]; then
      body+="
 Critical Domains (<= $CRITICAL_DAYS days):
 $(printf -- "- %s\n" "${critical_domains[@]}")
 "
    fi
    if [[ "${#warning_domains[@]}" -gt 0 ]]; then
      body+="
 Warning Domains (<= $WARN_DAYS days):
 $(printf -- "- %s\n" "${warning_domains[@]}")
 "
    fi
    body+="
 Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
 Host: $(hostname)
 Action Required:
 1. Renew SSL certificate immediately
 2. Check Let's Encrypt auto-renewal:
   sudo certbot renew --dry-run
 Certificate details:
 $(get_cert_details "${DOMAINS[0]}")
 Renewal commands:
 # Test renewal
 sudo certbot renew --dry-run
 # Force renewal
 sudo certbot renew --force-renewal
 # Check certificate status
 sudo certbot certificates
 "
    send_alert "$subject" "$body"
    log "CRITICAL" "SSL certificate alert sent"
  elif [[ "$max_severity" -eq 2 ]]; then
    local subject="[CRITICAL] SSL Certificate Expires Soon"
    local body="CRITICAL: SSL certificate expires in $CRITICAL_DAYS days or less.
 Critical Domains (<= $CRITICAL_DAYS days):
 $(printf -- "- %s\n" "${critical_domains[@]}")
 "
    if [[ "${#warning_domains[@]}" -gt 0 ]]; then
      body+="
 Warning Domains (<= $WARN_DAYS days):
 $(printf -- "- %s\n" "${warning_domains[@]}")
 "
    fi
    body+="
 Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
 Host: $(hostname)
 Please renew certificates soon.
 Check renewal:
 sudo certbot renew --dry-run
 "
    send_alert "$subject" "$body"
    log "CRITICAL" "SSL expiry alert sent"
  elif [[ "$max_severity" -eq 1 ]]; then
    local subject="[WARN] SSL Certificate Expires Soon"
    local body="WARNING: SSL certificate expires in $WARN_DAYS days or less.
 Warning Domains (<= $WARN_DAYS days):
 $(printf -- "- %s\n" "${warning_domains[@]}")
 Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
 Host: $(hostname)
 Please plan certificate renewal.
 "
    send_alert "$subject" "$body"
    log "WARN" "SSL expiry warning sent"
  else
    log "INFO" "All SSL certificates valid"
  fi
  exit $max_severity
 }
 # Run main function
 main