From c755c49ec1b3feafd14b22cea3603803f37fd256 Mon Sep 17 00:00:00 2001 From: TheFlow Date: Thu, 9 Oct 2025 22:23:40 +1300 Subject: [PATCH] ops: implement comprehensive production monitoring system MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Create self-hosted, privacy-first monitoring infrastructure for production environment with automated health checks, log analysis, and alerting. Monitoring Components: - health-check.sh: Application health, service status, DB connectivity, disk space - log-monitor.sh: Error detection, security events, anomaly detection - disk-monitor.sh: Disk space usage monitoring (5 paths) - ssl-monitor.sh: SSL certificate expiry monitoring - monitor-all.sh: Master orchestration script Features: - Email alerting system (configurable thresholds) - Consecutive failure tracking (prevents false positives) - Test mode for safe deployment testing - Comprehensive logging to /var/log/tractatus/ - Cron-ready for automated execution - Exit codes for monitoring tool integration Alert Triggers: - Health: 3 consecutive failures (15min downtime) - Logs: 10 errors OR 3 critical errors in 5min - Disk: 80% warning, 90% critical - SSL: 30 days warning, 7 days critical Setup Documentation: - Complete installation instructions - Cron configuration examples - Systemd timer alternative - Troubleshooting guide - Alert customization guide - Incident response procedures Privacy-First Design: - Self-hosted (no external monitoring services) - Minimal data exposure in alerts - Local log storage only - No telemetry to third parties Aligns with Tractatus values: transparency, privacy, operational excellence Addresses Phase 4 Prep Checklist Task #6: Production Monitoring & Alerting Next: Deploy to production, configure email alerts, set up cron jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- docs/PRODUCTION_MONITORING_SETUP.md | 648 ++++++++++++++++++++++++++++ scripts/monitoring/disk-monitor.sh | 257 +++++++++++ scripts/monitoring/health-check.sh | 269 ++++++++++++ scripts/monitoring/log-monitor.sh | 269 ++++++++++++ scripts/monitoring/monitor-all.sh | 178 ++++++++ scripts/monitoring/ssl-monitor.sh | 319 ++++++++++++++ 6 files changed, 1940 insertions(+) create mode 100644 docs/PRODUCTION_MONITORING_SETUP.md create mode 100755 scripts/monitoring/disk-monitor.sh create mode 100755 scripts/monitoring/health-check.sh create mode 100755 scripts/monitoring/log-monitor.sh create mode 100755 scripts/monitoring/monitor-all.sh create mode 100755 scripts/monitoring/ssl-monitor.sh diff --git a/docs/PRODUCTION_MONITORING_SETUP.md b/docs/PRODUCTION_MONITORING_SETUP.md new file mode 100644 index 00000000..613ba0b1 --- /dev/null +++ b/docs/PRODUCTION_MONITORING_SETUP.md @@ -0,0 +1,648 @@ +# Production Monitoring Setup + +**Project**: Tractatus AI Safety Framework Website +**Environment**: Production (vps-93a693da.vps.ovh.net) +**Created**: 2025-10-09 +**Status**: Ready for Deployment + +--- + +## Overview + +Comprehensive monitoring system for Tractatus production environment, providing: + +- **Health monitoring** - Application uptime, service status, database connectivity +- **Log monitoring** - Error detection, security events, anomaly detection +- **Disk monitoring** - Disk space usage alerts +- **SSL monitoring** - Certificate expiry warnings +- **Email alerts** - Automated notifications for critical issues + +**Philosophy**: Privacy-first, self-hosted monitoring aligned with Tractatus values. + +--- + +## Monitoring Components + +### 1. Health Check Monitor (`health-check.sh`) + +**What it monitors:** +- Application health endpoint (https://agenticgovernance.digital/health) +- Systemd service status (tractatus.service) +- MongoDB database connectivity +- Disk space usage + +**Alert Triggers:** +- Service not running +- Health endpoint returns non-200 +- Database connection failed +- Disk space > 90% + +**Frequency**: Every 5 minutes + +### 2. Log Monitor (`log-monitor.sh`) + +**What it monitors:** +- ERROR and CRITICAL log entries +- Security events (authentication failures, unauthorized access) +- Database errors +- HTTP 500 errors +- Unhandled exceptions + +**Alert Triggers:** +- 10+ errors in 5-minute window +- 3+ critical errors in 5-minute window +- Any security events + +**Frequency**: Every 5 minutes + +**Follow Mode**: Can run continuously for real-time monitoring + +### 3. Disk Space Monitor (`disk-monitor.sh`) + +**What it monitors:** +- Root filesystem (/) +- Var directory (/var) +- Log directory (/var/log) +- Tractatus application (/var/www/tractatus) +- Temp directory (/tmp) + +**Alert Triggers:** +- Warning: 80%+ usage +- Critical: 90%+ usage + +**Frequency**: Every 15 minutes + +### 4. SSL Certificate Monitor (`ssl-monitor.sh`) + +**What it monitors:** +- SSL certificate expiry for agenticgovernance.digital + +**Alert Triggers:** +- Warning: Expires in 30 days or less +- Critical: Expires in 7 days or less +- Critical: Already expired + +**Frequency**: Daily + +### 5. Master Monitor (`monitor-all.sh`) + +Orchestrates all monitoring checks in a single run. + +--- + +## Installation + +### Prerequisites + +```bash +# Ensure required commands are available +sudo apt-get update +sudo apt-get install -y curl jq openssl mailutils + +# Install MongoDB shell (if not installed) +wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add - +echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list +sudo apt-get update +sudo apt-get install -y mongodb-mongosh +``` + +### Deploy Monitoring Scripts + +```bash +# From local machine, deploy monitoring scripts to production +rsync -avz -e "ssh -i ~/.ssh/tractatus_deploy" \ + scripts/monitoring/ \ + ubuntu@vps-93a693da.vps.ovh.net:/var/www/tractatus/scripts/monitoring/ +``` + +### Set Up Log Directory + +```bash +# On production server +ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net + +# Create log directory +sudo mkdir -p /var/log/tractatus +sudo chown ubuntu:ubuntu /var/log/tractatus +sudo chmod 755 /var/log/tractatus +``` + +### Make Scripts Executable + +```bash +# On production server +cd /var/www/tractatus/scripts/monitoring +chmod +x *.sh +``` + +### Configure Email Alerts + +**Option 1: Using Postfix (Recommended for production)** + +```bash +# Install Postfix +sudo apt-get install -y postfix + +# Configure Postfix (select "Internet Site") +sudo dpkg-reconfigure postfix + +# Set ALERT_EMAIL environment variable +echo 'export ALERT_EMAIL="your-email@example.com"' | sudo tee -a /etc/environment +source /etc/environment +``` + +**Option 2: Using External SMTP (ProtonMail, Gmail, etc.)** + +```bash +# Install sendemail +sudo apt-get install -y sendemail libio-socket-ssl-perl libnet-ssleay-perl + +# Configure in monitoring scripts (or use system mail) +``` + +**Option 3: No Email (Testing)** + +```bash +# Leave ALERT_EMAIL unset - monitoring will log but not send emails +# Useful for initial testing +``` + +### Test Monitoring Scripts + +```bash +# Test health check +cd /var/www/tractatus/scripts/monitoring +./health-check.sh --test + +# Test log monitor +./log-monitor.sh --since "10 minutes ago" --test + +# Test disk monitor +./disk-monitor.sh --test + +# Test SSL monitor +./ssl-monitor.sh --test + +# Test master monitor +./monitor-all.sh --test +``` + +Expected output: Each script should run without errors and show `[INFO]` messages. + +--- + +## Cron Configuration + +### Create Monitoring Cron Jobs + +```bash +# On production server +crontab -e +``` + +Add the following cron jobs: + +```cron +# Tractatus Production Monitoring +# Logs: /var/log/tractatus/monitoring.log + +# Master monitoring (every 5 minutes) +# Runs: health check, log monitor, disk monitor +*/5 * * * * /var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl >> /var/log/tractatus/cron-monitor.log 2>&1 + +# SSL certificate check (daily at 3am) +0 3 * * * /var/www/tractatus/scripts/monitoring/ssl-monitor.sh >> /var/log/tractatus/cron-ssl.log 2>&1 + +# Disk monitor (every 15 minutes - separate from master for frequency control) +*/15 * * * * /var/www/tractatus/scripts/monitoring/disk-monitor.sh >> /var/log/tractatus/cron-disk.log 2>&1 +``` + +### Verify Cron Jobs + +```bash +# List active cron jobs +crontab -l + +# Check cron logs +sudo journalctl -u cron -f + +# Wait 5 minutes, then check monitoring logs +tail -f /var/log/tractatus/cron-monitor.log +``` + +### Alternative: Systemd Timers (Optional) + +More modern alternative to cron, provides better logging and failure handling. + +**Create timer file**: `/etc/systemd/system/tractatus-monitoring.timer` + +```ini +[Unit] +Description=Tractatus Monitoring Timer +Requires=tractatus-monitoring.service + +[Timer] +OnBootSec=5min +OnUnitActiveSec=5min +AccuracySec=1s + +[Install] +WantedBy=timers.target +``` + +**Create service file**: `/etc/systemd/system/tractatus-monitoring.service` + +```ini +[Unit] +Description=Tractatus Production Monitoring +After=network.target tractatus.service + +[Service] +Type=oneshot +User=ubuntu +WorkingDirectory=/var/www/tractatus +ExecStart=/var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl +StandardOutput=journal +StandardError=journal +Environment="ALERT_EMAIL=your-email@example.com" + +[Install] +WantedBy=multi-user.target +``` + +**Enable and start:** + +```bash +sudo systemctl daemon-reload +sudo systemctl enable tractatus-monitoring.timer +sudo systemctl start tractatus-monitoring.timer + +# Check status +sudo systemctl status tractatus-monitoring.timer +sudo systemctl list-timers +``` + +--- + +## Alert Configuration + +### Alert Thresholds + +**Health Check:** +- Consecutive failures: 3 (alerts on 3rd failure) +- Check interval: 5 minutes +- Time to alert: 15 minutes of downtime + +**Log Monitor:** +- Error threshold: 10 errors in 5 minutes +- Critical threshold: 3 critical errors in 5 minutes +- Security events: Immediate alert + +**Disk Space:** +- Warning: 80% usage +- Critical: 90% usage + +**SSL Certificate:** +- Warning: 30 days until expiry +- Critical: 7 days until expiry + +### Customize Alerts + +Edit thresholds in scripts: + +```bash +# Health check thresholds +vi /var/www/tractatus/scripts/monitoring/health-check.sh +# Change: MAX_FAILURES=3 + +# Log monitor thresholds +vi /var/www/tractatus/scripts/monitoring/log-monitor.sh +# Change: ERROR_THRESHOLD=10 +# Change: CRITICAL_THRESHOLD=3 + +# Disk monitor thresholds +vi /var/www/tractatus/scripts/monitoring/disk-monitor.sh +# Change: WARN_THRESHOLD=80 +# Change: CRITICAL_THRESHOLD=90 + +# SSL monitor thresholds +vi /var/www/tractatus/scripts/monitoring/ssl-monitor.sh +# Change: WARN_DAYS=30 +# Change: CRITICAL_DAYS=7 +``` + +--- + +## Manual Monitoring Commands + +### Check Current Status + +```bash +# Run all monitors manually +cd /var/www/tractatus/scripts/monitoring +./monitor-all.sh + +# Run individual monitors +./health-check.sh +./log-monitor.sh --since "1 hour" +./disk-monitor.sh +./ssl-monitor.sh +``` + +### View Monitoring Logs + +```bash +# View all monitoring logs +tail -f /var/log/tractatus/monitoring.log + +# View specific monitor logs +tail -f /var/log/tractatus/health-check.log +tail -f /var/log/tractatus/log-monitor.log +tail -f /var/log/tractatus/disk-monitor.log +tail -f /var/log/tractatus/ssl-monitor.log + +# View cron execution logs +tail -f /var/log/tractatus/cron-monitor.log +``` + +### Test Alert Delivery + +```bash +# Send test alert +cd /var/www/tractatus/scripts/monitoring + +# This should trigger an alert (if service is running) +# It will show "would send alert" in test mode +./health-check.sh --test + +# Force alert by temporarily stopping service +sudo systemctl stop tractatus +./health-check.sh # Should alert after 3 failures (15 minutes) +sudo systemctl start tractatus +``` + +--- + +## Troubleshooting + +### No Alerts Received + +**Check email configuration:** + +```bash +# Verify ALERT_EMAIL is set +echo $ALERT_EMAIL + +# Test mail command +echo "Test email" | mail -s "Test Subject" $ALERT_EMAIL + +# Check mail logs +sudo tail -f /var/log/mail.log +``` + +**Check cron execution:** + +```bash +# Verify cron jobs are running +crontab -l + +# Check cron logs +sudo journalctl -u cron -n 50 + +# Check script logs +tail -100 /var/log/tractatus/cron-monitor.log +``` + +### Scripts Not Executing + +**Check permissions:** + +```bash +ls -la /var/www/tractatus/scripts/monitoring/ +# Should show: -rwxr-xr-x (executable) + +# Fix if needed +chmod +x /var/www/tractatus/scripts/monitoring/*.sh +``` + +**Check cron PATH:** + +```bash +# Add to crontab +PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin + +# Or use full paths in cron commands +``` + +### High Alert Frequency + +**Increase thresholds:** + +Edit threshold values in scripts (see Alert Configuration section). + +**Increase consecutive failure count:** + +```bash +vi /var/www/tractatus/scripts/monitoring/health-check.sh +# Increase MAX_FAILURES from 3 to 5 or higher +``` + +### False Positives + +**Review alert conditions:** + +```bash +# Check recent logs to understand why alerts triggered +tail -100 /var/log/tractatus/monitoring.log + +# Run manual check with verbose output +./health-check.sh + +# Check if service is actually unhealthy +sudo systemctl status tractatus +curl https://agenticgovernance.digital/health +``` + +--- + +## Monitoring Dashboard (Optional - Future Enhancement) + +### Option 1: Grafana + Prometheus + +Self-hosted metrics dashboard (requires setup). + +### Option 2: Simple Web Dashboard + +Create minimal status page showing last check results. + +### Option 3: UptimeRobot Free Tier + +External monitoring service (privacy tradeoff). + +**Not implemented yet** - current solution uses email alerts only. + +--- + +## Best Practices + +### DO: +- ✅ Test monitoring scripts before deploying +- ✅ Check alert emails regularly +- ✅ Review monitoring logs weekly +- ✅ Adjust thresholds based on actual patterns +- ✅ Document any monitoring configuration changes +- ✅ Keep monitoring scripts updated + +### DON'T: +- ❌ Ignore alert emails +- ❌ Set thresholds too low (alert fatigue) +- ❌ Deploy monitoring without testing +- ❌ Disable monitoring without planning +- ❌ Let log files grow unbounded +- ❌ Ignore repeated warnings + +### Monitoring Hygiene + +```bash +# Rotate monitoring logs weekly +sudo logrotate /etc/logrotate.d/tractatus-monitoring + +# Clean up old state files +find /var/tmp -name "tractatus-*-state" -mtime +7 -delete + +# Review alert frequency monthly +grep "\[ALERT\]" /var/log/tractatus/monitoring.log | wc -l +``` + +--- + +## Incident Response + +### When Alert Received + +1. **Acknowledge alert** - Note time received +2. **Check current status** - Run manual health check +3. **Review logs** - Check what triggered alert +4. **Investigate root cause** - See deployment checklist emergency procedures +5. **Take action** - Fix issue or escalate +6. **Document** - Create incident report + +### Critical Alert Response Time + +- **Health check failure**: Respond within 15 minutes +- **Log errors**: Respond within 30 minutes +- **Disk space critical**: Respond within 1 hour +- **SSL expiry (7 days)**: Respond within 24 hours + +--- + +## Maintenance + +### Weekly Tasks + +- [ ] Review monitoring logs for patterns +- [ ] Check alert email inbox +- [ ] Verify cron jobs still running +- [ ] Review disk space trends + +### Monthly Tasks + +- [ ] Review and adjust alert thresholds +- [ ] Clean up old monitoring logs +- [ ] Test manual failover procedures +- [ ] Update monitoring documentation + +### Quarterly Tasks + +- [ ] Full monitoring system audit +- [ ] Test all alert scenarios +- [ ] Review incident response times +- [ ] Consider monitoring enhancements + +--- + +## Monitoring Metrics + +### Success Metrics + +- **Uptime**: Target 99.9% (< 45 minutes downtime/month) +- **Alert Response Time**: < 30 minutes for critical +- **False Positive Rate**: < 5% of alerts +- **Detection Time**: < 5 minutes for critical issues + +### Tracking + +```bash +# Calculate uptime from logs +grep "Health endpoint OK" /var/log/tractatus/monitoring.log | wc -l + +# Count alerts sent +grep "Alert email sent" /var/log/tractatus/monitoring.log | wc -l + +# Review response times (manual from incident reports) +``` + +--- + +## Security Considerations + +### Log Access Control + +```bash +# Ensure logs are readable only by ubuntu user and root +sudo chown ubuntu:ubuntu /var/log/tractatus/*.log +sudo chmod 640 /var/log/tractatus/*.log +``` + +### Alert Email Security + +- Use encrypted email if possible (ProtonMail) +- Don't include sensitive data in alert body +- Alerts show symptoms, not credentials + +### Monitoring Script Security + +- Scripts run as ubuntu user (not root) +- No credentials embedded in scripts +- Use environment variables for sensitive config + +--- + +## Future Enhancements + +### Planned Improvements + +- [ ] **Metrics collection**: Store monitoring metrics in database for trend analysis +- [ ] **Status page**: Public status page showing service availability +- [ ] **Mobile alerts**: SMS or push notifications for critical alerts +- [ ] **Distributed monitoring**: Multiple monitoring locations for redundancy +- [ ] **Automated remediation**: Auto-restart service on failure +- [ ] **Performance monitoring**: Response time tracking, query performance +- [ ] **User impact monitoring**: Track error rates from user perspective + +### Integration Opportunities + +- [ ] **Plausible Analytics**: Monitor traffic patterns, correlate with errors +- [ ] **GitHub Actions**: Run monitoring checks in CI/CD +- [ ] **Slack integration**: Send alerts to Slack channel +- [ ] **Database backup monitoring**: Alert on backup failures + +--- + +## Support & Documentation + +**Monitoring Scripts**: `/var/www/tractatus/scripts/monitoring/` +**Monitoring Logs**: `/var/log/tractatus/` +**Cron Configuration**: `crontab -l` (ubuntu user) +**Alert Email**: Set via `ALERT_EMAIL` environment variable + +**Related Documents:** +- [Production Deployment Checklist](PRODUCTION_DEPLOYMENT_CHECKLIST.md) +- [Phase 4 Preparation Checklist](../PHASE-4-PREPARATION-CHECKLIST.md) + +--- + +**Document Status**: Ready for Production +**Last Updated**: 2025-10-09 +**Next Review**: After 1 month of monitoring data +**Maintainer**: Technical Lead (Claude Code + John Stroh) diff --git a/scripts/monitoring/disk-monitor.sh b/scripts/monitoring/disk-monitor.sh new file mode 100755 index 00000000..181ff981 --- /dev/null +++ b/scripts/monitoring/disk-monitor.sh @@ -0,0 +1,257 @@ +#!/bin/bash +# +# Disk Space Monitoring Script +# Monitors disk space usage and alerts when thresholds exceeded +# +# Usage: +# ./disk-monitor.sh # Check all monitored paths +# ./disk-monitor.sh --test # Test mode (no alerts) +# +# Exit codes: +# 0 = OK +# 1 = Warning threshold exceeded +# 2 = Critical threshold exceeded + +set -euo pipefail + +# Configuration +ALERT_EMAIL="${ALERT_EMAIL:-}" +LOG_FILE="/var/log/tractatus/disk-monitor.log" +WARN_THRESHOLD=80 # Warn at 80% usage +CRITICAL_THRESHOLD=90 # Critical at 90% usage + +# Paths to monitor +declare -A MONITORED_PATHS=( + ["/"]="Root filesystem" + ["/var"]="Var directory" + ["/var/log"]="Log directory" + ["/var/www/tractatus"]="Tractatus application" + ["/tmp"]="Temp directory" +) + +# Parse arguments +TEST_MODE=false + +while [[ $# -gt 0 ]]; do + case $1 in + --test) + TEST_MODE=true + shift + ;; + *) + echo "Unknown option: $1" + exit 3 + ;; + esac +done + +# Logging function +log() { + local level="$1" + shift + local message="$*" + local timestamp=$(date '+%Y-%m-%d %H:%M:%S') + + echo "[$timestamp] [$level] $message" + + if [[ -d "$(dirname "$LOG_FILE")" ]]; then + echo "[$timestamp] [$level] $message" >> "$LOG_FILE" + fi +} + +# Send alert email +send_alert() { + local subject="$1" + local body="$2" + + if [[ "$TEST_MODE" == "true" ]]; then + log "INFO" "TEST MODE: Would send alert: $subject" + return 0 + fi + + if [[ -z "$ALERT_EMAIL" ]]; then + log "WARN" "No alert email configured (ALERT_EMAIL not set)" + return 0 + fi + + if command -v mail &> /dev/null; then + echo "$body" | mail -s "$subject" "$ALERT_EMAIL" + log "INFO" "Alert email sent to $ALERT_EMAIL" + elif command -v sendmail &> /dev/null; then + { + echo "Subject: $subject" + echo "From: tractatus-monitoring@agenticgovernance.digital" + echo "To: $ALERT_EMAIL" + echo "" + echo "$body" + } | sendmail "$ALERT_EMAIL" + log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL" + else + log "WARN" "No email command available" + fi +} + +# Get disk usage for path +get_disk_usage() { + local path="$1" + + # Check if path exists + if [[ ! -e "$path" ]]; then + echo "N/A" + return 1 + fi + + # Get usage percentage (remove % sign) + df -h "$path" 2>/dev/null | awk 'NR==2 {print $5}' | sed 's/%//' || echo "N/A" +} + +# Get human-readable disk usage details +get_disk_details() { + local path="$1" + + if [[ ! -e "$path" ]]; then + echo "Path does not exist" + return 1 + fi + + df -h "$path" 2>/dev/null | awk 'NR==2 {printf "Size: %s | Used: %s | Avail: %s | Use%%: %s | Mounted: %s\n", $2, $3, $4, $5, $6}' +} + +# Find largest directories in path +find_largest_dirs() { + local path="$1" + local limit="${2:-10}" + + if [[ ! -e "$path" ]]; then + return 1 + fi + + du -h "$path"/* 2>/dev/null | sort -rh | head -n "$limit" || echo "Unable to scan directory" +} + +# Check single path +check_path() { + local path="$1" + local description="$2" + + local usage=$(get_disk_usage "$path") + + if [[ "$usage" == "N/A" ]]; then + log "WARN" "$description ($path): Unable to check" + return 0 + fi + + if [[ "$usage" -ge "$CRITICAL_THRESHOLD" ]]; then + log "CRITICAL" "$description ($path): ${usage}% used (>= $CRITICAL_THRESHOLD%)" + return 2 + elif [[ "$usage" -ge "$WARN_THRESHOLD" ]]; then + log "WARN" "$description ($path): ${usage}% used (>= $WARN_THRESHOLD%)" + return 1 + else + log "INFO" "$description ($path): ${usage}% used" + return 0 + fi +} + +# Main monitoring function +main() { + log "INFO" "Starting disk space monitoring" + + local max_severity=0 + local issues=() + local critical_paths=() + local warning_paths=() + + # Check all monitored paths + for path in "${!MONITORED_PATHS[@]}"; do + local description="${MONITORED_PATHS[$path]}" + local exit_code=0 + + check_path "$path" "$description" || exit_code=$? + + if [[ "$exit_code" -eq 2 ]]; then + max_severity=2 + critical_paths+=("$path (${description})") + elif [[ "$exit_code" -eq 1 ]]; then + [[ "$max_severity" -lt 1 ]] && max_severity=1 + warning_paths+=("$path (${description})") + fi + done + + # Send alerts if thresholds exceeded + if [[ "$max_severity" -eq 2 ]]; then + local subject="[CRITICAL] Tractatus Disk Space Critical" + local body="CRITICAL: Disk space usage has exceeded ${CRITICAL_THRESHOLD}% on one or more paths. + +Critical Paths (>= ${CRITICAL_THRESHOLD}%): +$(printf -- "- %s\n" "${critical_paths[@]}") +" + + # Add warning paths if any + if [[ "${#warning_paths[@]}" -gt 0 ]]; then + body+=" +Warning Paths (>= ${WARN_THRESHOLD}%): +$(printf -- "- %s\n" "${warning_paths[@]}") +" + fi + + body+=" +Time: $(date '+%Y-%m-%d %H:%M:%S %Z') +Host: $(hostname) + +Disk Usage Details: +$(df -h) + +Largest directories in /var/www/tractatus: +$(find_largest_dirs /var/www/tractatus 10) + +Largest log files: +$(du -h /var/log/tractatus/*.log 2>/dev/null | sort -rh | head -10 || echo "No log files found") + +Action Required: +1. Clean up old log files +2. Remove unnecessary files +3. Check for runaway processes creating large files +4. Consider expanding disk space + +Clean up commands: +# Rotate old logs +sudo journalctl --vacuum-time=7d + +# Clean up npm cache +npm cache clean --force + +# Find large files +find /var/www/tractatus -type f -size +100M -exec ls -lh {} \; +" + + send_alert "$subject" "$body" + log "CRITICAL" "Disk space alert sent" + + elif [[ "$max_severity" -eq 1 ]]; then + local subject="[WARN] Tractatus Disk Space Warning" + local body="WARNING: Disk space usage has exceeded ${WARN_THRESHOLD}% on one or more paths. + +Warning Paths (>= ${WARN_THRESHOLD}%): +$(printf -- "- %s\n" "${warning_paths[@]}") + +Time: $(date '+%Y-%m-%d %H:%M:%S %Z') +Host: $(hostname) + +Disk Usage: +$(df -h) + +Please review disk usage and clean up if necessary. +" + + send_alert "$subject" "$body" + log "WARN" "Disk space warning sent" + else + log "INFO" "All monitored paths within acceptable limits" + fi + + exit $max_severity +} + +# Run main function +main diff --git a/scripts/monitoring/health-check.sh b/scripts/monitoring/health-check.sh new file mode 100755 index 00000000..797d611c --- /dev/null +++ b/scripts/monitoring/health-check.sh @@ -0,0 +1,269 @@ +#!/bin/bash +# +# Health Check Monitoring Script +# Monitors Tractatus application health endpoint and service status +# +# Usage: +# ./health-check.sh # Run check, alert if issues +# ./health-check.sh --quiet # Suppress output unless error +# ./health-check.sh --test # Test mode (no alerts) +# +# Exit codes: +# 0 = Healthy +# 1 = Health endpoint failed +# 2 = Service not running +# 3 = Configuration error + +set -euo pipefail + +# Configuration +HEALTH_URL="${HEALTH_URL:-https://agenticgovernance.digital/health}" +SERVICE_NAME="${SERVICE_NAME:-tractatus}" +ALERT_EMAIL="${ALERT_EMAIL:-}" +LOG_FILE="/var/log/tractatus/health-check.log" +STATE_FILE="/var/tmp/tractatus-health-state" +MAX_FAILURES=3 # Alert after 3 consecutive failures + +# Parse arguments +QUIET=false +TEST_MODE=false + +while [[ $# -gt 0 ]]; do + case $1 in + --quiet) QUIET=true; shift ;; + --test) TEST_MODE=true; shift ;; + *) echo "Unknown option: $1"; exit 3 ;; + esac +done + +# Logging function +log() { + local level="$1" + shift + local message="$*" + local timestamp=$(date '+%Y-%m-%d %H:%M:%S') + + if [[ "$QUIET" != "true" ]] || [[ "$level" == "ERROR" ]] || [[ "$level" == "CRITICAL" ]]; then + echo "[$timestamp] [$level] $message" + fi + + # Log to file if directory exists + if [[ -d "$(dirname "$LOG_FILE")" ]]; then + echo "[$timestamp] [$level] $message" >> "$LOG_FILE" + fi +} + +# Get current failure count +get_failure_count() { + if [[ -f "$STATE_FILE" ]]; then + cat "$STATE_FILE" + else + echo "0" + fi +} + +# Increment failure count +increment_failure_count() { + local count=$(get_failure_count) + echo $((count + 1)) > "$STATE_FILE" +} + +# Reset failure count +reset_failure_count() { + echo "0" > "$STATE_FILE" +} + +# Send alert email +send_alert() { + local subject="$1" + local body="$2" + + if [[ "$TEST_MODE" == "true" ]]; then + log "INFO" "TEST MODE: Would send alert: $subject" + return 0 + fi + + if [[ -z "$ALERT_EMAIL" ]]; then + log "WARN" "No alert email configured (ALERT_EMAIL not set)" + return 0 + fi + + # Try to send email using mail command (if available) + if command -v mail &> /dev/null; then + echo "$body" | mail -s "$subject" "$ALERT_EMAIL" + log "INFO" "Alert email sent to $ALERT_EMAIL" + elif command -v sendmail &> /dev/null; then + { + echo "Subject: $subject" + echo "From: tractatus-monitoring@agenticgovernance.digital" + echo "To: $ALERT_EMAIL" + echo "" + echo "$body" + } | sendmail "$ALERT_EMAIL" + log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL" + else + log "WARN" "No email command available (install mailutils or sendmail)" + fi +} + +# Check health endpoint +check_health_endpoint() { + log "INFO" "Checking health endpoint: $HEALTH_URL" + + # Make HTTP request with timeout + local response + local http_code + + response=$(curl -s -w "\n%{http_code}" --max-time 10 "$HEALTH_URL" 2>&1) || { + log "ERROR" "Health endpoint request failed: $response" + return 1 + } + + # Extract HTTP code (last line) + http_code=$(echo "$response" | tail -n 1) + + # Extract response body (everything except last line) + local body=$(echo "$response" | sed '$d') + + # Check HTTP status + if [[ "$http_code" != "200" ]]; then + log "ERROR" "Health endpoint returned HTTP $http_code" + return 1 + fi + + # Check response contains expected JSON + if ! echo "$body" | jq -e '.status == "ok"' &> /dev/null; then + log "ERROR" "Health endpoint response invalid: $body" + return 1 + fi + + log "INFO" "Health endpoint OK (HTTP $http_code)" + return 0 +} + +# Check systemd service status +check_service_status() { + log "INFO" "Checking service status: $SERVICE_NAME" + + if ! systemctl is-active --quiet "$SERVICE_NAME"; then + log "ERROR" "Service $SERVICE_NAME is not active" + return 2 + fi + + # Check if service is enabled + if ! systemctl is-enabled --quiet "$SERVICE_NAME"; then + log "WARN" "Service $SERVICE_NAME is not enabled (won't start on boot)" + fi + + log "INFO" "Service $SERVICE_NAME is active" + return 0 +} + +# Check database connectivity (quick MongoDB ping) +check_database() { + log "INFO" "Checking database connectivity" + + # Try to connect to MongoDB (timeout 5 seconds) + if ! timeout 5 mongosh --quiet --eval "db.adminCommand('ping')" localhost:27017/tractatus_prod &> /dev/null; then + log "ERROR" "Database connection failed" + return 1 + fi + + log "INFO" "Database connectivity OK" + return 0 +} + +# Check disk space +check_disk_space() { + log "INFO" "Checking disk space" + + # Get root filesystem usage percentage + local usage=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//') + + if [[ "$usage" -gt 90 ]]; then + log "CRITICAL" "Disk space critical: ${usage}% used" + return 1 + elif [[ "$usage" -gt 80 ]]; then + log "WARN" "Disk space high: ${usage}% used" + else + log "INFO" "Disk space OK: ${usage}% used" + fi + + return 0 +} + +# Main health check +main() { + log "INFO" "Starting health check" + + local all_healthy=true + local issues=() + + # Run all checks + if ! check_service_status; then + all_healthy=false + issues+=("Service not running") + fi + + if ! check_health_endpoint; then + all_healthy=false + issues+=("Health endpoint failed") + fi + + if ! check_database; then + all_healthy=false + issues+=("Database connectivity failed") + fi + + if ! check_disk_space; then + all_healthy=false + issues+=("Disk space issue") + fi + + # Handle results + if [[ "$all_healthy" == "true" ]]; then + log "INFO" "All health checks passed ✓" + reset_failure_count + exit 0 + else + log "ERROR" "Health check failed: ${issues[*]}" + increment_failure_count + + local failure_count=$(get_failure_count) + log "WARN" "Consecutive failures: $failure_count/$MAX_FAILURES" + + # Alert if threshold reached + if [[ "$failure_count" -ge "$MAX_FAILURES" ]]; then + local subject="[ALERT] Tractatus Health Check Failed ($failure_count failures)" + local body="Tractatus health check has failed $failure_count times consecutively. + +Issues detected: +$(printf -- "- %s\n" "${issues[@]}") + +Time: $(date '+%Y-%m-%d %H:%M:%S %Z') +Host: $(hostname) +Service: $SERVICE_NAME +Health URL: $HEALTH_URL + +Please investigate immediately. + +View logs: +sudo journalctl -u $SERVICE_NAME -n 100 + +Check service status: +sudo systemctl status $SERVICE_NAME + +Restart service: +sudo systemctl restart $SERVICE_NAME +" + + send_alert "$subject" "$body" + log "CRITICAL" "Alert sent after $failure_count consecutive failures" + fi + + exit 1 + fi +} + +# Run main function +main diff --git a/scripts/monitoring/log-monitor.sh b/scripts/monitoring/log-monitor.sh new file mode 100755 index 00000000..b832feaa --- /dev/null +++ b/scripts/monitoring/log-monitor.sh @@ -0,0 +1,269 @@ +#!/bin/bash +# +# Log Monitoring Script +# Monitors Tractatus service logs for errors, security events, and anomalies +# +# Usage: +# ./log-monitor.sh # Monitor logs since last check +# ./log-monitor.sh --since "1 hour" # Monitor specific time window +# ./log-monitor.sh --follow # Continuous monitoring +# ./log-monitor.sh --test # Test mode (no alerts) +# +# Exit codes: +# 0 = No issues found +# 1 = Errors detected +# 2 = Critical errors detected +# 3 = Configuration error + +set -euo pipefail + +# Configuration +SERVICE_NAME="${SERVICE_NAME:-tractatus}" +ALERT_EMAIL="${ALERT_EMAIL:-}" +LOG_FILE="/var/log/tractatus/log-monitor.log" +STATE_FILE="/var/tmp/tractatus-log-monitor-state" +ERROR_THRESHOLD=10 # Alert after 10 errors in window +CRITICAL_THRESHOLD=3 # Alert immediately after 3 critical errors + +# Parse arguments +SINCE="5 minutes ago" +FOLLOW=false +TEST_MODE=false + +while [[ $# -gt 0 ]]; do + case $1 in + --since) + SINCE="$2" + shift 2 + ;; + --follow) + FOLLOW=true + shift + ;; + --test) + TEST_MODE=true + shift + ;; + *) + echo "Unknown option: $1" + exit 3 + ;; + esac +done + +# Logging function +log() { + local level="$1" + shift + local message="$*" + local timestamp=$(date '+%Y-%m-%d %H:%M:%S') + + echo "[$timestamp] [$level] $message" + + # Log to file if directory exists + if [[ -d "$(dirname "$LOG_FILE")" ]]; then + echo "[$timestamp] [$level] $message" >> "$LOG_FILE" + fi +} + +# Send alert email +send_alert() { + local subject="$1" + local body="$2" + + if [[ "$TEST_MODE" == "true" ]]; then + log "INFO" "TEST MODE: Would send alert: $subject" + return 0 + fi + + if [[ -z "$ALERT_EMAIL" ]]; then + log "WARN" "No alert email configured (ALERT_EMAIL not set)" + return 0 + fi + + if command -v mail &> /dev/null; then + echo "$body" | mail -s "$subject" "$ALERT_EMAIL" + log "INFO" "Alert email sent to $ALERT_EMAIL" + elif command -v sendmail &> /dev/null; then + { + echo "Subject: $subject" + echo "From: tractatus-monitoring@agenticgovernance.digital" + echo "To: $ALERT_EMAIL" + echo "" + echo "$body" + } | sendmail "$ALERT_EMAIL" + log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL" + else + log "WARN" "No email command available" + fi +} + +# Extract errors from logs +extract_errors() { + local since="$1" + + # Get logs since specified time + sudo journalctl -u "$SERVICE_NAME" --since "$since" --no-pager 2>/dev/null || { + log "ERROR" "Failed to read journal for $SERVICE_NAME" + return 1 + } +} + +# Analyze log patterns +analyze_logs() { + local logs="$1" + + # Count different severity levels + local error_count=$(echo "$logs" | grep -ci "\[ERROR\]" || echo "0") + local critical_count=$(echo "$logs" | grep -ci "\[CRITICAL\]" || echo "0") + local warn_count=$(echo "$logs" | grep -ci "\[WARN\]" || echo "0") + + # Security-related patterns + local security_count=$(echo "$logs" | grep -ciE "(SECURITY|unauthorized|forbidden|authentication failed)" || echo "0") + + # Database errors + local db_error_count=$(echo "$logs" | grep -ciE "(mongodb|database|connection.*failed)" || echo "0") + + # HTTP errors + local http_error_count=$(echo "$logs" | grep -ciE "HTTP.*50[0-9]|Internal Server Error" || echo "0") + + # Unhandled exceptions + local exception_count=$(echo "$logs" | grep -ciE "(Unhandled.*exception|TypeError|ReferenceError)" || echo "0") + + log "INFO" "Log analysis: CRITICAL=$critical_count ERROR=$error_count WARN=$warn_count SECURITY=$security_count DB_ERROR=$db_error_count HTTP_ERROR=$http_error_count EXCEPTION=$exception_count" + + # Determine severity + if [[ "$critical_count" -ge "$CRITICAL_THRESHOLD" ]]; then + log "CRITICAL" "Critical error threshold exceeded: $critical_count critical errors" + return 2 + fi + + if [[ "$error_count" -ge "$ERROR_THRESHOLD" ]]; then + log "ERROR" "Error threshold exceeded: $error_count errors" + return 1 + fi + + if [[ "$security_count" -gt 0 ]]; then + log "WARN" "Security events detected: $security_count events" + fi + + if [[ "$db_error_count" -gt 5 ]]; then + log "WARN" "Database errors detected: $db_error_count errors" + fi + + if [[ "$exception_count" -gt 0 ]]; then + log "WARN" "Unhandled exceptions detected: $exception_count exceptions" + fi + + return 0 +} + +# Extract top error messages +get_top_errors() { + local logs="$1" + local limit="${2:-10}" + + echo "$logs" | grep -iE "\[ERROR\]|\[CRITICAL\]" | \ + sed 's/^.*\] //' | \ + sort | uniq -c | sort -rn | head -n "$limit" +} + +# Main monitoring function +main() { + log "INFO" "Starting log monitoring (since: $SINCE)" + + # Extract logs + local logs + logs=$(extract_errors "$SINCE") || { + log "ERROR" "Failed to extract logs" + exit 3 + } + + # Count total log entries + local log_count=$(echo "$logs" | wc -l) + log "INFO" "Analyzing $log_count log entries" + + if [[ "$log_count" -eq 0 ]]; then + log "INFO" "No logs found in time window" + exit 0 + fi + + # Analyze logs + local exit_code=0 + analyze_logs "$logs" || exit_code=$? + + # If errors detected, send alert + if [[ "$exit_code" -ne 0 ]]; then + local severity="ERROR" + [[ "$exit_code" -eq 2 ]] && severity="CRITICAL" + + local subject="[ALERT] Tractatus Log Monitoring - $severity Detected" + + # Extract top 10 error messages + local top_errors=$(get_top_errors "$logs" 10) + + local body="Log monitoring detected $severity level issues in Tractatus service. + +Time Window: $SINCE +Time: $(date '+%Y-%m-%d %H:%M:%S %Z') +Host: $(hostname) +Service: $SERVICE_NAME + +Top Error Messages: +$top_errors + +Recent Critical/Error Logs: +$(echo "$logs" | grep -iE "\[ERROR\]|\[CRITICAL\]" | tail -n 20) + +Full logs: +sudo journalctl -u $SERVICE_NAME --since \"$SINCE\" + +Check service status: +sudo systemctl status $SERVICE_NAME +" + + send_alert "$subject" "$body" + else + log "INFO" "No significant issues detected" + fi + + exit $exit_code +} + +# Follow mode (continuous monitoring) +follow_logs() { + log "INFO" "Starting continuous log monitoring" + + sudo journalctl -u "$SERVICE_NAME" -f --no-pager | while read -r line; do + # Check for error patterns + if echo "$line" | grep -qiE "\[ERROR\]|\[CRITICAL\]"; then + log "ERROR" "$line" + + # Extract error message + local error_msg=$(echo "$line" | sed 's/^.*\] //') + + # Check for critical patterns + if echo "$line" | grep -qiE "\[CRITICAL\]|Unhandled.*exception|Database.*failed|Service.*crashed"; then + local subject="[CRITICAL] Tractatus Error Detected" + local body="Critical error detected in Tractatus logs: + +$line + +Time: $(date '+%Y-%m-%d %H:%M:%S %Z') +Host: $(hostname) + +Recent logs: +$(sudo journalctl -u $SERVICE_NAME -n 10 --no-pager) +" + send_alert "$subject" "$body" + fi + fi + done +} + +# Run appropriate mode +if [[ "$FOLLOW" == "true" ]]; then + follow_logs +else + main +fi diff --git a/scripts/monitoring/monitor-all.sh b/scripts/monitoring/monitor-all.sh new file mode 100755 index 00000000..283b0499 --- /dev/null +++ b/scripts/monitoring/monitor-all.sh @@ -0,0 +1,178 @@ +#!/bin/bash +# +# Master Monitoring Script +# Orchestrates all monitoring checks for Tractatus production environment +# +# Usage: +# ./monitor-all.sh # Run all monitors +# ./monitor-all.sh --test # Test mode (no alerts) +# ./monitor-all.sh --skip-ssl # Skip SSL check +# +# Exit codes: +# 0 = All checks passed +# 1 = Some warnings +# 2 = Some critical issues +# 3 = Configuration error + +set -euo pipefail + +# Configuration +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +LOG_FILE="/var/log/tractatus/monitoring.log" +ALERT_EMAIL="${ALERT_EMAIL:-}" + +# Parse arguments +TEST_MODE=false +SKIP_SSL=false + +while [[ $# -gt 0 ]]; do + case $1 in + --test) + TEST_MODE=true + shift + ;; + --skip-ssl) + SKIP_SSL=true + shift + ;; + *) + echo "Unknown option: $1" + exit 3 + ;; + esac +done + +# Export configuration for child scripts +export ALERT_EMAIL +[[ "$TEST_MODE" == "true" ]] && TEST_FLAG="--test" || TEST_FLAG="" + +# Logging function +log() { + local level="$1" + shift + local message="$*" + local timestamp=$(date '+%Y-%m-%d %H:%M:%S') + + echo "[$timestamp] [$level] $message" + + if [[ -d "$(dirname "$LOG_FILE")" ]]; then + echo "[$timestamp] [$level] $message" >> "$LOG_FILE" + fi +} + +# Run monitoring check +run_check() { + local name="$1" + local script="$2" + shift 2 + local args="$@" + + log "INFO" "Running $name..." + + local exit_code=0 + "$SCRIPT_DIR/$script" $args $TEST_FLAG || exit_code=$? + + case $exit_code in + 0) + log "INFO" "$name: OK ✓" + ;; + 1) + log "WARN" "$name: Warning" + ;; + 2) + log "CRITICAL" "$name: Critical" + ;; + *) + log "ERROR" "$name: Error (exit code: $exit_code)" + ;; + esac + + return $exit_code +} + +# Main monitoring function +main() { + log "INFO" "=== Starting Tractatus Monitoring Suite ===" + log "INFO" "Timestamp: $(date '+%Y-%m-%d %H:%M:%S %Z')" + log "INFO" "Host: $(hostname)" + [[ "$TEST_MODE" == "true" ]] && log "INFO" "TEST MODE: Alerts suppressed" + + local max_severity=0 + local checks_run=0 + local checks_passed=0 + local checks_warned=0 + local checks_critical=0 + local checks_failed=0 + + # Health Check + if run_check "Health Check" "health-check.sh"; then + ((checks_passed++)) + else + local exit_code=$? + [[ $exit_code -eq 1 ]] && ((checks_warned++)) + [[ $exit_code -eq 2 ]] && ((checks_critical++)) + [[ $exit_code -ge 3 ]] && ((checks_failed++)) + [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code + fi + ((checks_run++)) + + # Log Monitor + if run_check "Log Monitor" "log-monitor.sh" --since "5 minutes ago"; then + ((checks_passed++)) + else + local exit_code=$? + [[ $exit_code -eq 1 ]] && ((checks_warned++)) + [[ $exit_code -eq 2 ]] && ((checks_critical++)) + [[ $exit_code -ge 3 ]] && ((checks_failed++)) + [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code + fi + ((checks_run++)) + + # Disk Monitor + if run_check "Disk Monitor" "disk-monitor.sh"; then + ((checks_passed++)) + else + local exit_code=$? + [[ $exit_code -eq 1 ]] && ((checks_warned++)) + [[ $exit_code -eq 2 ]] && ((checks_critical++)) + [[ $exit_code -ge 3 ]] && ((checks_failed++)) + [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code + fi + ((checks_run++)) + + # SSL Monitor (optional) + if [[ "$SKIP_SSL" != "true" ]]; then + if run_check "SSL Monitor" "ssl-monitor.sh"; then + ((checks_passed++)) + else + local exit_code=$? + [[ $exit_code -eq 1 ]] && ((checks_warned++)) + [[ $exit_code -eq 2 ]] && ((checks_critical++)) + [[ $exit_code -ge 3 ]] && ((checks_failed++)) + [[ $exit_code -gt $max_severity ]] && max_severity=$exit_code + fi + ((checks_run++)) + fi + + # Summary + log "INFO" "=== Monitoring Summary ===" + log "INFO" "Checks run: $checks_run" + log "INFO" "Passed: $checks_passed | Warned: $checks_warned | Critical: $checks_critical | Failed: $checks_failed" + + if [[ $max_severity -eq 0 ]]; then + log "INFO" "All monitoring checks passed ✓" + elif [[ $max_severity -eq 1 ]]; then + log "WARN" "Some checks returned warnings" + elif [[ $max_severity -eq 2 ]]; then + log "CRITICAL" "Some checks returned critical alerts" + else + log "ERROR" "Some checks failed" + fi + + log "INFO" "=== Monitoring Complete ===" + + exit $max_severity +} + +# Run main function +main diff --git a/scripts/monitoring/ssl-monitor.sh b/scripts/monitoring/ssl-monitor.sh new file mode 100755 index 00000000..5d75d8f7 --- /dev/null +++ b/scripts/monitoring/ssl-monitor.sh @@ -0,0 +1,319 @@ +#!/bin/bash +# +# SSL Certificate Monitoring Script +# Monitors SSL certificate expiry and alerts before expiration +# +# Usage: +# ./ssl-monitor.sh # Check all domains +# ./ssl-monitor.sh --domain example.com # Check specific domain +# ./ssl-monitor.sh --test # Test mode (no alerts) +# +# Exit codes: +# 0 = OK +# 1 = Warning (expires soon) +# 2 = Critical (expires very soon) +# 3 = Expired or error + +set -euo pipefail + +# Configuration +ALERT_EMAIL="${ALERT_EMAIL:-}" +LOG_FILE="/var/log/tractatus/ssl-monitor.log" +WARN_DAYS=30 # Warn 30 days before expiry +CRITICAL_DAYS=7 # Critical alert 7 days before expiry + +# Default domains to monitor +DOMAINS=( + "agenticgovernance.digital" +) + +# Parse arguments +TEST_MODE=false +SPECIFIC_DOMAIN="" + +while [[ $# -gt 0 ]]; do + case $1 in + --domain) + SPECIFIC_DOMAIN="$2" + shift 2 + ;; + --test) + TEST_MODE=true + shift + ;; + *) + echo "Unknown option: $1" + exit 3 + ;; + esac +done + +# Override domains if specific domain provided +if [[ -n "$SPECIFIC_DOMAIN" ]]; then + DOMAINS=("$SPECIFIC_DOMAIN") +fi + +# Logging function +log() { + local level="$1" + shift + local message="$*" + local timestamp=$(date '+%Y-%m-%d %H:%M:%S') + + echo "[$timestamp] [$level] $message" + + if [[ -d "$(dirname "$LOG_FILE")" ]]; then + echo "[$timestamp] [$level] $message" >> "$LOG_FILE" + fi +} + +# Send alert email +send_alert() { + local subject="$1" + local body="$2" + + if [[ "$TEST_MODE" == "true" ]]; then + log "INFO" "TEST MODE: Would send alert: $subject" + return 0 + fi + + if [[ -z "$ALERT_EMAIL" ]]; then + log "WARN" "No alert email configured (ALERT_EMAIL not set)" + return 0 + fi + + if command -v mail &> /dev/null; then + echo "$body" | mail -s "$subject" "$ALERT_EMAIL" + log "INFO" "Alert email sent to $ALERT_EMAIL" + elif command -v sendmail &> /dev/null; then + { + echo "Subject: $subject" + echo "From: tractatus-monitoring@agenticgovernance.digital" + echo "To: $ALERT_EMAIL" + echo "" + echo "$body" + } | sendmail "$ALERT_EMAIL" + log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL" + else + log "WARN" "No email command available" + fi +} + +# Get SSL certificate expiry date +get_cert_expiry() { + local domain="$1" + + # Use openssl to get certificate + local expiry_date + expiry_date=$(echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null | \ + openssl x509 -noout -enddate 2>/dev/null | \ + cut -d= -f2) || { + log "ERROR" "Failed to retrieve certificate for $domain" + return 1 + } + + echo "$expiry_date" +} + +# Get days until expiry +get_days_until_expiry() { + local expiry_date="$1" + + # Convert expiry date to seconds since epoch + local expiry_epoch + expiry_epoch=$(date -d "$expiry_date" +%s 2>/dev/null) || { + log "ERROR" "Failed to parse expiry date: $expiry_date" + return 1 + } + + # Get current time in seconds since epoch + local now_epoch=$(date +%s) + + # Calculate days until expiry + local seconds_until_expiry=$((expiry_epoch - now_epoch)) + local days_until_expiry=$((seconds_until_expiry / 86400)) + + echo "$days_until_expiry" +} + +# Get certificate details +get_cert_details() { + local domain="$1" + + echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null | \ + openssl x509 -noout -subject -issuer -dates 2>/dev/null || { + echo "Failed to retrieve certificate details" + return 1 + } +} + +# Check single domain +check_domain() { + local domain="$1" + + log "INFO" "Checking SSL certificate for $domain" + + # Get expiry date + local expiry_date + expiry_date=$(get_cert_expiry "$domain") || { + log "ERROR" "Failed to check certificate for $domain" + return 3 + } + + # Calculate days until expiry + local days_until_expiry + days_until_expiry=$(get_days_until_expiry "$expiry_date") || { + log "ERROR" "Failed to calculate expiry for $domain" + return 3 + } + + # Check if expired + if [[ "$days_until_expiry" -lt 0 ]]; then + log "CRITICAL" "$domain: Certificate EXPIRED ${days_until_expiry#-} days ago!" + return 3 + fi + + # Check thresholds + if [[ "$days_until_expiry" -le "$CRITICAL_DAYS" ]]; then + log "CRITICAL" "$domain: Certificate expires in $days_until_expiry days (expires: $expiry_date)" + return 2 + elif [[ "$days_until_expiry" -le "$WARN_DAYS" ]]; then + log "WARN" "$domain: Certificate expires in $days_until_expiry days (expires: $expiry_date)" + return 1 + else + log "INFO" "$domain: Certificate valid for $days_until_expiry days (expires: $expiry_date)" + return 0 + fi +} + +# Main monitoring function +main() { + log "INFO" "Starting SSL certificate monitoring" + + local max_severity=0 + local expired_domains=() + local critical_domains=() + local warning_domains=() + + # Check all domains + for domain in "${DOMAINS[@]}"; do + local exit_code=0 + local expiry_date=$(get_cert_expiry "$domain" 2>/dev/null || echo "Unknown") + local days_until_expiry=$(get_days_until_expiry "$expiry_date" 2>/dev/null || echo "Unknown") + + check_domain "$domain" || exit_code=$? + + if [[ "$exit_code" -eq 3 ]]; then + max_severity=3 + expired_domains+=("$domain (EXPIRED or ERROR)") + elif [[ "$exit_code" -eq 2 ]]; then + [[ "$max_severity" -lt 2 ]] && max_severity=2 + critical_domains+=("$domain (expires in $days_until_expiry days)") + elif [[ "$exit_code" -eq 1 ]]; then + [[ "$max_severity" -lt 1 ]] && max_severity=1 + warning_domains+=("$domain (expires in $days_until_expiry days)") + fi + done + + # Send alerts based on severity + if [[ "$max_severity" -eq 3 ]]; then + local subject="[CRITICAL] SSL Certificate Expired or Error" + local body="CRITICAL: SSL certificate has expired or error occurred. + +Expired/Error Domains: +$(printf -- "- %s\n" "${expired_domains[@]}") +" + + # Add other alerts if any + if [[ "${#critical_domains[@]}" -gt 0 ]]; then + body+=" +Critical Domains (<= $CRITICAL_DAYS days): +$(printf -- "- %s\n" "${critical_domains[@]}") +" + fi + + if [[ "${#warning_domains[@]}" -gt 0 ]]; then + body+=" +Warning Domains (<= $WARN_DAYS days): +$(printf -- "- %s\n" "${warning_domains[@]}") +" + fi + + body+=" +Time: $(date '+%Y-%m-%d %H:%M:%S %Z') +Host: $(hostname) + +Action Required: +1. Renew SSL certificate immediately +2. Check Let's Encrypt auto-renewal: + sudo certbot renew --dry-run + +Certificate details: +$(get_cert_details "${DOMAINS[0]}") + +Renewal commands: +# Test renewal +sudo certbot renew --dry-run + +# Force renewal +sudo certbot renew --force-renewal + +# Check certificate status +sudo certbot certificates +" + + send_alert "$subject" "$body" + log "CRITICAL" "SSL certificate alert sent" + + elif [[ "$max_severity" -eq 2 ]]; then + local subject="[CRITICAL] SSL Certificate Expires Soon" + local body="CRITICAL: SSL certificate expires in $CRITICAL_DAYS days or less. + +Critical Domains (<= $CRITICAL_DAYS days): +$(printf -- "- %s\n" "${critical_domains[@]}") +" + + if [[ "${#warning_domains[@]}" -gt 0 ]]; then + body+=" +Warning Domains (<= $WARN_DAYS days): +$(printf -- "- %s\n" "${warning_domains[@]}") +" + fi + + body+=" +Time: $(date '+%Y-%m-%d %H:%M:%S %Z') +Host: $(hostname) + +Please renew certificates soon. + +Check renewal: +sudo certbot renew --dry-run +" + + send_alert "$subject" "$body" + log "CRITICAL" "SSL expiry alert sent" + + elif [[ "$max_severity" -eq 1 ]]; then + local subject="[WARN] SSL Certificate Expires Soon" + local body="WARNING: SSL certificate expires in $WARN_DAYS days or less. + +Warning Domains (<= $WARN_DAYS days): +$(printf -- "- %s\n" "${warning_domains[@]}") + +Time: $(date '+%Y-%m-%d %H:%M:%S %Z') +Host: $(hostname) + +Please plan certificate renewal. +" + + send_alert "$subject" "$body" + log "WARN" "SSL expiry warning sent" + else + log "INFO" "All SSL certificates valid" + fi + + exit $max_severity +} + +# Run main function +main