ops: implement comprehensive production monitoring system
Create self-hosted, privacy-first monitoring infrastructure for production environment with automated health checks, log analysis, and alerting. Monitoring Components: - health-check.sh: Application health, service status, DB connectivity, disk space - log-monitor.sh: Error detection, security events, anomaly detection - disk-monitor.sh: Disk space usage monitoring (5 paths) - ssl-monitor.sh: SSL certificate expiry monitoring - monitor-all.sh: Master orchestration script Features: - Email alerting system (configurable thresholds) - Consecutive failure tracking (prevents false positives) - Test mode for safe deployment testing - Comprehensive logging to /var/log/tractatus/ - Cron-ready for automated execution - Exit codes for monitoring tool integration Alert Triggers: - Health: 3 consecutive failures (15min downtime) - Logs: 10 errors OR 3 critical errors in 5min - Disk: 80% warning, 90% critical - SSL: 30 days warning, 7 days critical Setup Documentation: - Complete installation instructions - Cron configuration examples - Systemd timer alternative - Troubleshooting guide - Alert customization guide - Incident response procedures Privacy-First Design: - Self-hosted (no external monitoring services) - Minimal data exposure in alerts - Local log storage only - No telemetry to third parties Aligns with Tractatus values: transparency, privacy, operational excellence Addresses Phase 4 Prep Checklist Task #6: Production Monitoring & Alerting Next: Deploy to production, configure email alerts, set up cron jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
1221941828
commit
c755c49ec1
6 changed files with 1940 additions and 0 deletions
648
docs/PRODUCTION_MONITORING_SETUP.md
Normal file
648
docs/PRODUCTION_MONITORING_SETUP.md
Normal file
|
|
@ -0,0 +1,648 @@
|
||||||
|
# Production Monitoring Setup
|
||||||
|
|
||||||
|
**Project**: Tractatus AI Safety Framework Website
|
||||||
|
**Environment**: Production (vps-93a693da.vps.ovh.net)
|
||||||
|
**Created**: 2025-10-09
|
||||||
|
**Status**: Ready for Deployment
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
Comprehensive monitoring system for Tractatus production environment, providing:
|
||||||
|
|
||||||
|
- **Health monitoring** - Application uptime, service status, database connectivity
|
||||||
|
- **Log monitoring** - Error detection, security events, anomaly detection
|
||||||
|
- **Disk monitoring** - Disk space usage alerts
|
||||||
|
- **SSL monitoring** - Certificate expiry warnings
|
||||||
|
- **Email alerts** - Automated notifications for critical issues
|
||||||
|
|
||||||
|
**Philosophy**: Privacy-first, self-hosted monitoring aligned with Tractatus values.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring Components
|
||||||
|
|
||||||
|
### 1. Health Check Monitor (`health-check.sh`)
|
||||||
|
|
||||||
|
**What it monitors:**
|
||||||
|
- Application health endpoint (https://agenticgovernance.digital/health)
|
||||||
|
- Systemd service status (tractatus.service)
|
||||||
|
- MongoDB database connectivity
|
||||||
|
- Disk space usage
|
||||||
|
|
||||||
|
**Alert Triggers:**
|
||||||
|
- Service not running
|
||||||
|
- Health endpoint returns non-200
|
||||||
|
- Database connection failed
|
||||||
|
- Disk space > 90%
|
||||||
|
|
||||||
|
**Frequency**: Every 5 minutes
|
||||||
|
|
||||||
|
### 2. Log Monitor (`log-monitor.sh`)
|
||||||
|
|
||||||
|
**What it monitors:**
|
||||||
|
- ERROR and CRITICAL log entries
|
||||||
|
- Security events (authentication failures, unauthorized access)
|
||||||
|
- Database errors
|
||||||
|
- HTTP 500 errors
|
||||||
|
- Unhandled exceptions
|
||||||
|
|
||||||
|
**Alert Triggers:**
|
||||||
|
- 10+ errors in 5-minute window
|
||||||
|
- 3+ critical errors in 5-minute window
|
||||||
|
- Any security events
|
||||||
|
|
||||||
|
**Frequency**: Every 5 minutes
|
||||||
|
|
||||||
|
**Follow Mode**: Can run continuously for real-time monitoring
|
||||||
|
|
||||||
|
### 3. Disk Space Monitor (`disk-monitor.sh`)
|
||||||
|
|
||||||
|
**What it monitors:**
|
||||||
|
- Root filesystem (/)
|
||||||
|
- Var directory (/var)
|
||||||
|
- Log directory (/var/log)
|
||||||
|
- Tractatus application (/var/www/tractatus)
|
||||||
|
- Temp directory (/tmp)
|
||||||
|
|
||||||
|
**Alert Triggers:**
|
||||||
|
- Warning: 80%+ usage
|
||||||
|
- Critical: 90%+ usage
|
||||||
|
|
||||||
|
**Frequency**: Every 15 minutes
|
||||||
|
|
||||||
|
### 4. SSL Certificate Monitor (`ssl-monitor.sh`)
|
||||||
|
|
||||||
|
**What it monitors:**
|
||||||
|
- SSL certificate expiry for agenticgovernance.digital
|
||||||
|
|
||||||
|
**Alert Triggers:**
|
||||||
|
- Warning: Expires in 30 days or less
|
||||||
|
- Critical: Expires in 7 days or less
|
||||||
|
- Critical: Already expired
|
||||||
|
|
||||||
|
**Frequency**: Daily
|
||||||
|
|
||||||
|
### 5. Master Monitor (`monitor-all.sh`)
|
||||||
|
|
||||||
|
Orchestrates all monitoring checks in a single run.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Ensure required commands are available
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get install -y curl jq openssl mailutils
|
||||||
|
|
||||||
|
# Install MongoDB shell (if not installed)
|
||||||
|
wget -qO - https://www.mongodb.org/static/pgp/server-7.0.asc | sudo apt-key add -
|
||||||
|
echo "deb [ arch=amd64,arm64 ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get install -y mongodb-mongosh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy Monitoring Scripts
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From local machine, deploy monitoring scripts to production
|
||||||
|
rsync -avz -e "ssh -i ~/.ssh/tractatus_deploy" \
|
||||||
|
scripts/monitoring/ \
|
||||||
|
ubuntu@vps-93a693da.vps.ovh.net:/var/www/tractatus/scripts/monitoring/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Set Up Log Directory
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On production server
|
||||||
|
ssh -i ~/.ssh/tractatus_deploy ubuntu@vps-93a693da.vps.ovh.net
|
||||||
|
|
||||||
|
# Create log directory
|
||||||
|
sudo mkdir -p /var/log/tractatus
|
||||||
|
sudo chown ubuntu:ubuntu /var/log/tractatus
|
||||||
|
sudo chmod 755 /var/log/tractatus
|
||||||
|
```
|
||||||
|
|
||||||
|
### Make Scripts Executable
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On production server
|
||||||
|
cd /var/www/tractatus/scripts/monitoring
|
||||||
|
chmod +x *.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure Email Alerts
|
||||||
|
|
||||||
|
**Option 1: Using Postfix (Recommended for production)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install Postfix
|
||||||
|
sudo apt-get install -y postfix
|
||||||
|
|
||||||
|
# Configure Postfix (select "Internet Site")
|
||||||
|
sudo dpkg-reconfigure postfix
|
||||||
|
|
||||||
|
# Set ALERT_EMAIL environment variable
|
||||||
|
echo 'export ALERT_EMAIL="your-email@example.com"' | sudo tee -a /etc/environment
|
||||||
|
source /etc/environment
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 2: Using External SMTP (ProtonMail, Gmail, etc.)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install sendemail
|
||||||
|
sudo apt-get install -y sendemail libio-socket-ssl-perl libnet-ssleay-perl
|
||||||
|
|
||||||
|
# Configure in monitoring scripts (or use system mail)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 3: No Email (Testing)**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Leave ALERT_EMAIL unset - monitoring will log but not send emails
|
||||||
|
# Useful for initial testing
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test Monitoring Scripts
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Test health check
|
||||||
|
cd /var/www/tractatus/scripts/monitoring
|
||||||
|
./health-check.sh --test
|
||||||
|
|
||||||
|
# Test log monitor
|
||||||
|
./log-monitor.sh --since "10 minutes ago" --test
|
||||||
|
|
||||||
|
# Test disk monitor
|
||||||
|
./disk-monitor.sh --test
|
||||||
|
|
||||||
|
# Test SSL monitor
|
||||||
|
./ssl-monitor.sh --test
|
||||||
|
|
||||||
|
# Test master monitor
|
||||||
|
./monitor-all.sh --test
|
||||||
|
```
|
||||||
|
|
||||||
|
Expected output: Each script should run without errors and show `[INFO]` messages.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cron Configuration
|
||||||
|
|
||||||
|
### Create Monitoring Cron Jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# On production server
|
||||||
|
crontab -e
|
||||||
|
```
|
||||||
|
|
||||||
|
Add the following cron jobs:
|
||||||
|
|
||||||
|
```cron
|
||||||
|
# Tractatus Production Monitoring
|
||||||
|
# Logs: /var/log/tractatus/monitoring.log
|
||||||
|
|
||||||
|
# Master monitoring (every 5 minutes)
|
||||||
|
# Runs: health check, log monitor, disk monitor
|
||||||
|
*/5 * * * * /var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl >> /var/log/tractatus/cron-monitor.log 2>&1
|
||||||
|
|
||||||
|
# SSL certificate check (daily at 3am)
|
||||||
|
0 3 * * * /var/www/tractatus/scripts/monitoring/ssl-monitor.sh >> /var/log/tractatus/cron-ssl.log 2>&1
|
||||||
|
|
||||||
|
# Disk monitor (every 15 minutes - separate from master for frequency control)
|
||||||
|
*/15 * * * * /var/www/tractatus/scripts/monitoring/disk-monitor.sh >> /var/log/tractatus/cron-disk.log 2>&1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Verify Cron Jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List active cron jobs
|
||||||
|
crontab -l
|
||||||
|
|
||||||
|
# Check cron logs
|
||||||
|
sudo journalctl -u cron -f
|
||||||
|
|
||||||
|
# Wait 5 minutes, then check monitoring logs
|
||||||
|
tail -f /var/log/tractatus/cron-monitor.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### Alternative: Systemd Timers (Optional)
|
||||||
|
|
||||||
|
More modern alternative to cron, provides better logging and failure handling.
|
||||||
|
|
||||||
|
**Create timer file**: `/etc/systemd/system/tractatus-monitoring.timer`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Tractatus Monitoring Timer
|
||||||
|
Requires=tractatus-monitoring.service
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnBootSec=5min
|
||||||
|
OnUnitActiveSec=5min
|
||||||
|
AccuracySec=1s
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
|
```
|
||||||
|
|
||||||
|
**Create service file**: `/etc/systemd/system/tractatus-monitoring.service`
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Unit]
|
||||||
|
Description=Tractatus Production Monitoring
|
||||||
|
After=network.target tractatus.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
User=ubuntu
|
||||||
|
WorkingDirectory=/var/www/tractatus
|
||||||
|
ExecStart=/var/www/tractatus/scripts/monitoring/monitor-all.sh --skip-ssl
|
||||||
|
StandardOutput=journal
|
||||||
|
StandardError=journal
|
||||||
|
Environment="ALERT_EMAIL=your-email@example.com"
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=multi-user.target
|
||||||
|
```
|
||||||
|
|
||||||
|
**Enable and start:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl daemon-reload
|
||||||
|
sudo systemctl enable tractatus-monitoring.timer
|
||||||
|
sudo systemctl start tractatus-monitoring.timer
|
||||||
|
|
||||||
|
# Check status
|
||||||
|
sudo systemctl status tractatus-monitoring.timer
|
||||||
|
sudo systemctl list-timers
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Alert Configuration
|
||||||
|
|
||||||
|
### Alert Thresholds
|
||||||
|
|
||||||
|
**Health Check:**
|
||||||
|
- Consecutive failures: 3 (alerts on 3rd failure)
|
||||||
|
- Check interval: 5 minutes
|
||||||
|
- Time to alert: 15 minutes of downtime
|
||||||
|
|
||||||
|
**Log Monitor:**
|
||||||
|
- Error threshold: 10 errors in 5 minutes
|
||||||
|
- Critical threshold: 3 critical errors in 5 minutes
|
||||||
|
- Security events: Immediate alert
|
||||||
|
|
||||||
|
**Disk Space:**
|
||||||
|
- Warning: 80% usage
|
||||||
|
- Critical: 90% usage
|
||||||
|
|
||||||
|
**SSL Certificate:**
|
||||||
|
- Warning: 30 days until expiry
|
||||||
|
- Critical: 7 days until expiry
|
||||||
|
|
||||||
|
### Customize Alerts
|
||||||
|
|
||||||
|
Edit thresholds in scripts:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Health check thresholds
|
||||||
|
vi /var/www/tractatus/scripts/monitoring/health-check.sh
|
||||||
|
# Change: MAX_FAILURES=3
|
||||||
|
|
||||||
|
# Log monitor thresholds
|
||||||
|
vi /var/www/tractatus/scripts/monitoring/log-monitor.sh
|
||||||
|
# Change: ERROR_THRESHOLD=10
|
||||||
|
# Change: CRITICAL_THRESHOLD=3
|
||||||
|
|
||||||
|
# Disk monitor thresholds
|
||||||
|
vi /var/www/tractatus/scripts/monitoring/disk-monitor.sh
|
||||||
|
# Change: WARN_THRESHOLD=80
|
||||||
|
# Change: CRITICAL_THRESHOLD=90
|
||||||
|
|
||||||
|
# SSL monitor thresholds
|
||||||
|
vi /var/www/tractatus/scripts/monitoring/ssl-monitor.sh
|
||||||
|
# Change: WARN_DAYS=30
|
||||||
|
# Change: CRITICAL_DAYS=7
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Manual Monitoring Commands
|
||||||
|
|
||||||
|
### Check Current Status
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Run all monitors manually
|
||||||
|
cd /var/www/tractatus/scripts/monitoring
|
||||||
|
./monitor-all.sh
|
||||||
|
|
||||||
|
# Run individual monitors
|
||||||
|
./health-check.sh
|
||||||
|
./log-monitor.sh --since "1 hour"
|
||||||
|
./disk-monitor.sh
|
||||||
|
./ssl-monitor.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### View Monitoring Logs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# View all monitoring logs
|
||||||
|
tail -f /var/log/tractatus/monitoring.log
|
||||||
|
|
||||||
|
# View specific monitor logs
|
||||||
|
tail -f /var/log/tractatus/health-check.log
|
||||||
|
tail -f /var/log/tractatus/log-monitor.log
|
||||||
|
tail -f /var/log/tractatus/disk-monitor.log
|
||||||
|
tail -f /var/log/tractatus/ssl-monitor.log
|
||||||
|
|
||||||
|
# View cron execution logs
|
||||||
|
tail -f /var/log/tractatus/cron-monitor.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### Test Alert Delivery
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Send test alert
|
||||||
|
cd /var/www/tractatus/scripts/monitoring
|
||||||
|
|
||||||
|
# This should trigger an alert (if service is running)
|
||||||
|
# It will show "would send alert" in test mode
|
||||||
|
./health-check.sh --test
|
||||||
|
|
||||||
|
# Force alert by temporarily stopping service
|
||||||
|
sudo systemctl stop tractatus
|
||||||
|
./health-check.sh # Should alert after 3 failures (15 minutes)
|
||||||
|
sudo systemctl start tractatus
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### No Alerts Received
|
||||||
|
|
||||||
|
**Check email configuration:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Verify ALERT_EMAIL is set
|
||||||
|
echo $ALERT_EMAIL
|
||||||
|
|
||||||
|
# Test mail command
|
||||||
|
echo "Test email" | mail -s "Test Subject" $ALERT_EMAIL
|
||||||
|
|
||||||
|
# Check mail logs
|
||||||
|
sudo tail -f /var/log/mail.log
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check cron execution:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Verify cron jobs are running
|
||||||
|
crontab -l
|
||||||
|
|
||||||
|
# Check cron logs
|
||||||
|
sudo journalctl -u cron -n 50
|
||||||
|
|
||||||
|
# Check script logs
|
||||||
|
tail -100 /var/log/tractatus/cron-monitor.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### Scripts Not Executing
|
||||||
|
|
||||||
|
**Check permissions:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls -la /var/www/tractatus/scripts/monitoring/
|
||||||
|
# Should show: -rwxr-xr-x (executable)
|
||||||
|
|
||||||
|
# Fix if needed
|
||||||
|
chmod +x /var/www/tractatus/scripts/monitoring/*.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**Check cron PATH:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Add to crontab
|
||||||
|
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
|
||||||
|
|
||||||
|
# Or use full paths in cron commands
|
||||||
|
```
|
||||||
|
|
||||||
|
### High Alert Frequency
|
||||||
|
|
||||||
|
**Increase thresholds:**
|
||||||
|
|
||||||
|
Edit threshold values in scripts (see Alert Configuration section).
|
||||||
|
|
||||||
|
**Increase consecutive failure count:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
vi /var/www/tractatus/scripts/monitoring/health-check.sh
|
||||||
|
# Increase MAX_FAILURES from 3 to 5 or higher
|
||||||
|
```
|
||||||
|
|
||||||
|
### False Positives
|
||||||
|
|
||||||
|
**Review alert conditions:**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Check recent logs to understand why alerts triggered
|
||||||
|
tail -100 /var/log/tractatus/monitoring.log
|
||||||
|
|
||||||
|
# Run manual check with verbose output
|
||||||
|
./health-check.sh
|
||||||
|
|
||||||
|
# Check if service is actually unhealthy
|
||||||
|
sudo systemctl status tractatus
|
||||||
|
curl https://agenticgovernance.digital/health
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring Dashboard (Optional - Future Enhancement)
|
||||||
|
|
||||||
|
### Option 1: Grafana + Prometheus
|
||||||
|
|
||||||
|
Self-hosted metrics dashboard (requires setup).
|
||||||
|
|
||||||
|
### Option 2: Simple Web Dashboard
|
||||||
|
|
||||||
|
Create minimal status page showing last check results.
|
||||||
|
|
||||||
|
### Option 3: UptimeRobot Free Tier
|
||||||
|
|
||||||
|
External monitoring service (privacy tradeoff).
|
||||||
|
|
||||||
|
**Not implemented yet** - current solution uses email alerts only.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
### DO:
|
||||||
|
- ✅ Test monitoring scripts before deploying
|
||||||
|
- ✅ Check alert emails regularly
|
||||||
|
- ✅ Review monitoring logs weekly
|
||||||
|
- ✅ Adjust thresholds based on actual patterns
|
||||||
|
- ✅ Document any monitoring configuration changes
|
||||||
|
- ✅ Keep monitoring scripts updated
|
||||||
|
|
||||||
|
### DON'T:
|
||||||
|
- ❌ Ignore alert emails
|
||||||
|
- ❌ Set thresholds too low (alert fatigue)
|
||||||
|
- ❌ Deploy monitoring without testing
|
||||||
|
- ❌ Disable monitoring without planning
|
||||||
|
- ❌ Let log files grow unbounded
|
||||||
|
- ❌ Ignore repeated warnings
|
||||||
|
|
||||||
|
### Monitoring Hygiene
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Rotate monitoring logs weekly
|
||||||
|
sudo logrotate /etc/logrotate.d/tractatus-monitoring
|
||||||
|
|
||||||
|
# Clean up old state files
|
||||||
|
find /var/tmp -name "tractatus-*-state" -mtime +7 -delete
|
||||||
|
|
||||||
|
# Review alert frequency monthly
|
||||||
|
grep "\[ALERT\]" /var/log/tractatus/monitoring.log | wc -l
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Incident Response
|
||||||
|
|
||||||
|
### When Alert Received
|
||||||
|
|
||||||
|
1. **Acknowledge alert** - Note time received
|
||||||
|
2. **Check current status** - Run manual health check
|
||||||
|
3. **Review logs** - Check what triggered alert
|
||||||
|
4. **Investigate root cause** - See deployment checklist emergency procedures
|
||||||
|
5. **Take action** - Fix issue or escalate
|
||||||
|
6. **Document** - Create incident report
|
||||||
|
|
||||||
|
### Critical Alert Response Time
|
||||||
|
|
||||||
|
- **Health check failure**: Respond within 15 minutes
|
||||||
|
- **Log errors**: Respond within 30 minutes
|
||||||
|
- **Disk space critical**: Respond within 1 hour
|
||||||
|
- **SSL expiry (7 days)**: Respond within 24 hours
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Maintenance
|
||||||
|
|
||||||
|
### Weekly Tasks
|
||||||
|
|
||||||
|
- [ ] Review monitoring logs for patterns
|
||||||
|
- [ ] Check alert email inbox
|
||||||
|
- [ ] Verify cron jobs still running
|
||||||
|
- [ ] Review disk space trends
|
||||||
|
|
||||||
|
### Monthly Tasks
|
||||||
|
|
||||||
|
- [ ] Review and adjust alert thresholds
|
||||||
|
- [ ] Clean up old monitoring logs
|
||||||
|
- [ ] Test manual failover procedures
|
||||||
|
- [ ] Update monitoring documentation
|
||||||
|
|
||||||
|
### Quarterly Tasks
|
||||||
|
|
||||||
|
- [ ] Full monitoring system audit
|
||||||
|
- [ ] Test all alert scenarios
|
||||||
|
- [ ] Review incident response times
|
||||||
|
- [ ] Consider monitoring enhancements
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monitoring Metrics
|
||||||
|
|
||||||
|
### Success Metrics
|
||||||
|
|
||||||
|
- **Uptime**: Target 99.9% (< 45 minutes downtime/month)
|
||||||
|
- **Alert Response Time**: < 30 minutes for critical
|
||||||
|
- **False Positive Rate**: < 5% of alerts
|
||||||
|
- **Detection Time**: < 5 minutes for critical issues
|
||||||
|
|
||||||
|
### Tracking
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Calculate uptime from logs
|
||||||
|
grep "Health endpoint OK" /var/log/tractatus/monitoring.log | wc -l
|
||||||
|
|
||||||
|
# Count alerts sent
|
||||||
|
grep "Alert email sent" /var/log/tractatus/monitoring.log | wc -l
|
||||||
|
|
||||||
|
# Review response times (manual from incident reports)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Security Considerations
|
||||||
|
|
||||||
|
### Log Access Control
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Ensure logs are readable only by ubuntu user and root
|
||||||
|
sudo chown ubuntu:ubuntu /var/log/tractatus/*.log
|
||||||
|
sudo chmod 640 /var/log/tractatus/*.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### Alert Email Security
|
||||||
|
|
||||||
|
- Use encrypted email if possible (ProtonMail)
|
||||||
|
- Don't include sensitive data in alert body
|
||||||
|
- Alerts show symptoms, not credentials
|
||||||
|
|
||||||
|
### Monitoring Script Security
|
||||||
|
|
||||||
|
- Scripts run as ubuntu user (not root)
|
||||||
|
- No credentials embedded in scripts
|
||||||
|
- Use environment variables for sensitive config
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Future Enhancements
|
||||||
|
|
||||||
|
### Planned Improvements
|
||||||
|
|
||||||
|
- [ ] **Metrics collection**: Store monitoring metrics in database for trend analysis
|
||||||
|
- [ ] **Status page**: Public status page showing service availability
|
||||||
|
- [ ] **Mobile alerts**: SMS or push notifications for critical alerts
|
||||||
|
- [ ] **Distributed monitoring**: Multiple monitoring locations for redundancy
|
||||||
|
- [ ] **Automated remediation**: Auto-restart service on failure
|
||||||
|
- [ ] **Performance monitoring**: Response time tracking, query performance
|
||||||
|
- [ ] **User impact monitoring**: Track error rates from user perspective
|
||||||
|
|
||||||
|
### Integration Opportunities
|
||||||
|
|
||||||
|
- [ ] **Plausible Analytics**: Monitor traffic patterns, correlate with errors
|
||||||
|
- [ ] **GitHub Actions**: Run monitoring checks in CI/CD
|
||||||
|
- [ ] **Slack integration**: Send alerts to Slack channel
|
||||||
|
- [ ] **Database backup monitoring**: Alert on backup failures
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Support & Documentation
|
||||||
|
|
||||||
|
**Monitoring Scripts**: `/var/www/tractatus/scripts/monitoring/`
|
||||||
|
**Monitoring Logs**: `/var/log/tractatus/`
|
||||||
|
**Cron Configuration**: `crontab -l` (ubuntu user)
|
||||||
|
**Alert Email**: Set via `ALERT_EMAIL` environment variable
|
||||||
|
|
||||||
|
**Related Documents:**
|
||||||
|
- [Production Deployment Checklist](PRODUCTION_DEPLOYMENT_CHECKLIST.md)
|
||||||
|
- [Phase 4 Preparation Checklist](../PHASE-4-PREPARATION-CHECKLIST.md)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Document Status**: Ready for Production
|
||||||
|
**Last Updated**: 2025-10-09
|
||||||
|
**Next Review**: After 1 month of monitoring data
|
||||||
|
**Maintainer**: Technical Lead (Claude Code + John Stroh)
|
||||||
257
scripts/monitoring/disk-monitor.sh
Executable file
257
scripts/monitoring/disk-monitor.sh
Executable file
|
|
@ -0,0 +1,257 @@
|
||||||
|
#!/bin/bash
|
||||||
|
#
|
||||||
|
# Disk Space Monitoring Script
|
||||||
|
# Monitors disk space usage and alerts when thresholds exceeded
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./disk-monitor.sh # Check all monitored paths
|
||||||
|
# ./disk-monitor.sh --test # Test mode (no alerts)
|
||||||
|
#
|
||||||
|
# Exit codes:
|
||||||
|
# 0 = OK
|
||||||
|
# 1 = Warning threshold exceeded
|
||||||
|
# 2 = Critical threshold exceeded
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
ALERT_EMAIL="${ALERT_EMAIL:-}"
|
||||||
|
LOG_FILE="/var/log/tractatus/disk-monitor.log"
|
||||||
|
WARN_THRESHOLD=80 # Warn at 80% usage
|
||||||
|
CRITICAL_THRESHOLD=90 # Critical at 90% usage
|
||||||
|
|
||||||
|
# Paths to monitor
|
||||||
|
declare -A MONITORED_PATHS=(
|
||||||
|
["/"]="Root filesystem"
|
||||||
|
["/var"]="Var directory"
|
||||||
|
["/var/log"]="Log directory"
|
||||||
|
["/var/www/tractatus"]="Tractatus application"
|
||||||
|
["/tmp"]="Temp directory"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
TEST_MODE=false
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--test)
|
||||||
|
TEST_MODE=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Unknown option: $1"
|
||||||
|
exit 3
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
# Logging function
|
||||||
|
log() {
|
||||||
|
local level="$1"
|
||||||
|
shift
|
||||||
|
local message="$*"
|
||||||
|
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
|
||||||
|
|
||||||
|
echo "[$timestamp] [$level] $message"
|
||||||
|
|
||||||
|
if [[ -d "$(dirname "$LOG_FILE")" ]]; then
|
||||||
|
echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Send alert email
|
||||||
|
send_alert() {
|
||||||
|
local subject="$1"
|
||||||
|
local body="$2"
|
||||||
|
|
||||||
|
if [[ "$TEST_MODE" == "true" ]]; then
|
||||||
|
log "INFO" "TEST MODE: Would send alert: $subject"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -z "$ALERT_EMAIL" ]]; then
|
||||||
|
log "WARN" "No alert email configured (ALERT_EMAIL not set)"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if command -v mail &> /dev/null; then
|
||||||
|
echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
|
||||||
|
log "INFO" "Alert email sent to $ALERT_EMAIL"
|
||||||
|
elif command -v sendmail &> /dev/null; then
|
||||||
|
{
|
||||||
|
echo "Subject: $subject"
|
||||||
|
echo "From: tractatus-monitoring@agenticgovernance.digital"
|
||||||
|
echo "To: $ALERT_EMAIL"
|
||||||
|
echo ""
|
||||||
|
echo "$body"
|
||||||
|
} | sendmail "$ALERT_EMAIL"
|
||||||
|
log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
|
||||||
|
else
|
||||||
|
log "WARN" "No email command available"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get disk usage for path
|
||||||
|
get_disk_usage() {
|
||||||
|
local path="$1"
|
||||||
|
|
||||||
|
# Check if path exists
|
||||||
|
if [[ ! -e "$path" ]]; then
|
||||||
|
echo "N/A"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Get usage percentage (remove % sign)
|
||||||
|
df -h "$path" 2>/dev/null | awk 'NR==2 {print $5}' | sed 's/%//' || echo "N/A"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get human-readable disk usage details
|
||||||
|
get_disk_details() {
|
||||||
|
local path="$1"
|
||||||
|
|
||||||
|
if [[ ! -e "$path" ]]; then
|
||||||
|
echo "Path does not exist"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
df -h "$path" 2>/dev/null | awk 'NR==2 {printf "Size: %s | Used: %s | Avail: %s | Use%%: %s | Mounted: %s\n", $2, $3, $4, $5, $6}'
|
||||||
|
}
|
||||||
|
|
||||||
|
# Find largest directories in path
|
||||||
|
find_largest_dirs() {
|
||||||
|
local path="$1"
|
||||||
|
local limit="${2:-10}"
|
||||||
|
|
||||||
|
if [[ ! -e "$path" ]]; then
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
du -h "$path"/* 2>/dev/null | sort -rh | head -n "$limit" || echo "Unable to scan directory"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check single path
|
||||||
|
check_path() {
|
||||||
|
local path="$1"
|
||||||
|
local description="$2"
|
||||||
|
|
||||||
|
local usage=$(get_disk_usage "$path")
|
||||||
|
|
||||||
|
if [[ "$usage" == "N/A" ]]; then
|
||||||
|
log "WARN" "$description ($path): Unable to check"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$usage" -ge "$CRITICAL_THRESHOLD" ]]; then
|
||||||
|
log "CRITICAL" "$description ($path): ${usage}% used (>= $CRITICAL_THRESHOLD%)"
|
||||||
|
return 2
|
||||||
|
elif [[ "$usage" -ge "$WARN_THRESHOLD" ]]; then
|
||||||
|
log "WARN" "$description ($path): ${usage}% used (>= $WARN_THRESHOLD%)"
|
||||||
|
return 1
|
||||||
|
else
|
||||||
|
log "INFO" "$description ($path): ${usage}% used"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Main monitoring function
|
||||||
|
main() {
|
||||||
|
log "INFO" "Starting disk space monitoring"
|
||||||
|
|
||||||
|
local max_severity=0
|
||||||
|
local issues=()
|
||||||
|
local critical_paths=()
|
||||||
|
local warning_paths=()
|
||||||
|
|
||||||
|
# Check all monitored paths
|
||||||
|
for path in "${!MONITORED_PATHS[@]}"; do
|
||||||
|
local description="${MONITORED_PATHS[$path]}"
|
||||||
|
local exit_code=0
|
||||||
|
|
||||||
|
check_path "$path" "$description" || exit_code=$?
|
||||||
|
|
||||||
|
if [[ "$exit_code" -eq 2 ]]; then
|
||||||
|
max_severity=2
|
||||||
|
critical_paths+=("$path (${description})")
|
||||||
|
elif [[ "$exit_code" -eq 1 ]]; then
|
||||||
|
[[ "$max_severity" -lt 1 ]] && max_severity=1
|
||||||
|
warning_paths+=("$path (${description})")
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Send alerts if thresholds exceeded
|
||||||
|
if [[ "$max_severity" -eq 2 ]]; then
|
||||||
|
local subject="[CRITICAL] Tractatus Disk Space Critical"
|
||||||
|
local body="CRITICAL: Disk space usage has exceeded ${CRITICAL_THRESHOLD}% on one or more paths.
|
||||||
|
|
||||||
|
Critical Paths (>= ${CRITICAL_THRESHOLD}%):
|
||||||
|
$(printf -- "- %s\n" "${critical_paths[@]}")
|
||||||
|
"
|
||||||
|
|
||||||
|
# Add warning paths if any
|
||||||
|
if [[ "${#warning_paths[@]}" -gt 0 ]]; then
|
||||||
|
body+="
|
||||||
|
Warning Paths (>= ${WARN_THRESHOLD}%):
|
||||||
|
$(printf -- "- %s\n" "${warning_paths[@]}")
|
||||||
|
"
|
||||||
|
fi
|
||||||
|
|
||||||
|
body+="
|
||||||
|
Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
|
||||||
|
Host: $(hostname)
|
||||||
|
|
||||||
|
Disk Usage Details:
|
||||||
|
$(df -h)
|
||||||
|
|
||||||
|
Largest directories in /var/www/tractatus:
|
||||||
|
$(find_largest_dirs /var/www/tractatus 10)
|
||||||
|
|
||||||
|
Largest log files:
|
||||||
|
$(du -h /var/log/tractatus/*.log 2>/dev/null | sort -rh | head -10 || echo "No log files found")
|
||||||
|
|
||||||
|
Action Required:
|
||||||
|
1. Clean up old log files
|
||||||
|
2. Remove unnecessary files
|
||||||
|
3. Check for runaway processes creating large files
|
||||||
|
4. Consider expanding disk space
|
||||||
|
|
||||||
|
Clean up commands:
|
||||||
|
# Rotate old logs
|
||||||
|
sudo journalctl --vacuum-time=7d
|
||||||
|
|
||||||
|
# Clean up npm cache
|
||||||
|
npm cache clean --force
|
||||||
|
|
||||||
|
# Find large files
|
||||||
|
find /var/www/tractatus -type f -size +100M -exec ls -lh {} \;
|
||||||
|
"
|
||||||
|
|
||||||
|
send_alert "$subject" "$body"
|
||||||
|
log "CRITICAL" "Disk space alert sent"
|
||||||
|
|
||||||
|
elif [[ "$max_severity" -eq 1 ]]; then
|
||||||
|
local subject="[WARN] Tractatus Disk Space Warning"
|
||||||
|
local body="WARNING: Disk space usage has exceeded ${WARN_THRESHOLD}% on one or more paths.
|
||||||
|
|
||||||
|
Warning Paths (>= ${WARN_THRESHOLD}%):
|
||||||
|
$(printf -- "- %s\n" "${warning_paths[@]}")
|
||||||
|
|
||||||
|
Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
|
||||||
|
Host: $(hostname)
|
||||||
|
|
||||||
|
Disk Usage:
|
||||||
|
$(df -h)
|
||||||
|
|
||||||
|
Please review disk usage and clean up if necessary.
|
||||||
|
"
|
||||||
|
|
||||||
|
send_alert "$subject" "$body"
|
||||||
|
log "WARN" "Disk space warning sent"
|
||||||
|
else
|
||||||
|
log "INFO" "All monitored paths within acceptable limits"
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit $max_severity
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run main function
|
||||||
|
main
|
||||||
269
scripts/monitoring/health-check.sh
Executable file
269
scripts/monitoring/health-check.sh
Executable file
|
|
@ -0,0 +1,269 @@
|
||||||
|
#!/bin/bash
|
||||||
|
#
|
||||||
|
# Health Check Monitoring Script
|
||||||
|
# Monitors Tractatus application health endpoint and service status
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./health-check.sh # Run check, alert if issues
|
||||||
|
# ./health-check.sh --quiet # Suppress output unless error
|
||||||
|
# ./health-check.sh --test # Test mode (no alerts)
|
||||||
|
#
|
||||||
|
# Exit codes:
|
||||||
|
# 0 = Healthy
|
||||||
|
# 1 = Health endpoint failed
|
||||||
|
# 2 = Service not running
|
||||||
|
# 3 = Configuration error
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
HEALTH_URL="${HEALTH_URL:-https://agenticgovernance.digital/health}"
|
||||||
|
SERVICE_NAME="${SERVICE_NAME:-tractatus}"
|
||||||
|
ALERT_EMAIL="${ALERT_EMAIL:-}"
|
||||||
|
LOG_FILE="/var/log/tractatus/health-check.log"
|
||||||
|
STATE_FILE="/var/tmp/tractatus-health-state"
|
||||||
|
MAX_FAILURES=3 # Alert after 3 consecutive failures
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
QUIET=false
|
||||||
|
TEST_MODE=false
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--quiet) QUIET=true; shift ;;
|
||||||
|
--test) TEST_MODE=true; shift ;;
|
||||||
|
*) echo "Unknown option: $1"; exit 3 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
# Logging function
|
||||||
|
log() {
|
||||||
|
local level="$1"
|
||||||
|
shift
|
||||||
|
local message="$*"
|
||||||
|
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
|
||||||
|
|
||||||
|
if [[ "$QUIET" != "true" ]] || [[ "$level" == "ERROR" ]] || [[ "$level" == "CRITICAL" ]]; then
|
||||||
|
echo "[$timestamp] [$level] $message"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Log to file if directory exists
|
||||||
|
if [[ -d "$(dirname "$LOG_FILE")" ]]; then
|
||||||
|
echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get current failure count
|
||||||
|
get_failure_count() {
|
||||||
|
if [[ -f "$STATE_FILE" ]]; then
|
||||||
|
cat "$STATE_FILE"
|
||||||
|
else
|
||||||
|
echo "0"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Increment failure count
|
||||||
|
increment_failure_count() {
|
||||||
|
local count=$(get_failure_count)
|
||||||
|
echo $((count + 1)) > "$STATE_FILE"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Reset failure count
|
||||||
|
reset_failure_count() {
|
||||||
|
echo "0" > "$STATE_FILE"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Send alert email
|
||||||
|
send_alert() {
|
||||||
|
local subject="$1"
|
||||||
|
local body="$2"
|
||||||
|
|
||||||
|
if [[ "$TEST_MODE" == "true" ]]; then
|
||||||
|
log "INFO" "TEST MODE: Would send alert: $subject"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -z "$ALERT_EMAIL" ]]; then
|
||||||
|
log "WARN" "No alert email configured (ALERT_EMAIL not set)"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Try to send email using mail command (if available)
|
||||||
|
if command -v mail &> /dev/null; then
|
||||||
|
echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
|
||||||
|
log "INFO" "Alert email sent to $ALERT_EMAIL"
|
||||||
|
elif command -v sendmail &> /dev/null; then
|
||||||
|
{
|
||||||
|
echo "Subject: $subject"
|
||||||
|
echo "From: tractatus-monitoring@agenticgovernance.digital"
|
||||||
|
echo "To: $ALERT_EMAIL"
|
||||||
|
echo ""
|
||||||
|
echo "$body"
|
||||||
|
} | sendmail "$ALERT_EMAIL"
|
||||||
|
log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
|
||||||
|
else
|
||||||
|
log "WARN" "No email command available (install mailutils or sendmail)"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check health endpoint
|
||||||
|
check_health_endpoint() {
|
||||||
|
log "INFO" "Checking health endpoint: $HEALTH_URL"
|
||||||
|
|
||||||
|
# Make HTTP request with timeout
|
||||||
|
local response
|
||||||
|
local http_code
|
||||||
|
|
||||||
|
response=$(curl -s -w "\n%{http_code}" --max-time 10 "$HEALTH_URL" 2>&1) || {
|
||||||
|
log "ERROR" "Health endpoint request failed: $response"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract HTTP code (last line)
|
||||||
|
http_code=$(echo "$response" | tail -n 1)
|
||||||
|
|
||||||
|
# Extract response body (everything except last line)
|
||||||
|
local body=$(echo "$response" | sed '$d')
|
||||||
|
|
||||||
|
# Check HTTP status
|
||||||
|
if [[ "$http_code" != "200" ]]; then
|
||||||
|
log "ERROR" "Health endpoint returned HTTP $http_code"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check response contains expected JSON
|
||||||
|
if ! echo "$body" | jq -e '.status == "ok"' &> /dev/null; then
|
||||||
|
log "ERROR" "Health endpoint response invalid: $body"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "INFO" "Health endpoint OK (HTTP $http_code)"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check systemd service status
|
||||||
|
check_service_status() {
|
||||||
|
log "INFO" "Checking service status: $SERVICE_NAME"
|
||||||
|
|
||||||
|
if ! systemctl is-active --quiet "$SERVICE_NAME"; then
|
||||||
|
log "ERROR" "Service $SERVICE_NAME is not active"
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check if service is enabled
|
||||||
|
if ! systemctl is-enabled --quiet "$SERVICE_NAME"; then
|
||||||
|
log "WARN" "Service $SERVICE_NAME is not enabled (won't start on boot)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "INFO" "Service $SERVICE_NAME is active"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check database connectivity (quick MongoDB ping)
|
||||||
|
check_database() {
|
||||||
|
log "INFO" "Checking database connectivity"
|
||||||
|
|
||||||
|
# Try to connect to MongoDB (timeout 5 seconds)
|
||||||
|
if ! timeout 5 mongosh --quiet --eval "db.adminCommand('ping')" localhost:27017/tractatus_prod &> /dev/null; then
|
||||||
|
log "ERROR" "Database connection failed"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "INFO" "Database connectivity OK"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check disk space
|
||||||
|
check_disk_space() {
|
||||||
|
log "INFO" "Checking disk space"
|
||||||
|
|
||||||
|
# Get root filesystem usage percentage
|
||||||
|
local usage=$(df -h / | awk 'NR==2 {print $5}' | sed 's/%//')
|
||||||
|
|
||||||
|
if [[ "$usage" -gt 90 ]]; then
|
||||||
|
log "CRITICAL" "Disk space critical: ${usage}% used"
|
||||||
|
return 1
|
||||||
|
elif [[ "$usage" -gt 80 ]]; then
|
||||||
|
log "WARN" "Disk space high: ${usage}% used"
|
||||||
|
else
|
||||||
|
log "INFO" "Disk space OK: ${usage}% used"
|
||||||
|
fi
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# Main health check
|
||||||
|
main() {
|
||||||
|
log "INFO" "Starting health check"
|
||||||
|
|
||||||
|
local all_healthy=true
|
||||||
|
local issues=()
|
||||||
|
|
||||||
|
# Run all checks
|
||||||
|
if ! check_service_status; then
|
||||||
|
all_healthy=false
|
||||||
|
issues+=("Service not running")
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! check_health_endpoint; then
|
||||||
|
all_healthy=false
|
||||||
|
issues+=("Health endpoint failed")
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! check_database; then
|
||||||
|
all_healthy=false
|
||||||
|
issues+=("Database connectivity failed")
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! check_disk_space; then
|
||||||
|
all_healthy=false
|
||||||
|
issues+=("Disk space issue")
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Handle results
|
||||||
|
if [[ "$all_healthy" == "true" ]]; then
|
||||||
|
log "INFO" "All health checks passed ✓"
|
||||||
|
reset_failure_count
|
||||||
|
exit 0
|
||||||
|
else
|
||||||
|
log "ERROR" "Health check failed: ${issues[*]}"
|
||||||
|
increment_failure_count
|
||||||
|
|
||||||
|
local failure_count=$(get_failure_count)
|
||||||
|
log "WARN" "Consecutive failures: $failure_count/$MAX_FAILURES"
|
||||||
|
|
||||||
|
# Alert if threshold reached
|
||||||
|
if [[ "$failure_count" -ge "$MAX_FAILURES" ]]; then
|
||||||
|
local subject="[ALERT] Tractatus Health Check Failed ($failure_count failures)"
|
||||||
|
local body="Tractatus health check has failed $failure_count times consecutively.
|
||||||
|
|
||||||
|
Issues detected:
|
||||||
|
$(printf -- "- %s\n" "${issues[@]}")
|
||||||
|
|
||||||
|
Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
|
||||||
|
Host: $(hostname)
|
||||||
|
Service: $SERVICE_NAME
|
||||||
|
Health URL: $HEALTH_URL
|
||||||
|
|
||||||
|
Please investigate immediately.
|
||||||
|
|
||||||
|
View logs:
|
||||||
|
sudo journalctl -u $SERVICE_NAME -n 100
|
||||||
|
|
||||||
|
Check service status:
|
||||||
|
sudo systemctl status $SERVICE_NAME
|
||||||
|
|
||||||
|
Restart service:
|
||||||
|
sudo systemctl restart $SERVICE_NAME
|
||||||
|
"
|
||||||
|
|
||||||
|
send_alert "$subject" "$body"
|
||||||
|
log "CRITICAL" "Alert sent after $failure_count consecutive failures"
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run main function
|
||||||
|
main
|
||||||
269
scripts/monitoring/log-monitor.sh
Executable file
269
scripts/monitoring/log-monitor.sh
Executable file
|
|
@ -0,0 +1,269 @@
|
||||||
|
#!/bin/bash
|
||||||
|
#
|
||||||
|
# Log Monitoring Script
|
||||||
|
# Monitors Tractatus service logs for errors, security events, and anomalies
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./log-monitor.sh # Monitor logs since last check
|
||||||
|
# ./log-monitor.sh --since "1 hour" # Monitor specific time window
|
||||||
|
# ./log-monitor.sh --follow # Continuous monitoring
|
||||||
|
# ./log-monitor.sh --test # Test mode (no alerts)
|
||||||
|
#
|
||||||
|
# Exit codes:
|
||||||
|
# 0 = No issues found
|
||||||
|
# 1 = Errors detected
|
||||||
|
# 2 = Critical errors detected
|
||||||
|
# 3 = Configuration error
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
SERVICE_NAME="${SERVICE_NAME:-tractatus}"
|
||||||
|
ALERT_EMAIL="${ALERT_EMAIL:-}"
|
||||||
|
LOG_FILE="/var/log/tractatus/log-monitor.log"
|
||||||
|
STATE_FILE="/var/tmp/tractatus-log-monitor-state"
|
||||||
|
ERROR_THRESHOLD=10 # Alert after 10 errors in window
|
||||||
|
CRITICAL_THRESHOLD=3 # Alert immediately after 3 critical errors
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
SINCE="5 minutes ago"
|
||||||
|
FOLLOW=false
|
||||||
|
TEST_MODE=false
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--since)
|
||||||
|
SINCE="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--follow)
|
||||||
|
FOLLOW=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--test)
|
||||||
|
TEST_MODE=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Unknown option: $1"
|
||||||
|
exit 3
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
# Logging function
|
||||||
|
log() {
|
||||||
|
local level="$1"
|
||||||
|
shift
|
||||||
|
local message="$*"
|
||||||
|
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
|
||||||
|
|
||||||
|
echo "[$timestamp] [$level] $message"
|
||||||
|
|
||||||
|
# Log to file if directory exists
|
||||||
|
if [[ -d "$(dirname "$LOG_FILE")" ]]; then
|
||||||
|
echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Send alert email
|
||||||
|
send_alert() {
|
||||||
|
local subject="$1"
|
||||||
|
local body="$2"
|
||||||
|
|
||||||
|
if [[ "$TEST_MODE" == "true" ]]; then
|
||||||
|
log "INFO" "TEST MODE: Would send alert: $subject"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -z "$ALERT_EMAIL" ]]; then
|
||||||
|
log "WARN" "No alert email configured (ALERT_EMAIL not set)"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if command -v mail &> /dev/null; then
|
||||||
|
echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
|
||||||
|
log "INFO" "Alert email sent to $ALERT_EMAIL"
|
||||||
|
elif command -v sendmail &> /dev/null; then
|
||||||
|
{
|
||||||
|
echo "Subject: $subject"
|
||||||
|
echo "From: tractatus-monitoring@agenticgovernance.digital"
|
||||||
|
echo "To: $ALERT_EMAIL"
|
||||||
|
echo ""
|
||||||
|
echo "$body"
|
||||||
|
} | sendmail "$ALERT_EMAIL"
|
||||||
|
log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
|
||||||
|
else
|
||||||
|
log "WARN" "No email command available"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract errors from logs
|
||||||
|
extract_errors() {
|
||||||
|
local since="$1"
|
||||||
|
|
||||||
|
# Get logs since specified time
|
||||||
|
sudo journalctl -u "$SERVICE_NAME" --since "$since" --no-pager 2>/dev/null || {
|
||||||
|
log "ERROR" "Failed to read journal for $SERVICE_NAME"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Analyze log patterns
|
||||||
|
analyze_logs() {
|
||||||
|
local logs="$1"
|
||||||
|
|
||||||
|
# Count different severity levels
|
||||||
|
local error_count=$(echo "$logs" | grep -ci "\[ERROR\]" || echo "0")
|
||||||
|
local critical_count=$(echo "$logs" | grep -ci "\[CRITICAL\]" || echo "0")
|
||||||
|
local warn_count=$(echo "$logs" | grep -ci "\[WARN\]" || echo "0")
|
||||||
|
|
||||||
|
# Security-related patterns
|
||||||
|
local security_count=$(echo "$logs" | grep -ciE "(SECURITY|unauthorized|forbidden|authentication failed)" || echo "0")
|
||||||
|
|
||||||
|
# Database errors
|
||||||
|
local db_error_count=$(echo "$logs" | grep -ciE "(mongodb|database|connection.*failed)" || echo "0")
|
||||||
|
|
||||||
|
# HTTP errors
|
||||||
|
local http_error_count=$(echo "$logs" | grep -ciE "HTTP.*50[0-9]|Internal Server Error" || echo "0")
|
||||||
|
|
||||||
|
# Unhandled exceptions
|
||||||
|
local exception_count=$(echo "$logs" | grep -ciE "(Unhandled.*exception|TypeError|ReferenceError)" || echo "0")
|
||||||
|
|
||||||
|
log "INFO" "Log analysis: CRITICAL=$critical_count ERROR=$error_count WARN=$warn_count SECURITY=$security_count DB_ERROR=$db_error_count HTTP_ERROR=$http_error_count EXCEPTION=$exception_count"
|
||||||
|
|
||||||
|
# Determine severity
|
||||||
|
if [[ "$critical_count" -ge "$CRITICAL_THRESHOLD" ]]; then
|
||||||
|
log "CRITICAL" "Critical error threshold exceeded: $critical_count critical errors"
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$error_count" -ge "$ERROR_THRESHOLD" ]]; then
|
||||||
|
log "ERROR" "Error threshold exceeded: $error_count errors"
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$security_count" -gt 0 ]]; then
|
||||||
|
log "WARN" "Security events detected: $security_count events"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$db_error_count" -gt 5 ]]; then
|
||||||
|
log "WARN" "Database errors detected: $db_error_count errors"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$exception_count" -gt 0 ]]; then
|
||||||
|
log "WARN" "Unhandled exceptions detected: $exception_count exceptions"
|
||||||
|
fi
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract top error messages
|
||||||
|
get_top_errors() {
|
||||||
|
local logs="$1"
|
||||||
|
local limit="${2:-10}"
|
||||||
|
|
||||||
|
echo "$logs" | grep -iE "\[ERROR\]|\[CRITICAL\]" | \
|
||||||
|
sed 's/^.*\] //' | \
|
||||||
|
sort | uniq -c | sort -rn | head -n "$limit"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Main monitoring function
|
||||||
|
main() {
|
||||||
|
log "INFO" "Starting log monitoring (since: $SINCE)"
|
||||||
|
|
||||||
|
# Extract logs
|
||||||
|
local logs
|
||||||
|
logs=$(extract_errors "$SINCE") || {
|
||||||
|
log "ERROR" "Failed to extract logs"
|
||||||
|
exit 3
|
||||||
|
}
|
||||||
|
|
||||||
|
# Count total log entries
|
||||||
|
local log_count=$(echo "$logs" | wc -l)
|
||||||
|
log "INFO" "Analyzing $log_count log entries"
|
||||||
|
|
||||||
|
if [[ "$log_count" -eq 0 ]]; then
|
||||||
|
log "INFO" "No logs found in time window"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Analyze logs
|
||||||
|
local exit_code=0
|
||||||
|
analyze_logs "$logs" || exit_code=$?
|
||||||
|
|
||||||
|
# If errors detected, send alert
|
||||||
|
if [[ "$exit_code" -ne 0 ]]; then
|
||||||
|
local severity="ERROR"
|
||||||
|
[[ "$exit_code" -eq 2 ]] && severity="CRITICAL"
|
||||||
|
|
||||||
|
local subject="[ALERT] Tractatus Log Monitoring - $severity Detected"
|
||||||
|
|
||||||
|
# Extract top 10 error messages
|
||||||
|
local top_errors=$(get_top_errors "$logs" 10)
|
||||||
|
|
||||||
|
local body="Log monitoring detected $severity level issues in Tractatus service.
|
||||||
|
|
||||||
|
Time Window: $SINCE
|
||||||
|
Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
|
||||||
|
Host: $(hostname)
|
||||||
|
Service: $SERVICE_NAME
|
||||||
|
|
||||||
|
Top Error Messages:
|
||||||
|
$top_errors
|
||||||
|
|
||||||
|
Recent Critical/Error Logs:
|
||||||
|
$(echo "$logs" | grep -iE "\[ERROR\]|\[CRITICAL\]" | tail -n 20)
|
||||||
|
|
||||||
|
Full logs:
|
||||||
|
sudo journalctl -u $SERVICE_NAME --since \"$SINCE\"
|
||||||
|
|
||||||
|
Check service status:
|
||||||
|
sudo systemctl status $SERVICE_NAME
|
||||||
|
"
|
||||||
|
|
||||||
|
send_alert "$subject" "$body"
|
||||||
|
else
|
||||||
|
log "INFO" "No significant issues detected"
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit $exit_code
|
||||||
|
}
|
||||||
|
|
||||||
|
# Follow mode (continuous monitoring)
|
||||||
|
follow_logs() {
|
||||||
|
log "INFO" "Starting continuous log monitoring"
|
||||||
|
|
||||||
|
sudo journalctl -u "$SERVICE_NAME" -f --no-pager | while read -r line; do
|
||||||
|
# Check for error patterns
|
||||||
|
if echo "$line" | grep -qiE "\[ERROR\]|\[CRITICAL\]"; then
|
||||||
|
log "ERROR" "$line"
|
||||||
|
|
||||||
|
# Extract error message
|
||||||
|
local error_msg=$(echo "$line" | sed 's/^.*\] //')
|
||||||
|
|
||||||
|
# Check for critical patterns
|
||||||
|
if echo "$line" | grep -qiE "\[CRITICAL\]|Unhandled.*exception|Database.*failed|Service.*crashed"; then
|
||||||
|
local subject="[CRITICAL] Tractatus Error Detected"
|
||||||
|
local body="Critical error detected in Tractatus logs:
|
||||||
|
|
||||||
|
$line
|
||||||
|
|
||||||
|
Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
|
||||||
|
Host: $(hostname)
|
||||||
|
|
||||||
|
Recent logs:
|
||||||
|
$(sudo journalctl -u $SERVICE_NAME -n 10 --no-pager)
|
||||||
|
"
|
||||||
|
send_alert "$subject" "$body"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run appropriate mode
|
||||||
|
if [[ "$FOLLOW" == "true" ]]; then
|
||||||
|
follow_logs
|
||||||
|
else
|
||||||
|
main
|
||||||
|
fi
|
||||||
178
scripts/monitoring/monitor-all.sh
Executable file
178
scripts/monitoring/monitor-all.sh
Executable file
|
|
@ -0,0 +1,178 @@
|
||||||
|
#!/bin/bash
|
||||||
|
#
|
||||||
|
# Master Monitoring Script
|
||||||
|
# Orchestrates all monitoring checks for Tractatus production environment
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./monitor-all.sh # Run all monitors
|
||||||
|
# ./monitor-all.sh --test # Test mode (no alerts)
|
||||||
|
# ./monitor-all.sh --skip-ssl # Skip SSL check
|
||||||
|
#
|
||||||
|
# Exit codes:
|
||||||
|
# 0 = All checks passed
|
||||||
|
# 1 = Some warnings
|
||||||
|
# 2 = Some critical issues
|
||||||
|
# 3 = Configuration error
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
LOG_FILE="/var/log/tractatus/monitoring.log"
|
||||||
|
ALERT_EMAIL="${ALERT_EMAIL:-}"
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
TEST_MODE=false
|
||||||
|
SKIP_SSL=false
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--test)
|
||||||
|
TEST_MODE=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--skip-ssl)
|
||||||
|
SKIP_SSL=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Unknown option: $1"
|
||||||
|
exit 3
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
# Export configuration for child scripts
|
||||||
|
export ALERT_EMAIL
|
||||||
|
[[ "$TEST_MODE" == "true" ]] && TEST_FLAG="--test" || TEST_FLAG=""
|
||||||
|
|
||||||
|
# Logging function
|
||||||
|
log() {
|
||||||
|
local level="$1"
|
||||||
|
shift
|
||||||
|
local message="$*"
|
||||||
|
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
|
||||||
|
|
||||||
|
echo "[$timestamp] [$level] $message"
|
||||||
|
|
||||||
|
if [[ -d "$(dirname "$LOG_FILE")" ]]; then
|
||||||
|
echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run monitoring check
|
||||||
|
run_check() {
|
||||||
|
local name="$1"
|
||||||
|
local script="$2"
|
||||||
|
shift 2
|
||||||
|
local args="$@"
|
||||||
|
|
||||||
|
log "INFO" "Running $name..."
|
||||||
|
|
||||||
|
local exit_code=0
|
||||||
|
"$SCRIPT_DIR/$script" $args $TEST_FLAG || exit_code=$?
|
||||||
|
|
||||||
|
case $exit_code in
|
||||||
|
0)
|
||||||
|
log "INFO" "$name: OK ✓"
|
||||||
|
;;
|
||||||
|
1)
|
||||||
|
log "WARN" "$name: Warning"
|
||||||
|
;;
|
||||||
|
2)
|
||||||
|
log "CRITICAL" "$name: Critical"
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
log "ERROR" "$name: Error (exit code: $exit_code)"
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
return $exit_code
|
||||||
|
}
|
||||||
|
|
||||||
|
# Main monitoring function
|
||||||
|
main() {
|
||||||
|
log "INFO" "=== Starting Tractatus Monitoring Suite ==="
|
||||||
|
log "INFO" "Timestamp: $(date '+%Y-%m-%d %H:%M:%S %Z')"
|
||||||
|
log "INFO" "Host: $(hostname)"
|
||||||
|
[[ "$TEST_MODE" == "true" ]] && log "INFO" "TEST MODE: Alerts suppressed"
|
||||||
|
|
||||||
|
local max_severity=0
|
||||||
|
local checks_run=0
|
||||||
|
local checks_passed=0
|
||||||
|
local checks_warned=0
|
||||||
|
local checks_critical=0
|
||||||
|
local checks_failed=0
|
||||||
|
|
||||||
|
# Health Check
|
||||||
|
if run_check "Health Check" "health-check.sh"; then
|
||||||
|
((checks_passed++))
|
||||||
|
else
|
||||||
|
local exit_code=$?
|
||||||
|
[[ $exit_code -eq 1 ]] && ((checks_warned++))
|
||||||
|
[[ $exit_code -eq 2 ]] && ((checks_critical++))
|
||||||
|
[[ $exit_code -ge 3 ]] && ((checks_failed++))
|
||||||
|
[[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
|
||||||
|
fi
|
||||||
|
((checks_run++))
|
||||||
|
|
||||||
|
# Log Monitor
|
||||||
|
if run_check "Log Monitor" "log-monitor.sh" --since "5 minutes ago"; then
|
||||||
|
((checks_passed++))
|
||||||
|
else
|
||||||
|
local exit_code=$?
|
||||||
|
[[ $exit_code -eq 1 ]] && ((checks_warned++))
|
||||||
|
[[ $exit_code -eq 2 ]] && ((checks_critical++))
|
||||||
|
[[ $exit_code -ge 3 ]] && ((checks_failed++))
|
||||||
|
[[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
|
||||||
|
fi
|
||||||
|
((checks_run++))
|
||||||
|
|
||||||
|
# Disk Monitor
|
||||||
|
if run_check "Disk Monitor" "disk-monitor.sh"; then
|
||||||
|
((checks_passed++))
|
||||||
|
else
|
||||||
|
local exit_code=$?
|
||||||
|
[[ $exit_code -eq 1 ]] && ((checks_warned++))
|
||||||
|
[[ $exit_code -eq 2 ]] && ((checks_critical++))
|
||||||
|
[[ $exit_code -ge 3 ]] && ((checks_failed++))
|
||||||
|
[[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
|
||||||
|
fi
|
||||||
|
((checks_run++))
|
||||||
|
|
||||||
|
# SSL Monitor (optional)
|
||||||
|
if [[ "$SKIP_SSL" != "true" ]]; then
|
||||||
|
if run_check "SSL Monitor" "ssl-monitor.sh"; then
|
||||||
|
((checks_passed++))
|
||||||
|
else
|
||||||
|
local exit_code=$?
|
||||||
|
[[ $exit_code -eq 1 ]] && ((checks_warned++))
|
||||||
|
[[ $exit_code -eq 2 ]] && ((checks_critical++))
|
||||||
|
[[ $exit_code -ge 3 ]] && ((checks_failed++))
|
||||||
|
[[ $exit_code -gt $max_severity ]] && max_severity=$exit_code
|
||||||
|
fi
|
||||||
|
((checks_run++))
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Summary
|
||||||
|
log "INFO" "=== Monitoring Summary ==="
|
||||||
|
log "INFO" "Checks run: $checks_run"
|
||||||
|
log "INFO" "Passed: $checks_passed | Warned: $checks_warned | Critical: $checks_critical | Failed: $checks_failed"
|
||||||
|
|
||||||
|
if [[ $max_severity -eq 0 ]]; then
|
||||||
|
log "INFO" "All monitoring checks passed ✓"
|
||||||
|
elif [[ $max_severity -eq 1 ]]; then
|
||||||
|
log "WARN" "Some checks returned warnings"
|
||||||
|
elif [[ $max_severity -eq 2 ]]; then
|
||||||
|
log "CRITICAL" "Some checks returned critical alerts"
|
||||||
|
else
|
||||||
|
log "ERROR" "Some checks failed"
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "INFO" "=== Monitoring Complete ==="
|
||||||
|
|
||||||
|
exit $max_severity
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run main function
|
||||||
|
main
|
||||||
319
scripts/monitoring/ssl-monitor.sh
Executable file
319
scripts/monitoring/ssl-monitor.sh
Executable file
|
|
@ -0,0 +1,319 @@
|
||||||
|
#!/bin/bash
|
||||||
|
#
|
||||||
|
# SSL Certificate Monitoring Script
|
||||||
|
# Monitors SSL certificate expiry and alerts before expiration
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./ssl-monitor.sh # Check all domains
|
||||||
|
# ./ssl-monitor.sh --domain example.com # Check specific domain
|
||||||
|
# ./ssl-monitor.sh --test # Test mode (no alerts)
|
||||||
|
#
|
||||||
|
# Exit codes:
|
||||||
|
# 0 = OK
|
||||||
|
# 1 = Warning (expires soon)
|
||||||
|
# 2 = Critical (expires very soon)
|
||||||
|
# 3 = Expired or error
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
ALERT_EMAIL="${ALERT_EMAIL:-}"
|
||||||
|
LOG_FILE="/var/log/tractatus/ssl-monitor.log"
|
||||||
|
WARN_DAYS=30 # Warn 30 days before expiry
|
||||||
|
CRITICAL_DAYS=7 # Critical alert 7 days before expiry
|
||||||
|
|
||||||
|
# Default domains to monitor
|
||||||
|
DOMAINS=(
|
||||||
|
"agenticgovernance.digital"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Parse arguments
|
||||||
|
TEST_MODE=false
|
||||||
|
SPECIFIC_DOMAIN=""
|
||||||
|
|
||||||
|
while [[ $# -gt 0 ]]; do
|
||||||
|
case $1 in
|
||||||
|
--domain)
|
||||||
|
SPECIFIC_DOMAIN="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--test)
|
||||||
|
TEST_MODE=true
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "Unknown option: $1"
|
||||||
|
exit 3
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
# Override domains if specific domain provided
|
||||||
|
if [[ -n "$SPECIFIC_DOMAIN" ]]; then
|
||||||
|
DOMAINS=("$SPECIFIC_DOMAIN")
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Logging function
|
||||||
|
log() {
|
||||||
|
local level="$1"
|
||||||
|
shift
|
||||||
|
local message="$*"
|
||||||
|
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
|
||||||
|
|
||||||
|
echo "[$timestamp] [$level] $message"
|
||||||
|
|
||||||
|
if [[ -d "$(dirname "$LOG_FILE")" ]]; then
|
||||||
|
echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Send alert email
|
||||||
|
send_alert() {
|
||||||
|
local subject="$1"
|
||||||
|
local body="$2"
|
||||||
|
|
||||||
|
if [[ "$TEST_MODE" == "true" ]]; then
|
||||||
|
log "INFO" "TEST MODE: Would send alert: $subject"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -z "$ALERT_EMAIL" ]]; then
|
||||||
|
log "WARN" "No alert email configured (ALERT_EMAIL not set)"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if command -v mail &> /dev/null; then
|
||||||
|
echo "$body" | mail -s "$subject" "$ALERT_EMAIL"
|
||||||
|
log "INFO" "Alert email sent to $ALERT_EMAIL"
|
||||||
|
elif command -v sendmail &> /dev/null; then
|
||||||
|
{
|
||||||
|
echo "Subject: $subject"
|
||||||
|
echo "From: tractatus-monitoring@agenticgovernance.digital"
|
||||||
|
echo "To: $ALERT_EMAIL"
|
||||||
|
echo ""
|
||||||
|
echo "$body"
|
||||||
|
} | sendmail "$ALERT_EMAIL"
|
||||||
|
log "INFO" "Alert email sent via sendmail to $ALERT_EMAIL"
|
||||||
|
else
|
||||||
|
log "WARN" "No email command available"
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get SSL certificate expiry date
|
||||||
|
get_cert_expiry() {
|
||||||
|
local domain="$1"
|
||||||
|
|
||||||
|
# Use openssl to get certificate
|
||||||
|
local expiry_date
|
||||||
|
expiry_date=$(echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null | \
|
||||||
|
openssl x509 -noout -enddate 2>/dev/null | \
|
||||||
|
cut -d= -f2) || {
|
||||||
|
log "ERROR" "Failed to retrieve certificate for $domain"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
echo "$expiry_date"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get days until expiry
|
||||||
|
get_days_until_expiry() {
|
||||||
|
local expiry_date="$1"
|
||||||
|
|
||||||
|
# Convert expiry date to seconds since epoch
|
||||||
|
local expiry_epoch
|
||||||
|
expiry_epoch=$(date -d "$expiry_date" +%s 2>/dev/null) || {
|
||||||
|
log "ERROR" "Failed to parse expiry date: $expiry_date"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get current time in seconds since epoch
|
||||||
|
local now_epoch=$(date +%s)
|
||||||
|
|
||||||
|
# Calculate days until expiry
|
||||||
|
local seconds_until_expiry=$((expiry_epoch - now_epoch))
|
||||||
|
local days_until_expiry=$((seconds_until_expiry / 86400))
|
||||||
|
|
||||||
|
echo "$days_until_expiry"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get certificate details
|
||||||
|
get_cert_details() {
|
||||||
|
local domain="$1"
|
||||||
|
|
||||||
|
echo | openssl s_client -servername "$domain" -connect "$domain:443" 2>/dev/null | \
|
||||||
|
openssl x509 -noout -subject -issuer -dates 2>/dev/null || {
|
||||||
|
echo "Failed to retrieve certificate details"
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check single domain
|
||||||
|
check_domain() {
|
||||||
|
local domain="$1"
|
||||||
|
|
||||||
|
log "INFO" "Checking SSL certificate for $domain"
|
||||||
|
|
||||||
|
# Get expiry date
|
||||||
|
local expiry_date
|
||||||
|
expiry_date=$(get_cert_expiry "$domain") || {
|
||||||
|
log "ERROR" "Failed to check certificate for $domain"
|
||||||
|
return 3
|
||||||
|
}
|
||||||
|
|
||||||
|
# Calculate days until expiry
|
||||||
|
local days_until_expiry
|
||||||
|
days_until_expiry=$(get_days_until_expiry "$expiry_date") || {
|
||||||
|
log "ERROR" "Failed to calculate expiry for $domain"
|
||||||
|
return 3
|
||||||
|
}
|
||||||
|
|
||||||
|
# Check if expired
|
||||||
|
if [[ "$days_until_expiry" -lt 0 ]]; then
|
||||||
|
log "CRITICAL" "$domain: Certificate EXPIRED ${days_until_expiry#-} days ago!"
|
||||||
|
return 3
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Check thresholds
|
||||||
|
if [[ "$days_until_expiry" -le "$CRITICAL_DAYS" ]]; then
|
||||||
|
log "CRITICAL" "$domain: Certificate expires in $days_until_expiry days (expires: $expiry_date)"
|
||||||
|
return 2
|
||||||
|
elif [[ "$days_until_expiry" -le "$WARN_DAYS" ]]; then
|
||||||
|
log "WARN" "$domain: Certificate expires in $days_until_expiry days (expires: $expiry_date)"
|
||||||
|
return 1
|
||||||
|
else
|
||||||
|
log "INFO" "$domain: Certificate valid for $days_until_expiry days (expires: $expiry_date)"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Main monitoring function
|
||||||
|
main() {
|
||||||
|
log "INFO" "Starting SSL certificate monitoring"
|
||||||
|
|
||||||
|
local max_severity=0
|
||||||
|
local expired_domains=()
|
||||||
|
local critical_domains=()
|
||||||
|
local warning_domains=()
|
||||||
|
|
||||||
|
# Check all domains
|
||||||
|
for domain in "${DOMAINS[@]}"; do
|
||||||
|
local exit_code=0
|
||||||
|
local expiry_date=$(get_cert_expiry "$domain" 2>/dev/null || echo "Unknown")
|
||||||
|
local days_until_expiry=$(get_days_until_expiry "$expiry_date" 2>/dev/null || echo "Unknown")
|
||||||
|
|
||||||
|
check_domain "$domain" || exit_code=$?
|
||||||
|
|
||||||
|
if [[ "$exit_code" -eq 3 ]]; then
|
||||||
|
max_severity=3
|
||||||
|
expired_domains+=("$domain (EXPIRED or ERROR)")
|
||||||
|
elif [[ "$exit_code" -eq 2 ]]; then
|
||||||
|
[[ "$max_severity" -lt 2 ]] && max_severity=2
|
||||||
|
critical_domains+=("$domain (expires in $days_until_expiry days)")
|
||||||
|
elif [[ "$exit_code" -eq 1 ]]; then
|
||||||
|
[[ "$max_severity" -lt 1 ]] && max_severity=1
|
||||||
|
warning_domains+=("$domain (expires in $days_until_expiry days)")
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
# Send alerts based on severity
|
||||||
|
if [[ "$max_severity" -eq 3 ]]; then
|
||||||
|
local subject="[CRITICAL] SSL Certificate Expired or Error"
|
||||||
|
local body="CRITICAL: SSL certificate has expired or error occurred.
|
||||||
|
|
||||||
|
Expired/Error Domains:
|
||||||
|
$(printf -- "- %s\n" "${expired_domains[@]}")
|
||||||
|
"
|
||||||
|
|
||||||
|
# Add other alerts if any
|
||||||
|
if [[ "${#critical_domains[@]}" -gt 0 ]]; then
|
||||||
|
body+="
|
||||||
|
Critical Domains (<= $CRITICAL_DAYS days):
|
||||||
|
$(printf -- "- %s\n" "${critical_domains[@]}")
|
||||||
|
"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "${#warning_domains[@]}" -gt 0 ]]; then
|
||||||
|
body+="
|
||||||
|
Warning Domains (<= $WARN_DAYS days):
|
||||||
|
$(printf -- "- %s\n" "${warning_domains[@]}")
|
||||||
|
"
|
||||||
|
fi
|
||||||
|
|
||||||
|
body+="
|
||||||
|
Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
|
||||||
|
Host: $(hostname)
|
||||||
|
|
||||||
|
Action Required:
|
||||||
|
1. Renew SSL certificate immediately
|
||||||
|
2. Check Let's Encrypt auto-renewal:
|
||||||
|
sudo certbot renew --dry-run
|
||||||
|
|
||||||
|
Certificate details:
|
||||||
|
$(get_cert_details "${DOMAINS[0]}")
|
||||||
|
|
||||||
|
Renewal commands:
|
||||||
|
# Test renewal
|
||||||
|
sudo certbot renew --dry-run
|
||||||
|
|
||||||
|
# Force renewal
|
||||||
|
sudo certbot renew --force-renewal
|
||||||
|
|
||||||
|
# Check certificate status
|
||||||
|
sudo certbot certificates
|
||||||
|
"
|
||||||
|
|
||||||
|
send_alert "$subject" "$body"
|
||||||
|
log "CRITICAL" "SSL certificate alert sent"
|
||||||
|
|
||||||
|
elif [[ "$max_severity" -eq 2 ]]; then
|
||||||
|
local subject="[CRITICAL] SSL Certificate Expires Soon"
|
||||||
|
local body="CRITICAL: SSL certificate expires in $CRITICAL_DAYS days or less.
|
||||||
|
|
||||||
|
Critical Domains (<= $CRITICAL_DAYS days):
|
||||||
|
$(printf -- "- %s\n" "${critical_domains[@]}")
|
||||||
|
"
|
||||||
|
|
||||||
|
if [[ "${#warning_domains[@]}" -gt 0 ]]; then
|
||||||
|
body+="
|
||||||
|
Warning Domains (<= $WARN_DAYS days):
|
||||||
|
$(printf -- "- %s\n" "${warning_domains[@]}")
|
||||||
|
"
|
||||||
|
fi
|
||||||
|
|
||||||
|
body+="
|
||||||
|
Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
|
||||||
|
Host: $(hostname)
|
||||||
|
|
||||||
|
Please renew certificates soon.
|
||||||
|
|
||||||
|
Check renewal:
|
||||||
|
sudo certbot renew --dry-run
|
||||||
|
"
|
||||||
|
|
||||||
|
send_alert "$subject" "$body"
|
||||||
|
log "CRITICAL" "SSL expiry alert sent"
|
||||||
|
|
||||||
|
elif [[ "$max_severity" -eq 1 ]]; then
|
||||||
|
local subject="[WARN] SSL Certificate Expires Soon"
|
||||||
|
local body="WARNING: SSL certificate expires in $WARN_DAYS days or less.
|
||||||
|
|
||||||
|
Warning Domains (<= $WARN_DAYS days):
|
||||||
|
$(printf -- "- %s\n" "${warning_domains[@]}")
|
||||||
|
|
||||||
|
Time: $(date '+%Y-%m-%d %H:%M:%S %Z')
|
||||||
|
Host: $(hostname)
|
||||||
|
|
||||||
|
Please plan certificate renewal.
|
||||||
|
"
|
||||||
|
|
||||||
|
send_alert "$subject" "$body"
|
||||||
|
log "WARN" "SSL expiry warning sent"
|
||||||
|
else
|
||||||
|
log "INFO" "All SSL certificates valid"
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit $max_severity
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run main function
|
||||||
|
main
|
||||||
Loading…
Add table
Reference in a new issue