tractatus/deployment-quickstart/UPTIME_MONITORING_SETUP.md
TheFlow 18bb173c95 feat: add disk monitoring system for dev and production
Add comprehensive disk monitoring with real-time metrics:
- Backend API endpoints for disk/memory metrics (local + remote)
- Admin UI page with CSP-compliant DOM rendering
- Health status indicators with color-coded thresholds
- SSH-based remote metrics collection from OVH VPS
- Auto-refresh every 5 minutes

Backend:
- src/models/DiskMetrics.model.js: Metrics collection model
- src/controllers/diskMetrics.controller.js: 3 admin endpoints
- src/routes/diskMetrics.routes.js: Admin-authenticated routes
- src/routes/index.js: Register disk-metrics routes

Frontend:
- public/admin/disk-monitoring.html: Admin dashboard page
- public/js/admin-disk-monitoring.js: CSP-compliant UI rendering
- public/js/components/navbar-admin.js: Add disk monitoring link

Documentation:
- deployment-quickstart/UPTIME_MONITORING_SETUP.md

API endpoints:
- GET /api/admin/disk-metrics (all systems)
- GET /api/admin/disk-metrics/local (dev system)
- GET /api/admin/disk-metrics/remote (production VPS)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-29 11:53:55 +13:00

186 lines
5.2 KiB
Markdown

# External Uptime Monitoring Setup Guide
This guide explains how to set up external uptime monitoring for the Tractatus Umami Analytics instance.
## Monitored Endpoints
### Primary Monitoring Target
- **URL**: `https://analytics.agenticgovernance.digital/api/heartbeat`
- **Expected Response**: HTTP 200 OK
- **Purpose**: Umami application health check
### Secondary Monitoring Targets (Optional)
- **URL**: `https://agenticgovernance.digital/`
- **Expected Response**: HTTP 200 OK
- **Purpose**: Main website availability
## Recommended Service: UptimeRobot (Free Tier)
UptimeRobot provides free uptime monitoring with:
- 50 monitors
- 5-minute check intervals
- Email/SMS alerts
- Status page generation
### Setup Instructions
#### 1. Create Account
1. Visit https://uptimerobot.com
2. Sign up for a free account
3. Verify your email address
#### 2. Add Analytics Monitor
1. Click "Add New Monitor"
2. Configure:
- **Monitor Type**: HTTP(s)
- **Friendly Name**: `Tractatus Analytics (Umami)`
- **URL**: `https://analytics.agenticgovernance.digital/api/heartbeat`
- **Monitoring Interval**: 5 minutes
- **Monitor Timeout**: 30 seconds
- **HTTP Method**: GET
- **Expected Status Code**: 200
3. Click "Create Monitor"
#### 3. Add Main Website Monitor (Optional)
1. Click "Add New Monitor"
2. Configure:
- **Monitor Type**: HTTP(s)
- **Friendly Name**: `Tractatus Website`
- **URL**: `https://agenticgovernance.digital/`
- **Monitoring Interval**: 5 minutes
- **Monitor Timeout**: 30 seconds
3. Click "Create Monitor"
#### 4. Configure Alert Contacts
1. Go to "My Settings" → "Alert Contacts"
2. Add email address for alerts
3. (Optional) Add SMS number for critical alerts
4. Configure alert preferences:
- **Alert When**: Down
- **Alert After**: 2 consecutive failures (10 minutes)
- **Re-Alert After**: 30 minutes
#### 5. Create Public Status Page (Optional)
1. Go to "Status Pages"
2. Click "Add Status Page"
3. Configure:
- **Title**: Tractatus Services Status
- **Custom Domain**: (optional) status.agenticgovernance.digital
- **Monitors**: Select both monitors
4. Enable "Show Uptime Percentage"
5. Enable "Show Response Times"
## Alternative Services
### Pingdom
- **Free Tier**: 1 monitor
- **Check Interval**: 1 minute
- **URL**: https://www.pingdom.com
### Better Uptime
- **Free Tier**: 10 monitors
- **Check Interval**: 3 minutes
- **URL**: https://betteruptime.com
### StatusCake
- **Free Tier**: 10 monitors
- **Check Interval**: 5 minutes
- **URL**: https://www.statuscake.com
## Internal Monitoring (Already Configured)
The following internal monitoring is already set up:
### Docker Health Checks
- **Umami Container**: `curl -f http://localhost:3000/api/heartbeat`
- Interval: 10 seconds
- Timeout: 5 seconds
- Retries: 5
- **PostgreSQL Container**: `pg_isready -U $POSTGRES_USER -d $POSTGRES_DB`
- Interval: 5 seconds
- Timeout: 5 seconds
- Retries: 5
### Automated Backups
- **Schedule**: Daily at 2:00 AM
- **Retention**: 7 days
- **Location**: `~/umami-backups/`
- **Script**: `~/umami-deployment/backup-umami-db.sh`
### Disk Usage Monitoring
- **Schedule**: Daily at 3:00 AM
- **Warning Threshold**: 80% disk usage
- **Critical Threshold**: 90% disk usage
- **Location**: `~/umami-backups/disk-monitoring.log`
- **Script**: `~/umami-deployment/monitor-disk-usage.sh`
## Verification
To verify monitoring is working:
1. **Check Endpoint Manually**:
```bash
curl -I https://analytics.agenticgovernance.digital/api/heartbeat
# Should return: HTTP/2 200
```
2. **Test Alert Flow**:
- Stop Umami container: `docker stop tractatus-umami`
- Wait for alert (should arrive within 10 minutes)
- Restart container: `docker start tractatus-umami`
- Verify recovery alert
3. **Check Internal Monitoring**:
```bash
# View Docker health status
docker ps
# Check backup logs
tail -20 ~/umami-backups/backup.log
# Check disk monitoring logs
tail -20 ~/umami-backups/disk-monitoring.log
```
## Alert Response Procedures
### Analytics Down (5+ minutes)
1. Check Docker container status: `docker ps`
2. Check container logs: `docker logs tractatus-umami`
3. Check PostgreSQL status: `docker logs tractatus-umami-db`
4. If needed, restart: `cd ~/umami-deployment && docker compose restart`
### High Disk Usage (>80%)
1. Check backup retention: `ls -lh ~/umami-backups/`
2. Remove old backups manually if needed
3. Check PostgreSQL volume: `docker exec tractatus-umami-db du -sh /var/lib/postgresql/data`
4. Consider database cleanup or server upgrade
### Database Corruption
1. Stop Umami: `docker compose stop umami`
2. Restore from backup: `~/umami-deployment/restore-umami-db.sh ~/umami-backups/umami_backup_YYYYMMDD_HHMMSS.sql.gz`
3. Restart services: `docker compose up -d`
## Next Steps
- [ ] Sign up for UptimeRobot
- [ ] Add analytics.agenticgovernance.digital monitor
- [ ] Configure email alerts
- [ ] Test alert delivery
- [ ] (Optional) Create public status page
- [ ] Document response procedures in team wiki
## Maintenance
- Review monitoring logs monthly
- Test restore procedure quarterly
- Update alert contacts when team changes
- Review disk usage trends monthly
---
**Last Updated**: 2025-10-29
**Monitoring Status**: Internal monitoring active, external monitoring pending user setup