Add metrics monitoring stack with VictoriaMetrics, Grafana, and node_exporter
Implement complete monitoring infrastructure following rick-infra principles: Components: - VictoriaMetrics: Prometheus-compatible TSDB (7x less RAM usage) - Grafana: Visualization dashboard with Authentik OAuth/OIDC integration - node_exporter: System metrics collection (CPU, memory, disk, network) Architecture: - All services run as native systemd binaries (no containers) - localhost-only binding for security - Grafana uses native OAuth integration with Authentik (not forward_auth) - Full systemd security hardening enabled - Proxied via Caddy at metrics.jnss.me with HTTPS Role Features: - Unified metrics role (single role for complete stack) - Automatic role mapping via Authentik groups: - authentik Admins OR grafana-admins -> Admin access - grafana-editors -> Editor access - All others -> Viewer access - VictoriaMetrics auto-provisioned as default Grafana datasource - 12-month metrics retention by default - Comprehensive documentation included Security: - OAuth/OIDC SSO via Authentik - All metrics services bind to 127.0.0.1 only - systemd hardening (NoNewPrivileges, ProtectSystem, etc.) - Grafana accessible only via Caddy HTTPS proxy Documentation: - roles/metrics/README.md: Complete role documentation - docs/metrics-deployment-guide.md: Step-by-step deployment guide Configuration: - Updated rick-infra.yml to include metrics deployment - Grafana port set to 3001 (Gitea uses 3000) - Ready for multi-host expansion (designed for future node_exporter deployment to production hosts)
This commit is contained in:
325
roles/metrics/README.md
Normal file
325
roles/metrics/README.md
Normal file
@@ -0,0 +1,325 @@
|
||||
# Metrics Role
|
||||
|
||||
Complete monitoring stack for rick-infra providing system metrics collection, storage, and visualization with SSO integration.
|
||||
|
||||
## Components
|
||||
|
||||
### VictoriaMetrics
|
||||
- **Purpose**: Time-series database for metrics storage
|
||||
- **Type**: Native systemd service
|
||||
- **Listen**: `127.0.0.1:8428` (localhost only)
|
||||
- **Features**:
|
||||
- Prometheus-compatible API and PromQL
|
||||
- 7x less RAM usage than Prometheus
|
||||
- Single binary deployment
|
||||
- 12-month data retention by default
|
||||
|
||||
### Grafana
|
||||
- **Purpose**: Metrics visualization and dashboarding
|
||||
- **Type**: Native systemd service
|
||||
- **Listen**: `127.0.0.1:3000` (localhost only, proxied via Caddy)
|
||||
- **Domain**: `metrics.jnss.me`
|
||||
- **Features**:
|
||||
- OAuth/OIDC integration with Authentik
|
||||
- Role-based access control via Authentik groups
|
||||
- VictoriaMetrics as default data source
|
||||
|
||||
### node_exporter
|
||||
- **Purpose**: System metrics collection
|
||||
- **Type**: Native systemd service
|
||||
- **Listen**: `127.0.0.1:9100` (localhost only)
|
||||
- **Metrics**: CPU, memory, disk, network, systemd units
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ metrics.jnss.me (Grafana Dashboard) │
|
||||
│ ┌─────────────────────────────────────────────────┐ │
|
||||
│ │ Caddy (HTTPS) │ │
|
||||
│ │ ↓ │ │
|
||||
│ │ Grafana (OAuth → Authentik) │ │
|
||||
│ │ ↓ │ │
|
||||
│ │ VictoriaMetrics (Prometheus-compatible) │ │
|
||||
│ │ ↑ │ │
|
||||
│ │ node_exporter (System Metrics) │ │
|
||||
│ └─────────────────────────────────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Deployment
|
||||
|
||||
### Prerequisites
|
||||
|
||||
1. **Caddy role deployed** - Required for HTTPS proxy
|
||||
2. **Authentik deployed** - Required for OAuth/SSO
|
||||
3. **Vault variables configured**:
|
||||
```yaml
|
||||
# In host_vars/arch-vps/vault.yml
|
||||
vault_grafana_admin_password: "secure-admin-password"
|
||||
vault_grafana_secret_key: "random-secret-key-32-chars"
|
||||
vault_grafana_oauth_client_id: "grafana"
|
||||
vault_grafana_oauth_client_secret: "oauth-client-secret-from-authentik"
|
||||
```
|
||||
|
||||
### Authentik Configuration
|
||||
|
||||
Before deployment, create OAuth2/OIDC provider in Authentik:
|
||||
|
||||
1. **Create Provider**:
|
||||
- Name: `Grafana`
|
||||
- Type: `OAuth2/OpenID Provider`
|
||||
- Client ID: `grafana`
|
||||
- Client Secret: Generate and save to vault
|
||||
- Redirect URIs: `https://metrics.jnss.me/login/generic_oauth`
|
||||
- Signing Key: Auto-generated
|
||||
|
||||
2. **Create Application**:
|
||||
- Name: `Grafana`
|
||||
- Slug: `grafana`
|
||||
- Provider: Select Grafana provider created above
|
||||
|
||||
3. **Create Groups** (optional, for role mapping):
|
||||
- `grafana-admins` - Full admin access
|
||||
- `grafana-editors` - Can create/edit dashboards
|
||||
- Users without these groups get Viewer access
|
||||
|
||||
### Deploy
|
||||
|
||||
```bash
|
||||
# Deploy complete metrics stack
|
||||
ansible-playbook rick-infra.yml --tags metrics
|
||||
|
||||
# Deploy individual components
|
||||
ansible-playbook rick-infra.yml --tags victoriametrics
|
||||
ansible-playbook rick-infra.yml --tags grafana
|
||||
ansible-playbook rick-infra.yml --tags node_exporter
|
||||
```
|
||||
|
||||
### Verify Deployment
|
||||
|
||||
```bash
|
||||
# Check service status
|
||||
ansible homelab -a "systemctl status victoriametrics grafana node_exporter"
|
||||
|
||||
# Check metrics collection
|
||||
curl http://127.0.0.1:9100/metrics # node_exporter metrics
|
||||
curl http://127.0.0.1:8428/metrics # VictoriaMetrics metrics
|
||||
curl http://127.0.0.1:8428/api/v1/targets # Scrape targets
|
||||
|
||||
# Access Grafana
|
||||
curl -I https://metrics.jnss.me/ # Should redirect to Authentik login
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Access Dashboard
|
||||
|
||||
1. Navigate to `https://metrics.jnss.me`
|
||||
2. Click "Sign in with Authentik"
|
||||
3. Authenticate via Authentik SSO
|
||||
4. Access granted based on Authentik group membership
|
||||
|
||||
### Role Mapping
|
||||
|
||||
Grafana roles are automatically assigned based on Authentik groups:
|
||||
|
||||
- **Admin**: Members of `grafana-admins` group
|
||||
- Full administrative access
|
||||
- Can manage users, data sources, plugins
|
||||
- Can create/edit/delete all dashboards
|
||||
|
||||
- **Editor**: Members of `grafana-editors` group
|
||||
- Can create and edit dashboards
|
||||
- Cannot manage users or data sources
|
||||
|
||||
- **Viewer**: All other authenticated users
|
||||
- Read-only access to dashboards
|
||||
- Cannot create or edit dashboards
|
||||
|
||||
### Creating Dashboards
|
||||
|
||||
Grafana comes with VictoriaMetrics pre-configured as the default data source. Use PromQL queries:
|
||||
|
||||
```promql
|
||||
# CPU usage
|
||||
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
||||
|
||||
# Memory usage
|
||||
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
|
||||
|
||||
# Disk usage
|
||||
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
|
||||
|
||||
# Network traffic
|
||||
irate(node_network_receive_bytes_total[5m])
|
||||
```
|
||||
|
||||
### Import Community Dashboards
|
||||
|
||||
1. Browse dashboards at https://grafana.com/grafana/dashboards/
|
||||
2. Recommended for node_exporter:
|
||||
- Dashboard ID: 1860 (Node Exporter Full)
|
||||
- Dashboard ID: 11074 (Node Exporter for Prometheus)
|
||||
3. Import via Grafana UI: Dashboards → Import → Enter ID
|
||||
|
||||
## Configuration
|
||||
|
||||
### Customization
|
||||
|
||||
Key configuration options in `roles/metrics/defaults/main.yml`:
|
||||
|
||||
```yaml
|
||||
# Data retention
|
||||
victoriametrics_retention_period: "12" # months
|
||||
|
||||
# Scrape interval
|
||||
victoriametrics_scrape_interval: "15s"
|
||||
|
||||
# OAuth role mapping (JMESPath expression)
|
||||
grafana_oauth_role_attribute_path: "contains(groups, 'grafana-admins') && 'Admin' || contains(groups, 'grafana-editors') && 'Editor' || 'Viewer'"
|
||||
|
||||
# Memory limits
|
||||
victoriametrics_memory_allowed_percent: "60"
|
||||
```
|
||||
|
||||
### Adding Scrape Targets
|
||||
|
||||
Edit `roles/metrics/templates/scrape.yml.j2`:
|
||||
|
||||
```yaml
|
||||
scrape_configs:
|
||||
# Add custom application metrics
|
||||
- job_name: 'myapp'
|
||||
static_configs:
|
||||
- targets: ['127.0.0.1:8080']
|
||||
labels:
|
||||
service: 'myapp'
|
||||
```
|
||||
|
||||
## Operations
|
||||
|
||||
### Service Management
|
||||
|
||||
```bash
|
||||
# VictoriaMetrics
|
||||
systemctl status victoriametrics
|
||||
systemctl restart victoriametrics
|
||||
journalctl -u victoriametrics -f
|
||||
|
||||
# Grafana
|
||||
systemctl status grafana
|
||||
systemctl restart grafana
|
||||
journalctl -u grafana -f
|
||||
|
||||
# node_exporter
|
||||
systemctl status node_exporter
|
||||
systemctl restart node_exporter
|
||||
journalctl -u node_exporter -f
|
||||
```
|
||||
|
||||
### Data Locations
|
||||
|
||||
```
|
||||
/var/lib/victoriametrics/ # Time-series data
|
||||
/var/lib/grafana/ # Grafana database and dashboards
|
||||
/var/log/grafana/ # Grafana logs
|
||||
/etc/victoriametrics/ # VictoriaMetrics config
|
||||
/etc/grafana/ # Grafana config
|
||||
```
|
||||
|
||||
### Backup
|
||||
|
||||
VictoriaMetrics data is stored in `/var/lib/victoriametrics`:
|
||||
|
||||
```bash
|
||||
# Stop service
|
||||
systemctl stop victoriametrics
|
||||
|
||||
# Backup data
|
||||
tar -czf victoriametrics-backup-$(date +%Y%m%d).tar.gz /var/lib/victoriametrics
|
||||
|
||||
# Start service
|
||||
systemctl start victoriametrics
|
||||
```
|
||||
|
||||
Grafana dashboards are stored in SQLite database at `/var/lib/grafana/grafana.db`:
|
||||
|
||||
```bash
|
||||
# Backup Grafana
|
||||
systemctl stop grafana
|
||||
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /var/lib/grafana /etc/grafana
|
||||
systemctl start grafana
|
||||
```
|
||||
|
||||
## Security
|
||||
|
||||
### Authentication
|
||||
- Grafana protected by Authentik OAuth/OIDC
|
||||
- Local admin account available for emergency access
|
||||
- All services bind to localhost only
|
||||
|
||||
### Network Security
|
||||
- VictoriaMetrics: `127.0.0.1:8428` (no external access)
|
||||
- Grafana: `127.0.0.1:3000` (proxied via Caddy with HTTPS)
|
||||
- node_exporter: `127.0.0.1:9100` (no external access)
|
||||
|
||||
### systemd Hardening
|
||||
All services run with security restrictions:
|
||||
- `NoNewPrivileges=true`
|
||||
- `ProtectSystem=strict`
|
||||
- `ProtectHome=true`
|
||||
- `PrivateTmp=true`
|
||||
- Read-only filesystem (except data directories)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Grafana OAuth Not Working
|
||||
|
||||
1. Check Authentik provider configuration:
|
||||
```bash
|
||||
# Verify redirect URI matches
|
||||
# https://metrics.jnss.me/login/generic_oauth
|
||||
```
|
||||
|
||||
2. Check Grafana logs:
|
||||
```bash
|
||||
journalctl -u grafana -f
|
||||
```
|
||||
|
||||
3. Verify OAuth credentials in vault match Authentik
|
||||
|
||||
### No Metrics in Grafana
|
||||
|
||||
1. Check VictoriaMetrics scrape targets:
|
||||
```bash
|
||||
curl http://127.0.0.1:8428/api/v1/targets
|
||||
```
|
||||
|
||||
2. Check node_exporter is running:
|
||||
```bash
|
||||
systemctl status node_exporter
|
||||
curl http://127.0.0.1:9100/metrics
|
||||
```
|
||||
|
||||
3. Check VictoriaMetrics logs:
|
||||
```bash
|
||||
journalctl -u victoriametrics -f
|
||||
```
|
||||
|
||||
### High Memory Usage
|
||||
|
||||
VictoriaMetrics is configured to use max 60% of available memory. Adjust if needed:
|
||||
|
||||
```yaml
|
||||
# In roles/metrics/defaults/main.yml
|
||||
victoriametrics_memory_allowed_percent: "40" # Reduce to 40%
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- [VictoriaMetrics Documentation](https://docs.victoriametrics.com/)
|
||||
- [Grafana Documentation](https://grafana.com/docs/)
|
||||
- [node_exporter GitHub](https://github.com/prometheus/node_exporter)
|
||||
- [PromQL Documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/)
|
||||
- [Authentik OAuth Integration](https://goauthentik.io/docs/providers/oauth2/)
|
||||
Reference in New Issue
Block a user