Files
rick-infra/roles/metrics/README.md
Joakim 1f3f111d88 Add metrics monitoring stack with VictoriaMetrics, Grafana, and node_exporter
Implement complete monitoring infrastructure following rick-infra principles:

Components:
- VictoriaMetrics: Prometheus-compatible TSDB (7x less RAM usage)
- Grafana: Visualization dashboard with Authentik OAuth/OIDC integration
- node_exporter: System metrics collection (CPU, memory, disk, network)

Architecture:
- All services run as native systemd binaries (no containers)
- localhost-only binding for security
- Grafana uses native OAuth integration with Authentik (not forward_auth)
- Full systemd security hardening enabled
- Proxied via Caddy at metrics.jnss.me with HTTPS

Role Features:
- Unified metrics role (single role for complete stack)
- Automatic role mapping via Authentik groups:
  - authentik Admins OR grafana-admins -> Admin access
  - grafana-editors -> Editor access
  - All others -> Viewer access
- VictoriaMetrics auto-provisioned as default Grafana datasource
- 12-month metrics retention by default
- Comprehensive documentation included

Security:
- OAuth/OIDC SSO via Authentik
- All metrics services bind to 127.0.0.1 only
- systemd hardening (NoNewPrivileges, ProtectSystem, etc.)
- Grafana accessible only via Caddy HTTPS proxy

Documentation:
- roles/metrics/README.md: Complete role documentation
- docs/metrics-deployment-guide.md: Step-by-step deployment guide

Configuration:
- Updated rick-infra.yml to include metrics deployment
- Grafana port set to 3001 (Gitea uses 3000)
- Ready for multi-host expansion (designed for future node_exporter deployment to production hosts)
2025-12-28 19:18:30 +01:00

326 lines
9.0 KiB
Markdown

# Metrics Role
Complete monitoring stack for rick-infra providing system metrics collection, storage, and visualization with SSO integration.
## Components
### VictoriaMetrics
- **Purpose**: Time-series database for metrics storage
- **Type**: Native systemd service
- **Listen**: `127.0.0.1:8428` (localhost only)
- **Features**:
- Prometheus-compatible API and PromQL
- 7x less RAM usage than Prometheus
- Single binary deployment
- 12-month data retention by default
### Grafana
- **Purpose**: Metrics visualization and dashboarding
- **Type**: Native systemd service
- **Listen**: `127.0.0.1:3000` (localhost only, proxied via Caddy)
- **Domain**: `metrics.jnss.me`
- **Features**:
- OAuth/OIDC integration with Authentik
- Role-based access control via Authentik groups
- VictoriaMetrics as default data source
### node_exporter
- **Purpose**: System metrics collection
- **Type**: Native systemd service
- **Listen**: `127.0.0.1:9100` (localhost only)
- **Metrics**: CPU, memory, disk, network, systemd units
## Architecture
```
┌─────────────────────────────────────────────────────┐
│ metrics.jnss.me (Grafana Dashboard) │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Caddy (HTTPS) │ │
│ │ ↓ │ │
│ │ Grafana (OAuth → Authentik) │ │
│ │ ↓ │ │
│ │ VictoriaMetrics (Prometheus-compatible) │ │
│ │ ↑ │ │
│ │ node_exporter (System Metrics) │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
```
## Deployment
### Prerequisites
1. **Caddy role deployed** - Required for HTTPS proxy
2. **Authentik deployed** - Required for OAuth/SSO
3. **Vault variables configured**:
```yaml
# In host_vars/arch-vps/vault.yml
vault_grafana_admin_password: "secure-admin-password"
vault_grafana_secret_key: "random-secret-key-32-chars"
vault_grafana_oauth_client_id: "grafana"
vault_grafana_oauth_client_secret: "oauth-client-secret-from-authentik"
```
### Authentik Configuration
Before deployment, create OAuth2/OIDC provider in Authentik:
1. **Create Provider**:
- Name: `Grafana`
- Type: `OAuth2/OpenID Provider`
- Client ID: `grafana`
- Client Secret: Generate and save to vault
- Redirect URIs: `https://metrics.jnss.me/login/generic_oauth`
- Signing Key: Auto-generated
2. **Create Application**:
- Name: `Grafana`
- Slug: `grafana`
- Provider: Select Grafana provider created above
3. **Create Groups** (optional, for role mapping):
- `grafana-admins` - Full admin access
- `grafana-editors` - Can create/edit dashboards
- Users without these groups get Viewer access
### Deploy
```bash
# Deploy complete metrics stack
ansible-playbook rick-infra.yml --tags metrics
# Deploy individual components
ansible-playbook rick-infra.yml --tags victoriametrics
ansible-playbook rick-infra.yml --tags grafana
ansible-playbook rick-infra.yml --tags node_exporter
```
### Verify Deployment
```bash
# Check service status
ansible homelab -a "systemctl status victoriametrics grafana node_exporter"
# Check metrics collection
curl http://127.0.0.1:9100/metrics # node_exporter metrics
curl http://127.0.0.1:8428/metrics # VictoriaMetrics metrics
curl http://127.0.0.1:8428/api/v1/targets # Scrape targets
# Access Grafana
curl -I https://metrics.jnss.me/ # Should redirect to Authentik login
```
## Usage
### Access Dashboard
1. Navigate to `https://metrics.jnss.me`
2. Click "Sign in with Authentik"
3. Authenticate via Authentik SSO
4. Access granted based on Authentik group membership
### Role Mapping
Grafana roles are automatically assigned based on Authentik groups:
- **Admin**: Members of `grafana-admins` group
- Full administrative access
- Can manage users, data sources, plugins
- Can create/edit/delete all dashboards
- **Editor**: Members of `grafana-editors` group
- Can create and edit dashboards
- Cannot manage users or data sources
- **Viewer**: All other authenticated users
- Read-only access to dashboards
- Cannot create or edit dashboards
### Creating Dashboards
Grafana comes with VictoriaMetrics pre-configured as the default data source. Use PromQL queries:
```promql
# CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Disk usage
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
# Network traffic
irate(node_network_receive_bytes_total[5m])
```
### Import Community Dashboards
1. Browse dashboards at https://grafana.com/grafana/dashboards/
2. Recommended for node_exporter:
- Dashboard ID: 1860 (Node Exporter Full)
- Dashboard ID: 11074 (Node Exporter for Prometheus)
3. Import via Grafana UI: Dashboards → Import → Enter ID
## Configuration
### Customization
Key configuration options in `roles/metrics/defaults/main.yml`:
```yaml
# Data retention
victoriametrics_retention_period: "12" # months
# Scrape interval
victoriametrics_scrape_interval: "15s"
# OAuth role mapping (JMESPath expression)
grafana_oauth_role_attribute_path: "contains(groups, 'grafana-admins') && 'Admin' || contains(groups, 'grafana-editors') && 'Editor' || 'Viewer'"
# Memory limits
victoriametrics_memory_allowed_percent: "60"
```
### Adding Scrape Targets
Edit `roles/metrics/templates/scrape.yml.j2`:
```yaml
scrape_configs:
# Add custom application metrics
- job_name: 'myapp'
static_configs:
- targets: ['127.0.0.1:8080']
labels:
service: 'myapp'
```
## Operations
### Service Management
```bash
# VictoriaMetrics
systemctl status victoriametrics
systemctl restart victoriametrics
journalctl -u victoriametrics -f
# Grafana
systemctl status grafana
systemctl restart grafana
journalctl -u grafana -f
# node_exporter
systemctl status node_exporter
systemctl restart node_exporter
journalctl -u node_exporter -f
```
### Data Locations
```
/var/lib/victoriametrics/ # Time-series data
/var/lib/grafana/ # Grafana database and dashboards
/var/log/grafana/ # Grafana logs
/etc/victoriametrics/ # VictoriaMetrics config
/etc/grafana/ # Grafana config
```
### Backup
VictoriaMetrics data is stored in `/var/lib/victoriametrics`:
```bash
# Stop service
systemctl stop victoriametrics
# Backup data
tar -czf victoriametrics-backup-$(date +%Y%m%d).tar.gz /var/lib/victoriametrics
# Start service
systemctl start victoriametrics
```
Grafana dashboards are stored in SQLite database at `/var/lib/grafana/grafana.db`:
```bash
# Backup Grafana
systemctl stop grafana
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /var/lib/grafana /etc/grafana
systemctl start grafana
```
## Security
### Authentication
- Grafana protected by Authentik OAuth/OIDC
- Local admin account available for emergency access
- All services bind to localhost only
### Network Security
- VictoriaMetrics: `127.0.0.1:8428` (no external access)
- Grafana: `127.0.0.1:3000` (proxied via Caddy with HTTPS)
- node_exporter: `127.0.0.1:9100` (no external access)
### systemd Hardening
All services run with security restrictions:
- `NoNewPrivileges=true`
- `ProtectSystem=strict`
- `ProtectHome=true`
- `PrivateTmp=true`
- Read-only filesystem (except data directories)
## Troubleshooting
### Grafana OAuth Not Working
1. Check Authentik provider configuration:
```bash
# Verify redirect URI matches
# https://metrics.jnss.me/login/generic_oauth
```
2. Check Grafana logs:
```bash
journalctl -u grafana -f
```
3. Verify OAuth credentials in vault match Authentik
### No Metrics in Grafana
1. Check VictoriaMetrics scrape targets:
```bash
curl http://127.0.0.1:8428/api/v1/targets
```
2. Check node_exporter is running:
```bash
systemctl status node_exporter
curl http://127.0.0.1:9100/metrics
```
3. Check VictoriaMetrics logs:
```bash
journalctl -u victoriametrics -f
```
### High Memory Usage
VictoriaMetrics is configured to use max 60% of available memory. Adjust if needed:
```yaml
# In roles/metrics/defaults/main.yml
victoriametrics_memory_allowed_percent: "40" # Reduce to 40%
```
## See Also
- [VictoriaMetrics Documentation](https://docs.victoriametrics.com/)
- [Grafana Documentation](https://grafana.com/docs/)
- [node_exporter GitHub](https://github.com/prometheus/node_exporter)
- [PromQL Documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/)
- [Authentik OAuth Integration](https://goauthentik.io/docs/providers/oauth2/)