Files
rick-infra/docs/metrics-deployment-guide.md
Joakim 1f3f111d88 Add metrics monitoring stack with VictoriaMetrics, Grafana, and node_exporter
Implement complete monitoring infrastructure following rick-infra principles:

Components:
- VictoriaMetrics: Prometheus-compatible TSDB (7x less RAM usage)
- Grafana: Visualization dashboard with Authentik OAuth/OIDC integration
- node_exporter: System metrics collection (CPU, memory, disk, network)

Architecture:
- All services run as native systemd binaries (no containers)
- localhost-only binding for security
- Grafana uses native OAuth integration with Authentik (not forward_auth)
- Full systemd security hardening enabled
- Proxied via Caddy at metrics.jnss.me with HTTPS

Role Features:
- Unified metrics role (single role for complete stack)
- Automatic role mapping via Authentik groups:
  - authentik Admins OR grafana-admins -> Admin access
  - grafana-editors -> Editor access
  - All others -> Viewer access
- VictoriaMetrics auto-provisioned as default Grafana datasource
- 12-month metrics retention by default
- Comprehensive documentation included

Security:
- OAuth/OIDC SSO via Authentik
- All metrics services bind to 127.0.0.1 only
- systemd hardening (NoNewPrivileges, ProtectSystem, etc.)
- Grafana accessible only via Caddy HTTPS proxy

Documentation:
- roles/metrics/README.md: Complete role documentation
- docs/metrics-deployment-guide.md: Step-by-step deployment guide

Configuration:
- Updated rick-infra.yml to include metrics deployment
- Grafana port set to 3001 (Gitea uses 3000)
- Ready for multi-host expansion (designed for future node_exporter deployment to production hosts)
2025-12-28 19:18:30 +01:00

312 lines
8.4 KiB
Markdown

# Metrics Stack Deployment Guide
Complete guide to deploying the monitoring stack (VictoriaMetrics, Grafana, node_exporter) on rick-infra.
## Overview
The metrics stack provides:
- **System monitoring**: CPU, memory, disk, network via node_exporter
- **Time-series storage**: VictoriaMetrics (Prometheus-compatible, 7x less RAM)
- **Visualization**: Grafana with Authentik SSO integration
- **Access**: `https://metrics.jnss.me` with role-based permissions
## Architecture
```
User → metrics.jnss.me (HTTPS)
Caddy (Reverse Proxy)
Grafana (OAuth → Authentik for SSO)
VictoriaMetrics (Time-series DB)
node_exporter (System Metrics)
```
All services run on localhost only, following rick-infra security principles.
## Prerequisites
### 1. Caddy Deployed
```bash
ansible-playbook rick-infra.yml --tags caddy
```
### 2. Authentik Deployed
```bash
ansible-playbook rick-infra.yml --tags authentik
```
### 3. DNS Configuration
Ensure `metrics.jnss.me` points to arch-vps IP:
```bash
dig metrics.jnss.me # Should return 69.62.119.31
```
## Step 1: Configure Authentik OAuth Provider
### Create OAuth2/OIDC Provider
1. Login to Authentik at `https://auth.jnss.me`
2. Navigate to **Applications → Providers****Create**
3. Configure provider:
- **Name**: `Grafana`
- **Type**: `OAuth2/OpenID Provider`
- **Authentication flow**: `default-authentication-flow`
- **Authorization flow**: `default-provider-authorization-explicit-consent`
- **Client type**: `Confidential`
- **Client ID**: `grafana`
- **Client Secret**: Click **Generate** and **copy the secret**
- **Redirect URIs**: `https://metrics.jnss.me/login/generic_oauth`
- **Signing Key**: Select auto-generated key
- **Scopes**: `openid`, `profile`, `email`, `groups`
4. Click **Finish**
### Create Application
1. Navigate to **Applications****Create**
2. Configure application:
- **Name**: `Grafana`
- **Slug**: `grafana`
- **Provider**: Select `Grafana` provider created above
- **Launch URL**: `https://metrics.jnss.me`
3. Click **Create**
### Create Groups (Optional)
For role-based access control:
1. Navigate to **Directory → Groups****Create**
2. Create groups:
- **grafana-admins**: Full admin access to Grafana
- **grafana-editors**: Can create/edit dashboards
- All other users get Viewer access
3. Add users to groups as needed
## Step 2: Configure Vault Variables
Edit vault file:
```bash
ansible-vault edit host_vars/arch-vps/vault.yml
```
Add these variables:
```yaml
# Grafana admin password (for emergency local login)
vault_grafana_admin_password: "your-secure-admin-password"
# Grafana secret key (generate with: openssl rand -base64 32)
vault_grafana_secret_key: "your-random-32-char-secret-key"
# OAuth credentials from Authentik
vault_grafana_oauth_client_id: "grafana"
vault_grafana_oauth_client_secret: "paste-secret-from-authentik-here"
```
Save and close (`:wq` in vim).
## Step 3: Deploy Metrics Stack
Deploy all components:
```bash
ansible-playbook rick-infra.yml --tags metrics
```
This will:
1. Install and configure VictoriaMetrics
2. Install and configure node_exporter
3. Install and configure Grafana with OAuth
4. Deploy Caddy configuration for `metrics.jnss.me`
Expected output:
```
PLAY RECAP *******************************************************
arch-vps : ok=25 changed=15 unreachable=0 failed=0 skipped=0
```
## Step 4: Verify Deployment
### Check Services
SSH to arch-vps and verify services:
```bash
# Check all services are running
systemctl status victoriametrics grafana node_exporter
# Check service health
curl http://127.0.0.1:8428/health # VictoriaMetrics
curl http://127.0.0.1:9100/metrics # node_exporter
curl http://127.0.0.1:3000/api/health # Grafana
```
### Check HTTPS Access
```bash
curl -I https://metrics.jnss.me
# Should return 200 or 302 (redirect to Authentik)
```
### Check Metrics Collection
```bash
# Check VictoriaMetrics scrape targets
curl http://127.0.0.1:8428/api/v1/targets
# Should show node_exporter as "up"
```
## Step 5: Access Grafana
1. Navigate to `https://metrics.jnss.me`
2. Click **"Sign in with Authentik"**
3. Login with your Authentik credentials
4. You should be redirected to Grafana dashboard
First login will:
- Auto-create your Grafana user
- Assign role based on Authentik group membership
- Grant access to default organization
## Step 6: Verify Data Source
1. In Grafana, navigate to **Connections → Data sources**
2. Verify **VictoriaMetrics** is listed and default
3. Click on VictoriaMetrics → **Save & test**
4. Should show green "Data source is working" message
## Step 7: Create First Dashboard
### Option 1: Import Community Dashboard (Recommended)
1. Navigate to **Dashboards → Import**
2. Enter dashboard ID: `1860` (Node Exporter Full)
3. Click **Load**
4. Select **VictoriaMetrics** as data source
5. Click **Import**
You now have a comprehensive system monitoring dashboard!
### Option 2: Create Custom Dashboard
1. Navigate to **Dashboards → New → New Dashboard**
2. Click **Add visualization**
3. Select **VictoriaMetrics** data source
4. Enter PromQL query:
```promql
# CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
```
5. Click **Apply**
## Step 8: Configure Alerting (Optional)
Grafana supports alerting on metrics. Configure via **Alerting → Alert rules**.
Example alert for high CPU:
```promql
avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 < 20
```
## Troubleshooting
### OAuth Login Fails
**Symptom**: Redirect to Authentik, but returns error after login
**Solution**:
1. Verify redirect URI in Authentik matches exactly: `https://metrics.jnss.me/login/generic_oauth`
2. Check Grafana logs: `journalctl -u grafana -f`
3. Verify OAuth credentials in vault match Authentik
### No Metrics in Grafana
**Symptom**: Data source working, but no data in dashboards
**Solution**:
1. Check VictoriaMetrics targets: `curl http://127.0.0.1:8428/api/v1/targets`
2. Verify node_exporter is up: `systemctl status node_exporter`
3. Check time range in Grafana (top right) - try "Last 5 minutes"
### Can't Access metrics.jnss.me
**Symptom**: Connection timeout or SSL error
**Solution**:
1. Verify DNS: `dig metrics.jnss.me`
2. Check Caddy is running: `systemctl status caddy`
3. Check Caddy logs: `journalctl -u caddy -f`
4. Verify Caddy config loaded: `ls /etc/caddy/sites/grafana.caddy`
### Wrong Grafana Role
**Symptom**: User has wrong permissions (e.g., Viewer instead of Admin)
**Solution**:
1. Verify user is in correct Authentik group (`grafana-admins` or `grafana-editors`)
2. Logout of Grafana and login again
3. Check role mapping expression in `roles/metrics/defaults/main.yml`:
```yaml
grafana_oauth_role_attribute_path: "contains(groups, 'grafana-admins') && 'Admin' || contains(groups, 'grafana-editors') && 'Editor' || 'Viewer'"
```
## Next Steps
### Add More Hosts
To monitor additional hosts (e.g., mini-vps):
1. Deploy node_exporter to target host
2. Update VictoriaMetrics scrape config to include remote targets
3. Configure remote_write or federation
### Add Service Metrics
To monitor containerized services:
1. Expose `/metrics` endpoint in application (port 8080)
2. Add scrape config in `roles/metrics/templates/scrape.yml.j2`:
```yaml
- job_name: 'myservice'
static_configs:
- targets: ['127.0.0.1:8080']
```
3. Redeploy metrics role
### Set Up Alerting
1. Configure notification channels in Grafana (Email, Slack, etc.)
2. Create alert rules for critical metrics
3. Set up on-call rotation if needed
## Security Notes
- All metrics services run on localhost only
- Grafana is the only internet-facing component (via Caddy HTTPS)
- OAuth provides SSO with Authentik (no separate Grafana passwords)
- systemd hardening enabled on all services
- Default admin account should only be used for emergencies
## Resources
- **VictoriaMetrics Docs**: https://docs.victoriametrics.com/
- **Grafana Docs**: https://grafana.com/docs/
- **PromQL Guide**: https://prometheus.io/docs/prometheus/latest/querying/basics/
- **Dashboard Library**: https://grafana.com/grafana/dashboards/
- **Authentik OAuth**: https://goauthentik.io/docs/providers/oauth2/
## Support
For issues specific to rick-infra metrics deployment:
1. Check service logs: `journalctl -u <service> -f`
2. Review role README: `roles/metrics/README.md`
3. Verify vault variables are correctly set
4. Ensure Authentik OAuth provider is properly configured