# Metrics Stack Deployment Guide Complete guide to deploying the monitoring stack (VictoriaMetrics, Grafana, node_exporter) on rick-infra. ## Overview The metrics stack provides: - **System monitoring**: CPU, memory, disk, network via node_exporter - **Time-series storage**: VictoriaMetrics (Prometheus-compatible, 7x less RAM) - **Visualization**: Grafana with Authentik SSO integration - **Access**: `https://metrics.jnss.me` with role-based permissions ## Architecture ``` User → metrics.jnss.me (HTTPS) ↓ Caddy (Reverse Proxy) ↓ Grafana (OAuth → Authentik for SSO) ↓ VictoriaMetrics (Time-series DB) ↑ node_exporter (System Metrics) ``` All services run on localhost only, following rick-infra security principles. ## Prerequisites ### 1. Caddy Deployed ```bash ansible-playbook rick-infra.yml --tags caddy ``` ### 2. Authentik Deployed ```bash ansible-playbook rick-infra.yml --tags authentik ``` ### 3. DNS Configuration Ensure `metrics.jnss.me` points to arch-vps IP: ```bash dig metrics.jnss.me # Should return 69.62.119.31 ``` ## Step 1: Configure Authentik OAuth Provider ### Create OAuth2/OIDC Provider 1. Login to Authentik at `https://auth.jnss.me` 2. Navigate to **Applications → Providers** → **Create** 3. Configure provider: - **Name**: `Grafana` - **Type**: `OAuth2/OpenID Provider` - **Authentication flow**: `default-authentication-flow` - **Authorization flow**: `default-provider-authorization-explicit-consent` - **Client type**: `Confidential` - **Client ID**: `grafana` - **Client Secret**: Click **Generate** and **copy the secret** - **Redirect URIs**: `https://metrics.jnss.me/login/generic_oauth` - **Signing Key**: Select auto-generated key - **Scopes**: `openid`, `profile`, `email`, `groups` 4. Click **Finish** ### Create Application 1. Navigate to **Applications** → **Create** 2. Configure application: - **Name**: `Grafana` - **Slug**: `grafana` - **Provider**: Select `Grafana` provider created above - **Launch URL**: `https://metrics.jnss.me` 3. Click **Create** ### Create Groups (Optional) For role-based access control: 1. Navigate to **Directory → Groups** → **Create** 2. Create groups: - **grafana-admins**: Full admin access to Grafana - **grafana-editors**: Can create/edit dashboards - All other users get Viewer access 3. Add users to groups as needed ## Step 2: Configure Vault Variables Edit vault file: ```bash ansible-vault edit host_vars/arch-vps/vault.yml ``` Add these variables: ```yaml # Grafana admin password (for emergency local login) vault_grafana_admin_password: "your-secure-admin-password" # Grafana secret key (generate with: openssl rand -base64 32) vault_grafana_secret_key: "your-random-32-char-secret-key" # OAuth credentials from Authentik vault_grafana_oauth_client_id: "grafana" vault_grafana_oauth_client_secret: "paste-secret-from-authentik-here" ``` Save and close (`:wq` in vim). ## Step 3: Deploy Metrics Stack Deploy all components: ```bash ansible-playbook rick-infra.yml --tags metrics ``` This will: 1. Install and configure VictoriaMetrics 2. Install and configure node_exporter 3. Install and configure Grafana with OAuth 4. Deploy Caddy configuration for `metrics.jnss.me` Expected output: ``` PLAY RECAP ******************************************************* arch-vps : ok=25 changed=15 unreachable=0 failed=0 skipped=0 ``` ## Step 4: Verify Deployment ### Check Services SSH to arch-vps and verify services: ```bash # Check all services are running systemctl status victoriametrics grafana node_exporter # Check service health curl http://127.0.0.1:8428/health # VictoriaMetrics curl http://127.0.0.1:9100/metrics # node_exporter curl http://127.0.0.1:3000/api/health # Grafana ``` ### Check HTTPS Access ```bash curl -I https://metrics.jnss.me # Should return 200 or 302 (redirect to Authentik) ``` ### Check Metrics Collection ```bash # Check VictoriaMetrics scrape targets curl http://127.0.0.1:8428/api/v1/targets # Should show node_exporter as "up" ``` ## Step 5: Access Grafana 1. Navigate to `https://metrics.jnss.me` 2. Click **"Sign in with Authentik"** 3. Login with your Authentik credentials 4. You should be redirected to Grafana dashboard First login will: - Auto-create your Grafana user - Assign role based on Authentik group membership - Grant access to default organization ## Step 6: Verify Data Source 1. In Grafana, navigate to **Connections → Data sources** 2. Verify **VictoriaMetrics** is listed and default 3. Click on VictoriaMetrics → **Save & test** 4. Should show green "Data source is working" message ## Step 7: Create First Dashboard ### Option 1: Import Community Dashboard (Recommended) 1. Navigate to **Dashboards → Import** 2. Enter dashboard ID: `1860` (Node Exporter Full) 3. Click **Load** 4. Select **VictoriaMetrics** as data source 5. Click **Import** You now have a comprehensive system monitoring dashboard! ### Option 2: Create Custom Dashboard 1. Navigate to **Dashboards → New → New Dashboard** 2. Click **Add visualization** 3. Select **VictoriaMetrics** data source 4. Enter PromQL query: ```promql # CPU usage 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) ``` 5. Click **Apply** ## Step 8: Configure Alerting (Optional) Grafana supports alerting on metrics. Configure via **Alerting → Alert rules**. Example alert for high CPU: ```promql avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 < 20 ``` ## Troubleshooting ### OAuth Login Fails **Symptom**: Redirect to Authentik, but returns error after login **Solution**: 1. Verify redirect URI in Authentik matches exactly: `https://metrics.jnss.me/login/generic_oauth` 2. Check Grafana logs: `journalctl -u grafana -f` 3. Verify OAuth credentials in vault match Authentik ### No Metrics in Grafana **Symptom**: Data source working, but no data in dashboards **Solution**: 1. Check VictoriaMetrics targets: `curl http://127.0.0.1:8428/api/v1/targets` 2. Verify node_exporter is up: `systemctl status node_exporter` 3. Check time range in Grafana (top right) - try "Last 5 minutes" ### Can't Access metrics.jnss.me **Symptom**: Connection timeout or SSL error **Solution**: 1. Verify DNS: `dig metrics.jnss.me` 2. Check Caddy is running: `systemctl status caddy` 3. Check Caddy logs: `journalctl -u caddy -f` 4. Verify Caddy config loaded: `ls /etc/caddy/sites/grafana.caddy` ### Wrong Grafana Role **Symptom**: User has wrong permissions (e.g., Viewer instead of Admin) **Solution**: 1. Verify user is in correct Authentik group (`grafana-admins` or `grafana-editors`) 2. Logout of Grafana and login again 3. Check role mapping expression in `roles/metrics/defaults/main.yml`: ```yaml grafana_oauth_role_attribute_path: "contains(groups, 'grafana-admins') && 'Admin' || contains(groups, 'grafana-editors') && 'Editor' || 'Viewer'" ``` ## Next Steps ### Add More Hosts To monitor additional hosts (e.g., mini-vps): 1. Deploy node_exporter to target host 2. Update VictoriaMetrics scrape config to include remote targets 3. Configure remote_write or federation ### Add Service Metrics To monitor containerized services: 1. Expose `/metrics` endpoint in application (port 8080) 2. Add scrape config in `roles/metrics/templates/scrape.yml.j2`: ```yaml - job_name: 'myservice' static_configs: - targets: ['127.0.0.1:8080'] ``` 3. Redeploy metrics role ### Set Up Alerting 1. Configure notification channels in Grafana (Email, Slack, etc.) 2. Create alert rules for critical metrics 3. Set up on-call rotation if needed ## Security Notes - All metrics services run on localhost only - Grafana is the only internet-facing component (via Caddy HTTPS) - OAuth provides SSO with Authentik (no separate Grafana passwords) - systemd hardening enabled on all services - Default admin account should only be used for emergencies ## Resources - **VictoriaMetrics Docs**: https://docs.victoriametrics.com/ - **Grafana Docs**: https://grafana.com/docs/ - **PromQL Guide**: https://prometheus.io/docs/prometheus/latest/querying/basics/ - **Dashboard Library**: https://grafana.com/grafana/dashboards/ - **Authentik OAuth**: https://goauthentik.io/docs/providers/oauth2/ ## Support For issues specific to rick-infra metrics deployment: 1. Check service logs: `journalctl -u -f` 2. Review role README: `roles/metrics/README.md` 3. Verify vault variables are correctly set 4. Ensure Authentik OAuth provider is properly configured