Implement complete monitoring infrastructure following rick-infra principles: Components: - VictoriaMetrics: Prometheus-compatible TSDB (7x less RAM usage) - Grafana: Visualization dashboard with Authentik OAuth/OIDC integration - node_exporter: System metrics collection (CPU, memory, disk, network) Architecture: - All services run as native systemd binaries (no containers) - localhost-only binding for security - Grafana uses native OAuth integration with Authentik (not forward_auth) - Full systemd security hardening enabled - Proxied via Caddy at metrics.jnss.me with HTTPS Role Features: - Unified metrics role (single role for complete stack) - Automatic role mapping via Authentik groups: - authentik Admins OR grafana-admins -> Admin access - grafana-editors -> Editor access - All others -> Viewer access - VictoriaMetrics auto-provisioned as default Grafana datasource - 12-month metrics retention by default - Comprehensive documentation included Security: - OAuth/OIDC SSO via Authentik - All metrics services bind to 127.0.0.1 only - systemd hardening (NoNewPrivileges, ProtectSystem, etc.) - Grafana accessible only via Caddy HTTPS proxy Documentation: - roles/metrics/README.md: Complete role documentation - docs/metrics-deployment-guide.md: Step-by-step deployment guide Configuration: - Updated rick-infra.yml to include metrics deployment - Grafana port set to 3001 (Gitea uses 3000) - Ready for multi-host expansion (designed for future node_exporter deployment to production hosts)
312 lines
8.4 KiB
Markdown
312 lines
8.4 KiB
Markdown
# Metrics Stack Deployment Guide
|
|
|
|
Complete guide to deploying the monitoring stack (VictoriaMetrics, Grafana, node_exporter) on rick-infra.
|
|
|
|
## Overview
|
|
|
|
The metrics stack provides:
|
|
- **System monitoring**: CPU, memory, disk, network via node_exporter
|
|
- **Time-series storage**: VictoriaMetrics (Prometheus-compatible, 7x less RAM)
|
|
- **Visualization**: Grafana with Authentik SSO integration
|
|
- **Access**: `https://metrics.jnss.me` with role-based permissions
|
|
|
|
## Architecture
|
|
|
|
```
|
|
User → metrics.jnss.me (HTTPS)
|
|
↓
|
|
Caddy (Reverse Proxy)
|
|
↓
|
|
Grafana (OAuth → Authentik for SSO)
|
|
↓
|
|
VictoriaMetrics (Time-series DB)
|
|
↑
|
|
node_exporter (System Metrics)
|
|
```
|
|
|
|
All services run on localhost only, following rick-infra security principles.
|
|
|
|
## Prerequisites
|
|
|
|
### 1. Caddy Deployed
|
|
```bash
|
|
ansible-playbook rick-infra.yml --tags caddy
|
|
```
|
|
|
|
### 2. Authentik Deployed
|
|
```bash
|
|
ansible-playbook rick-infra.yml --tags authentik
|
|
```
|
|
|
|
### 3. DNS Configuration
|
|
Ensure `metrics.jnss.me` points to arch-vps IP:
|
|
```bash
|
|
dig metrics.jnss.me # Should return 69.62.119.31
|
|
```
|
|
|
|
## Step 1: Configure Authentik OAuth Provider
|
|
|
|
### Create OAuth2/OIDC Provider
|
|
|
|
1. Login to Authentik at `https://auth.jnss.me`
|
|
|
|
2. Navigate to **Applications → Providers** → **Create**
|
|
|
|
3. Configure provider:
|
|
- **Name**: `Grafana`
|
|
- **Type**: `OAuth2/OpenID Provider`
|
|
- **Authentication flow**: `default-authentication-flow`
|
|
- **Authorization flow**: `default-provider-authorization-explicit-consent`
|
|
- **Client type**: `Confidential`
|
|
- **Client ID**: `grafana`
|
|
- **Client Secret**: Click **Generate** and **copy the secret**
|
|
- **Redirect URIs**: `https://metrics.jnss.me/login/generic_oauth`
|
|
- **Signing Key**: Select auto-generated key
|
|
- **Scopes**: `openid`, `profile`, `email`, `groups`
|
|
|
|
4. Click **Finish**
|
|
|
|
### Create Application
|
|
|
|
1. Navigate to **Applications** → **Create**
|
|
|
|
2. Configure application:
|
|
- **Name**: `Grafana`
|
|
- **Slug**: `grafana`
|
|
- **Provider**: Select `Grafana` provider created above
|
|
- **Launch URL**: `https://metrics.jnss.me`
|
|
|
|
3. Click **Create**
|
|
|
|
### Create Groups (Optional)
|
|
|
|
For role-based access control:
|
|
|
|
1. Navigate to **Directory → Groups** → **Create**
|
|
|
|
2. Create groups:
|
|
- **grafana-admins**: Full admin access to Grafana
|
|
- **grafana-editors**: Can create/edit dashboards
|
|
- All other users get Viewer access
|
|
|
|
3. Add users to groups as needed
|
|
|
|
## Step 2: Configure Vault Variables
|
|
|
|
Edit vault file:
|
|
```bash
|
|
ansible-vault edit host_vars/arch-vps/vault.yml
|
|
```
|
|
|
|
Add these variables:
|
|
```yaml
|
|
# Grafana admin password (for emergency local login)
|
|
vault_grafana_admin_password: "your-secure-admin-password"
|
|
|
|
# Grafana secret key (generate with: openssl rand -base64 32)
|
|
vault_grafana_secret_key: "your-random-32-char-secret-key"
|
|
|
|
# OAuth credentials from Authentik
|
|
vault_grafana_oauth_client_id: "grafana"
|
|
vault_grafana_oauth_client_secret: "paste-secret-from-authentik-here"
|
|
```
|
|
|
|
Save and close (`:wq` in vim).
|
|
|
|
## Step 3: Deploy Metrics Stack
|
|
|
|
Deploy all components:
|
|
```bash
|
|
ansible-playbook rick-infra.yml --tags metrics
|
|
```
|
|
|
|
This will:
|
|
1. Install and configure VictoriaMetrics
|
|
2. Install and configure node_exporter
|
|
3. Install and configure Grafana with OAuth
|
|
4. Deploy Caddy configuration for `metrics.jnss.me`
|
|
|
|
Expected output:
|
|
```
|
|
PLAY RECAP *******************************************************
|
|
arch-vps : ok=25 changed=15 unreachable=0 failed=0 skipped=0
|
|
```
|
|
|
|
## Step 4: Verify Deployment
|
|
|
|
### Check Services
|
|
|
|
SSH to arch-vps and verify services:
|
|
```bash
|
|
# Check all services are running
|
|
systemctl status victoriametrics grafana node_exporter
|
|
|
|
# Check service health
|
|
curl http://127.0.0.1:8428/health # VictoriaMetrics
|
|
curl http://127.0.0.1:9100/metrics # node_exporter
|
|
curl http://127.0.0.1:3000/api/health # Grafana
|
|
```
|
|
|
|
### Check HTTPS Access
|
|
|
|
```bash
|
|
curl -I https://metrics.jnss.me
|
|
# Should return 200 or 302 (redirect to Authentik)
|
|
```
|
|
|
|
### Check Metrics Collection
|
|
|
|
```bash
|
|
# Check VictoriaMetrics scrape targets
|
|
curl http://127.0.0.1:8428/api/v1/targets
|
|
|
|
# Should show node_exporter as "up"
|
|
```
|
|
|
|
## Step 5: Access Grafana
|
|
|
|
1. Navigate to `https://metrics.jnss.me`
|
|
2. Click **"Sign in with Authentik"**
|
|
3. Login with your Authentik credentials
|
|
4. You should be redirected to Grafana dashboard
|
|
|
|
First login will:
|
|
- Auto-create your Grafana user
|
|
- Assign role based on Authentik group membership
|
|
- Grant access to default organization
|
|
|
|
## Step 6: Verify Data Source
|
|
|
|
1. In Grafana, navigate to **Connections → Data sources**
|
|
2. Verify **VictoriaMetrics** is listed and default
|
|
3. Click on VictoriaMetrics → **Save & test**
|
|
4. Should show green "Data source is working" message
|
|
|
|
## Step 7: Create First Dashboard
|
|
|
|
### Option 1: Import Community Dashboard (Recommended)
|
|
|
|
1. Navigate to **Dashboards → Import**
|
|
2. Enter dashboard ID: `1860` (Node Exporter Full)
|
|
3. Click **Load**
|
|
4. Select **VictoriaMetrics** as data source
|
|
5. Click **Import**
|
|
|
|
You now have a comprehensive system monitoring dashboard!
|
|
|
|
### Option 2: Create Custom Dashboard
|
|
|
|
1. Navigate to **Dashboards → New → New Dashboard**
|
|
2. Click **Add visualization**
|
|
3. Select **VictoriaMetrics** data source
|
|
4. Enter PromQL query:
|
|
```promql
|
|
# CPU usage
|
|
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
```
|
|
5. Click **Apply**
|
|
|
|
## Step 8: Configure Alerting (Optional)
|
|
|
|
Grafana supports alerting on metrics. Configure via **Alerting → Alert rules**.
|
|
|
|
Example alert for high CPU:
|
|
```promql
|
|
avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 < 20
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### OAuth Login Fails
|
|
|
|
**Symptom**: Redirect to Authentik, but returns error after login
|
|
|
|
**Solution**:
|
|
1. Verify redirect URI in Authentik matches exactly: `https://metrics.jnss.me/login/generic_oauth`
|
|
2. Check Grafana logs: `journalctl -u grafana -f`
|
|
3. Verify OAuth credentials in vault match Authentik
|
|
|
|
### No Metrics in Grafana
|
|
|
|
**Symptom**: Data source working, but no data in dashboards
|
|
|
|
**Solution**:
|
|
1. Check VictoriaMetrics targets: `curl http://127.0.0.1:8428/api/v1/targets`
|
|
2. Verify node_exporter is up: `systemctl status node_exporter`
|
|
3. Check time range in Grafana (top right) - try "Last 5 minutes"
|
|
|
|
### Can't Access metrics.jnss.me
|
|
|
|
**Symptom**: Connection timeout or SSL error
|
|
|
|
**Solution**:
|
|
1. Verify DNS: `dig metrics.jnss.me`
|
|
2. Check Caddy is running: `systemctl status caddy`
|
|
3. Check Caddy logs: `journalctl -u caddy -f`
|
|
4. Verify Caddy config loaded: `ls /etc/caddy/sites/grafana.caddy`
|
|
|
|
### Wrong Grafana Role
|
|
|
|
**Symptom**: User has wrong permissions (e.g., Viewer instead of Admin)
|
|
|
|
**Solution**:
|
|
1. Verify user is in correct Authentik group (`grafana-admins` or `grafana-editors`)
|
|
2. Logout of Grafana and login again
|
|
3. Check role mapping expression in `roles/metrics/defaults/main.yml`:
|
|
```yaml
|
|
grafana_oauth_role_attribute_path: "contains(groups, 'grafana-admins') && 'Admin' || contains(groups, 'grafana-editors') && 'Editor' || 'Viewer'"
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
### Add More Hosts
|
|
|
|
To monitor additional hosts (e.g., mini-vps):
|
|
|
|
1. Deploy node_exporter to target host
|
|
2. Update VictoriaMetrics scrape config to include remote targets
|
|
3. Configure remote_write or federation
|
|
|
|
### Add Service Metrics
|
|
|
|
To monitor containerized services:
|
|
|
|
1. Expose `/metrics` endpoint in application (port 8080)
|
|
2. Add scrape config in `roles/metrics/templates/scrape.yml.j2`:
|
|
```yaml
|
|
- job_name: 'myservice'
|
|
static_configs:
|
|
- targets: ['127.0.0.1:8080']
|
|
```
|
|
3. Redeploy metrics role
|
|
|
|
### Set Up Alerting
|
|
|
|
1. Configure notification channels in Grafana (Email, Slack, etc.)
|
|
2. Create alert rules for critical metrics
|
|
3. Set up on-call rotation if needed
|
|
|
|
## Security Notes
|
|
|
|
- All metrics services run on localhost only
|
|
- Grafana is the only internet-facing component (via Caddy HTTPS)
|
|
- OAuth provides SSO with Authentik (no separate Grafana passwords)
|
|
- systemd hardening enabled on all services
|
|
- Default admin account should only be used for emergencies
|
|
|
|
## Resources
|
|
|
|
- **VictoriaMetrics Docs**: https://docs.victoriametrics.com/
|
|
- **Grafana Docs**: https://grafana.com/docs/
|
|
- **PromQL Guide**: https://prometheus.io/docs/prometheus/latest/querying/basics/
|
|
- **Dashboard Library**: https://grafana.com/grafana/dashboards/
|
|
- **Authentik OAuth**: https://goauthentik.io/docs/providers/oauth2/
|
|
|
|
## Support
|
|
|
|
For issues specific to rick-infra metrics deployment:
|
|
1. Check service logs: `journalctl -u <service> -f`
|
|
2. Review role README: `roles/metrics/README.md`
|
|
3. Verify vault variables are correctly set
|
|
4. Ensure Authentik OAuth provider is properly configured
|