Files
rick-infra/roles/metrics/README.md
Joakim 1f3f111d88 Add metrics monitoring stack with VictoriaMetrics, Grafana, and node_exporter
Implement complete monitoring infrastructure following rick-infra principles:

Components:
- VictoriaMetrics: Prometheus-compatible TSDB (7x less RAM usage)
- Grafana: Visualization dashboard with Authentik OAuth/OIDC integration
- node_exporter: System metrics collection (CPU, memory, disk, network)

Architecture:
- All services run as native systemd binaries (no containers)
- localhost-only binding for security
- Grafana uses native OAuth integration with Authentik (not forward_auth)
- Full systemd security hardening enabled
- Proxied via Caddy at metrics.jnss.me with HTTPS

Role Features:
- Unified metrics role (single role for complete stack)
- Automatic role mapping via Authentik groups:
  - authentik Admins OR grafana-admins -> Admin access
  - grafana-editors -> Editor access
  - All others -> Viewer access
- VictoriaMetrics auto-provisioned as default Grafana datasource
- 12-month metrics retention by default
- Comprehensive documentation included

Security:
- OAuth/OIDC SSO via Authentik
- All metrics services bind to 127.0.0.1 only
- systemd hardening (NoNewPrivileges, ProtectSystem, etc.)
- Grafana accessible only via Caddy HTTPS proxy

Documentation:
- roles/metrics/README.md: Complete role documentation
- docs/metrics-deployment-guide.md: Step-by-step deployment guide

Configuration:
- Updated rick-infra.yml to include metrics deployment
- Grafana port set to 3001 (Gitea uses 3000)
- Ready for multi-host expansion (designed for future node_exporter deployment to production hosts)
2025-12-28 19:18:30 +01:00

9.0 KiB

Metrics Role

Complete monitoring stack for rick-infra providing system metrics collection, storage, and visualization with SSO integration.

Components

VictoriaMetrics

  • Purpose: Time-series database for metrics storage
  • Type: Native systemd service
  • Listen: 127.0.0.1:8428 (localhost only)
  • Features:
    • Prometheus-compatible API and PromQL
    • 7x less RAM usage than Prometheus
    • Single binary deployment
    • 12-month data retention by default

Grafana

  • Purpose: Metrics visualization and dashboarding
  • Type: Native systemd service
  • Listen: 127.0.0.1:3000 (localhost only, proxied via Caddy)
  • Domain: metrics.jnss.me
  • Features:
    • OAuth/OIDC integration with Authentik
    • Role-based access control via Authentik groups
    • VictoriaMetrics as default data source

node_exporter

  • Purpose: System metrics collection
  • Type: Native systemd service
  • Listen: 127.0.0.1:9100 (localhost only)
  • Metrics: CPU, memory, disk, network, systemd units

Architecture

┌─────────────────────────────────────────────────────┐
│ metrics.jnss.me (Grafana Dashboard)                │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Caddy (HTTPS)                                   │ │
│ │   ↓                                             │ │
│ │ Grafana (OAuth → Authentik)                     │ │
│ │   ↓                                             │ │
│ │ VictoriaMetrics (Prometheus-compatible)         │ │
│ │   ↑                                             │ │
│ │ node_exporter (System Metrics)                  │ │
│ └─────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘

Deployment

Prerequisites

  1. Caddy role deployed - Required for HTTPS proxy
  2. Authentik deployed - Required for OAuth/SSO
  3. Vault variables configured:
    # In host_vars/arch-vps/vault.yml
    vault_grafana_admin_password: "secure-admin-password"
    vault_grafana_secret_key: "random-secret-key-32-chars"
    vault_grafana_oauth_client_id: "grafana"
    vault_grafana_oauth_client_secret: "oauth-client-secret-from-authentik"
    

Authentik Configuration

Before deployment, create OAuth2/OIDC provider in Authentik:

  1. Create Provider:

    • Name: Grafana
    • Type: OAuth2/OpenID Provider
    • Client ID: grafana
    • Client Secret: Generate and save to vault
    • Redirect URIs: https://metrics.jnss.me/login/generic_oauth
    • Signing Key: Auto-generated
  2. Create Application:

    • Name: Grafana
    • Slug: grafana
    • Provider: Select Grafana provider created above
  3. Create Groups (optional, for role mapping):

    • grafana-admins - Full admin access
    • grafana-editors - Can create/edit dashboards
    • Users without these groups get Viewer access

Deploy

# Deploy complete metrics stack
ansible-playbook rick-infra.yml --tags metrics

# Deploy individual components
ansible-playbook rick-infra.yml --tags victoriametrics
ansible-playbook rick-infra.yml --tags grafana
ansible-playbook rick-infra.yml --tags node_exporter

Verify Deployment

# Check service status
ansible homelab -a "systemctl status victoriametrics grafana node_exporter"

# Check metrics collection
curl http://127.0.0.1:9100/metrics  # node_exporter metrics
curl http://127.0.0.1:8428/metrics  # VictoriaMetrics metrics
curl http://127.0.0.1:8428/api/v1/targets  # Scrape targets

# Access Grafana
curl -I https://metrics.jnss.me/  # Should redirect to Authentik login

Usage

Access Dashboard

  1. Navigate to https://metrics.jnss.me
  2. Click "Sign in with Authentik"
  3. Authenticate via Authentik SSO
  4. Access granted based on Authentik group membership

Role Mapping

Grafana roles are automatically assigned based on Authentik groups:

  • Admin: Members of grafana-admins group

    • Full administrative access
    • Can manage users, data sources, plugins
    • Can create/edit/delete all dashboards
  • Editor: Members of grafana-editors group

    • Can create and edit dashboards
    • Cannot manage users or data sources
  • Viewer: All other authenticated users

    • Read-only access to dashboards
    • Cannot create or edit dashboards

Creating Dashboards

Grafana comes with VictoriaMetrics pre-configured as the default data source. Use PromQL queries:

# CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Disk usage
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

# Network traffic
irate(node_network_receive_bytes_total[5m])

Import Community Dashboards

  1. Browse dashboards at https://grafana.com/grafana/dashboards/
  2. Recommended for node_exporter:
    • Dashboard ID: 1860 (Node Exporter Full)
    • Dashboard ID: 11074 (Node Exporter for Prometheus)
  3. Import via Grafana UI: Dashboards → Import → Enter ID

Configuration

Customization

Key configuration options in roles/metrics/defaults/main.yml:

# Data retention
victoriametrics_retention_period: "12"  # months

# Scrape interval
victoriametrics_scrape_interval: "15s"

# OAuth role mapping (JMESPath expression)
grafana_oauth_role_attribute_path: "contains(groups, 'grafana-admins') && 'Admin' || contains(groups, 'grafana-editors') && 'Editor' || 'Viewer'"

# Memory limits
victoriametrics_memory_allowed_percent: "60"

Adding Scrape Targets

Edit roles/metrics/templates/scrape.yml.j2:

scrape_configs:
  # Add custom application metrics
  - job_name: 'myapp'
    static_configs:
      - targets: ['127.0.0.1:8080']
        labels:
          service: 'myapp'

Operations

Service Management

# VictoriaMetrics
systemctl status victoriametrics
systemctl restart victoriametrics
journalctl -u victoriametrics -f

# Grafana
systemctl status grafana
systemctl restart grafana
journalctl -u grafana -f

# node_exporter
systemctl status node_exporter
systemctl restart node_exporter
journalctl -u node_exporter -f

Data Locations

/var/lib/victoriametrics/  # Time-series data
/var/lib/grafana/          # Grafana database and dashboards
/var/log/grafana/          # Grafana logs
/etc/victoriametrics/      # VictoriaMetrics config
/etc/grafana/              # Grafana config

Backup

VictoriaMetrics data is stored in /var/lib/victoriametrics:

# Stop service
systemctl stop victoriametrics

# Backup data
tar -czf victoriametrics-backup-$(date +%Y%m%d).tar.gz /var/lib/victoriametrics

# Start service
systemctl start victoriametrics

Grafana dashboards are stored in SQLite database at /var/lib/grafana/grafana.db:

# Backup Grafana
systemctl stop grafana
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /var/lib/grafana /etc/grafana
systemctl start grafana

Security

Authentication

  • Grafana protected by Authentik OAuth/OIDC
  • Local admin account available for emergency access
  • All services bind to localhost only

Network Security

  • VictoriaMetrics: 127.0.0.1:8428 (no external access)
  • Grafana: 127.0.0.1:3000 (proxied via Caddy with HTTPS)
  • node_exporter: 127.0.0.1:9100 (no external access)

systemd Hardening

All services run with security restrictions:

  • NoNewPrivileges=true
  • ProtectSystem=strict
  • ProtectHome=true
  • PrivateTmp=true
  • Read-only filesystem (except data directories)

Troubleshooting

Grafana OAuth Not Working

  1. Check Authentik provider configuration:

    # Verify redirect URI matches
    # https://metrics.jnss.me/login/generic_oauth
    
  2. Check Grafana logs:

    journalctl -u grafana -f
    
  3. Verify OAuth credentials in vault match Authentik

No Metrics in Grafana

  1. Check VictoriaMetrics scrape targets:

    curl http://127.0.0.1:8428/api/v1/targets
    
  2. Check node_exporter is running:

    systemctl status node_exporter
    curl http://127.0.0.1:9100/metrics
    
  3. Check VictoriaMetrics logs:

    journalctl -u victoriametrics -f
    

High Memory Usage

VictoriaMetrics is configured to use max 60% of available memory. Adjust if needed:

# In roles/metrics/defaults/main.yml
victoriametrics_memory_allowed_percent: "40"  # Reduce to 40%

See Also