Implement complete monitoring infrastructure following rick-infra principles: Components: - VictoriaMetrics: Prometheus-compatible TSDB (7x less RAM usage) - Grafana: Visualization dashboard with Authentik OAuth/OIDC integration - node_exporter: System metrics collection (CPU, memory, disk, network) Architecture: - All services run as native systemd binaries (no containers) - localhost-only binding for security - Grafana uses native OAuth integration with Authentik (not forward_auth) - Full systemd security hardening enabled - Proxied via Caddy at metrics.jnss.me with HTTPS Role Features: - Unified metrics role (single role for complete stack) - Automatic role mapping via Authentik groups: - authentik Admins OR grafana-admins -> Admin access - grafana-editors -> Editor access - All others -> Viewer access - VictoriaMetrics auto-provisioned as default Grafana datasource - 12-month metrics retention by default - Comprehensive documentation included Security: - OAuth/OIDC SSO via Authentik - All metrics services bind to 127.0.0.1 only - systemd hardening (NoNewPrivileges, ProtectSystem, etc.) - Grafana accessible only via Caddy HTTPS proxy Documentation: - roles/metrics/README.md: Complete role documentation - docs/metrics-deployment-guide.md: Step-by-step deployment guide Configuration: - Updated rick-infra.yml to include metrics deployment - Grafana port set to 3001 (Gitea uses 3000) - Ready for multi-host expansion (designed for future node_exporter deployment to production hosts)
326 lines
9.0 KiB
Markdown
326 lines
9.0 KiB
Markdown
# Metrics Role
|
|
|
|
Complete monitoring stack for rick-infra providing system metrics collection, storage, and visualization with SSO integration.
|
|
|
|
## Components
|
|
|
|
### VictoriaMetrics
|
|
- **Purpose**: Time-series database for metrics storage
|
|
- **Type**: Native systemd service
|
|
- **Listen**: `127.0.0.1:8428` (localhost only)
|
|
- **Features**:
|
|
- Prometheus-compatible API and PromQL
|
|
- 7x less RAM usage than Prometheus
|
|
- Single binary deployment
|
|
- 12-month data retention by default
|
|
|
|
### Grafana
|
|
- **Purpose**: Metrics visualization and dashboarding
|
|
- **Type**: Native systemd service
|
|
- **Listen**: `127.0.0.1:3000` (localhost only, proxied via Caddy)
|
|
- **Domain**: `metrics.jnss.me`
|
|
- **Features**:
|
|
- OAuth/OIDC integration with Authentik
|
|
- Role-based access control via Authentik groups
|
|
- VictoriaMetrics as default data source
|
|
|
|
### node_exporter
|
|
- **Purpose**: System metrics collection
|
|
- **Type**: Native systemd service
|
|
- **Listen**: `127.0.0.1:9100` (localhost only)
|
|
- **Metrics**: CPU, memory, disk, network, systemd units
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────┐
|
|
│ metrics.jnss.me (Grafana Dashboard) │
|
|
│ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ Caddy (HTTPS) │ │
|
|
│ │ ↓ │ │
|
|
│ │ Grafana (OAuth → Authentik) │ │
|
|
│ │ ↓ │ │
|
|
│ │ VictoriaMetrics (Prometheus-compatible) │ │
|
|
│ │ ↑ │ │
|
|
│ │ node_exporter (System Metrics) │ │
|
|
│ └─────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Deployment
|
|
|
|
### Prerequisites
|
|
|
|
1. **Caddy role deployed** - Required for HTTPS proxy
|
|
2. **Authentik deployed** - Required for OAuth/SSO
|
|
3. **Vault variables configured**:
|
|
```yaml
|
|
# In host_vars/arch-vps/vault.yml
|
|
vault_grafana_admin_password: "secure-admin-password"
|
|
vault_grafana_secret_key: "random-secret-key-32-chars"
|
|
vault_grafana_oauth_client_id: "grafana"
|
|
vault_grafana_oauth_client_secret: "oauth-client-secret-from-authentik"
|
|
```
|
|
|
|
### Authentik Configuration
|
|
|
|
Before deployment, create OAuth2/OIDC provider in Authentik:
|
|
|
|
1. **Create Provider**:
|
|
- Name: `Grafana`
|
|
- Type: `OAuth2/OpenID Provider`
|
|
- Client ID: `grafana`
|
|
- Client Secret: Generate and save to vault
|
|
- Redirect URIs: `https://metrics.jnss.me/login/generic_oauth`
|
|
- Signing Key: Auto-generated
|
|
|
|
2. **Create Application**:
|
|
- Name: `Grafana`
|
|
- Slug: `grafana`
|
|
- Provider: Select Grafana provider created above
|
|
|
|
3. **Create Groups** (optional, for role mapping):
|
|
- `grafana-admins` - Full admin access
|
|
- `grafana-editors` - Can create/edit dashboards
|
|
- Users without these groups get Viewer access
|
|
|
|
### Deploy
|
|
|
|
```bash
|
|
# Deploy complete metrics stack
|
|
ansible-playbook rick-infra.yml --tags metrics
|
|
|
|
# Deploy individual components
|
|
ansible-playbook rick-infra.yml --tags victoriametrics
|
|
ansible-playbook rick-infra.yml --tags grafana
|
|
ansible-playbook rick-infra.yml --tags node_exporter
|
|
```
|
|
|
|
### Verify Deployment
|
|
|
|
```bash
|
|
# Check service status
|
|
ansible homelab -a "systemctl status victoriametrics grafana node_exporter"
|
|
|
|
# Check metrics collection
|
|
curl http://127.0.0.1:9100/metrics # node_exporter metrics
|
|
curl http://127.0.0.1:8428/metrics # VictoriaMetrics metrics
|
|
curl http://127.0.0.1:8428/api/v1/targets # Scrape targets
|
|
|
|
# Access Grafana
|
|
curl -I https://metrics.jnss.me/ # Should redirect to Authentik login
|
|
```
|
|
|
|
## Usage
|
|
|
|
### Access Dashboard
|
|
|
|
1. Navigate to `https://metrics.jnss.me`
|
|
2. Click "Sign in with Authentik"
|
|
3. Authenticate via Authentik SSO
|
|
4. Access granted based on Authentik group membership
|
|
|
|
### Role Mapping
|
|
|
|
Grafana roles are automatically assigned based on Authentik groups:
|
|
|
|
- **Admin**: Members of `grafana-admins` group
|
|
- Full administrative access
|
|
- Can manage users, data sources, plugins
|
|
- Can create/edit/delete all dashboards
|
|
|
|
- **Editor**: Members of `grafana-editors` group
|
|
- Can create and edit dashboards
|
|
- Cannot manage users or data sources
|
|
|
|
- **Viewer**: All other authenticated users
|
|
- Read-only access to dashboards
|
|
- Cannot create or edit dashboards
|
|
|
|
### Creating Dashboards
|
|
|
|
Grafana comes with VictoriaMetrics pre-configured as the default data source. Use PromQL queries:
|
|
|
|
```promql
|
|
# CPU usage
|
|
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
|
|
|
|
# Memory usage
|
|
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
|
|
|
|
# Disk usage
|
|
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
|
|
|
|
# Network traffic
|
|
irate(node_network_receive_bytes_total[5m])
|
|
```
|
|
|
|
### Import Community Dashboards
|
|
|
|
1. Browse dashboards at https://grafana.com/grafana/dashboards/
|
|
2. Recommended for node_exporter:
|
|
- Dashboard ID: 1860 (Node Exporter Full)
|
|
- Dashboard ID: 11074 (Node Exporter for Prometheus)
|
|
3. Import via Grafana UI: Dashboards → Import → Enter ID
|
|
|
|
## Configuration
|
|
|
|
### Customization
|
|
|
|
Key configuration options in `roles/metrics/defaults/main.yml`:
|
|
|
|
```yaml
|
|
# Data retention
|
|
victoriametrics_retention_period: "12" # months
|
|
|
|
# Scrape interval
|
|
victoriametrics_scrape_interval: "15s"
|
|
|
|
# OAuth role mapping (JMESPath expression)
|
|
grafana_oauth_role_attribute_path: "contains(groups, 'grafana-admins') && 'Admin' || contains(groups, 'grafana-editors') && 'Editor' || 'Viewer'"
|
|
|
|
# Memory limits
|
|
victoriametrics_memory_allowed_percent: "60"
|
|
```
|
|
|
|
### Adding Scrape Targets
|
|
|
|
Edit `roles/metrics/templates/scrape.yml.j2`:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
# Add custom application metrics
|
|
- job_name: 'myapp'
|
|
static_configs:
|
|
- targets: ['127.0.0.1:8080']
|
|
labels:
|
|
service: 'myapp'
|
|
```
|
|
|
|
## Operations
|
|
|
|
### Service Management
|
|
|
|
```bash
|
|
# VictoriaMetrics
|
|
systemctl status victoriametrics
|
|
systemctl restart victoriametrics
|
|
journalctl -u victoriametrics -f
|
|
|
|
# Grafana
|
|
systemctl status grafana
|
|
systemctl restart grafana
|
|
journalctl -u grafana -f
|
|
|
|
# node_exporter
|
|
systemctl status node_exporter
|
|
systemctl restart node_exporter
|
|
journalctl -u node_exporter -f
|
|
```
|
|
|
|
### Data Locations
|
|
|
|
```
|
|
/var/lib/victoriametrics/ # Time-series data
|
|
/var/lib/grafana/ # Grafana database and dashboards
|
|
/var/log/grafana/ # Grafana logs
|
|
/etc/victoriametrics/ # VictoriaMetrics config
|
|
/etc/grafana/ # Grafana config
|
|
```
|
|
|
|
### Backup
|
|
|
|
VictoriaMetrics data is stored in `/var/lib/victoriametrics`:
|
|
|
|
```bash
|
|
# Stop service
|
|
systemctl stop victoriametrics
|
|
|
|
# Backup data
|
|
tar -czf victoriametrics-backup-$(date +%Y%m%d).tar.gz /var/lib/victoriametrics
|
|
|
|
# Start service
|
|
systemctl start victoriametrics
|
|
```
|
|
|
|
Grafana dashboards are stored in SQLite database at `/var/lib/grafana/grafana.db`:
|
|
|
|
```bash
|
|
# Backup Grafana
|
|
systemctl stop grafana
|
|
tar -czf grafana-backup-$(date +%Y%m%d).tar.gz /var/lib/grafana /etc/grafana
|
|
systemctl start grafana
|
|
```
|
|
|
|
## Security
|
|
|
|
### Authentication
|
|
- Grafana protected by Authentik OAuth/OIDC
|
|
- Local admin account available for emergency access
|
|
- All services bind to localhost only
|
|
|
|
### Network Security
|
|
- VictoriaMetrics: `127.0.0.1:8428` (no external access)
|
|
- Grafana: `127.0.0.1:3000` (proxied via Caddy with HTTPS)
|
|
- node_exporter: `127.0.0.1:9100` (no external access)
|
|
|
|
### systemd Hardening
|
|
All services run with security restrictions:
|
|
- `NoNewPrivileges=true`
|
|
- `ProtectSystem=strict`
|
|
- `ProtectHome=true`
|
|
- `PrivateTmp=true`
|
|
- Read-only filesystem (except data directories)
|
|
|
|
## Troubleshooting
|
|
|
|
### Grafana OAuth Not Working
|
|
|
|
1. Check Authentik provider configuration:
|
|
```bash
|
|
# Verify redirect URI matches
|
|
# https://metrics.jnss.me/login/generic_oauth
|
|
```
|
|
|
|
2. Check Grafana logs:
|
|
```bash
|
|
journalctl -u grafana -f
|
|
```
|
|
|
|
3. Verify OAuth credentials in vault match Authentik
|
|
|
|
### No Metrics in Grafana
|
|
|
|
1. Check VictoriaMetrics scrape targets:
|
|
```bash
|
|
curl http://127.0.0.1:8428/api/v1/targets
|
|
```
|
|
|
|
2. Check node_exporter is running:
|
|
```bash
|
|
systemctl status node_exporter
|
|
curl http://127.0.0.1:9100/metrics
|
|
```
|
|
|
|
3. Check VictoriaMetrics logs:
|
|
```bash
|
|
journalctl -u victoriametrics -f
|
|
```
|
|
|
|
### High Memory Usage
|
|
|
|
VictoriaMetrics is configured to use max 60% of available memory. Adjust if needed:
|
|
|
|
```yaml
|
|
# In roles/metrics/defaults/main.yml
|
|
victoriametrics_memory_allowed_percent: "40" # Reduce to 40%
|
|
```
|
|
|
|
## See Also
|
|
|
|
- [VictoriaMetrics Documentation](https://docs.victoriametrics.com/)
|
|
- [Grafana Documentation](https://grafana.com/docs/)
|
|
- [node_exporter GitHub](https://github.com/prometheus/node_exporter)
|
|
- [PromQL Documentation](https://prometheus.io/docs/prometheus/latest/querying/basics/)
|
|
- [Authentik OAuth Integration](https://goauthentik.io/docs/providers/oauth2/)
|