Files
rick-infra/docs/architecture-decisions.md
Joakim ecbeb07ba2 Migrate sigvild-gallery to production environment
- Add multi-environment architecture (homelab + production)
- Create production environment (mini-vps) for client projects
- Create homelab playbook for arch-vps services
- Create production playbook for mini-vps services
- Move sigvild-gallery from homelab to production
- Restructure variables: group_vars/production + host_vars/arch-vps
- Add backup-sigvild.yml playbook with auto-restore functionality
- Fix restore logic to check for data before creating directories
- Add manual variable loading workaround for Ansible 2.20
- Update all documentation for multi-environment setup
- Add ADR-007 documenting multi-environment architecture decision
2025-12-15 16:33:33 +01:00

1247 lines
44 KiB
Markdown

# Architecture Decision Records (ADR)
This document records the significant architectural decisions made in the rick-infra project.
---
## Unix Socket IPC Architecture
### Context
Containerized applications need to communicate with database and cache services. Communication methods include:
1. **Network TCP/IP**: Standard network protocols
2. **Unix Domain Sockets**: Filesystem-based IPC
### Decision
We will use **Unix domain sockets** for all communication between applications and infrastructure services.
### Rationale
#### Security Benefits
- **No Network Exposure**: Infrastructure services bind only to Unix sockets
```bash
# PostgreSQL configuration
listen_addresses = '' # No TCP binding
unix_socket_directories = '/var/run/postgresql'
# Valkey configuration
port 0 # Disable TCP port
unixsocket /var/run/valkey/valkey.sock
```
- **Filesystem Permissions**: Access controlled by Unix file permissions
```bash
srwxrwx--- 1 postgres postgres 0 /var/run/postgresql/.s.PGSQL.5432
srwxrwx--- 1 valkey valkey 0 /var/run/valkey/valkey.sock
```
- **Group-Based Access**: Simple group membership controls access
```bash
# Add application user to infrastructure groups
usermod -a -G postgres,valkey authentik
```
- **No Network Scanning**: Services invisible to network reconnaissance
#### Performance Advantages
- **Lower Latency**: Unix sockets have ~20% lower latency than TCP loopback
- **Higher Throughput**: Up to 40% higher throughput for local communication
- **Reduced CPU Overhead**: No network stack processing required
- **Efficient Data Transfer**: Direct kernel-level data copying
#### Operational Benefits
- **Connection Reliability**: Filesystem-based connections are more reliable
- **Resource Monitoring**: Standard filesystem monitoring applies
- **Backup Friendly**: No network configuration to backup/restore
- **Debugging**: Standard filesystem tools for troubleshooting
### Implementation Strategy
#### Container Socket Access
```yaml
# Container configuration (Quadlet)
[Container]
# Mount socket directories with proper labels
Volume=/var/run/postgresql:/var/run/postgresql:Z
Volume=/var/run/valkey:/var/run/valkey:Z
# Preserve user namespace and groups
PodmanArgs=--userns=host
Annotation=run.oci.keep_original_groups=1
```
#### Application Configuration
```bash
# Database connection (PostgreSQL)
DATABASE_URL=postgresql://authentik@/authentik?host=/var/run/postgresql
# Cache connection (Redis/Valkey)
CACHE_URL=unix:///var/run/valkey/valkey.sock?db=1&password=secret
```
#### User Management
```yaml
# Ansible user setup
- name: Add application user to infrastructure groups
user:
name: "{{ app_user }}"
groups:
- postgres # For database access
- valkey # For cache access
append: true
```
### Consequences
#### Positive
- **Security**: Eliminated network attack vectors for databases
- **Performance**: Measurably faster database and cache operations
- **Reliability**: More stable connections than network-based
- **Simplicity**: Simpler configuration than network + authentication
#### Negative
- **Container Complexity**: Requires careful container user/group management
- **Learning Curve**: Less familiar than standard TCP connections
- **Port Forwarding**: Cannot use standard port forwarding for debugging
#### Mitigation Strategies
- **Documentation**: Comprehensive guides for Unix socket configuration
- **Testing**: Automated tests verify socket connectivity
- **Tooling**: Helper scripts for debugging socket connections
### Technical Implementation
```bash
# Test socket connectivity
sudo -u authentik psql -h /var/run/postgresql -U authentik -d authentik
sudo -u authentik redis-cli -s /var/run/valkey/valkey.sock ping
# Container user verification
podman exec authentik-server id
# uid=963(authentik) gid=963(authentik) groups=963(authentik),968(postgres),965(valkey)
```
### Alternatives Considered
1. **TCP with Authentication**: Rejected due to network exposure
2. **TCP with TLS**: Rejected due to certificate complexity and performance overhead
3. **Shared Memory**: Rejected due to implementation complexity
---
## ADR-003: Podman + systemd Container Orchestration
**Technical Story**: Container orchestration solution for secure application deployment with systemd integration.
### Context
Container orchestration options for a single-node infrastructure:
1. **Docker + Docker Compose**: Traditional container orchestration
2. **Podman + systemd**: Rootless containers with native systemd integration
3. **Kubernetes**: Full orchestration platform (overkill for single node)
4. **Nomad**: HashiCorp orchestration solution
### Decision
We will use **Podman with systemd integration (Quadlet)** for container orchestration, deployed as system-level services (rootful containers running as dedicated users).
### Rationale
#### Security Advantages
- **No Daemon Required**: No privileged daemon attack surface
```bash
# Docker: Requires root daemon
sudo systemctl status docker
# Podman: Daemonless operation
podman ps # No daemon needed
```
- **Dedicated Service Users**: Containers run as dedicated system users (not root)
- **Group-Based Access Control**: Unix group membership controls infrastructure access
- **SELinux Integration**: Better SELinux support than Docker
#### systemd Integration Benefits
- **Native Service Management**: Containers as system-level systemd services
```ini
# Quadlet file: /etc/containers/systemd/authentik.pod
[Unit]
Description=Authentik Authentication Pod
[Pod]
PublishPort=0.0.0.0:9000:9000
ShmSize=256m
[Service]
Restart=always
TimeoutStartSec=900
[Install]
WantedBy=multi-user.target
```
- **Dependency Management**: systemd handles service dependencies
- **Resource Control**: systemd resource limits and monitoring
- **Logging Integration**: journald for centralized logging
#### Operational Excellence
- **Familiar Tooling**: Standard systemd commands
```bash
systemctl status authentik-pod
systemctl restart authentik-server
journalctl -u authentik-server -f
```
- **Boot Integration**: Services start automatically at system boot
- **Resource Monitoring**: systemd resource tracking
- **Configuration Management**: Declarative Quadlet files
#### Performance Benefits
- **Lower Overhead**: No daemon overhead for container management
- **Direct Kernel Access**: Better performance than daemon-based solutions
- **Resource Efficiency**: More efficient resource utilization
### Implementation Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ systemd System Services (/system.slice/) │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌───────────────┐ │
│ │ authentik-pod │ │ authentik-server│ │authentik-worker│ │
│ │ .service │ │ .service │ │ .service │ │
│ └─────────────────┘ └─────────────────┘ └───────────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Podman Pod (rootful, dedicated user) │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────────────────────┐ │ │
│ │ │ Server Container│ │ Worker Container │ │ │
│ │ │ User: 966:966 │ │ User: 966:966 │ │ │
│ │ │ Groups: 961,962 │ │ Groups: 961,962 │ │ │
│ │ │ (valkey,postgres)│ │ (valkey,postgres) │ │ │
│ │ └─────────────────┘ └─────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│ Group-based access to infrastructure
┌─────────────────────────────────────────────────────────────┐
│ Infrastructure Services │
│ PostgreSQL: /var/run/postgresql (postgres:postgres-clients)│
│ Valkey: /var/run/valkey (valkey:valkey-clients) │
└─────────────────────────────────────────────────────────────┘
```
#### Quadlet Configuration
```ini
# Pod configuration (authentik.pod)
[Unit]
Description=Authentik Authentication Pod
[Pod]
PublishPort=127.0.0.1:9000:9000
ShmSize=256m
[Service]
Restart=always
[Install]
WantedBy=default.target
```
```ini
# Container configuration (authentik-server.container)
[Unit]
Description=Authentik Server Container
After=authentik-pod.service
Requires=authentik-pod.service
[Container]
ContainerName=authentik-server
Image=ghcr.io/goauthentik/server:2025.10
Pod=authentik.pod
EnvironmentFile=/opt/authentik/.env
User=966:966
PodmanArgs=--group-add 962 --group-add 961
# Volume mounts for sockets and data
Volume=/opt/authentik/media:/media
Volume=/opt/authentik/data:/data
Volume=/var/run/postgresql:/var/run/postgresql:Z
Volume=/var/run/valkey:/var/run/valkey:Z
[Service]
Restart=always
TimeoutStartSec=300
[Install]
WantedBy=multi-user.target
```
### User Management Strategy
```yaml
# Ansible implementation
- name: Create service user
user:
name: authentik
group: authentik
groups: [postgres-clients, valkey-clients]
system: true
shell: /bin/bash
home: /opt/authentik
create_home: true
append: true
```
**Note**: Infrastructure roles (PostgreSQL, Valkey) export client group GIDs as Ansible facts (`postgresql_client_group_gid`, `valkey_client_group_gid`) which are consumed by application container templates for dynamic `--group-add` arguments.
### Consequences
#### Positive
- **Security**: Eliminated privileged daemon attack surface
- **Integration**: Seamless systemd integration for management
- **Performance**: Lower overhead than daemon-based solutions
- **Reliability**: systemd's proven service management
- **Monitoring**: Standard systemd monitoring and logging
#### Negative
- **Learning Curve**: Different from Docker Compose workflows
- **Tooling**: Ecosystem less mature than Docker
- **Documentation**: Fewer online resources and examples
#### Mitigation Strategies
- **Documentation**: Comprehensive internal documentation
- **Training**: Team training on Podman/systemd workflows
- **Tooling**: Helper scripts for common operations
### Technical Implementation
```bash
# Container management (system scope)
systemctl status authentik-pod
systemctl restart authentik-server
podman ps
podman logs authentik-server
# Resource monitoring
systemctl show authentik-server --property=MemoryCurrent
journalctl -u authentik-server -f
# Verify container groups
ps aux | grep authentik-server | head -1 | awk '{print $2}' | \
xargs -I {} cat /proc/{}/status | grep Groups
# Output: Groups: 961 962 966
```
### Alternatives Considered
1. **Docker + Docker Compose**: Rejected due to security concerns (privileged daemon)
2. **Kubernetes**: Rejected as overkill for single-node deployment
3. **Nomad**: Rejected to maintain consistency with systemd ecosystem
---
## OAuth/OIDC and Forward Authentication Security Model
**Technical Story**: Centralized authentication and authorization for multiple services using industry-standard OAuth2/OIDC protocols where supported, with forward authentication as a fallback.
### Context
Authentication strategies for multiple services:
1. **Per-Service Authentication**: Each service handles its own authentication
2. **Shared Database**: Services share authentication database
3. **OAuth2/OIDC Integration**: Services implement standard OAuth2/OIDC clients
4. **Forward Authentication**: Reverse proxy handles authentication for services without OAuth support
### Decision
We will use **OAuth2/OIDC integration** as the primary authentication method for services that support it, and **forward authentication** for services that do not support native OAuth2/OIDC integration.
### Rationale
#### OAuth/OIDC as Primary Method
**Security Benefits**:
- **Standard Protocol**: Industry-standard authentication flow (RFC 6749, RFC 7636)
- **Token-Based Security**: Secure JWT tokens with cryptographic signatures
- **Proper Session Management**: Native application session handling with refresh tokens
- **Scope-Based Authorization**: Fine-grained permission control via OAuth scopes
- **PKCE Support**: Protection against authorization code interception attacks
**Integration Benefits**:
- **Native Support**: Applications designed for OAuth/OIDC work seamlessly
- **Better UX**: Proper redirect flows, logout handling, and token refresh
- **API Access**: OAuth tokens enable secure API integrations
- **Standard Claims**: OpenID Connect user info endpoint provides standardized user data
- **Multi-Application SSO**: Proper single sign-on with token sharing
**Examples**: Nextcloud, Gitea, Grafana, many modern applications
#### Forward Auth as Fallback
**Use Cases**:
- Services without OAuth/OIDC support
- Legacy applications that cannot be modified
- Static sites requiring authentication
- Simple internal tools
**Security Benefits**:
- **Zero Application Changes**: Protect existing services without modification
- **Header-Based Identity**: Simple identity propagation to backend
- **Transparent Protection**: Services receive pre-authenticated requests
**Limitations**:
- **Non-Standard**: Not using industry-standard authentication protocols
- **Proxy Dependency**: All requests must flow through authenticating proxy
- **Limited Logout**: Complex logout scenarios across services
- **Header Trust**: Backend must trust proxy-provided headers
#### Shared Benefits (Both Methods)
- **Single Point of Control**: Centralized authentication policy via Authentik
- **Consistent Security**: Same authentication provider across all services
- **Multi-Factor Authentication**: MFA applied consistently via Authentik
- **Audit Trail**: Centralized authentication logging
- **User Management**: One system for all user administration
### Implementation Architecture
#### OAuth/OIDC Flow (Primary Method)
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ User │ │ Service │ │ Authentik │
│ │ │ (OAuth App) │ │ (IdP) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ Access Service │ │
│─────────────────▶│ │
│ │ │
│ │ No session │
│ 302 → OAuth │ │
│◀─────────────────│ │
│ │ │
│ GET /authorize?client_id=...&redirect_uri=...
│──────────────────────────────────────▶│
│ │ │
│ Login form (if not authenticated) │
│◀────────────────────────────────────│
│ │ │
│ Credentials │ │
│─────────────────────────────────────▶│
│ │ │
│ 302 → callback?code=AUTH_CODE │
│◀────────────────────────────────────│
│ │ │
│ GET /callback?code=AUTH_CODE │
│─────────────────▶│ │
│ │ │
│ │ POST /token │
│ │ code=AUTH_CODE │
│ │─────────────────▶│
│ │ │
│ │ access_token │
│ │ id_token (JWT) │
│ │◀─────────────────│
│ │ │
│ Set-Cookie │ GET /userinfo │
│ 302 → /dashboard │─────────────────▶│
│◀─────────────────│ │
│ │ User claims │
│ │◀─────────────────│
│ │ │
│ GET /dashboard │ │
│─────────────────▶│ │
│ │ │
│ Dashboard │ │
│◀─────────────────│ │
```
#### Forward Auth Flow (Fallback Method)
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ User │ │ Caddy │ │ Authentik │ │ Service │
│ │ │ (Proxy) │ │ (Forward) │ │ (Backend) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │ │
│ GET / │ │ │
│─────────────────▶│ │ │
│ │ │ │
│ │ Forward Auth │ │
│ │─────────────────▶│ │
│ │ │ │
│ │ 401 Unauthorized │ │
│ │◀─────────────────│ │
│ │ │ │
│ 302 → /auth │ │ │
│◀─────────────────│ │ │
│ │ │ │
│ Login form │ │ │
│──────────────────────────────────────▶│ │
│ │ │ │
│ Credentials │ │ │
│──────────────────────────────────────▶│ │
│ │ │ │
│ Set-Cookie │ │ │
│◀──────────────────────────────────────│ │
│ │ │ │
│ GET / │ │ │
│─────────────────▶│ │ │
│ │ │ │
│ │ Forward Auth │ │
│ │─────────────────▶│ │
│ │ │ │
│ │ 200 + Headers │ │
│ │◀─────────────────│ │
│ │ │ │
│ │ Proxy + Headers │ │
│ │─────────────────────────────────────▶│
│ │ │ │
│ │ Response │ │
│ │◀─────────────────────────────────────│
│ │ │ │
│ Content │ │ │
│◀─────────────────│ │ │
```
### OAuth/OIDC Configuration Examples
#### Nextcloud OAuth Configuration
```php
// Nextcloud config.php
'oidc_login_provider_url' => 'https://auth.jnss.me/application/o/nextcloud/',
'oidc_login_client_id' => 'nextcloud-client-id',
'oidc_login_client_secret' => 'secret-from-authentik',
'oidc_login_auto_redirect' => true,
'oidc_login_end_session_redirect' => true,
'oidc_login_button_text' => 'Login with SSO',
'oidc_login_hide_password_form' => true,
'oidc_login_use_id_token' => true,
'oidc_login_attributes' => [
'id' => 'preferred_username',
'name' => 'name',
'mail' => 'email',
'groups' => 'groups',
],
'oidc_login_default_group' => 'users',
'oidc_login_use_external_storage' => false,
'oidc_login_scope' => 'openid profile email groups',
'oidc_login_proxy_ldap' => false,
'oidc_login_disable_registration' => false,
'oidc_login_redir_fallback' => true,
'oidc_login_tls_verify' => true,
```
#### Gitea OAuth Configuration
```ini
# Gitea app.ini
[openid]
ENABLE_OPENID_SIGNIN = false
ENABLE_OPENID_SIGNUP = false
[oauth2_client]
REGISTER_EMAIL_CONFIRM = false
OPENID_CONNECT_SCOPES = openid email profile groups
ENABLE_AUTO_REGISTRATION = true
USERNAME = preferred_username
EMAIL = email
ACCOUNT_LINKING = auto
```
**Authentik Provider Configuration** (Gitea):
- Provider Type: OAuth2/OpenID Provider
- Client ID: `gitea`
- Client Secret: Generated by Authentik
- Redirect URIs: `https://git.jnss.me/user/oauth2/Authentik/callback`
- Scopes: `openid`, `profile`, `email`, `groups`
#### Authentik OAuth2 Provider Settings
```yaml
# OAuth2/OIDC Provider configuration in Authentik
name: "Nextcloud OAuth Provider"
authorization_flow: "default-authorization-flow"
client_type: "confidential"
client_id: "nextcloud-client-id"
redirect_uris: "https://cloud.jnss.me/apps/oidc_login/oidc"
signing_key: "authentik-default-key"
property_mappings:
- "authentik default OAuth Mapping: OpenID 'openid'"
- "authentik default OAuth Mapping: OpenID 'email'"
- "authentik default OAuth Mapping: OpenID 'profile'"
- "Custom: Groups" # Maps user groups to 'groups' claim
```
### Forward Auth Configuration Examples
#### Caddy Configuration for Forward Auth
```caddyfile
# whoami service with forward authentication
whoami.jnss.me {
# Forward authentication to Authentik
forward_auth https://auth.jnss.me {
uri /outpost.goauthentik.io/auth/caddy
copy_headers Remote-User Remote-Name Remote-Email Remote-Groups
}
# Backend service (receives authenticated requests)
reverse_proxy localhost:8080
}
```
#### Authentik Proxy Provider Configuration
```yaml
# Authentik Proxy Provider for forward auth
name: "Whoami Forward Auth"
type: "proxy"
authorization_flow: "default-authorization-flow"
external_host: "https://whoami.jnss.me"
internal_host: "http://localhost:8080"
skip_path_regex: "^/(health|metrics).*"
mode: "forward_single" # Single application mode
```
#### Service Integration (Forward Auth)
Services receive authentication information via HTTP headers:
```python
# Example service code (Python Flask)
@app.route('/')
def index():
username = request.headers.get('Remote-User')
name = request.headers.get('Remote-Name')
email = request.headers.get('Remote-Email')
groups = request.headers.get('Remote-Groups', '').split(',')
return render_template('index.html',
username=username,
name=name,
email=email,
groups=groups)
```
### Authorization Policies
Both OAuth and Forward Auth support Authentik authorization policies:
```yaml
# Example authorization policy in Authentik
policy_bindings:
- policy: "group_admins_only"
target: "nextcloud_oauth_provider"
order: 0
- policy: "require_mfa"
target: "gitea_oauth_provider"
order: 1
- policy: "internal_network_only"
target: "whoami_proxy_provider"
order: 0
```
### Decision Matrix: OAuth/OIDC vs Forward Auth
| Criteria | OAuth/OIDC | Forward Auth |
|----------|-----------|-------------|
| **Application Support** | Requires native OAuth/OIDC support | Any application |
| **Protocol Standard** | Industry standard (RFC 6749, 7636) | Proprietary/custom |
| **Token Management** | Native refresh tokens, proper expiry | Session-based only |
| **Logout Handling** | Proper logout flow | Complex, proxy-dependent |
| **API Access** | Full API support via tokens | Header-only |
| **Implementation Effort** | Configure OAuth settings | Zero app changes |
| **User Experience** | Standard OAuth redirects | Transparent |
| **Security Model** | Token-based with scopes | Header trust model |
| **When to Use** | **Nextcloud, Gitea, modern apps** | **Static sites, legacy apps, whoami** |
### Consequences
#### Positive
- **Standards Compliance**: OAuth/OIDC uses industry-standard protocols
- **Security**: Multiple authentication options with appropriate security models
- **Flexibility**: Right tool for each service (OAuth when possible, forward auth when needed)
- **Auditability**: Centralized authentication logging via Authentik
- **User Experience**: Proper SSO across all services
- **Token Security**: OAuth provides secure token refresh and scope management
- **Graceful Degradation**: Forward auth available for services without OAuth support
#### Negative
- **Complexity**: Need to understand two authentication methods
- **Configuration Overhead**: OAuth requires per-service configuration
- **Single Point of Failure**: Authentik failure affects all services
- **Learning Curve**: Team must understand OAuth flows and forward auth model
#### Mitigation Strategies
- **Documentation**: Clear decision guide for choosing OAuth vs forward auth
- **Templates**: Reusable OAuth configuration templates for common services
- **High Availability**: Robust deployment and monitoring of Authentik
- **Monitoring**: Comprehensive monitoring of both authentication flows
- **Testing**: Automated tests for authentication flows
### Security Considerations
#### OAuth/OIDC Security
```yaml
# Authentik OAuth2 Provider security settings
authorization_code_validity: 60 # 1 minute
access_code_validity: 3600 # 1 hour
refresh_code_validity: 2592000 # 30 days
include_claims_in_id_token: true
signing_key: "authentik-default-key"
sub_mode: "hashed_user_id"
issuer_mode: "per_provider"
```
**Best Practices**:
- Use PKCE for all OAuth flows (protection against interception)
- Implement proper token rotation (refresh tokens expire and rotate)
- Validate `aud` (audience) and `iss` (issuer) claims in JWT tokens
- Use short-lived access tokens (1 hour)
- Store client secrets securely (Ansible Vault)
#### Forward Auth Security
```yaml
# Authentik Proxy Provider security settings
token_validity: 3600 # 1 hour session
cookie_domain: ".jnss.me"
skip_path_regex: "^/(health|metrics|static).*"
```
**Best Practices**:
- Trust only Authentik-provided headers
- Validate `Remote-User` header exists before granting access
- Use HTTPS for all forward auth endpoints
- Implement proper session timeouts
- Strip user-provided authentication headers at proxy
#### Access Control
- **Group-Based Authorization**: Users assigned to groups, groups to applications
- **Policy Engine**: Authentik policies for fine-grained access control
- **MFA Requirements**: Multi-factor authentication for sensitive services
- **IP-Based Restrictions**: Geographic or network-based access control
- **Time-Based Access**: Temporary access grants via policies
#### Audit Logging
```json
{
"timestamp": "2025-12-15T10:30:00Z",
"event": "oauth_authorization",
"user": "john.doe",
"application": "nextcloud",
"scopes": ["openid", "email", "profile", "groups"],
"ip": "192.168.1.100",
"user_agent": "Mozilla/5.0..."
}
```
### Implementation Examples by Service Type
#### OAuth/OIDC Services (Primary Method)
**Nextcloud**:
```caddyfile
cloud.jnss.me {
reverse_proxy localhost:8080
}
# OAuth configured within Nextcloud application
```
**Gitea**:
```caddyfile
git.jnss.me {
reverse_proxy localhost:3000
}
# OAuth configured within Gitea application settings
```
#### Forward Auth Services (Fallback Method)
**Whoami (test/demo service)**:
```caddyfile
whoami.jnss.me {
forward_auth https://auth.jnss.me {
uri /outpost.goauthentik.io/auth/caddy
copy_headers Remote-User Remote-Name Remote-Email Remote-Groups
}
reverse_proxy localhost:8080
}
```
**Static Documentation Site**:
```caddyfile
docs.jnss.me {
forward_auth https://auth.jnss.me {
uri /outpost.goauthentik.io/auth/caddy
copy_headers Remote-User Remote-Groups
}
root * /var/www/docs
file_server
}
```
**Internal API (no OAuth support)**:
```caddyfile
api.jnss.me {
forward_auth https://auth.jnss.me {
uri /outpost.goauthentik.io/auth/caddy
copy_headers Remote-User Remote-Email Remote-Groups
}
reverse_proxy localhost:3000
}
```
#### Selective Protection (Public + Protected Paths)
```caddyfile
app.jnss.me {
# Public endpoints (no auth required)
handle /health {
reverse_proxy localhost:8080
}
handle /metrics {
reverse_proxy localhost:8080
}
handle /public/* {
reverse_proxy localhost:8080
}
# Protected endpoints (forward auth)
handle /admin/* {
forward_auth https://auth.jnss.me {
uri /outpost.goauthentik.io/auth/caddy
copy_headers Remote-User Remote-Groups
}
reverse_proxy localhost:8080
}
# Default: protected
handle {
forward_auth https://auth.jnss.me {
uri /outpost.goauthentik.io/auth/caddy
copy_headers Remote-User Remote-Groups
}
reverse_proxy localhost:8080
}
}
```
### Alternatives Considered
1. **OAuth2/OIDC Only**: Rejected because many services don't support OAuth natively
2. **Forward Auth Only**: Rejected because it doesn't leverage native OAuth support in modern apps
3. **Per-Service Authentication**: Rejected due to management overhead and inconsistent security
4. **Shared Database**: Rejected due to tight coupling between services
5. **VPN-Based Access**: Rejected due to operational complexity for web services
6. **SAML**: Rejected in favor of modern OAuth2/OIDC standards
---
## Rootful Containers with Infrastructure Fact Pattern
**Technical Story**: Enable containerized applications to access native infrastructure services (PostgreSQL, Valkey) via Unix sockets with group-based permissions.
### Context
Containerized applications need to access infrastructure services (PostgreSQL, Valkey) through Unix sockets with filesystem-based permission controls. The permission model requires:
1. **Socket directories** owned by service groups (`postgres-clients`, `valkey-clients`)
2. **Application users** added to these groups for access
3. **Container processes** must preserve group membership to access sockets
Two approaches were evaluated:
1. **Rootless containers (user namespace)**: Containers run in user namespace with UID/GID remapping
2. **Rootful containers (system services)**: Containers run as dedicated system users without namespace isolation
### Decision
We will use **rootful containers deployed as system-level systemd services** with an **Infrastructure Fact Pattern** where infrastructure roles export client group GIDs as Ansible facts for application consumption.
### Rationale
#### Why Rootful Succeeds
**Direct UID/GID Mapping**:
```bash
# Host: authentik user UID 966, groups: 966 (authentik), 961 (valkey-clients), 962 (postgres-clients)
# Container User=966:966 with PodmanArgs=--group-add 961 --group-add 962
# Inside container:
id
# uid=966(authentik) gid=966(authentik) groups=966(authentik),961(valkey-clients),962(postgres-clients)
# Socket access works:
ls -l /var/run/postgresql/.s.PGSQL.5432
# srwxrwx--- 1 postgres postgres-clients 0 ... /var/run/postgresql/.s.PGSQL.5432
```
**Group membership preserved**: Container process has GIDs 961 and 962, matching socket group ownership.
#### Why Rootless Failed (Discarded Approach)
**User Namespace UID/GID Remapping**:
```bash
# Host: authentik user UID 100000, subuid range 200000-265535
# Container User=%i:%i with --userns=host --group-add=keep-groups
# User namespace remaps:
# Host UID 100000 → Container UID 100000 (root in namespace)
# Host GID 961 → Container GID 200961 (remapped into subgid range)
# Host GID 962 → Container GID 200962 (remapped into subgid range)
# Socket ownership on host:
# srwxrwx--- 1 postgres postgres-clients (GID 962)
# Container process groups: 200961, 200962 (remapped)
# Socket expects: GID 962 (not remapped)
# Result: Permission denied ❌
```
**Root cause**: User namespace supplementary group remapping breaks group-based socket access even with `--userns=host`, `--group-add=keep-groups`, and `Annotation=run.oci.keep_original_groups=1`.
### Infrastructure Fact Pattern
#### Infrastructure Roles Export GIDs
Infrastructure services create client groups and export their GIDs as Ansible facts:
```yaml
# PostgreSQL role: roles/postgresql/tasks/main.yml
- name: Create PostgreSQL client access group
group:
name: postgres-clients
system: true
- name: Get PostgreSQL client group GID
shell: "getent group postgres-clients | cut -d: -f3"
register: postgresql_client_group_lookup
changed_when: false
- name: Set PostgreSQL client group GID as fact
set_fact:
postgresql_client_group_gid: "{{ postgresql_client_group_lookup.stdout }}"
```
```yaml
# Valkey role: roles/valkey/tasks/main.yml
- name: Create Valkey client access group
group:
name: valkey-clients
system: true
- name: Get Valkey client group GID
shell: "getent group valkey-clients | cut -d: -f3"
register: valkey_client_group_lookup
changed_when: false
- name: Set Valkey client group GID as fact
set_fact:
valkey_client_group_gid: "{{ valkey_client_group_lookup.stdout }}"
```
#### Application Roles Consume Facts
Application roles validate and consume infrastructure facts:
```yaml
# Authentik role: roles/authentik/tasks/main.yml
- name: Validate infrastructure facts are available
assert:
that:
- postgresql_client_group_gid is defined
- valkey_client_group_gid is defined
fail_msg: |
Required infrastructure facts are not available.
Ensure PostgreSQL and Valkey roles have run first.
- name: Create authentik user with infrastructure groups
user:
name: authentik
groups: [postgres-clients, valkey-clients]
append: true
```
```ini
# Container template: roles/authentik/templates/authentik-server.container
[Container]
User={{ authentik_uid }}:{{ authentik_gid }}
PodmanArgs=--group-add {{ postgresql_client_group_gid }} --group-add {{ valkey_client_group_gid }}
```
### Implementation Details
#### System-Level Deployment
```ini
# Quadlet files deployed to /etc/containers/systemd/ (not ~/.config/)
# Pod: /etc/containers/systemd/authentik.pod
[Unit]
Description=Authentik Authentication Pod
[Pod]
PublishPort=0.0.0.0:9000:9000
ShmSize=256m
[Service]
Restart=always
[Install]
WantedBy=multi-user.target # System target, not default.target
```
```ini
# Container: /etc/containers/systemd/authentik-server.container
[Container]
User=966:966
PodmanArgs=--group-add 962 --group-add 961
Volume=/var/run/postgresql:/var/run/postgresql:Z
Volume=/var/run/valkey:/var/run/valkey:Z
```
#### Service Management
```bash
# System scope (not user scope)
systemctl status authentik-pod
systemctl restart authentik-server
journalctl -u authentik-server -f
# Verify container location
systemctl status authentik-server | grep CGroup
# CGroup: /system.slice/authentik-server.service ✓
```
### Special Case: Valkey Socket Group Fix
Valkey doesn't natively support socket group configuration (unlike PostgreSQL's `unix_socket_group`). A helper service ensures correct socket permissions:
```ini
# /etc/systemd/system/valkey-socket-fix.service
[Unit]
Description=Fix Valkey socket group ownership and permissions
BindsTo=valkey.service
After=valkey.service
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'i=0; while [ ! -S /var/run/valkey/valkey.sock ] && [ $i -lt 100 ]; do sleep 0.1; i=$((i+1)); done'
ExecStart=/bin/chgrp valkey-clients /var/run/valkey/valkey.sock
ExecStart=/bin/chmod 770 /var/run/valkey/valkey.sock
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target
```
Triggered by Valkey service:
```ini
# /etc/systemd/system/valkey.service (excerpt)
[Unit]
Wants=valkey-socket-fix.service
```
### Consequences
#### Positive
- **Socket Access Works**: Group-based permissions function correctly
- **Security**: Containers run as dedicated users (not root), no privileged daemon
- **Portability**: Dynamic GID facts work across different hosts
- **Consistency**: Same pattern for all containerized applications
- **Simplicity**: No user namespace complexity, standard systemd service management
#### Negative
- **Not "Pure" Rootless**: Containers require root for systemd service deployment
- **Different from Docker**: Less familiar pattern than rootless user services
#### Neutral
- **System vs User Scope**: Different commands (`systemctl` vs `systemctl --user`) but equally capable
- **Deployment Location**: `/etc/containers/systemd/` vs `~/.config/` but same Quadlet functionality
### Validation
```bash
# Verify service location
systemctl status authentik-server | grep CGroup
# → /system.slice/authentik-server.service ✓
# Verify process groups
ps aux | grep authentik | head -1 | awk '{print $2}' | \
xargs -I {} cat /proc/{}/status | grep Groups
# → Groups: 961 962 966 ✓
# Verify socket permissions
ls -l /var/run/postgresql/.s.PGSQL.5432
# → srwxrwx--- postgres postgres-clients ✓
ls -l /var/run/valkey/valkey.sock
# → srwxrwx--- valkey valkey-clients ✓
# Verify HTTP endpoint
curl -I http://127.0.0.1:9000/
# → HTTP/1.1 302 Found ✓
```
### Alternatives Considered
1. **Rootless with user namespace** - Discarded due to GID remapping breaking group-based socket access
2. **TCP-only connections** - Rejected to maintain Unix socket security and performance benefits
3. **Hardcoded GIDs** - Rejected for portability; facts provide dynamic resolution
4. **Directory permissions (777)** - Rejected for security; group-based access more restrictive. This is then later changed again to 777, due to Nextcloud switching from root to www-data, breaking group-based permissions.
---
---
## ADR-007: Multi-Environment Infrastructure Architecture
**Date**: December 2025
**Status**: Accepted
**Context**: Separation of homelab services from production client projects
### Decision
Rick-infra will manage two separate environments with different purposes and uptime requirements:
1. **Homelab Environment** (arch-vps)
- Purpose: Personal services and experimentation
- Infrastructure: Full stack (PostgreSQL, Valkey, Podman, Caddy)
- Services: Authentik, Nextcloud, Gitea
- Uptime requirement: Best effort
2. **Production Environment** (mini-vps)
- Purpose: Client projects requiring high uptime
- Infrastructure: Minimal (Caddy only)
- Services: Sigvild Gallery
- Uptime requirement: High availability
### Rationale
**Separation of Concerns**:
- Personal experiments don't affect client services
- Client services isolated from homelab maintenance
- Clear distinction between environments in code
**Infrastructure Optimization**:
- Production runs minimal services (no PostgreSQL/Valkey overhead)
- Homelab can be rebooted/upgraded without affecting clients
- Cost optimization: smaller VPS for production
**Operational Flexibility**:
- Different backup strategies per environment
- Different monitoring/alerting levels
- Independent deployment schedules
### Implementation
**Variable Organization**:
```
rick-infra/
├── group_vars/
│ └── production/ # Production environment config
│ ├── main.yml
│ └── vault.yml
├── host_vars/
│ └── arch-vps/ # Homelab host config
│ ├── main.yml
│ └── vault.yml
└── playbooks/
├── homelab.yml # Homelab deployment
├── production.yml # Production deployment
└── site.yml # Orchestrates both
```
**Playbook Structure**:
- `site.yml` imports both homelab.yml and production.yml
- Each playbook manually loads variables (Ansible 2.20 workaround)
- Services deploy only to their designated environment
**Inventory Groups**:
```yaml
homelab:
hosts:
arch-vps:
ansible_host: 69.62.119.31
production:
hosts:
mini-vps:
ansible_host: 72.62.91.251
```
### Migration Example
**Sigvild Gallery Migration** (December 2025):
- **From**: arch-vps (homelab)
- **To**: mini-vps (production)
- **Reason**: Client project requiring higher uptime
- **Process**:
1. Created backup on arch-vps
2. Deployed to mini-vps with automatic restore
3. Updated DNS (5 min downtime)
4. Removed from arch-vps configuration
### Consequences
**Positive**:
- Clear separation of personal vs. client services
- Reduced blast radius for experiments
- Optimized resource usage per environment
- Independent scaling and management
**Negative**:
- Increased complexity in playbook organization
- Need to manage multiple VPS instances
- Ansible 2.20 variable loading requires workarounds
- Duplicate infrastructure code (Caddy on both)
**Neutral**:
- Services can be migrated between environments with minimal friction
- Backup/restore procedures work across environments
- Group_vars vs. host_vars hybrid approach
### Future Considerations
- Consider grouping multiple client projects on production VPS
- Evaluate if homelab needs full infrastructure stack
- Monitor for opportunities to share infrastructure between environments
- Document migration procedures for moving services between environments
---