Files
rick-infra/docs/socket-permissions-architecture.md
Joakim 4f8da38ca6 Add Nextcloud cloud storage role with split Redis caching strategy
## New Features

- **Nextcloud Role**: Complete cloud storage deployment using Podman Quadlet
  - FPM variant with Caddy reverse proxy and FastCGI
  - PostgreSQL database via Unix socket
  - Valkey/Redis for app-level caching and file locking
  - Automatic HTTPS with Let's Encrypt via Caddy
  - Dual-root pattern: Caddy serves static assets, FPM handles PHP

- **Split Caching Strategy**: Redis caching WITHOUT Redis sessions
  - Custom redis.config.php template for app-level caching only
  - File-based PHP sessions for stability (avoids session lock issues)
  - Prevents cascading failures from session lock contention
  - Documented in role README with detailed rationale

## Infrastructure Updates

- **Socket Permissions**: Update PostgreSQL and Valkey to mode 777
  - Required for containers that switch users (root → www-data)
  - Nextcloud container loses supplementary groups on user switch
  - Security maintained via password authentication (scram-sha-256, requirepass)
  - Documented socket permission architecture in docs/

- **PostgreSQL**: Export client group GID as fact for dependent roles
- **Valkey**: Export client group GID as fact, update socket fix service

## Documentation

- New: docs/socket-permissions-architecture.md
  - Explains 777 vs 770 socket permission trade-offs
  - Documents why group-based access doesn't work for user-switching containers
  - Provides TCP alternative for stricter security requirements

- Updated: All role READMEs with socket permission notes
- New: Nextcloud README with comprehensive deployment, troubleshooting, and Redis architecture documentation

## Configuration

- host_vars: Add Nextcloud vault variables and configuration
- site.yml: Include Nextcloud role in main playbook

## Technical Details

**Why disable Redis sessions?**

The official Nextcloud container enables Redis session handling via REDIS_HOST env var,
which causes severe performance issues:

1. Session lock contention under high concurrency (browser parallel asset requests)
2. Infinite lock retries (default lock_retries=-1) blocking FPM workers
3. Timeout orphaning: reverse proxy kills connection, worker keeps lock
4. Worker pool exhaustion: all 5 default workers blocked on same session lock
5. Cascading failure: new requests queue, more timeouts, more orphaned locks

Solution: Use file-based sessions (reliable, fast for single-server) while keeping
Redis for distributed cache and transactional file locking via custom config file.

This provides optimal performance without the complexity of Redis session debugging.

Tested: Fresh deployment on arch-vps (69.62.119.31)
Domain: https://cloud.jnss.me/
2025-12-14 22:07:08 +01:00

6.4 KiB
Raw Permalink Blame History

Socket Permissions Architecture Decision

Context

Rick-infra uses Unix domain sockets for PostgreSQL and Valkey (Redis) connections to maximize performance and security. Applications run in Podman containers and need to access these infrastructure services via sockets.

Problem

Different container images have different user models:

  1. Authentik: Runs as a specific user (UID 966) from start to finish
  2. Nextcloud: Starts as root, runs entrypoint scripts, then switches to www-data (UID 33)

When using --group-add with Podman:

  • Supplementary groups are added to the initial user the container runs as
  • Groups are NOT inherited when a container switches users internally
  • Nextcloud's www-data process ends up without socket access

Decision

Use 777 permissions on Unix sockets for PostgreSQL and Valkey.

Rationale

Why 777 Works

  1. Compatibility: Any container user model can access the sockets
  2. Simplicity: No complex user namespace mapping needed
  3. Security maintained: Password authentication still required
  4. Local-only: Sockets are not network-exposed

Security Analysis

What 777 allows:

  • Any local process can attempt to connect to the socket

What 777 does NOT allow:

  • Authentication bypass - PostgreSQL requires username + password (scram-sha-256)
  • Network access - Sockets are local filesystem only
  • Remote connections - Not exposed beyond localhost

Security layers:

  1. Physical: Server access required
  2. Process: Must be running on the same host
  3. Authentication: Must provide valid credentials
  4. Authorization: Database/Redis permissions enforced

Comparison to TCP Localhost

Using 127.0.0.1:5432 (TCP) has identical security:

  • Localhost-only (not network)
  • Requires authentication
  • Any local process can attempt connection

Socket 777 vs TCP localhost:

  • Same security model: Both require credentials, both are local-only
  • Different performance: Sockets are faster (no TCP/IP stack overhead)
  • Different permissions: Sockets use filesystem permissions, TCP uses network

Alternatives Considered

Alternative 1: Group-based Permissions (770)

Implementation:

postgresql_unix_socket_permissions: "0770"
valkey_unix_socket_perm: "770"

Why rejected:

  • Doesn't work for Nextcloud (www-data not in groups after su switch)
  • Requires all containers to use --group-add
  • Complex UID/GID management
  • Breaks container user-switching patterns

Alternative 2: User Namespace Mapping

Implementation:

--uidmap 33:963:1  # Map www-data to nextcloud
--gidmap 33:963:1

Why rejected:

  • Container's root loses privileges (can't run entrypoint)
  • Very complex configuration
  • Fragile (breaks on image updates)
  • Doesn't solve the fundamental user-switching problem

Alternative 3: TCP on Localhost

Implementation:

# PostgreSQL
postgresql_listen_addresses: "127.0.0.1"

# Valkey
valkey_bind: "127.0.0.1"
valkey_port: 6379

Why not chosen (but valid alternative):

  • Same security as socket 777
  • No permission issues
  • Abandons Unix socket performance benefits
  • Goes against infrastructure design goal

Status: Documented as alternative, available for users who prefer it

Alternative 4: Custom Entrypoint

Implementation: Create wrapper that adds www-data to groups before starting FPM.

Why rejected:

  • Requires custom Dockerfile
  • Maintenance burden
  • Breaks on upstream image updates
  • Fragile and complex

Implementation

Files Changed

  1. roles/postgresql/defaults/main.yml: Set postgresql_unix_socket_permissions: "0777"
  2. roles/valkey/defaults/main.yml: Set valkey_unix_socket_perm: "777"
  3. Documentation updated in all affected role READMEs

Migration Path

For existing deployments:

  1. Update socket permissions: chmod 777 /var/run/postgresql/.s.PGSQL.5432
  2. Update socket permissions: chmod 777 /var/run/valkey/valkey.sock
  3. Restart services (permissions persist via role configuration)

Consequences

Positive

  • Works with all container user models (root-switching, single-user, etc.)
  • Simple to understand and maintain
  • No complex UID/GID mapping required
  • Standard pattern, well-documented
  • Authentication still enforced

Negative

  • ⚠️ Any local process can attempt socket connection
  • ⚠️ Requires clear documentation of security model
  • ⚠️ May surprise users expecting tighter filesystem permissions

Neutral

  • Same security model as TCP localhost
  • Alternative (TCP) available for those who prefer it
  • Follows "make it work, make it right, make it fast" philosophy

Validation

Security Validation

  1. Authentication required: Tested - connection requires credentials
  2. Password strength: Enforced via scram-sha-256 and vault
  3. Local-only: Sockets are filesystem objects, not network
  4. Process isolation: Each service has separate database/namespace

Functional Validation

  1. Authentik: Works with 777 sockets
  2. Nextcloud: Works with 777 sockets (www-data can access)
  3. Gitea: Works with 777 sockets

Monitoring

No additional monitoring required. Standard checks apply:

  • Service authentication logs (failed login attempts)
  • Connection monitoring via application logs
  • Systemd service health

Documentation

All relevant READMEs updated with:

  • Explanation of 777 permission choice
  • Security rationale
  • TCP alternative configuration
  • Clear security model explanation

Future Considerations

This decision can be revisited if:

  1. Container orchestration changes (e.g., Kubernetes with different security contexts)
  2. New containers with different user models emerge
  3. Network isolation requirements change
  4. Regulatory compliance requires stricter filesystem permissions

In such cases, the TCP alternative provides an equivalent security model without filesystem permission concerns.

References


Decision Date: December 14, 2025
Status: Accepted
Reviewers: rick-infra maintainers