Add Nextcloud cloud storage role with split Redis caching strategy
## New Features - **Nextcloud Role**: Complete cloud storage deployment using Podman Quadlet - FPM variant with Caddy reverse proxy and FastCGI - PostgreSQL database via Unix socket - Valkey/Redis for app-level caching and file locking - Automatic HTTPS with Let's Encrypt via Caddy - Dual-root pattern: Caddy serves static assets, FPM handles PHP - **Split Caching Strategy**: Redis caching WITHOUT Redis sessions - Custom redis.config.php template for app-level caching only - File-based PHP sessions for stability (avoids session lock issues) - Prevents cascading failures from session lock contention - Documented in role README with detailed rationale ## Infrastructure Updates - **Socket Permissions**: Update PostgreSQL and Valkey to mode 777 - Required for containers that switch users (root → www-data) - Nextcloud container loses supplementary groups on user switch - Security maintained via password authentication (scram-sha-256, requirepass) - Documented socket permission architecture in docs/ - **PostgreSQL**: Export client group GID as fact for dependent roles - **Valkey**: Export client group GID as fact, update socket fix service ## Documentation - New: docs/socket-permissions-architecture.md - Explains 777 vs 770 socket permission trade-offs - Documents why group-based access doesn't work for user-switching containers - Provides TCP alternative for stricter security requirements - Updated: All role READMEs with socket permission notes - New: Nextcloud README with comprehensive deployment, troubleshooting, and Redis architecture documentation ## Configuration - host_vars: Add Nextcloud vault variables and configuration - site.yml: Include Nextcloud role in main playbook ## Technical Details **Why disable Redis sessions?** The official Nextcloud container enables Redis session handling via REDIS_HOST env var, which causes severe performance issues: 1. Session lock contention under high concurrency (browser parallel asset requests) 2. Infinite lock retries (default lock_retries=-1) blocking FPM workers 3. Timeout orphaning: reverse proxy kills connection, worker keeps lock 4. Worker pool exhaustion: all 5 default workers blocked on same session lock 5. Cascading failure: new requests queue, more timeouts, more orphaned locks Solution: Use file-based sessions (reliable, fast for single-server) while keeping Redis for distributed cache and transactional file locking via custom config file. This provides optimal performance without the complexity of Redis session debugging. Tested: Fresh deployment on arch-vps (69.62.119.31) Domain: https://cloud.jnss.me/
This commit is contained in:
210
docs/socket-permissions-architecture.md
Normal file
210
docs/socket-permissions-architecture.md
Normal file
@@ -0,0 +1,210 @@
|
||||
# Socket Permissions Architecture Decision
|
||||
|
||||
## Context
|
||||
|
||||
Rick-infra uses Unix domain sockets for PostgreSQL and Valkey (Redis) connections to maximize performance and security. Applications run in Podman containers and need to access these infrastructure services via sockets.
|
||||
|
||||
## Problem
|
||||
|
||||
Different container images have different user models:
|
||||
|
||||
1. **Authentik**: Runs as a specific user (UID 966) from start to finish
|
||||
2. **Nextcloud**: Starts as root, runs entrypoint scripts, then switches to www-data (UID 33)
|
||||
|
||||
When using `--group-add` with Podman:
|
||||
- Supplementary groups are added to the **initial user** the container runs as
|
||||
- Groups are **NOT inherited** when a container switches users internally
|
||||
- Nextcloud's www-data process ends up without socket access
|
||||
|
||||
## Decision
|
||||
|
||||
**Use 777 permissions on Unix sockets** for PostgreSQL and Valkey.
|
||||
|
||||
## Rationale
|
||||
|
||||
### Why 777 Works
|
||||
|
||||
1. **Compatibility**: Any container user model can access the sockets
|
||||
2. **Simplicity**: No complex user namespace mapping needed
|
||||
3. **Security maintained**: Password authentication still required
|
||||
4. **Local-only**: Sockets are not network-exposed
|
||||
|
||||
### Security Analysis
|
||||
|
||||
**What 777 allows:**
|
||||
- ✅ Any local process can **attempt** to connect to the socket
|
||||
|
||||
**What 777 does NOT allow:**
|
||||
- ❌ Authentication bypass - PostgreSQL requires username + password (scram-sha-256)
|
||||
- ❌ Network access - Sockets are local filesystem only
|
||||
- ❌ Remote connections - Not exposed beyond localhost
|
||||
|
||||
**Security layers:**
|
||||
1. **Physical**: Server access required
|
||||
2. **Process**: Must be running on the same host
|
||||
3. **Authentication**: Must provide valid credentials
|
||||
4. **Authorization**: Database/Redis permissions enforced
|
||||
|
||||
### Comparison to TCP Localhost
|
||||
|
||||
Using `127.0.0.1:5432` (TCP) has **identical security**:
|
||||
- Localhost-only (not network)
|
||||
- Requires authentication
|
||||
- Any local process can attempt connection
|
||||
|
||||
Socket 777 vs TCP localhost:
|
||||
- **Same security model**: Both require credentials, both are local-only
|
||||
- **Different performance**: Sockets are faster (no TCP/IP stack overhead)
|
||||
- **Different permissions**: Sockets use filesystem permissions, TCP uses network
|
||||
|
||||
## Alternatives Considered
|
||||
|
||||
### Alternative 1: Group-based Permissions (770)
|
||||
|
||||
**Implementation:**
|
||||
```yaml
|
||||
postgresql_unix_socket_permissions: "0770"
|
||||
valkey_unix_socket_perm: "770"
|
||||
```
|
||||
|
||||
**Why rejected:**
|
||||
- Doesn't work for Nextcloud (www-data not in groups after su switch)
|
||||
- Requires all containers to use `--group-add`
|
||||
- Complex UID/GID management
|
||||
- Breaks container user-switching patterns
|
||||
|
||||
### Alternative 2: User Namespace Mapping
|
||||
|
||||
**Implementation:**
|
||||
```
|
||||
--uidmap 33:963:1 # Map www-data to nextcloud
|
||||
--gidmap 33:963:1
|
||||
```
|
||||
|
||||
**Why rejected:**
|
||||
- Container's root loses privileges (can't run entrypoint)
|
||||
- Very complex configuration
|
||||
- Fragile (breaks on image updates)
|
||||
- Doesn't solve the fundamental user-switching problem
|
||||
|
||||
### Alternative 3: TCP on Localhost
|
||||
|
||||
**Implementation:**
|
||||
```yaml
|
||||
# PostgreSQL
|
||||
postgresql_listen_addresses: "127.0.0.1"
|
||||
|
||||
# Valkey
|
||||
valkey_bind: "127.0.0.1"
|
||||
valkey_port: 6379
|
||||
```
|
||||
|
||||
**Why not chosen (but valid alternative):**
|
||||
- ✅ Same security as socket 777
|
||||
- ✅ No permission issues
|
||||
- ❌ Abandons Unix socket performance benefits
|
||||
- ❌ Goes against infrastructure design goal
|
||||
|
||||
**Status:** Documented as alternative, available for users who prefer it
|
||||
|
||||
### Alternative 4: Custom Entrypoint
|
||||
|
||||
**Implementation:**
|
||||
Create wrapper that adds www-data to groups before starting FPM.
|
||||
|
||||
**Why rejected:**
|
||||
- Requires custom Dockerfile
|
||||
- Maintenance burden
|
||||
- Breaks on upstream image updates
|
||||
- Fragile and complex
|
||||
|
||||
## Implementation
|
||||
|
||||
### Files Changed
|
||||
|
||||
1. `roles/postgresql/defaults/main.yml`: Set `postgresql_unix_socket_permissions: "0777"`
|
||||
2. `roles/valkey/defaults/main.yml`: Set `valkey_unix_socket_perm: "777"`
|
||||
3. Documentation updated in all affected role READMEs
|
||||
|
||||
### Migration Path
|
||||
|
||||
For existing deployments:
|
||||
1. Update socket permissions: `chmod 777 /var/run/postgresql/.s.PGSQL.5432`
|
||||
2. Update socket permissions: `chmod 777 /var/run/valkey/valkey.sock`
|
||||
3. Restart services (permissions persist via role configuration)
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- ✅ Works with all container user models (root-switching, single-user, etc.)
|
||||
- ✅ Simple to understand and maintain
|
||||
- ✅ No complex UID/GID mapping required
|
||||
- ✅ Standard pattern, well-documented
|
||||
- ✅ Authentication still enforced
|
||||
|
||||
### Negative
|
||||
|
||||
- ⚠️ Any local process can attempt socket connection
|
||||
- ⚠️ Requires clear documentation of security model
|
||||
- ⚠️ May surprise users expecting tighter filesystem permissions
|
||||
|
||||
### Neutral
|
||||
|
||||
- ℹ️ Same security model as TCP localhost
|
||||
- ℹ️ Alternative (TCP) available for those who prefer it
|
||||
- ℹ️ Follows "make it work, make it right, make it fast" philosophy
|
||||
|
||||
## Validation
|
||||
|
||||
### Security Validation
|
||||
|
||||
1. **Authentication required**: ✅ Tested - connection requires credentials
|
||||
2. **Password strength**: ✅ Enforced via scram-sha-256 and vault
|
||||
3. **Local-only**: ✅ Sockets are filesystem objects, not network
|
||||
4. **Process isolation**: ✅ Each service has separate database/namespace
|
||||
|
||||
### Functional Validation
|
||||
|
||||
1. **Authentik**: ✅ Works with 777 sockets
|
||||
2. **Nextcloud**: ✅ Works with 777 sockets (www-data can access)
|
||||
3. **Gitea**: ✅ Works with 777 sockets
|
||||
|
||||
## Monitoring
|
||||
|
||||
No additional monitoring required. Standard checks apply:
|
||||
- Service authentication logs (failed login attempts)
|
||||
- Connection monitoring via application logs
|
||||
- Systemd service health
|
||||
|
||||
## Documentation
|
||||
|
||||
All relevant READMEs updated with:
|
||||
- Explanation of 777 permission choice
|
||||
- Security rationale
|
||||
- TCP alternative configuration
|
||||
- Clear security model explanation
|
||||
|
||||
## Future Considerations
|
||||
|
||||
This decision can be revisited if:
|
||||
1. Container orchestration changes (e.g., Kubernetes with different security contexts)
|
||||
2. New containers with different user models emerge
|
||||
3. Network isolation requirements change
|
||||
4. Regulatory compliance requires stricter filesystem permissions
|
||||
|
||||
In such cases, the TCP alternative provides an equivalent security model without filesystem permission concerns.
|
||||
|
||||
## References
|
||||
|
||||
- [PostgreSQL Role README](../roles/postgresql/README.md)
|
||||
- [Valkey Role README](../roles/valkey/README.md)
|
||||
- [Nextcloud Role README](../roles/nextcloud/README.md)
|
||||
- [Podman User Namespaces Documentation](https://docs.podman.io/en/latest/markdown/podman-run.1.html#userns-mode)
|
||||
- [Unix Socket Security](https://www.man7.org/linux/man-pages/man7/unix.7.html)
|
||||
|
||||
---
|
||||
|
||||
**Decision Date**: December 14, 2025
|
||||
**Status**: Accepted
|
||||
**Reviewers**: rick-infra maintainers
|
||||
Reference in New Issue
Block a user