Files
rick-infra/roles/nextcloud
Joakim 846ab74f87 Fix Nextcloud DNS resolution and implement systemd cron for background jobs
- Enable IP forwarding in security playbook (net.ipv4.ip_forward = 1)
- Add podman network firewall rules to fix container DNS/HTTPS access
- Implement systemd timer for reliable Nextcloud background job execution
- Add database optimization tasks (indices, bigint conversion, mimetypes)
- Configure maintenance window (04:00 UTC) and phone region (NO)
- Add security headers (X-Robots-Tag, X-Permitted-Cross-Domain-Policies)
- Create Nextcloud removal playbook for clean uninstall
- Fix nftables interface matching (podman0 vs podman+)

Root cause: nftables FORWARD chain blocked container egress traffic
Solution: Explicit firewall rules for podman0 bridge interface
2025-12-20 19:51:26 +01:00
..

Nextcloud Cloud Storage Role

Self-contained Nextcloud deployment using Podman Quadlet with FPM, PostgreSQL database, and Valkey cache via Unix sockets.

Features

  • Container: Single Nextcloud FPM container via Podman Quadlet
  • Database: Self-managed PostgreSQL database via Unix socket
  • Cache: Valkey (Redis-compatible) for file locking and caching
  • Web Server: Caddy reverse proxy with FastCGI and automatic HTTPS
  • Security: Group-based socket access, separated data/config volumes
  • Size: ~320MB FPM image (vs 1.1GB Apache variant)

Architecture

Internet → Caddy (HTTPS:443) → FastCGI → Nextcloud FPM Container (127.0.0.1:9000)
                 ↓                                ↓
          Serves static files              PostgreSQL (socket)
          from /opt/nextcloud/html         Valkey (socket)

Volume Layout

/opt/nextcloud/
├── html/           # Application code (755 - readable by Caddy for static files)
├── data/           # User files (700 - private to container)
├── config/         # Config with secrets (700 - private to container)
├── custom_apps/    # Installed apps (755 - readable)
└── .env            # Environment variables (600)

Security Model:

  • Caddy serves static assets (CSS/JS/images) directly from /opt/nextcloud/html
  • Caddy cannot access /data or /config (mode 700)
  • User files are only served through authenticated PHP requests via FPM

Dependencies

  • postgresql role (infrastructure)
  • valkey role (infrastructure)
  • caddy role (web server)
  • podman role (container runtime)

Variables

See defaults/main.yml for all configurable variables.

Required Vault Variables

Define these in your host_vars/ with ansible-vault:

vault_nextcloud_db_password: "secure-database-password"
vault_nextcloud_admin_password: "secure-admin-password"
vault_valkey_password: "secure-valkey-password"

Key Variables

# Domain
nextcloud_domain: "cloud.jnss.me"

# Admin user
nextcloud_admin_user: "admin"

# Database
nextcloud_db_name: "nextcloud"
nextcloud_db_user: "nextcloud"

# Cache (use different DB number per service)
nextcloud_valkey_db: 2  # Authentik uses 1

# PHP limits
nextcloud_php_memory_limit: "512M"
nextcloud_php_upload_limit: "512M"

Deployment Strategy

This role uses a two-phase deployment approach to work correctly with the Nextcloud container's initialization process:

Phase 1: Container Initialization (automatic)

  1. Create empty directories for volumes
  2. Deploy environment configuration (.env)
  3. Start Nextcloud container
  4. Container entrypoint detects first-time setup (no version.php)
  5. Container copies Nextcloud files to /var/www/html/
  6. Container runs occ maintenance:install with PostgreSQL
  7. Installation creates config.php with database credentials

Phase 2: Custom Configuration (automatic)

  1. Ansible waits for occ status to report installed: true
  2. Ansible deploys custom redis.config.php (overwrites default)
  3. Container restart applies custom configuration

Why this order?

The Nextcloud container's entrypoint uses version.php as a marker to determine if installation is needed. If you deploy any files into /opt/nextcloud/config/ before the container starts, the initialization process fails:

  • Container copies files including version.php
  • Entrypoint sees version.php exists → assumes already installed
  • Skips running occ maintenance:install
  • Result: Empty config.php, 503 errors

By deploying custom configs after installation completes, we:

  • Allow the container's auto-installation to run properly
  • Override specific configs (like Redis) after the fact
  • Maintain idempotency (subsequent runs just update configs)

See the official Nextcloud Docker documentation for more details on the auto-configuration process.

Usage

Include in Playbook

- role: nextcloud
  tags: ['nextcloud', 'cloud', 'storage']

Deploy

# Deploy Nextcloud role
ansible-playbook -i inventory/hosts.yml site.yml --tags nextcloud --ask-vault-pass

# Deploy only infrastructure dependencies
ansible-playbook -i inventory/hosts.yml site.yml --tags postgresql,valkey,caddy

Verification

After deployment:

  1. Access Nextcloud:

    https://cloud.jnss.me
    
  2. Check service status:

    ssh root@arch-vps
    systemctl status nextcloud
    podman ps | grep nextcloud
    
  3. View logs:

    # Container logs
    journalctl -u nextcloud -f
    podman logs nextcloud
    
    # Caddy logs
    tail -f /var/log/caddy/nextcloud.log
    
  4. Verify socket access:

    # Check group memberships
    id nextcloud
    # Should show: postgres-clients, valkey-clients
    
    # Check socket permissions
    ls -la /var/run/postgresql/.s.PGSQL.5432
    ls -la /var/run/valkey/valkey.sock
    

Maintenance

OCC Command (Nextcloud CLI)

Run Nextcloud's OCC command-line tool:

# General syntax
podman exec --user www-data nextcloud php occ <command>

# Examples
podman exec --user www-data nextcloud php occ status
podman exec --user www-data nextcloud php occ app:list
podman exec --user www-data nextcloud php occ maintenance:mode --on
podman exec --user www-data nextcloud php occ files:scan --all

Update Nextcloud

The container automatically updates on restart:

systemctl restart nextcloud

Or pull specific version:

# In host_vars or defaults
nextcloud_version: "32-fpm"  # Pin to major version
# Or
nextcloud_version: "32.0.3-fpm"  # Pin to exact version

Backup Strategy

Key directories to backup:

  1. User data: /opt/nextcloud/data
  2. Configuration: /opt/nextcloud/config
  3. Database: PostgreSQL nextcloud database
  4. Custom apps: /opt/nextcloud/custom_apps (optional)

Example backup script:

#!/bin/bash
# Enable maintenance mode
podman exec --user www-data nextcloud php occ maintenance:mode --on

# Backup data and config
tar -czf nextcloud-data-$(date +%Y%m%d).tar.gz /opt/nextcloud/data /opt/nextcloud/config

# Backup database
sudo -u postgres pg_dump nextcloud > nextcloud-db-$(date +%Y%m%d).sql

# Disable maintenance mode
podman exec --user www-data nextcloud php occ maintenance:mode --off

Performance Tuning

Adjust PHP limits in host_vars:

nextcloud_php_memory_limit: "1G"      # For large files
nextcloud_php_upload_limit: "10G"     # For large uploads

Redis/Valkey Caching Architecture

This role uses a split caching strategy for optimal performance and stability:

PHP Sessions: File-based (default PHP session handler)

  • Location: /var/www/html/data/sessions/
  • Why: Redis session locking can cause cascading failures under high concurrency
  • Performance: Excellent for single-server deployments

Nextcloud Application Cache: Redis/Valkey

  • memcache.local: APCu (in-memory opcode cache)
  • memcache.distributed: Redis (shared cache, file locking)
  • memcache.locking: Redis (transactional file locking)
  • Configuration: Via custom redis.config.php template

Why not Redis sessions?

The official Nextcloud Docker image enables Redis session handling when REDIS_HOST is set. However, this can cause severe performance issues:

  1. Session lock contention: Multiple parallel requests (browser loading CSS/JS/images) compete for the same session lock
  2. Infinite retries: Default lock_retries = -1 means workers block forever
  3. Timeout orphaning: When reverse proxy times out, FPM workers keep running and hold locks
  4. Worker exhaustion: Limited FPM workers (default 5) all become blocked
  5. Cascading failure: New requests queue, timeouts accumulate, locks orphan

This role disables Redis sessions by not setting REDIS_HOST in the environment, while still providing Redis caching via a custom redis.config.php that is deployed independently.

If you need Redis sessions (e.g., multi-server setup with session sharing), you must:

  1. Enable REDIS_HOST in nextcloud.env.j2
  2. Add a custom PHP ini file with proper lock parameters:
    • redis.session.lock_expire = 30 (locks expire after 30 seconds)
    • redis.session.lock_retries = 100 (max 100 retries, not infinite)
    • redis.session.lock_wait_time = 50000 (50ms between retries)
  3. Mount the ini file with zz- prefix to load after the entrypoint's redis-session.ini
  4. Increase FPM workers significantly (15-20+)
  5. Monitor for orphaned session locks

Troubleshooting

Container won't start

# Check container logs
journalctl -u nextcloud -n 50
podman logs nextcloud

# Check systemd unit
systemctl status nextcloud

Permission errors

# Verify user groups
id nextcloud

# Should be in: postgres-clients, valkey-clients
# If not, re-run user.yml tasks:
ansible-playbook -i inventory/hosts.yml site.yml --tags nextcloud,user

Database connection errors

# Test PostgreSQL socket
sudo -u nextcloud psql -h /var/run/postgresql -U nextcloud -d nextcloud

# Check socket exists and permissions
ls -la /var/run/postgresql/.s.PGSQL.5432

Caddy FastCGI errors

# Check Caddy can read app files
sudo -u caddy ls -la /opt/nextcloud/html

# Verify FPM is listening
ss -tlnp | grep 9000

# Test FPM connection
curl -v http://127.0.0.1:9000

"Trusted domain" errors

Add domains to nextcloud_trusted_domains:

nextcloud_trusted_domains: "cloud.jnss.me localhost 69.62.119.31"

Or add via OCC:

podman exec --user www-data nextcloud php occ config:system:set trusted_domains 1 --value=cloud.jnss.me

Integration with Authentik SSO

To integrate Nextcloud with Authentik for SSO, see the Authentik documentation for OAuth2/OIDC provider setup.

Security Notes

  • User data (/opt/nextcloud/data) is mode 700 - only container can access
  • Config (/opt/nextcloud/config) is mode 700 - contains database passwords
  • Application files (/opt/nextcloud/html) are mode 755 - Caddy can read for static files
  • All traffic is HTTPS via Caddy with automatic Let's Encrypt certificates
  • Database and cache connections use Unix sockets (no TCP exposure)
  • Container runs as root initially, then switches to www-data (UID 33) for PHP-FPM

Socket Access Pattern

Nextcloud uses a different access pattern than other rick-infra services due to how the official Nextcloud container works:

How it works:

  1. Container starts as root (UID 0)
  2. Entrypoint runs as root to write PHP configuration files
  3. Entrypoint switches to www-data (UID 33) for PHP-FPM process
  4. www-data accesses PostgreSQL and Valkey via Unix sockets

Why 777 socket permissions are needed:

  • The Nextcloud container cannot use --group-add effectively because:
    • --group-add only adds groups to the initial user (root)
    • When the container switches from root to www-data, supplementary groups are lost
    • www-data (UID 33, GID 33) ends up with no access to group-restricted sockets
  • Infrastructure sockets use mode 777 to allow access by any UID
  • Security is maintained via password authentication (PostgreSQL: scram-sha-256, Valkey: requirepass)
  • Sockets are local-only (not network-exposed)

Alternative (TCP): If you prefer group-based socket access (770), you can configure PostgreSQL and Valkey to use TCP instead:

# In host_vars
postgresql_listen_addresses: "127.0.0.1"
postgresql_unix_socket_permissions: "0770"  # Restrict to group

valkey_bind: "127.0.0.1"
valkey_port: 6379
valkey_unix_socket_enabled: false

# In Nextcloud env
POSTGRES_HOST=127.0.0.1
POSTGRES_PORT=5432
REDIS_HOST=127.0.0.1
REDIS_PORT=6379

This provides the same security level (password-authenticated, localhost-only) but uses TCP instead of Unix sockets. The trade-off is slightly lower performance compared to Unix sockets.

See infrastructure role documentation (PostgreSQL and Valkey READMEs) for more details on this architectural decision.

References